Other aop/monitoring

Table Of Contents

Previous topic

Backups Overview

Next topic

DbiMonitor package : cron invoked nosetests

This Page

Daya Bay Links

Content Skeleton

Monitoring

A system of infrastructure servers at IHEP, NTU and NUU are setup to automatically cooperate via sending/receiving backup tarballs and monitoring the operation of each other. The servers involved are outlined Backups Overview.

Monitoring requires reaction to notification emails that are sent when error conditions or monitoring irregularities are seen by a large number of cron invoked scripts. The objective being to maintain continuous daily operation of the scripts,

Issues normally occur following IHEP server reboots, requiring Qiumei to be notifified and aided with getting scripts back into operation. These common issues are detailed in SSH Setup For Automated transfers.

Debugging/scripting skills and dogged persistence are required to chase the causes of problems and perform remote sys admin debugging via proxy. IHEP rules prevent root access being conferred on foreign collaborators, thus debugging tends to be a laborious process performed via email exchanges with Qiumei.

crontabs

The collection of crontabs from the collaborating nodes provides the best reference to the tasks being performed. The typical first action to take on receiving notification email is to examine cron logs.

blyth@cms01.phys.ntu.edu.tw

[blyth@cms01 cronlog]$ crontab -l
#  see roots crontab
SHELL=/bin/bash
HOME=/home/blyth
ENV_HOME=/home/blyth/env
CRONLOG_DIR=/home/blyth/cronlog
DAILY_SCRIPTS=/data/env/local/dyb/trunk/daily/scripts
MAILTO=blyth@hep1.phys.ntu.edu.tw
PATH=/home/blyth/env/bin:/data/env/system/python/Python-2.5.1/bin:/usr/bin:/bin
LD_LIBRARY_PATH=/data/env/system/python/Python-2.5.1/lib
#
# dybsvn altbackup (using scp) [NB another cron job at IHEP uses other arguments with the altbackup script]
30 15 * * * ( . $ENV_HOME/env.bash ; env- ; python- source ; ssh-- ; $ENV_HOME/scm/altbackup.sh $HOME/cronlog/altbackup.log dump check_target ) > $CRONLOG_DIR/altbackup_.log 2>&1

# offline_db backup monitoring
08 09 * * * ( . $ENV_HOME/env.bash ; env- ; python- source ; db- ; db-backup-recover offline_db dybdb1.ihep.ac.cn ; db-test ) > $CRONLOG_DIR/db-backup-recover-offline_db-dybdb1.log 2>&1
40 05 * * * ( . $ENV_HOME/env.bash ; db- ; db-backup-rsync-monitor )                                                      > $CRONLOG_DIR/db-backup-rsync-monitor.log 2>&1

# planting of daily symbolic links
15 18 * * * ( cd /data/env/local/dyb/trunk ; python installation/trunk/dybinst/scripts/slvmgr.py --diabolic dybinst  )  > $CRONLOG_DIR/diabolic.log 2>&1

# env repo monitoring
42 * * * * ( valmon.py -s envmon rec rep mon ) > $CRONLOG_DIR/envmon.log 2>&1

# disk space monitoring
52 * * * * ( valmon.py -s diskmon rec rep mon ) > $CRONLOG_DIR/diskmon.log 2>&1
20 * * * * ( valmon.py -s diskmon_slash rec rep mon ) > $CRONLOG_DIR/diskmon_slash.log 2>&1

# channelquality_db backup monitoring
05 13 * * * ( valmon.py -s dbsrvmon rec rep mon ) > $CRONLOG_DIR/dbsrvmon.log 2>&1

root@cms01.phys.ntu.edu.tw

[blyth@cms01 cronlog]$ sudo crontab -l
SHELL = /bin/bash
# avoid huge logs from the daily recovery clogging the disk
50 18 * * * ( cd /var/log/mysql ; echo root-cron-truncate-$(date) > log )
40 17 * * * /usr/sbin/ntpdate pool.ntp.org

root@cms02.phys.ntu.edu.tw

[root@cms02 log]# crontab -l
SHELL = /bin/bash
#
# backup and offbox transfers of env+aberdeen+.. SVN/Trac instances
31 15 * * *  ( export HOME=/root ; export NODE=cms02 ; export MAILTO=blyth@hep1.phys.ntu.edu.tw ; export ENV_HOME=/home/blyth/env ; . /home/blyth/env/env.bash ; env-  ; scm-backup- ; scm-backup-nightly ) >  /var/scm/log/scm-backup-nightly-$(date +"\%d").log 2>&1
31 16 * * *  ( export HOME=/root ; export NODE=cms02 ; export MAILTO=blyth@hep1.phys.ntu.edu.tw ; export ENV_HOME=/home/blyth/env ; . /home/blyth/env/env.bash ; env-  ; scm-backup- ; scm-backup-tgzmon ) >  /var/scm/log/scm-backup-tgzmon-$(date +"\%d").log 2>&1
#
# monitoring for an out-of-memory issue that strikes every few months
50 * * * * ( export HOME=/root ; /home/blyth/env/db/valmon.py -s oomon rec mon ; ) > /var/scm/log/oomon.log 2>&1

blyth@dayabay.ihep.ac.cn

[dayabay] /home/blyth > crontab -l
SHELL=/bin/bash
HOME=/home/blyth
ENV_HOME=/home/blyth/env
CRONLOG_DIR=/home/blyth/cronlog
NODE_TAG_OVERRIDE=WW
#
# backup of Trac+SVN tarballs to NTU
00 13 * * * ( . $ENV_HOME/env.bash ; env- ; python- source ; ssh-- ; $ENV_HOME/scm/altbackup.sh $HOME/cronlog/altbackup.log dump check_source transfer purge_target ) > $CRONLOG_DIR/altbackup_.log 2>&1
#
# checking the ssh-agent, the usual cause of
21 14 * * * ( . $ENV_HOME/env.bash ; env- ; python- source ; ssh-- ; ssh--agent-monitor root ) > $CRONLOG_DIR/ssh--agent-monitor.log 2>&1
#
# former backup approach, no longer used as rsync is too susceptable to network gnome blockages
##01 04 * * * ( . $ENV_HOME/env.bash ; env- ; python- source ; scm-backup- ; scm-backup-checkscp ; scm-backup-rsync ; scm-backup-rls  ) > $CRONLOG_DIR/scm-backup-rsync.log 2>&1

root@dybdb1.ihep.ac.cn managed by Qiumei

  1. offline_db backup and rsync

root@dybdb2.ihep.ac.cn managed by Qiumei

  1. offline_db backup and rsync
  2. channelquality_db backup and rsync