Other aop/dbrepair/lessons

Table Of Contents

Previous topic

Corruption incident tmp_ligs_offline_db : timeline and next steps

Next topic

NuWa Python API

This Page

Daya Bay Links

Content Skeleton

Lessons from MySQL corruption incident

Such rare events of DB corruption may recur no matter what we do. The improvements we implement in response to this should focus on preventive measures, mitigating the pain/disruption if it does recur and defining and documenting procedures for such incidents.

Preventive Avoidance

  • avoid writing more that we need to, the DqChannelStatus tables use a ruinously profligate schema (by a factor of 125) they are currently ~2350 MB (~14 months) they could be ~19 MB with no loss of information. As demonstrated by the size of DqChannelPacked.

    The probability of corruption probably scales with the time/volume of writes so it is no surprise that DQ tables are the first to see corruption.

  • disk space monitoring at least daily with email notifications on all critical nodes especially dybdb1.ihep.ac.cn and dybdb2.ihep.ac.cn, reorganization/purchases to avoid tight disk space

  • queue draining procedures for DB writing jobs

Large DB very different to small ones

Tools and procedures developed to handle DB of tens of megabytes mostly not applicable to multi GB databases. Required creation of new tools and optimised procedures.

Mitigating Pain/Disruption of corruption recovery

  • automatic daily backups + remote tarball transfers + operation monitor for all critical databases,

    • offline_db has been for many years,
    • channelquality_db recently implemented partial backup, operational and validated but still under close scrutiny

    Replication is not a replacement for backups as “drop database” gets propagated along the chain within seconds.

    The servers are claimed to have disk level backups. However these do not lock the DB during the backup and without regular tests that backups are recoverable from I do not trust them. The backups of offline_db are recovered onto an NTU node daily.

  • processes that perform critical tasks such as DB writing need logfile monitoring with email notifications when things go awry

TODO: Investigate DBI response to crashed payload OR validity table

Probably just need all unattended DB writing to check the written SEQNO is non-zero:

wseqno = wrt.Close()  # closing DybDbi writer returns SEQNO written OR zero on error
assert wseqno, wseqno

But checking DBI behavior when writing into a “crashed” payload table might suggest some DBI improvements beyond this usage change. Clearly the observered behavior of continuing to write into the validity table after the payload was corrupted is not appropriate.

A rapid job abort would have given a clean cut, and hopefully notified us of the problem sooner.

I have the crashed table in a tarball, so I can now reproduce DBI writing into a crashed table and observe the current error handling and see where improvements need to be made. Either in DBI/DybDBI or its usage/documentation.

SOP for MySQL corruption

As soon as corruption is detected, presumbaly from failed KUP jobs

  • stop writing to corrupted tables and other related tables until recovery is done and resumption is agreed by DB managers, sys admins, KUP job admins and table experts

    • could enforce readonly by appropiate GRANT... statements if controlling KUP jobs is problematic
  • check the state of the automated backups by a remote node restoration and comparison with source tables that remain acccessible, transfered to remote node via mysqldump OR mysqlhotcopy

    • perform mysqldump or mysqlhotcopy (possibly with some SEQNO excluded) and transfer to a remote node in which the they are recovered from
  • coordinate with table experts to decide on appropriate SEQNO cutoffs

  • repairs could be attempted but as long as the backup system was operational there is no need to rely on that working

Comparing tables sizes of uncompressed and compressed tables

Table DqChannelStatus contains the same information as DqChannelPacked:

mysql> select concat(table_schema,".",table_name),table_type, engine, round((data_length+index_length-data_free)/1024/1024,2) as MB  from information_schema.tables where substr(table_name,1,2) = 'Dq' ;
+------------------------------------------+------------+--------+---------+
| concat(table_schema,".",table_name)      | table_type | engine | MB      |
+------------------------------------------+------------+--------+---------+
| tmp_ligs_offline_db_0.DqChannelStatus    | BASE TABLE | MyISAM | 2265.14 |
| tmp_ligs_offline_db_0.DqChannelStatusVld | BASE TABLE | MyISAM |   20.24 |
| tmp_ligs_offline_db_1.DqChannelStatus    | BASE TABLE | MyISAM | 2349.86 |
| tmp_ligs_offline_db_1.DqChannelStatusVld | BASE TABLE | MyISAM |   20.24 |
| tmp_offline_db.DqChannelPacked           | BASE TABLE | MyISAM |   18.61 |
| tmp_offline_db.DqChannelPackedVld        | BASE TABLE | MyISAM |   18.87 |
+------------------------------------------+------------+--------+---------+
6 rows in set (0.01 sec)

mysql> select max(SEQNO) from tmp_offline_db.DqChannelPacked ;
+------------+
| max(SEQNO) |
+------------+
|     323000 |
+------------+
1 row in set (0.04 sec)

mysql> select max(SEQNO) from tmp_ligs_offline_db_1.DqChannelStatus ;
+------------+
| max(SEQNO) |
+------------+
|     340817 |
+------------+
1 row in set (0.06 sec)

mysql> select 2349.86/18.61 ;
+---------------+
| 2349.86/18.61 |
+---------------+
|    126.268673 |
+---------------+
1 row in set (0.00 sec)