Mozilla DB News, Fri 7 Sep – checksums

Well, it is getting closer and closer to the end of the 3rd quarter of 2012, and so we are starting to hunker down and make sure our goals get met. One of those goals is to start running regular checksums on our databases, so that when doing a switchover (a manual “failover” for maintenance purposes) or when taking backups we can ensure we have maintained the same level of data integrity that the master has. And, of course, monitor the output of those checksums with Nagios. So this week has seen a lot of fiddling with pt-table-checksum to get the appropriate values….without further ado, here’s the list of what the database team did this week:

  • Found out the hard way that slow_query_log=ON is silently ignored by MySQL, and that the syntax to turn on the slow query logs is slow_query_log=1 (even when log_output and slow_query_log_file are set properly).
  • Updated the Graphs database to be informed about Talos/Mountain Lion.
  • Fixed the problem where a getpersonas table locked up intermittently. The table locked up because it is MyISAM, but it had a FULLTEXT index on it. As it turns out, that FULLTEXT index was replaced by Elastic Search, so there was no problem dropping the FULLTEXT index and converting the table to InnoDB.
  • Fixed a problem where cron was not running on one of our backup servers.
  • Researched and debugged a problem with thimble having lost data. Unfortunately, that data was not recoverable.
  • Created development and stage database environments for the Bouncer application.
  • Converted all the Affiliates tables that had a default character set of latin1 to have a default character set of utf8 in dev, stage and production (character_set_server=utf8 has been set on that server for a while, but it was after those tables were created).
  • Deleted some spam comments from a Bugzilla bug.
  • As mentioned above, got checksums working, specifically for that Bugzilla database cluster.
  • Downloaded, updated, tested and submitted a patch back to the owners for PalominoDB’s Nagios check for pt-table-checksum table results, and queued the Nagios check up to be put into monitoring next week.
  • Rebuilt a metrics server for staging purposes.
  • Converted a development server to use innodb_file_per_table.
  • Checked the size of the Tinderboxpushlog database after last month’s defragmentation to see how much larger it became.
  • Used pt-query-digest to parse 40G (about 1/3 of what we have total) of general and slow query logs from one day of Bugzilla logging to come up with some metrics for which Bugzilla queries we are going to optimize in Q4 2012.
  • Tuned the innodb buffer pool on an auxiliary database for Addons database cluster to fix a bug.

These next few weeks will be exciting and fast-paced!

Well, it is getting closer and closer to the end of the 3rd quarter of 2012, and so we are starting to hunker down and make sure our goals get met. One of those goals is to start running regular checksums on our databases, so that when doing a switchover (a manual “failover” for maintenance purposes) or when taking backups we can ensure we have maintained the same level of data integrity that the master has. And, of course, monitor the output of those checksums with Nagios. So this week has seen a lot of fiddling with pt-table-checksum to get the appropriate values….without further ado, here’s the list of what the database team did this week:

  • Found out the hard way that slow_query_log=ON is silently ignored by MySQL, and that the syntax to turn on the slow query logs is slow_query_log=1 (even when log_output and slow_query_log_file are set properly).
  • Updated the Graphs database to be informed about Talos/Mountain Lion.
  • Fixed the problem where a getpersonas table locked up intermittently. The table locked up because it is MyISAM, but it had a FULLTEXT index on it. As it turns out, that FULLTEXT index was replaced by Elastic Search, so there was no problem dropping the FULLTEXT index and converting the table to InnoDB.
  • Fixed a problem where cron was not running on one of our backup servers.
  • Researched and debugged a problem with thimble having lost data. Unfortunately, that data was not recoverable.
  • Created development and stage database environments for the Bouncer application.
  • Converted all the Affiliates tables that had a default character set of latin1 to have a default character set of utf8 in dev, stage and production (character_set_server=utf8 has been set on that server for a while, but it was after those tables were created).
  • Deleted some spam comments from a Bugzilla bug.
  • As mentioned above, got checksums working, specifically for that Bugzilla database cluster.
  • Downloaded, updated, tested and submitted a patch back to the owners for PalominoDB’s Nagios check for pt-table-checksum table results, and queued the Nagios check up to be put into monitoring next week.
  • Rebuilt a metrics server for staging purposes.
  • Converted a development server to use innodb_file_per_table.
  • Checked the size of the Tinderboxpushlog database after last month’s defragmentation to see how much larger it became.
  • Used pt-query-digest to parse 40G (about 1/3 of what we have total) of general and slow query logs from one day of Bugzilla logging to come up with some metrics for which Bugzilla queries we are going to optimize in Q4 2012.
  • Tuned the innodb buffer pool on an auxiliary database for Addons database cluster to fix a bug.

These next few weeks will be exciting and fast-paced!