Mozilla IT News, Thu 18 October

It has been several weeks since the last time I did a DB roundup, mostly due to travel. I spoke at Nagios World Conference 2012 at the end of September in St. Paul, Minneapolis, and then had a keynote and 2 talks at MySQL Connect, which was followed by Oracle OpenWorld and a Web Operations Team Meetup, which the DB team was invited to participate (and since I was in San Francisco anyway, I figured I’d stay for the meetup and the millionaire’s bacon (it was epic).

We have been busy, and getting some great and productive stuff done.

  • Upgraded the databases behind our critical bouncer application.
  • Put the bouncer databases into puppet.
  • Started a regular run of pt-table-checksum on the bouncer, addons and support databases, complete with monitoring using our updated version of PalominoDB’s check_table_checksums Nagios check.
  • Updated data in the datazilla database
  • Eliminated one of our legacy puppet MySQL modules.
  • Started using pt-config-diff to see differences between MySQL’s running configuration and its configuration file. Right now we are e-mailing if there is a difference, but the next step is to make it into a Nagios check.
  • Created a database for an internal repository related to desktop encryption.
  • Updated our developer tools master/master cluster to use auto_increment_increment and auto_increment_offset properly, via puppet.
  • Added a new node to the graphs database. Also added the nodes for the panda board project.
  • Debugged a problem with the plugin check databases due to our Kohana ORM issuing unnecessary SHOW COLUMNS queries upon every connection.
  • Completed the cleanup of the MySQL and PostgreSQL ACL’s to ensure there were no legacy ACL’s from our data center move.
  • Gave Selena and Paula (a metrics dashboard implementer) access to Postgres databases, as well as access for a django role account.
  • We have dealt with two of our backup systems starting to run out of space, so we are spinning up new VMs to accommodate the increased disk footprint.
  • Done our quarterly crash-stats purge.
  • Applied a fix to all our postgres databases that corrected a possible data corruption issue.
  • Created a database and users for the moztrap test case system.
  • Increased max_allowed_packet on the Bugzilla databases when a mass update of 500 bugs caused the application to display the “Got a packet bigger than ‘max_allowed_packet’ bytes” error.
  • Manually refreshed the graphs staging database with production data.
  • Fixed a bug in a script that caused a file to copy over the script itself.
  • Added an ACL in puppet in the crash-stats dev database.
  • Ran a manual backfill on the crash-stats database when a weekly cron job failed.
  • Added an index to a table in Tinderbox pushlog in dev and production, so that data purges went more quickly.
  • Completed upgrading drivers for all of our fusion IO SSD’s to make our SSDs faster with MySQL.
  • Upgraded our stage database to MySQL 5.1, and converted it to be innodb_file_per_table (almost everything else in the Mozilla ecosystem is innodb_file_per_table).
  • Did our monthly Tinderbox pushlog purge (sadly we use foreign keys so we cannot use partitioning and automation here).
  • Upgraded half of the webdev database cluster to MySQL 5.1 and put it under puppet control.
  • Created an external database for our Vidyo installation to use, and imported data from our existing embedded Vidyo database.
  • Put the checksumming wrapper script into puppet, which has made it much easier to deploy checksumming to more systems.
  • Resynced a slave cluster of addons that is only used for checking versions. Ran into a tricky problem with puppet where each of the replicate_wild_do_table and replicate_do_table options had to be done separately in the my.cnf, such as:
    replicate_do_table=foo
    replicate_do_table=bar
    replicate_do_table=baz

    But we solved the tricky config problem and made an array that gets split into individual entries. Yay puppet!

  • Also resync’d the addons staging cluster.
  • Fixed some problems with malformed UTF-8 in a few Bugzilla bugs.
  • Lowered the innodb_buffer_pool_size on a cluster so the cluster was not continually swapping.
  • Retired “Rock Your Firefox” (no link needed, it’s retired!)
  • Added permissions to the new Bugzilla fields for metrics.
  • Removed over 9,000 buildbot builds from a MySQL-based scheduling queue that had gotten stale.
  • Started to create documentation that is linked from Nagios alerts to better streamline our responses to pages. Particularly actionable steps our site reliability engineers (SRE’s, who are oncall) can specifically do until a DBA can get to a terminal.
  • Dealt with interesting undo log corruption on our dev database cluster, which caused a large number of tables have the interesting characteristic that while they could be read from and written to, they could not be dropped (and some could be truncated, others could not be truncated).
  • Worked on getting the Percona Toolkit installed on our puppetized machines.

I have tomorrow off for some personal fun (a sheep and wool festival), so I figured I would publish this today, lest I go another week without publishing this!

It has been several weeks since the last time I did a DB roundup, mostly due to travel. I spoke at Nagios World Conference 2012 at the end of September in St. Paul, Minneapolis, and then had a keynote and 2 talks at MySQL Connect, which was followed by Oracle OpenWorld and a Web Operations Team Meetup, which the DB team was invited to participate (and since I was in San Francisco anyway, I figured Id stay for the meetup and the millionaires bacon (it was epic).

We have been busy, and getting some great and productive stuff done.

  • Upgraded the databases behind our critical bouncer application.
  • Put the bouncer databases into puppet.
  • Started a regular run of pt-table-checksum on the bouncer, addons and support databases, complete with monitoring using our updated version of PalominoDBs check_table_checksums Nagios check.
  • Updated data in the datazilla database
  • Eliminated one of our legacy puppet MySQL modules.
  • Started using pt-config-diff to see differences between MySQLs running configuration and its configuration file. Right now we are e-mailing if there is a difference, but the next step is to make it into a Nagios check.
  • Created a database for an internal repository related to desktop encryption.
  • Updated our developer tools master/master cluster to use auto_increment_increment and auto_increment_offset properly, via puppet.
  • Added a new node to the graphs database. Also added the nodes for the panda board project.
  • Debugged a problem with the plugin check databases due to our Kohana ORM issuing unnecessary SHOW COLUMNS queries upon every connection.
  • Completed the cleanup of the MySQL and PostgreSQL ACLs to ensure there were no legacy ACLs from our data center move.
  • Gave Selena and Paula (a metrics dashboard implementer) access to Postgres databases, as well as access for a django role account.
  • We have dealt with two of our backup systems starting to run out of space, so we are spinning up new VMs to accommodate the increased disk footprint.
  • Done our quarterly crash-stats purge.
  • Applied a fix to all our postgres databases that corrected a possible data corruption issue.
  • Created a database and users for the moztrap test case system.
  • Increased max_allowed_packet on the Bugzilla databases when a mass update of 500 bugs caused the application to display the Got a packet bigger than max_allowed_packet bytes error.
  • Manually refreshed the graphs staging database with production data.
  • Fixed a bug in a script that caused a file to copy over the script itself.
  • Added an ACL in puppet in the crash-stats dev database.
  • Ran a manual backfill on the crash-stats database when a weekly cron job failed.
  • Added an index to a table in Tinderbox pushlog in dev and production, so that data purges went more quickly.
  • Completed upgrading drivers for all of our fusion IO SSDs to make our SSDs faster with MySQL.
  • Upgraded our stage database to MySQL 5.1, and converted it to be innodb_file_per_table (almost everything else in the Mozilla ecosystem is innodb_file_per_table).
  • Did our monthly Tinderbox pushlog purge (sadly we use foreign keys so we cannot use partitioning and automation here).
  • Upgraded half of the webdev database cluster to MySQL 5.1 and put it under puppet control.
  • Created an external database for our Vidyo installation to use, and imported data from our existing embedded Vidyo database.
  • Put the checksumming wrapper script into puppet, which has made it much easier to deploy checksumming to more systems.
  • Resynced a slave cluster of addons that is only used for checking versions. Ran into a tricky problem with puppet where each of the replicate_wild_do_table and replicate_do_table options had to be done separately in the my.cnf, such as:
    replicate_do_table=foo
    replicate_do_table=bar
    replicate_do_table=bazBut we solved the tricky config problem and made an array that gets split into individual entries. Yay puppet!
  • Also resyncd the addons staging cluster.
  • Fixed some problems with malformed UTF-8 in a few Bugzilla bugs.
  • Lowered the innodb_buffer_pool_size on a cluster so the cluster was not continually swapping.
  • Retired Rock Your Firefox (no link needed, its retired!)
  • Added permissions to the new Bugzilla fields for metrics.
  • Removed over 9,000 buildbot builds from a MySQL-based scheduling queue that had gotten stale.
  • Started to create documentation that is linked from Nagios alerts to better streamline our responses to pages. Particularly actionable steps our site reliability engineers (SREs, who are oncall) can specifically do until a DBA can get to a terminal.
  • Dealt with interesting undo log corruption on our dev database cluster, which caused a large number of tables have the interesting characteristic that while they could be read from and written to, they could not be dropped (and some could be truncated, others could not be truncated).
  • Worked on getting the Percona Toolkit installed on our puppetized machines.

I have tomorrow off for some personal fun (a sheep and wool festival), so I figured I would publish this today, lest I go another week without publishing this!

Percona Live: MySQL Conference and Expo Call for Papers Extended!

The call for papers for Percona Live: MySQL Conference and Expo 2013 has been extended through October 31st. The conference will be held in Santa Clara, California from Monday, April 22nd through Thursday April 25th (and this year it’s not during Passover!).

Why You Should Submit
Percona Live is a meeting of the minds – not just the April Santa Clara conference, but all the Percona Live conferences. If you get a proposal accepted, you get free admission to the conference!

There is no cost to submit, and you do not have to tell anyone you submitted. I have submitted to conferences and been rejected – it stinks. But there is no reason not to submit. Submit a few presentations on different topics, because the presentation you have in mind might be submitted by other people too. If you have a presentation about backups rejected, it might be that someone else is doing a presentation on backups. Try to find something unique that nobody else is doing – I have not seen anything close to my talk on White-Hat Google Hacking, for example.

Submitting Your Project
I am not on the conference content committee, but I have been for the former O’Reilly MySQL Conference, OSCon, and several other conferences. If you are submitting a proposal about a project you have, or a product your company makes, it is not too hard to make a presentation that is worthy of acceptance. All you need to do is give a talk where the audience learns, even if they have no interest in your product.

Let me go a bit more into this: you have this project/product related to MySQL. It solves a problem – usually that problem is a lack of feature in MySQL. An introductory presentation about your project/product should talk about traditional ways to solve the problem within MySQL, and then for 10-15 minutes at the end, talk about your project/product. This way, even someone who will never use your product/project will learn something.

Let’s say you have a NoSQL solution. If you have a key-value store, talk about the problems of using unstructured or semi-structured data in MySQL, including how it can be done with MySQL (having big tables with 2 fields, keys and values). Or if you have a document or graph storage solution, talk about the problems of storing blobs of text or navigating through graphs while trying to use MySQL. That should take up 30 of the 45 minutes in a session, and the last 10 minutes can be explaining how your solution makes those problems easier to solve (with 5 minutes at the end for questions).

Why I Am Not Submitting
The main reason I am not submitting to Percona Live: MySQL Conference and Expo is a protest of the conference itself. This specific Percona Live conference was born out of Percona being unwilling to work together with members of the IOUG MySQL Council along with every other major vendor in the MySQL Space (Oracle, Monty Program, SkySQL, Continuent, Tokutek, and more) to make a conference that would be useful for everyone. Percona refused to work with us (the core team being myself, Giuseppe Maxia and Sarah Novotny) and made their own conference. By calling it “Percona Live” they ensured that Oracle would not be able to send representatives, because Percona is a direct competitor to Oracle (it can be argued that they called it “Percona Live” because it’s a continuation of their conference series; however, the fact remains that the name ensures Oracle cannot send employees). Percona refused to change the branding of the event, which was the one block Oracle had against sending their employees.

Percona Live is not the place to see MySQL engineers within Oracle, which has more software engineers working on MySQL than Percona and Monty Program combined. Roadmap discussions are rare, and only happen when a community member decides to do the best they can and figure out the roadmap. Therefore, Percona Live: MySQL Conference and Expo is a deliberate move by Percona to fracture the community.

But there are plenty of other very big reasons I am not submitting to Percona Live:


  • I do not need to go. For several years I worked in consulting firms, and Percona Live is a great place to go to meet potential new customers.

  • A big reason I speak is to help people learn. I am doing that in many ways – blogging, publishing the weekly OurSQL podcast, and next year from January through the end of March, teaching MySQL beginners through the free MySQL Marinate program.

  • I am also submitting to speak at conferences I have never spoken at (or not spoken at in several years) such as SCALE (social linux expo), Confoo and some Oracle conferences such as the annual one held by the New Zealand Oracle User Group. And those conferences are just for the first quarter of 2013.

  • Percona Live is full of fantastic speakers. Other conferences, like the ones listed in the previous bullet point, do not traditionally have the same level of MySQL expertise at them. By speaking at those conferences, I can bring in something that’s missing. Speaking at Percona Live, I am one of many MySQL experts. I feel I have more value to give the other conferences.

  • All these projects are a lot of work. By not speaking at Percona Live: MySQL Conference and Expo, I can hopefully restore some of the energy that I use for the free MySQL Marinate program, the weekly OurSQL podcast, and the other conferences I am (or hope to be) speaking at.

At the end of the day, my personal mission is to help people, and not speaking at Percona Live: MySQL Conference and Expo really does not change that mission. Hopefully this post explains my reasons for not submitting, so that 1) the community does not think my proposals were rejected when my name does not appear on the speaker list and 2) if the conference committee was wondering why they had not seen proposals by me, now they know why, and that they will not be seeing any from me.

The call for papers for Percona Live: MySQL Conference and Expo 2013 has been extended through October 31st. The conference will be held in Santa Clara, California from Monday, April 22nd through Thursday April 25th (and this year its not during Passover!).

Why You Should Submit
Percona Live is a meeting of the minds not just the April Santa Clara conference, but all the Percona Live conferences. If you get a proposal accepted, you get free admission to the conference!

There is no cost to submit, and you do not have to tell anyone you submitted. I have submitted to conferences and been rejected it stinks. But there is no reason not to submit. Submit a few presentations on different topics, because the presentation you have in mind might be submitted by other people too. If you have a presentation about backups rejected, it might be that someone else is doing a presentation on backups. Try to find something unique that nobody else is doing I have not seen anything close to my talk on White-Hat Google Hacking, for example.

Submitting Your Project
I am not on the conference content committee, but I have been for the former OReilly MySQL Conference, OSCon, and several other conferences. If you are submitting a proposal about a project you have, or a product your company makes, it is not too hard to make a presentation that is worthy of acceptance. All you need to do is give a talk where the audience learns, even if they have no interest in your product.

Let me go a bit more into this: you have this project/product related to MySQL. It solves a problem usually that problem is a lack of feature in MySQL. An introductory presentation about your project/product should talk about traditional ways to solve the problem within MySQL, and then for 10-15 minutes at the end, talk about your project/product. This way, even someone who will never use your product/project will learn something.

Lets say you have a NoSQL solution. If you have a key-value store, talk about the problems of using unstructured or semi-structured data in MySQL, including how it can be done with MySQL (having big tables with 2 fields, keys and values). Or if you have a document or graph storage solution, talk about the problems of storing blobs of text or navigating through graphs while trying to use MySQL. That should take up 30 of the 45 minutes in a session, and the last 10 minutes can be explaining how your solution makes those problems easier to solve (with 5 minutes at the end for questions).

Why I Am Not Submitting
The main reason I am not submitting to Percona Live: MySQL Conference and Expo is a protest of the conference itself. This specific Percona Live conference was born out of Percona being unwilling to work together with members of the IOUG MySQL Council along with every other major vendor in the MySQL Space (Oracle, Monty Program, SkySQL, Continuent, Tokutek, and more) to make a conference that would be useful for everyone. Percona refused to work with us (the core team being myself, Giuseppe Maxia and Sarah Novotny) and made their own conference. By calling it Percona Live they ensured that Oracle would not be able to send representatives, because Percona is a direct competitor to Oracle (it can be argued that they called it Percona Live because its a continuation of their conference series; however, the fact remains that the name ensures Oracle cannot send employees). Percona refused to change the branding of the event, which was the one block Oracle had against sending their employees.

Percona Live is not the place to see MySQL engineers within Oracle, which has more software engineers working on MySQL than Percona and Monty Program combined. Roadmap discussions are rare, and only happen when a community member decides to do the best they can and figure out the roadmap. Therefore, Percona Live: MySQL Conference and Expo is a deliberate move by Percona to fracture the community.

But there are plenty of other very big reasons I am not submitting to Percona Live:

    • I do not need to go. For several years I worked in consulting firms, and Percona Live is a great place to go to meet potential new customers.
    • A big reason I speak is to help people learn. I am doing that in many ways blogging, publishing the weekly OurSQL podcast, and next year from January through the end of March, teaching MySQL beginners through the free MySQL Marinate program.
    • I am also submitting to speak at conferences I have never spoken at (or not spoken at in several years) such as SCALE (social linux expo), Confoo and some Oracle conferences such as the annual one held by the New Zealand Oracle User Group. And those conferences are just for the first quarter of 2013.
    • Percona Live is full of fantastic speakers. Other conferences, like the ones listed in the previous bullet point, do not traditionally have the same level of MySQL expertise at them. By speaking at those conferences, I can bring in something that’s missing. Speaking at Percona Live, I am one of many MySQL experts. I feel I have more value to give the other conferences.
    • All these projects are a lot of work. By not speaking at Percona Live: MySQL Conference and Expo, I can hopefully restore some of the energy that I use for the free MySQL Marinate program, the weekly OurSQL podcast, and the other conferences I am (or hope to be) speaking at.

At the end of the day, my personal mission is to help people, and not speaking at Percona Live: MySQL Conference and Expo really does not change that mission. Hopefully this post explains my reasons for not submitting, so that 1) the community does not think my proposals were rejected when my name does not appear on the speaker list and 2) if the conference committee was wondering why they had not seen proposals by me, now they know why, and that they will not be seeing any from me.

The Ops Benefit of the Cloud

Last week, Baron wrote a great post entitled “What’s the benefit of the cloud?” The post was short and made the point that “the benefit of the cloud” is “less ops, more dev.” But Baron is coming from the point of a developer, and from the point of an ops person, there is not necessarily “less ops”.

Some commenters made points along the lines of, “you can just rent rack space in some datacenter for that.” And I agree. There are some ops benefits that Amazon adds, such as easier monitoring and backups, but for the most part, there is not *less* work from an operations standpoint when you are in a cloud environment – my time doing remote DBA work at Pythian and PalominoDB certainly taught me that!

There are still operating systems to install, maintain and upgrade. There are still compatibility issues and having to upgrade and maintain software and configurations. There are software-as-a-service (SaaS) technologies like Amazon’s RDS (which provides MySQL as a service) and the benefit there is not having to worry about configuration or upgrades. There are Amazon machine images (AMIs) that folks share, so that the operating system initial installation requires little knowledge.

The cloud is really useful if you need a machine up and running very quickly. I totally understand that developers want to use the cloud instead of waiting for a machine to arrive, and even in IT it is useful to have another machine for a while. For example, if you wanted to test MySQL 5.6 but do not have a spare machine, you can spin up an instance of a machine in the cloud.

From a production or staging perspective though, there is still a LOT of work to be done to architect a system. The ops benefit of the cloud is NOT “less ops”. The ops benefit of the cloud is actually thanks to how Amazon built its cloud – it was built as a cloud computing platform. The Amazon Cloud was built to provide extra CPU cycles (“elastic cloud computing”). In the days before persistent data stores with “elastic block systems” (EBS), many developers and system administrators lost time when an instance would reboot and all their work was gone – not just the development work, but their setup work – the operating system users, any software packages they had installed, etc. Running any important system in the cloud, these days, means having some kind of installation and configuration management in place, so that if an instance reboots or if another instance is needed, the rebooted/new instance can be brought back to a working state as quickly as possible.

Some folks get their environment set up how they want it, and take a snapshot that can be used as an Amazon machine instance (AMI). This works great, until you need to update any software or make any changes in configuration, whether it’s operating system configuration (like adding a user), or software configuration (like a my.cnf file).

The benefit of the cloud from an ops side is that it forces us to do what we should be doing anyway – running installation and configuration management. At Mozilla, we have a plan to move some services to “the cloud”, but we are already using kickstart for installation management and Puppet for configuration management, so we are already set with those benefits.

Last week, Baron wrote a great post entitled Whats the benefit of the cloud? The post was short and made the point that the benefit of the cloud is less ops, more dev. But Baron is coming from the point of a developer, and from the point of an ops person, there is not necessarily less ops.

Some commenters made points along the lines of, you can just rent rack space in some datacenter for that. And I agree. There are some ops benefits that Amazon adds, such as easier monitoring and backups, but for the most part, there is not *less* work from an operations standpoint when you are in a cloud environment my time doing remote DBA work at Pythian and PalominoDB certainly taught me that!

There are still operating systems to install, maintain and upgrade. There are still compatibility issues and having to upgrade and maintain software and configurations. There are software-as-a-service (SaaS) technologies like Amazons RDS (which provides MySQL as a service) and the benefit there is not having to worry about configuration or upgrades. There are Amazon machine images (AMIs) that folks share, so that the operating system initial installation requires little knowledge.

The cloud is really useful if you need a machine up and running very quickly. I totally understand that developers want to use the cloud instead of waiting for a machine to arrive, and even in IT it is useful to have another machine for a while. For example, if you wanted to test MySQL 5.6 but do not have a spare machine, you can spin up an instance of a machine in the cloud.

From a production or staging perspective though, there is still a LOT of work to be done to architect a system. The ops benefit of the cloud is NOT less ops. The ops benefit of the cloud is actually thanks to how Amazon built its cloud it was built as a cloud computing platform. The Amazon Cloud was built to provide extra CPU cycles (elastic cloud computing). In the days before persistent data stores with elastic block systems (EBS), many developers and system administrators lost time when an instance would reboot and all their work was gone not just the development work, but their setup work the operating system users, any software packages they had installed, etc. Running any important system in the cloud, these days, means having some kind of installation and configuration management in place, so that if an instance reboots or if another instance is needed, the rebooted/new instance can be brought back to a working state as quickly as possible.

Some folks get their environment set up how they want it, and take a snapshot that can be used as an Amazon machine instance (AMI). This works great, until you need to update any software or make any changes in configuration, whether its operating system configuration (like adding a user), or software configuration (like a my.cnf file).

The benefit of the cloud from an ops side is that it forces us to do what we should be doing anyway running installation and configuration management. At Mozilla, we have a plan to move some services to the cloud, but we are already using kickstart for installation management and Puppet for configuration management, so we are already set with those benefits.

MySQL Connect Liveblog: Big Data is a Big Scam: Most of the Time

I was very excited about this session at MySQL Connect by Daniel Austin of PayPal. I have been talking about this session for a few weeks, on 2 different podcasts, and 2 different blog posts. But I was a bit nervous, because the description was fantastic but the talk itself could have fallen apart.

After seeing the keynote, I knew the talk would be fantastic. I was not disappointed.

Big myths about Big Data:
PayPal problem – “How do we manage reliable distribution of data across geographical distances?”

The first thing people think of when they think of “big data” is “NoSQL”. NoSQL provides a solution that relaxes many of the common RDBMS constraints – too slow, requires complex data management like Sarbanes-Oxley (SOX), costly to maintain, slow to change and adapt, intolerant of CAP models.

NoSQL are non-relational models usually (not always) like key-value stores. They may be batched or streaming, and they are not necessarily distributed geographically (but they are at PayPal).

Big data myth #1 – Big data = nosql
Big data refers to a common set of problems – large volumes, high rates of change – data/data models/presentation and output.
Often big data isn’t just big, it’s that it needs to be FAST too. Things like near real-time analytics, or mapping complex structures.

3 kinds of big data systems:
1) Columnar key-value systems – Hadoop, Hbase, Cassandra, PNUTs
2) Document-based – MongoDB, TerraCotta
3) Graph-based – Voldemort, FlockDB. These are the more interesting ones, the other 2 are a bit more “brute force” according to Daniel.

big data hype slide

The CAP theorem (Daniel Abadi added latency)
The nice sound byte is:
“You can’t really trade availability for consistency, because if it’s not available you have no idea if it’s consistent or now.”

Do you need a big data system?
What’s your problem – one of 2:
1) I have too much data and it’s coming in too fast to handle with any RDBMS
(e.g. sensor data)

2) I have a lot of data distributed geographically and need to be able to read and write from anywhere in near real-time. (PayPal’s problem)

If you have one of those 2 problems, you may have a problem that can be solved with NoSQL,

Myth: Big Data and NoSQL are not new ideas. DNS was the first and most successful such system, created in 1983 [Sheeri says: memcached is NoSQL – key/value store]:

big data/nosql is not new

YESQL: A counter example. The mission – develop a globally distributed db for user-related data. Here are the constraints:
– Must not fail
– Must not lose data (it’s your MONEY!!)
– Must support transactions
– Must support (some) SQL
– Must WriteRead 32-bit integer globally in 1000ms (1sec)
– Max data volume: 100 TB
– Must scale linearly with costs

Speed constraints:
Max lightspeed distance on earth’s surface – 67ms. Target – data available worldwide in 1000ms

They chose to use MySQL Cluster because:
– True HA by design
– …with fast recovery
– Supports (some) transactions
– Relational model
– In-memory architecture, which translates to high performance
– Disk storage available for non-indexed data
– APIs to make things easier. Can’t just use ODBC or JDBC for this, need high performance APIs.

There are cons to MySQL cluster:
– some semantic limitations on fields (already lifted, but weren’t when PayPal was looking for a solution)
– Size constraints (about 2 Tb) – back when Cluster couldn’t do 64-bit, so this is resolved now.
– Hardware constraints
– Higher cost/byte
– Requires reasonable data partitioning
– Higher complexity

They use circular replication/failover with cluster. They have 4 nodes, talking to each other, keeping themselves in sync. If node C fails, node B can talk to node D – that’s what this pic shows:

circular replication/failover with mysql cluster

When C comes back up you have to move it back to the *end* of the replication flow chain so it can catch up.

Availability defined – availability of the entire system is:

Built this in Amazon Web Services (AWS)
– Why AWS? Cheap and easy infrastructure-in-a-box – or so they thought!
Services used:
– EC2, CentOS 5.3, small instances for the management (mgm) and query nodes, XL instances for data – 4×4 with 24G each, each “tile” is 96G RAM)
– Elastic IPs/ELB
– EBS Volumes, used to have to use dd to move images from one AWS data center to another
– S3
– Cloudwatch for monitoring

Architectural tiles – developed in a paper with Donald Knuth. Picture on this slide:
– Never separate NDB and SQL
– 2 NDB (aka data) nodes for every SQL node for every 1 management nodes
– For scaling, bring up a new tile, not just a new machine – they use a RightScale template
– Failover first to the nearest availability zone, then to the nearest data center
– At least one replica for every availability zone
– No shared nodes
– Some management nodes are redundant, that’s OK
– AWS is network-bound at 250 Mb per second!
– Need specific ACL across availability zone boundaries
– AZ’s not uniform,
– No GSLB – global server load balancing
– Dynamic IPs
– ELB sticky sessions are unreliable – this is fixed now in AWS
– con: have to upgrade the whole tile at once

Other tech considered:
– Paxos – elegant-but-complex consensus-based messaging protocol. Used in Google Megastore, Bing metadata
– Java Query caching – queries as serialized objects – but not yet working
– Multiple ring architectures, but those are even more complicated.

System r/w performance:
– 23 & 256 byte char fields
– reads/writes/query speed vs. volume
– data replication speeds

Results:
– global replication in under 350 ms
– 256 bytes read in under 1000 ms worldwide.

Data models and query optimization
– network latency (obvious issue)
– data model requires all segments present in each geo-region
– parameterized (linked) joins – adaptive query localization (SIP) technique from Clustra – see Frazer Clement’s blog for details)

they went around the international date line the wrong way at first….commit ordering matters!
Order in which you do writes vs. reads is important! Writes don’t always happen at the same time you start them at.

Be careful:
– with “eventual consistency”-related concepts
– ACID/CAP are not really as well-defined considering how often we invoke them
– MySQL Cluster is good, b/c it has real HA, real SQL. Notable limits around fields, datatypes, but successfully competes with NoSQL for many use cases, often is better
– NoSQL has relatively low maturity, MySQL Cluster is much more mature.
– Don’t be a victim of Technological Fashion!

Fugure directions:
– Alternatives using Pacemaker, Heartbeat (using InnoDB, Yves Trudeau at Percona)
– Implement the memcached plugin – add simple connection-based persistence to preserve connections during failover
– better monitoring
– -better data node distribution

Summing up on “YESQL 0.85″:
– it works, better than expected!
– very fast
– very reliable
– reduced complexity since 0.7
– AWS poses cahallenges that private data centers might not have

Only use big data solutions when you have a REAL big data problems. Not all big data solutions are createdeuqal. What tradeoffs are important – consstency, fault tolerance, etc.
– You can achieve high performance and aviailability w/out giving up on relational models…

Maynard
Keynes on “NoSQL Databases”
In the long run, we are all dead eventually consistent).

I was very excited about this session at MySQL Connect by Daniel Austin of PayPal. I have been talking about this session for a few weeks, on 2 different podcasts, and 2 different blog posts. But I was a bit nervous, because the description was fantastic but the talk itself could have fallen apart.

After seeing the keynote, I knew the talk would be fantastic. I was not disappointed.

Big myths about Big Data:
PayPal problem How do we manage reliable distribution of data across geographical distances?

The first thing people think of when they think of big data is NoSQL. NoSQL provides a solution that relaxes many of the common RDBMS constraints too slow, requires complex data management like Sarbanes-Oxley (SOX), costly to maintain, slow to change and adapt, intolerant of CAP models.

NoSQL are non-relational models usually (not always) like key-value stores. They may be batched or streaming, and they are not necessarily distributed geographically (but they are at PayPal).

Big data myth #1 Big data = nosql
Big data refers to a common set of problems large volumes, high rates of change data/data models/presentation and output.
Often big data isnt just big, its that it needs to be FAST too. Things like near real-time analytics, or mapping complex structures.

3 kinds of big data systems:
1) Columnar key-value systems Hadoop, Hbase, Cassandra, PNUTs
2) Document-based MongoDB, TerraCotta
3) Graph-based Voldemort, FlockDB. These are the more interesting ones, the other 2 are a bit more brute force according to Daniel.

The CAP theorem (Daniel Abadi added latency)
The nice sound byte is:
You cant really trade availability for consistency, because if its not available you have no idea if its consistent or now.

Do you need a big data system?
Whats your problem one of 2:
1) I have too much data and its coming in too fast to handle with any RDBMS
(e.g. sensor data)

2) I have a lot of data distributed geographically and need to be able to read and write from anywhere in near real-time. (PayPals problem)

If you have one of those 2 problems, you may have a problem that can be solved with NoSQL,

Myth: Big Data and NoSQL are not new ideas. DNS was the first and most successful such system, created in 1983 [Sheeri says: memcached is NoSQL – key/value store]:

YESQL: A counter example. The mission develop a globally distributed db for user-related data. Here are the constraints:
– Must not fail
– Must not lose data (its your MONEY!!)
– Must support transactions
– Must support (some) SQL
– Must WriteRead 32-bit integer globally in 1000ms (1sec)
– Max data volume: 100 TB
– Must scale linearly with costs

Speed constraints:
Max lightspeed distance on earths surface 67ms. Target data available worldwide in 1000ms

They chose to use MySQL Cluster because:
– True HA by design
– with fast recovery
– Supports (some) transactions
– Relational model
– In-memory architecture, which translates to high performance
– Disk storage available for non-indexed data
– APIs to make things easier. Cant just use ODBC or JDBC for this, need high performance APIs.

There are cons to MySQL cluster:
– some semantic limitations on fields (already lifted, but werent when PayPal was looking for a solution)
– Size constraints (about 2 Tb) back when Cluster couldnt do 64-bit, so this is resolved now.
– Hardware constraints
– Higher cost/byte
– Requires reasonable data partitioning
– Higher complexity

They use circular replication/failover with cluster. They have 4 nodes, talking to each other, keeping themselves in sync. If node C fails, node B can talk to node D thats what this pic shows:

When C comes back up you have to move it back to the *end* of the replication flow chain so it can catch up.

Availability defined availability of the entire system is:

Built this in Amazon Web Services (AWS)
– Why AWS? Cheap and easy infrastructure-in-a-box or so they thought!
Services used:
– EC2, CentOS 5.3, small instances for the management (mgm) and query nodes, XL instances for data 44 with 24G each, each tile is 96G RAM)
– Elastic IPs/ELB
– EBS Volumes, used to have to use dd to move images from one AWS data center to another
– S3
– Cloudwatch for monitoring

Architectural tiles developed in a paper with Donald Knuth. Picture on this slide:
– Never separate NDB and SQL
– 2 NDB (aka data) nodes for every SQL node for every 1 management nodes
– For scaling, bring up a new tile, not just a new machine they use a RightScale template
– Failover first to the nearest availability zone, then to the nearest data center
– At least one replica for every availability zone
– No shared nodes
– Some management nodes are redundant, thats OK
– AWS is network-bound at 250 Mb per second!
– Need specific ACL across availability zone boundaries
– AZs not uniform,
– No GSLB global server load balancing
– Dynamic IPs
– ELB sticky sessions are unreliable this is fixed now in AWS
– con: have to upgrade the whole tile at once

Other tech considered:
– Paxos elegant-but-complex consensus-based messaging protocol. Used in Google Megastore, Bing metadata
– Java Query caching queries as serialized objects but not yet working
– Multiple ring architectures, but those are even more complicated.

System r/w performance:
– 23 256 byte char fields
– reads/writes/query speed vs. volume
– data replication speeds

Results:
– global replication in under 350 ms
– 256 bytes read in under 1000 ms worldwide.

Data models and query optimization
– network latency (obvious issue)
– data model requires all segments present in each geo-region
– parameterized (linked) joins adaptive query localization (SIP) technique from Clustra see Frazer Clements blog for details)

they went around the international date line the wrong way at first.commit ordering matters!
Order in which you do writes vs. reads is important! Writes dont always happen at the same time you start them at.

Be careful:
– with eventual consistency-related concepts
– ACID/CAP are not really as well-defined considering how often we invoke them
– MySQL Cluster is good, b/c it has real HA, real SQL. Notable limits around fields, datatypes, but successfully competes with NoSQL for many use cases, often is better
– NoSQL has relatively low maturity, MySQL Cluster is much more mature.
– Dont be a victim of Technological Fashion!

Fugure directions:
– Alternatives using Pacemaker, Heartbeat (using InnoDB, Yves Trudeau at Percona)
– Implement the memcached plugin add simple connection-based persistence to preserve connections during failover
– better monitoring
– -better data node distribution

Summing up on YESQL 0.85:
– it works, better than expected!
– very fast
– very reliable
– reduced complexity since 0.7
– AWS poses cahallenges that private data centers might not have

Only use big data solutions when you have a REAL big data problems. Not all big data solutions are createdeuqal. What tradeoffs are important consstency, fault tolerance, etc.
– You can achieve high performance and aviailability w/out giving up on relational models

Maynard
Keynes on NoSQL Databases
In the long run, we are all dead eventually consistent).

My Thoughts About MySQL 5.6

If you are reading this blog post, you are probably not at MySQL Connect. You may have heard about today’s new release – MySQL 5.6.7. This is a release candidate quality release, and if Oracle treats MySQL like the rest of its software, that means that there will very likely be a 5.6 GA by the end of 2012.

That all being said, is MySQL 5.6 worth upgrading to, once it’s GA? Probably the most compelling reason to upgrade is InnoDB online DDL – including online add/drop indexes (including foreign keys) and online add/drop/rename columns.

There are some great InnoDB performance enhancements, which you can read about if you are inclined to look further into it. Those are interesting, but it’s hard to say how much improvement any one organization will get until they actually test their system. So I won’t go into it too much until I have had time to see if Mozilla would benefit from it. Similarly, the fact that MySQL can now support parallel threading up to 48 cores is also great – Oracle tested on a 96-core server and got 48 cores working in parallel.

One of the most commonly used SQL extensions has gotten lots of new features added – EXPLAIN. In MySQL 5.6 you can now use EXPLAIN on SELECT, UPDATE and DELETE queries. There is also a visual EXPLAIN output and the output can be stored in JSON format. Here is a simple example of the new syntax and format:

mysql> EXPLAIN FORMAT=JSON DELETE FROM dup_index WHERE id=1\G
*************************** 1. row ***************************
EXPLAIN: {
   "query_block": {
     "select_id": 1,
     "table": {
       "delete": true,
       "table_name": "dup_index",
       "access_type": "range",
       "possible_keys": [
         "id",
         "id_2"
       ],
       "key": "id",
       "key_length": "5",
       "rows": 1,
       "filtered": 100,
       "attached_condition": "(`version`.`dup_index`.`id` = 1)"
     }
   }
}
1 row in set (0.00 sec)

Personally, I am pretty excited about the new security features of MySQL 5.6. The biggest one, which is a pretty big change to watch out for when upgrading, is that secure_auth defaults to on, unless you specify skip-secure-auth in the configuration. This means that when you upgrade, any user in the old password format (the password hash is 16 characters) will be blocked.

Other security features have to do with passwords – in MySQL 5.6 you can force a user to do a password change the next time they login (great for first-time logins, and no other commands will work until the password is changed), you can set a password expiration, and you can set a password strength that has to be met.

MySQL will also warn you when you set a replication password without using SSL, or when it is stored in cleartext. For example, the normal setting of replication’s username and password will generate the following 2 notes:

mysql> SHOW WARNINGS\G
*************************** 1. row ***************************
Level: Note
Code: 1759
Message: Sending passwords in plain text without SSL/TLS is extremely insecure.
*************************** 2. row ***************************
Level: Note
Code: 1760
Message: Storing MySQL user name or password information in the master.info repository is not secure and is therefore not recommended. Please see the MySQL Manual for more about this issue and possible alternatives.
2 rows in set (0.00 sec)

In MySQL 5.6, you can now store replication information in a table, not just in master.info.

I am also excited about having checksums in replication. Using pt-table-checksum can get tedious, and it only finds inconsistencies after the fact, it doesn’t prevent the inconsistencies or give an error exactly when the inconsistency occurs.

Another really nice replication change is that you can control row-based binary logging so it only logs a change in a row, not the entire changed row itself. This reduces overhead in row-based replication by a lot.

There are some nice little touches that show that Oracle is going in the right direction with MySQL – for example, in MySQL 5.6, innodb_file_per_table is enabled by default. And there is a new feature that warns you with a “note” if you create a duplicate index:

mysql> ALTER TABLE dup_index ADD INDEX(id);
Query OK, 0 rows affected, 1 warning (0.01 sec)
Records: 0 Duplicates: 0 Warnings: 1

mysql> show warnings\G
*************************** 1. row ***************************
Level: Note
Code: 1831
Message: Duplicate index 'id_2' defined on the table 'version.dup_index'. This is deprecated and will be disallowed in a future release.
1 row in set (0.00 sec)

This note only appears if you make an index with the same fields as another index; if you create an index that’s a prefix subset of another index, there is no warning (e.g. if you have an index on (a,b) and create an index on (a), there is no warning). Still, it is a good step in the right direction.

By default, sql_mode is no longer blank:
mysql> show variables like 'sql_mode'\G
*************************** 1. row ***************************
Variable_name: sql_mode
Value: NO_ENGINE_SUBSTITUTION
1 row in set (0.00 sec)

If you use statements like UPDATE...LIMIT x and fill up your error logs with messages that the transaction is “unsafe”. There is now a warning suppression system, so that after 50 warnings in 50 seconds, the warnings will be aggregated with X warnings in Y seconds.

Other neat features I think I will make use of are:
sync_binlog is less resource-intensive
transportable tablespaces
being able to specify locations for .ibd files
multiple InnoDB buffer pools

All in all, MySQL 5.6 is a release to look forward to. I have not covered every change in MySQL 5.6, but the major ones that I am looking forward to. Others may have different priorities and reasons for wanting to move to MySQL 5.6. You can see the full MySQL 5.6.7 changelog, or read about the major changes in MySQL 5.6.

If you are reading this blog post, you are probably not at MySQL Connect. You may have heard about todays new release MySQL 5.6.7. This is a release candidate quality release, and if Oracle treats MySQL like the rest of its software, that means that there will very likely be a 5.6 GA by the end of 2012.

That all being said, is MySQL 5.6 worth upgrading to, once its GA? Probably the most compelling reason to upgrade is InnoDB online DDL including online add/drop indexes (including foreign keys) and online add/drop/rename columns.

There are some great InnoDB performance enhancements, which you can read about if you are inclined to look further into it. Those are interesting, but its hard to say how much improvement any one organization will get until they actually test their system. So I wont go into it too much until I have had time to see if Mozilla would benefit from it. Similarly, the fact that MySQL can now support parallel threading up to 48 cores is also great Oracle tested on a 96-core server and got 48 cores working in parallel.

One of the most commonly used SQL extensions has gotten lots of new features added EXPLAIN. In MySQL 5.6 you can now use EXPLAIN on SELECT, UPDATE and DELETE queries. There is also a visual EXPLAIN output and the output can be stored in JSON format. Here is a simple example of the new syntax and format:

mysql> EXPLAIN FORMAT=JSON DELETE FROM dup_index WHERE id=1\G
*************************** 1. row ***************************
EXPLAIN: {
"query_block": {
"select_id": 1,
"table": {
"delete": true,
"table_name": "dup_index",
"access_type": "range",
"possible_keys": [
"id",
"id_2"
],
"key": "id",
"key_length": "5",
"rows": 1,
"filtered": 100,
"attached_condition": "(`version`.`dup_index`.`id` = 1)"
}
}
}
1 row in set (0.00 sec)

Personally, I am pretty excited about the new security features of MySQL 5.6. The biggest one, which is a pretty big change to watch out for when upgrading, is that secure_auth defaults to on, unless you specify skip-secure-auth in the configuration. This means that when you upgrade, any user in the old password format (the password hash is 16 characters) will be blocked.

Other security features have to do with passwords in MySQL 5.6 you can force a user to do a password change the next time they login (great for first-time logins, and no other commands will work until the password is changed), you can set a password expiration, and you can set a password strength that has to be met.

MySQL will also warn you when you set a replication password without using SSL, or when it is stored in cleartext. For example, the normal setting of replications username and password will generate the following 2 notes:

mysql> SHOW WARNINGS\G
*************************** 1. row ***************************
Level: Note
Code: 1759
Message: Sending passwords in plain text without SSL/TLS is extremely insecure.
*************************** 2. row ***************************
Level: Note
Code: 1760
Message: Storing MySQL user name or password information in the master.info repository is not secure and is therefore not recommended. Please see the MySQL Manual for more about this issue and possible alternatives.
2 rows in set (0.00 sec)

In MySQL 5.6, you can now store replication information in a table, not just in master.info.

I am also excited about having checksums in replication. Using pt-table-checksum can get tedious, and it only finds inconsistencies after the fact, it doesnt prevent the inconsistencies or give an error exactly when the inconsistency occurs.

Another really nice replication change is that you can control row-based binary logging so it only logs a change in a row, not the entire changed row itself. This reduces overhead in row-based replication by a lot.

There are some nice little touches that show that Oracle is going in the right direction with MySQL for example, in MySQL 5.6, innodb_file_per_table is enabled by default. And there is a new feature that warns you with a note if you create a duplicate index:

mysql> ALTER TABLE dup_index ADD INDEX(id);
Query OK, 0 rows affected, 1 warning (0.01 sec)
Records: 0 Duplicates: 0 Warnings: 1

mysql> show warnings\G
*************************** 1. row ***************************
Level: Note
Code: 1831
Message: Duplicate index ‘id_2’ defined on the table ‘version.dup_index’. This is deprecated and will be disallowed in a future release.
1 row in set (0.00 sec)

This note only appears if you make an index with the same fields as another index; if you create an index thats a prefix subset of another index, there is no warning (e.g. if you have an index on (a,b) and create an index on (a), there is no warning). Still, it is a good step in the right direction.

By default, sql_mode is no longer blank:
mysql> show variables like 'sql_mode'\G
*************************** 1. row ***************************
Variable_name: sql_mode
Value: NO_ENGINE_SUBSTITUTION
1 row in set (0.00 sec)

If you use statements like UPDATE...LIMIT x and fill up your error logs with messages that the transaction is unsafe. There is now a warning suppression system, so that after 50 warnings in 50 seconds, the warnings will be aggregated with X warnings in Y seconds.

Other neat features I think I will make use of are:
sync_binlog is less resource-intensive
transportable tablespaces
being able to specify locations for .ibd files
multiple InnoDB buffer pools

All in all, MySQL 5.6 is a release to look forward to. I have not covered every change in MySQL 5.6, but the major ones that I am looking forward to. Others may have different priorities and reasons for wanting to move to MySQL 5.6. You can see the full MySQL 5.6.7 changelog, or read about the major changes in MySQL 5.6.

Liveblog: Nagios and Another Layer of Indirection

John Sellens presents Nagios and Another Layer of Indirection at the Nagios World Conference. PDF slides are here.

“All problems in computer science can be solved by another level of indirection” – David Wheeler

Nagios Constitution: Separation of Core and State

There are separate components and interfaces, and they’re well-defined. This separation allows us to subvert how they’re supposed to be used and do whatever we want.

Nagios if well-documented, that’s one of the major strengths!

Where is there indirection in Nagios?

Favorite plugin is the negate plugin – in the official nagios plugins.

Remote checking adds another layer – between a local plugin and the nagios server – check_by_ssh, check_nrpe, check_snmp

Another layer of indirection – graphing. Apan was the original nagios grapher, but now there’s performance data and plugins and event brokers that will get the data.

How can we implement indirection?
“Unix Philosophy: Write programs that do one thing and do it well” – Doug McIlroy.

Add plugin timeout with timeout (a unix program). Write a wrapper around an existing plugin. You can do multi-stage checks, e.g. is at least one interface up? Is at least one web server up? You can use expect for interactions, or use webform posting tools.

You can make a “pervasive wrapper” – e.g. it changes *everything* – e.g. change the value of $USER1$ to be /usr/local/mywrapper /usr/local/libexec/nagios.

Custom object variables in the environment –
_web_regexp SomeRegExp
or
NAGIOS__HOSTWEB_REGEXP

Environment macros means your plugins can know everything from the external and internal environment.

Try to avoid per-machine configs, but try to make them simple. e.g. to add a new webhost or db machine, want to make the changes as small as possible. Use hostgroups, sevices, etc. so all you need to do is add a host definition, including setting critical and warning variable values.

Smarter plugins
Make wrappers that change based on time of day – e.g. if it’s off-hours, report good, because it doesn’t matter. You could use a timeperiod for that, or you could hard-code it into the plugin. John made a plugin that says “what storage do you have, and is it good.” So you don’t have to make a separate /data check or whatever. Or, the plugin can assume that the first observed state is “normal” and complain if it changes.

Principle: derived thresholds – dynamically adjust thresholds based on time of day, based on trends and past experience, based on other current state/activity.

Let machines make the configurations, not you.

How else can we use these principles?
Define exec commands in snmpd.conf – check_snmpexec gets a table of everything that’s available, looks for the number of the description that matches (“mysql”), and uses that. Then you don’t need to know the number:

check_snmpexec host snmpcomm execname

Get Service: check_winservices – uses files to know what should be running, and gets the running services, and complains if the files don’t match.

mbdivert – for a certain machine, route this way (e.g. hop through this ssh server)

..and more examples. Check the slides….John’s a really smart guy and knows how to use and abuse Nagios, in the good ways!

John Sellens presents Nagios and Another Layer of Indirection at the Nagios World Conference. PDF slides are here.

All problems in computer science can be solved by another level of indirection David Wheeler

Nagios Constitution: Separation of Core and State

There are separate components and interfaces, and theyre well-defined. This separation allows us to subvert how theyre supposed to be used and do whatever we want.

Nagios if well-documented, thats one of the major strengths!

Where is there indirection in Nagios?

Favorite plugin is the negate plugin in the official nagios plugins.

Remote checking adds another layer between a local plugin and the nagios server check_by_ssh, check_nrpe, check_snmp

Another layer of indirection graphing. Apan was the original nagios grapher, but now theres performance data and plugins and event brokers that will get the data.

How can we implement indirection?
Unix Philosophy: Write programs that do one thing and do it well Doug McIlroy.

Add plugin timeout with timeout (a unix program). Write a wrapper around an existing plugin. You can do multi-stage checks, e.g. is at least one interface up? Is at least one web server up? You can use expect for interactions, or use webform posting tools.

You can make a pervasive wrapper e.g. it changes *everything* e.g. change the value of $USER1$ to be /usr/local/mywrapper /usr/local/libexec/nagios.

Custom object variables in the environment
_web_regexp SomeRegExp
or
NAGIOS__HOSTWEB_REGEXP

Environment macros means your plugins can know everything from the external and internal environment.

Try to avoid per-machine configs, but try to make them simple. e.g. to add a new webhost or db machine, want to make the changes as small as possible. Use hostgroups, sevices, etc. so all you need to do is add a host definition, including setting critical and warning variable values.

Smarter plugins
Make wrappers that change based on time of day e.g. if its off-hours, report good, because it doesnt matter. You could use a timeperiod for that, or you could hard-code it into the plugin. John made a plugin that says what storage do you have, and is it good. So you dont have to make a separate /data check or whatever. Or, the plugin can assume that the first observed state is “normal” and complain if it changes.

Principle: derived thresholds dynamically adjust thresholds based on time of day, based on trends and past experience, based on other current state/activity.

Let machines make the configurations, not you.

How else can we use these principles?
Define exec commands in snmpd.conf check_snmpexec gets a table of everything thats available, looks for the number of the description that matches (mysql), and uses that. Then you dont need to know the number:

check_snmpexec host snmpcomm execname

Get Service: check_winservices uses files to know what should be running, and gets the running services, and complains if the files dont match.

mbdivert for a certain machine, route this way (e.g. hop through this ssh server)

..and more examples. Check the slides. John’s a really smart guy and knows how to use and abuse Nagios, in the good ways!

Monitoring with “Non-Obvious” Nagios

John Sellens of SYONEX presents Non-Obvious Nagios.

PDF slides are available

John’s viewpoints and religion (you may not agree):
Monitoring is Exceptions, Trending, History
UNIX philosophy: Effective tools, not kitchen sink – Choose the best tool(s) for the job
SNMP is Your Friend – Use it whenever you can
Solve any problem in computer science with another level of indirection

Nagios has:
Discrete components
Well-defined interfaces
Great documentation
Nagios core just schedules and executes
It’s just an engine – that’s the simple genius of it

Well-defined and simple interface between all the parts, that’s part pf the brilliance.

He talks about compiling Nagios, and goes on to say most folks use packages. Then John mentions about monitoring status.dat, maybe with check_file_age, to make sure that Nagios is still running. And then he talks about the basics of configuration, that anyone who Nagios is familiar with.

Plugins:
Nagios Plugins or you can just write one, it’s not a hard thing to do. There are helpers like Nagios::Plugin in Perl. Then there’s more talk about plugin basics, how they work and performance data.

Put secrets in the enable-extra-opts file – they’re then hidden from ps on the server.

If a plugin does something you want, but you want a little more, you can write a wrapper around an existing script.

He goes on to talk about plugin basics, there are lots available (Nagios Exchange – 3rd party plugins), explains the difference between local/remote checks, and active/passive checks. John doesn’t use NRPE, he relies on SNMP, but that’s his self-admitted religion.

Configuration:
Talks about a bit more of the basics, especially the files themselves, that variables are case-sensitive, etc.

check_result_reaper_frequency – checks results every 10 seconds for any fork (plugin) to see if it’s finished. If you set this lower, there’s no harm [says John], and a child waiting to be reached is part of the CPU LOAD count on a unix machine – so it’s worth it, and things will finish quicker.

There’s also a variable for the frequency of checking the contents of a directory [missed which], again this can be lowered.

Set config_dir to point to where the object definitions are, and then leave it alone and put the files in there – as opposed to every time you add/remove a file, updating the cfg_file variable. The order and location of things in config files are irrelevant.

Last value of directive/variable (same thing) in a definition is used, so you can define a directive more than once, but only the last one is used. You can also inherit from templates, including multi-template inheritance (comma-separated list). Templates are usually specified with “register=0″, but you can also use a registered object as a template if you want, but it’s a bad idea – the point is either you have a template or an instantiation, and can probably abstract out if you’re in that situation.

“use” is evaluated first, no matter where in the directive it appears.

Nagios object template inheritance is a directed acyclic graph. Slide 25 is pretty useful, shows precedence and inheritance. Of course if you arrange templates such that the order doesn’t matter, then you’re golden no matter what.

You can append to an inherited value with + for instance “directive +value”. There’s no subtraction, but with host groups and hosts you can use ! for exclusion, e.g. mysql_slaves -db3

Custom directives start with an underscore (_) – case-insensitive. For example, you might want one for your SNMP community string. In your generic host template, define _snmp_community, and if there’s a different one for a specific host, you can define that in the host. And instead of “-c password”, you do “-c _snmp_community”.

Custom object variables, define a variable called “_load_warn” as 3, _load_crit as 5, the default check has that. On big machines, change that in the host definition. Refer to as macros or env variables, eg:

macro $_HOSTBLOOP$
environment variable NAGIOS__HOSTBLOOP

Implied Inheritance
Nagios will sometimes assume a value from a related object
Service objects will inherit from the associated host: contact_groups, notification_interval, notification_period
Hostescalations and serviceescalations will similarly inherit as well, except notification_period becomes escalation_period

Timeperiods are nice b/c you can exclude time periods (like holidays). There are lots of examples

Command/service definition vs. quoting – Command line quoting is sometimes challenging, so try to avoid special characters in your arguments. John’s advice, put quotes in the service definition only.

host notification options:
Don’t have “u” in things that page you, that way if things are unreachable, you won’t get paged, you want to get paged for the firewall being down or whatever. Also take off the s unless you want to be paged when scheduled downtime happens. John uses d,r.

Escalations are to a contactgroup.

Host definitions: host has a host_name and address (IP or FQDN). An IP address avoids alerts if DNS fails, but is harder to maintain. John recommends using FQDNs and having locally cached DNS on the Nagios box.

check_command is used to see if a host is up, usually it’s a ping of some kind, and it’s only checked if a service on the host fails. parents are a list of routers and gateways between Nagios and the servers. One tip is to define a “google” machine, and if Nagios can’t get to Google, all heck has broken loose b/c the whole network is gone.

hostgroups are useful for admin grouping of hosts.

Services:
In Nagios terms, a “service” could be an aspect of a running system, like disk capacity, or memory utilization. A “service” needn’t be offered externally to a device

Nagios tests services based on:
max_check_attempts — how many times to check a service before concluding it is actually down – e.g. maybe a mail queue peaks and that’s OK, but not for more than 30 min. used in conjunction with the next 2.

normal_check_interval — how many “time units” to wait between regular service checks

retry_check_interval – how many “time units” to wait before checking a service that is not “OK”

contact_groups — who to complain to in case of a problem

———
You can use notes for a URL for notes or action. Interesting, so if you have a frequent problem you could put URLs here, or an “Action url”.

dependencies & escalations are a good thing. With escalations, only add contactgroups, so that the oncall person doesn’t think it’s fixed when the manager starts getting paged.

avoid repitition:
General rule: anywhere you can list a host_name or hostgroup_name you can:
– use a comma-separated list of hosts/groups
– exclude with !
– use a wildcard host_name of “”, meaning “all hosts” to have it apply (or not) to multiple hosts

e.g. A service definition for the HTTP service might include
hostgroup_name webservers
to cause the service to be defined for all hosts in the
webservers hostgroup

Generic Notification Author (genoa) – uses environment variables.

Most are added to environment e.g. NAGIOS_SERVICESTATE including any custom variables

“On-Demand Macros” allow you to refer to values from other config settings e.g.
$SERVICESTATEID:novellserver:DS Database$

e.g. db1 is down but db2 is still up.

“On-Demand Group Macros” get you a comma-separated list of all values in a host, service or contact group e.g.
$HOSTSTATEID:hg1:,$

Documentation contains “Theory of Operation” information – read it

Then he reviews soft vs. hard state basics. state types

Stalking – If enabled, stalking logs any changes in plugin output, even with no state change, it’s logged for later review/analysis. It turns off acknowledgement, you need to ack again, because it’s a new problem.
– e.g. RAID check was “1 disk dead” and is now “2 disks dead”

Volatile services – Something that resets to OK after each check. For things that need attention every time there is a problem. Notification and event handler happen once per failure – e.g. intrusion detection system, you want to know about every time.

If you define your topology (e.g. parents) it’s easier to find the root cause of stuff.

Dependencies: Host and service dependencies define operational requirements
e.g. Web server can’t work unless file server is working

execution_failure_criteria
and
notification_failure_criteria
determine what we do
if something we depend on fails, e.g. if file server down, don’t execute web check and don’t notify me about web problem

Set inherits_parent to inherit dependencies in definitions

Cached checks
Can cache and re-use host or service check results

Used only for “On-Demand Checks” – e.g. Checking that host is up if a service fails, Checking topological reachability, For “predictive dependency checks”, Checking for “collateral damage”. Lower overhead, good results. You should enable and tune the cache.

Event Handlers:
In a perfect world, nothing would ever go wrong. In a semi-perfect world, problems would fix themselves. Event handlers are one of Nagios’ ways of moving closer to perfection.

An event handler is a command that is run in response to a state change. Canonical example: restart httpd if WWW service fails

But you could do things like open a trouble ticket on failure. You can have Global and specific host and service event handlers.

Complications: runs as the nagios user, on the nagios server.

External Commands

The Nagios server maintains a named pipe in the file system for accepting various commands from other processes. External commands are used most often by the web interface to
record information and modify Nagios’ behaviour. But you can do lots of things from shell scripts. Some of the available functionality:
– Add/delete host or service comments
– Schedule downtime, enable/disable notifications
– Reschedule host or service checks
– Submit passive service check results
– Restart or stop the Nagios server

Nagios can accept service check results from other programs. Since Nagios did not initiate the check, these are called “passive service checks”. Useful for embedded Nagios, asynchronous events, results from other, existing programs.

Nagios supports distributed monitoring of a certain style. Remote Nagios servers are essentially probe engines, submitting their results to a central server with passive service check results. The configuration on the remote servers is a subset of the central configuration. The central server is configured to notice if the passive results stop coming from the remote server.

The “central aggregation” approach is used by a number of more recent tools, such as Nagios Fusion, Thruk (slide 104), MNTOS (slide 104), and Multisite (slide 104).

Adaptive Monitoring – Can change things during runtime via external commands – e.g. schedule changes, or from an exception handler. I wonder if this could be useful for oncall rotation

Can change:
– Check commands and arguments
– Check interval, max attempts, timeperiod
– Event handler commands and arguments

Scaling Up
– Nagios can handle a lot without much effort
– As you get larger, advanced features are more important
– Use parent/child and host/service dependencies
– More efficient for humans and machines
– You will need to be more rigorous in your configuration
– Consistency, completeness, tuning
– Version 3 adds scalability and tuning features

Lots more cores now, so parallelism with plugins is more automatic.

oncall rotation in Nagios.

Tuning for Performance
– Lots of tunable configuration parameters
– Keep performance graphs of Nagios – MRTG, nagiostats, etc.
– Disable environment macros
– Use passive checks if you can- Not John’s favorite idea
– Avoid interpreted plugins, or offload checks
– Use Fast Startup Options – pre-cache configs – can save time on startup for new configs.

Tips/tricsk:
– Use the parent/child topology
– Pre Nagios 3, host checks are not parallelized
– Host checks of a down segment can block all other checks
– Be consistent and use templates and groups
– Make it easy to add another similar host
– Make it easy to add a service to a group of hosts
– Smarter plugins make life (configuration) easier (e.g. default thresholds)

With multiple Nagios servers use allow_empty_hostgroup_assignment=1 – You can define machine types as common hostgroups, even if you don’t have every type on every Nagios server. So nagios1 might not have a web server b/c it’s in an office and it won’t refuse to start nagios because a type isn’t used.

Organize Your Config Files
– Put files in different directories
– One host per config file
– Generate configs from other information you already have
– Or use a script to generate from a list
– Take advantage of your naming convention
– Wildcards in host names based on FQDNs

Contacts: sysadmin, sysadmin-email, sysadmin-page for different levels of contacting.

check_allstorage plugin – made by John – Don’t need to set limits in nagios config. Gets list of filesystems from device, cache in /tmp dir, Estimates thresholds based on current usage. NICE. nagios checks on his resources page

Web server monitoring hack – Got a visible web server that can run PHP or CGI? Set up a “hidden” web page to run your check. Use Auth or allow/deny rules to limit access. Use check_http to look for a regular expression. Get remote status over port 80.

Hosts don’t actually have to exist – you can make up a service check like “mail” that will hit a generic MX record.

Negate check – e.g. page me if ssh ever gets turned on, on this machine.

This was a great talk, I can see why it’s usually a 3-hour tutorial, and I want to take it and learn about more ideas/examples.

John Sellens of SYONEX presents Non-Obvious Nagios.

PDF slides are available

John’s viewpoints and religion (you may not agree):
Monitoring is Exceptions, Trending, History
UNIX philosophy: Effective tools, not kitchen sink Choose the best tool(s) for the job
SNMP is Your Friend Use it whenever you can
Solve any problem in computer science with another level of indirection

Nagios has:
Discrete components
Well-defined interfaces
Great documentation
Nagios core just schedules and executes
It’s just an engine thats the simple genius of it

Well-defined and simple interface between all the parts, thats part pf the brilliance.

He talks about compiling Nagios, and goes on to say most folks use packages. Then John mentions about monitoring status.dat, maybe with check_file_age, to make sure that Nagios is still running. And then he talks about the basics of configuration, that anyone who Nagios is familiar with.

Plugins:
Nagios Plugins or you can just write one, its not a hard thing to do. There are helpers like Nagios::Plugin in Perl. Then theres more talk about plugin basics, how they work and performance data.

Put secrets in the enable-extra-opts file theyre then hidden from ps on the server.

If a plugin does something you want, but you want a little more, you can write a wrapper around an existing script.

He goes on to talk about plugin basics, there are lots available (Nagios Exchange 3rd party plugins), explains the difference between local/remote checks, and active/passive checks. John doesnt use NRPE, he relies on SNMP, but thats his self-admitted religion.

Configuration:
Talks about a bit more of the basics, especially the files themselves, that variables are case-sensitive, etc.

check_result_reaper_frequency checks results every 10 seconds for any fork (plugin) to see if its finished. If you set this lower, theres no harm [says John], and a child waiting to be reached is part of the CPU LOAD count on a unix machine so its worth it, and things will finish quicker.

Theres also a variable for the frequency of checking the contents of a directory [missed which], again this can be lowered.

Set config_dir to point to where the object definitions are, and then leave it alone and put the files in there as opposed to every time you add/remove a file, updating the cfg_file variable. The order and location of things in config files are irrelevant.

Last value of directive/variable (same thing) in a definition is used, so you can define a directive more than once, but only the last one is used. You can also inherit from templates, including multi-template inheritance (comma-separated list). Templates are usually specified with register=0, but you can also use a registered object as a template if you want, but its a bad idea the point is either you have a template or an instantiation, and can probably abstract out if youre in that situation.

use is evaluated first, no matter where in the directive it appears.

Nagios object template inheritance is a directed acyclic graph. Slide 25 is pretty useful, shows precedence and inheritance. Of course if you arrange templates such that the order doesnt matter, then youre golden no matter what.

You can append to an inherited value with + for instance directive +value. Theres no subtraction, but with host groups and hosts you can use ! for exclusion, e.g. mysql_slaves -db3

Custom directives start with an underscore (_) case-insensitive. For example, you might want one for your SNMP community string. In your generic host template, define _snmp_community, and if theres a different one for a specific host, you can define that in the host. And instead of -c password, you do -c _snmp_community.

Custom object variables, define a variable called _load_warn as 3, _load_crit as 5, the default check has that. On big machines, change that in the host definition. Refer to as macros or env variables, eg:

macro $_HOSTBLOOP$
environment variable NAGIOS__HOSTBLOOP

Implied Inheritance
Nagios will sometimes assume a value from a related object
Service objects will inherit from the associated host: contact_groups, notification_interval, notification_period
Hostescalations and serviceescalations will similarly inherit as well, except notification_period becomes escalation_period

Timeperiods are nice b/c you can exclude time periods (like holidays). There are lots of examples

Command/service definition vs. quoting Command line quoting is sometimes challenging, so try to avoid special characters in your arguments. Johns advice, put quotes in the service definition only.

host notification options:
Dont have u in things that page you, that way if things are unreachable, you wont get paged, you want to get paged for the firewall being down or whatever. Also take off the s unless you want to be paged when scheduled downtime happens. John uses d,r.

Escalations are to a contactgroup.

Host definitions: host has a host_name and address (IP or FQDN). An IP address avoids alerts if DNS fails, but is harder to maintain. John recommends using FQDNs and having locally cached DNS on the Nagios box.

check_command is used to see if a host is up, usually its a ping of some kind, and its only checked if a service on the host fails. parents are a list of routers and gateways between Nagios and the servers. One tip is to define a google machine, and if Nagios cant get to Google, all heck has broken loose b/c the whole network is gone.

hostgroups are useful for admin grouping of hosts.

Services:
In Nagios terms, a “service” could be an aspect of a running system, like disk capacity, or memory utilization. A “service” needn’t be offered externally to a device

Nagios tests services based on:
max_check_attempts — how many times to check a service before concluding it is actually down e.g. maybe a mail queue peaks and thats OK, but not for more than 30 min. used in conjunction with the next 2.

normal_check_interval — how many “time units” to wait between regular service checks

retry_check_interval – how many “time units” to wait before checking a service that is not “OK”

contact_groups — who to complain to in case of a problem

You can use notes for a URL for notes or action. Interesting, so if you have a frequent problem you could put URLs here, or an Action url.

dependencies escalations are a good thing. With escalations, only add contactgroups, so that the oncall person doesnt think its fixed when the manager starts getting paged.

avoid repitition:
General rule: anywhere you can list a host_name or hostgroup_name you can:
use a comma-separated list of hosts/groups
exclude with !
use a wildcard host_name of “”, meaning “all hosts” to have it apply (or not) to multiple hosts

e.g. A service definition for the HTTP service might include
hostgroup_name webservers
to cause the service to be defined for all hosts in the
webservers hostgroup

Generic Notification Author (genoa) uses environment variables.

Most are added to environment e.g. NAGIOS_SERVICESTATE including any custom variables

“On-Demand Macros” allow you to refer to values from other config settings e.g.
$SERVICESTATEID:novellserver:DS Database$

e.g. db1 is down but db2 is still up.

“On-Demand Group Macros” get you a comma-separated list of all values in a host, service or contact group e.g.
$HOSTSTATEID:hg1:,$

Documentation contains “Theory of Operation” information read it

Then he reviews soft vs. hard state basics. Cached checks
Can cache and re-use host or service check results

Used only for “On-Demand Checks” e.g. Checking that host is up if a service fails, Checking topological reachability, For “predictive dependency checks”, Checking for “collateral damage”. Lower overhead, good results. You should enable and tune the cache.

Event Handlers:
In a perfect world, nothing would ever go wrong. In a semi-perfect world, problems would fix themselves. Event handlers are one of Nagios’ ways of moving closer to perfection.

An event handler is a command that is run in response to a state change. Canonical example: restart httpd if WWW service fails

But you could do things like open a trouble ticket on failure. You can have Global and specific host and service event handlers.

Complications: runs as the nagios user, on the nagios server.

External Commands

The Nagios server maintains a named pipe in the file system for accepting various commands from other processes. External commands are used most often by the web interface to
record information and modify Nagios’ behaviour. But you can do lots of things from shell scripts. Some of the available functionality:
– Add/delete host or service comments
– Schedule downtime, enable/disable notifications
– Reschedule host or service checks
– Submit passive service check results
– Restart or stop the Nagios server

Nagios can accept service check results from other programs. Since Nagios did not initiate the check, these are called “passive service checks”. Useful for embedded Nagios, asynchronous events, results from other, existing programs.

Nagios supports distributed monitoring of a certain style. Remote Nagios servers are essentially probe engines, submitting their results to a central server with passive service check results. The configuration on the remote servers is a subset of the central configuration. The central server is configured to notice if the passive results stop coming from the remote server.

The “central aggregation” approach is used by a number of more recent tools, such as Nagios Fusion, Thruk (slide 104), MNTOS (slide 104), and Multisite (slide 104).

Adaptive Monitoring Can change things during runtime via external commands e.g. schedule changes, or from an exception handler. I wonder if this could be useful for oncall rotation

Can change:
– Check commands and arguments
– Check interval, max attempts, timeperiod
– Event handler commands and arguments

Scaling Up
– Nagios can handle a lot without much effort
– As you get larger, advanced features are more important
– Use parent/child and host/service dependencies
– More efficient for humans and machines
– You will need to be more rigorous in your configuration
– Consistency, completeness, tuning
– Version 3 adds scalability and tuning features

Lots more cores now, so parallelism with plugins is more automatic.

oncall rotation in Nagios.

Tuning for Performance
– Lots of tunable configuration parameters
– Keep performance graphs of Nagios MRTG, nagiostats, etc.
– Disable environment macros
– Use passive checks if you can- Not Johns favorite idea
– Avoid interpreted plugins, or offload checks
– Use Fast Startup Options – pre-cache configs can save time on startup for new configs.

Tips/tricsk:
– Use the parent/child topology
– Pre Nagios 3, host checks are not parallelized
– Host checks of a down segment can block all other checks
– Be consistent and use templates and groups
– Make it easy to add another similar host
– Make it easy to add a service to a group of hosts
– Smarter plugins make life (configuration) easier (e.g. default thresholds)

With multiple Nagios servers use allow_empty_hostgroup_assignment=1 You can define machine types as common hostgroups, even if you don’t have every type on every Nagios server. So nagios1 might not have a web server b/c its in an office and it wont refuse to start nagios because a type isnt used.

Organize Your Config Files
– Put files in different directories
– One host per config file
– Generate configs from other information you already have
– Or use a script to generate from a list
– Take advantage of your naming convention
– Wildcards in host names based on FQDNs

Contacts: sysadmin, sysadmin-email, sysadmin-page for different levels of contacting.

check_allstorage plugin made by John Don’t need to set limits in nagios config. Gets list of filesystems from device, cache in /tmp dir, Estimates thresholds based on current usage. NICE. nagios checks on his resources page

Web server monitoring hack Got a visible web server that can run PHP or CGI? Set up a “hidden” web page to run your check. Use Auth or allow/deny rules to limit access. Use check_http to look for a regular expression. Get remote status over port 80.

Hosts dont actually have to exist you can make up a service check like mail that will hit a generic MX record.

Negate check e.g. page me if ssh ever gets turned on, on this machine.

This was a great talk, I can see why its usually a 3-hour tutorial, and I want to take it and learn about more ideas/examples.

Liveblog: Managing Your Heroes: The People Aspect of Monitoring

At the Nagios World Conference, North America Alex Solomon of PagerDuty talked about Managing Your Heroes: The People Aspect of Monitoring.

First he goes over some acronyms:
SLA – service level agreement
MTTR – mean time to resolution – avg time it takes to fix a problem
– also mean time to response, a subset of MTTR – avg time it takes to respond.
MTBF – mean time between failures

How can we prevent outages?
– look for single points of failure (SPOFs) – engineer them to be redundant, if you can’t live with the downtime.
– a complex, monolithic system means that a failure in one part can fail another part. e.g. if your reporting tool is heavily loaded, that will affect your customers who want to buy your product, if your reporting tool and sales system go against the same machines.
– systems that change a lot are prone to more outages
– Outages WILL happen

Failure lifecycle:
– monitoring -> alert -> investigate -> fix -> root-cause analysis, and from here it could go back to any of the other stages. The line between fix and investigate is blurry, because you might try something in the course of investigation and it might actually fix it.

Why monitor everything?
Metrics, metrics, metrics. “If it’s easy, just monitor it. You can’t have too many metrics.”

Tools – for internal, behind the firewall – Nagios, Splunk. Exetrnal – New Relic, Pingdom. Metrics – graphite, data dog.

Severities – based on business impact
sev1 – large scale business loss (critical)
sev2 – small to medium business loss (critical)
sev3 – no immediate business loss, customer may be impacted
sev4 – no immediate business loss, no customers impacted

Each severity level should have it’s own standard operating procedure (SOP)/SLA:
sev1 – major outage, all hands on deck. Notify the team via phone/sms, response time 5 min
sev2 – critical issue – notify the oncall person/team via phone/sms, response time 15 min
sev3/4 – non-critical issue notify on-call person via e-mail, response time next business day.

Severities can be downgraded/upgraded.

Alert *before* systems fail completely.

Oncall best practices – have a cellphone for phone calls and SMS. You might want to get a pager, but the paging system isn’t necessarily reliable either. A smart phone is better, because you can then handle the problem from the phone. 4G/3G internet – like a 4g hotspot, a USB modem, or tethering.

Set up your system so it pages multiple times until you respond. Escalate to different phones as needed. Get a vibrating bluetooth bracelet if you sleep with someone else and they don’t want to be disturbed.

Don’t send calls to the whole team if one person can handle it. You wake everyone up, the issue could be ignored by everyone or duplicated by everyone.

Follow-the-sun paging, oncall schedules.

Measure on-call performance, measure MTTR, % of issues that were escalated, set up policies to encourage good performance. Managers should be in the on-call chain, and you can pay people extra to do on-call. Google pays per on-call shift, so people actually volunteer to be on-call.

NOCs reduce the MTTR drastically. Expensive (staffed 24×7 with multiple ppl). But you can train your NOC staff to fix a good percentage of the issues. As you scale, you might want a hybrid on-call approach – NOC handles some, teams directly handle others.

Automate fixes and/or add more fault tolerance.

You need the right tools (monitoring tools were mentioned before). Soft tools:
voice – conference bridge / skype / google hangout
chat – hipchat, campfire [we use IRC at mozilla]

Best practice: having an incident commander – provides leadership and is in charge of the situation. prevents analysis paralysis, instructs who should do what, etc.

At the Nagios World Conference, North America Alex Solomon of PagerDuty talked about Managing Your Heroes: The People Aspect of Monitoring.

First he goes over some acronyms:
SLA service level agreement
MTTR mean time to resolution avg time it takes to fix a problem
also mean time to response, a subset of MTTR avg time it takes to respond.
MTBF mean time between failures

How can we prevent outages?
look for single points of failure (SPOFs) engineer them to be redundant, if you cant live with the downtime.
a complex, monolithic system means that a failure in one part can fail another part. e.g. if your reporting tool is heavily loaded, that will affect your customers who want to buy your product, if your reporting tool and sales system go against the same machines.
systems that change a lot are prone to more outages
Outages WILL happen

Failure lifecycle:
monitoring -> alert -> investigate -> fix -> root-cause analysis, and from here it could go back to any of the other stages. The line between fix and investigate is blurry, because you might try something in the course of investigation and it might actually fix it.

Why monitor everything?
Metrics, metrics, metrics. If its easy, just monitor it. You cant have too many metrics.

Tools for internal, behind the firewall Nagios, Splunk. Exetrnal New Relic, Pingdom. Metrics graphite, data dog.

Severities based on business impact
sev1 large scale business loss (critical)
sev2 small to medium business loss (critical)
sev3 no immediate business loss, customer may be impacted
sev4 no immediate business loss, no customers impacted

Each severity level should have its own standard operating procedure (SOP)/SLA:
sev1 major outage, all hands on deck. Notify the team via phone/sms, response time 5 min
sev2 critical issue notify the oncall person/team via phone/sms, response time 15 min
sev3/4 non-critical issue notify on-call person via e-mail, response time next business day.

Severities can be downgraded/upgraded.

Alert *before* systems fail completely.

Oncall best practices have a cellphone for phone calls and SMS. You might want to get a pager, but the paging system isnt necessarily reliable either. A smart phone is better, because you can then handle the problem from the phone. 4G/3G internet like a 4g hotspot, a USB modem, or tethering.

Set up your system so it pages multiple times until you respond. Escalate to different phones as needed. Get a vibrating bluetooth bracelet if you sleep with someone else and they dont want to be disturbed.

Dont send calls to the whole team if one person can handle it. You wake everyone up, the issue could be ignored by everyone or duplicated by everyone.

Follow-the-sun paging, oncall schedules.

Measure on-call performance, measure MTTR, % of issues that were escalated, set up policies to encourage good performance. Managers should be in the on-call chain, and you can pay people extra to do on-call. Google pays per on-call shift, so people actually volunteer to be on-call.

NOCs reduce the MTTR drastically. Expensive (staffed 247 with multiple ppl). But you can train your NOC staff to fix a good percentage of the issues. As you scale, you might want a hybrid on-call approach NOC handles some, teams directly handle others.

Automate fixes and/or add more fault tolerance.

You need the right tools (monitoring tools were mentioned before). Soft tools:
voice conference bridge / skype / google hangout
chat hipchat, campfire [we use IRC at mozilla]

Best practice: having an incident commander provides leadership and is in charge of the situation. prevents analysis paralysis, instructs who should do what, etc.

Alerting on MySQL at NagiosWorld Slides and Links

In this presentation, I talk about the reason I developed a new Nagios check for MySQL while at PalminoDB, what the plugin does, how it works, how to extend it, and what its limitations are. There are some supplemental links that folks might want to know, so I am posting them here:

In this presentation, I talk about the reason I developed a new Nagios check for MySQL while at PalminoDB, what the plugin does, how it works, how to extend it, and what its limitations are. There are some supplemental links that folks might want to know, so I am posting them here:

 

 

 

Liveblog: Ethan Galstad Nagios Keynote

Ethan Galstad, “father of Nagios”, keynoted NagiosWorld’s first day.

For those who are curious, Ethan pronounces Nagios as “NAH-ghee-os” – emphasis on the first syllable, the “g” is a “hard g” like in the word “good”, and a long “o” as in “ghost”.

Nagios Core – been around since 1999. Extendable FLOSS monitoring and alerting engine. What people used to just call “nagios”. Usually you have to be a technical person to install and configure.

Nagios XI – Commercial Nagios solution, built on top of Nagios core – has dashboards, web configuration, auto-discovery, advanced reports. Unlike Nagios core, it’s much easier to use.

Nagios Fusion – Centralized dashboard, can login to one place and see all Nagios XI and Nagios Core servers in your infrastructure. There’s also centralized authentication so you only have to login once and you’re logged into the underlying system. Each user can set up their own customized dashboards. [I think this is true for Nagios XI too]

Recent Developments
Community Project Development – 700 new projects, 3500 total projects, Nagios Exchange site.

6 new community sites – Brazil (started by Roderigo Faria), Chile, Panama, Dominican Republic, Spain (started by Poloma Galan), Italy.

Nagios Core 4 – coming soon.
Worker processes added so there’s built-in distributed monitoring, cleaning up internal data structures for faster performance – committed mostly by Andreas Ericsson.

Nagios XI 2012 – Bulk management, capacity planning, scheduled reporting, audit logging, advanced reporting, more wizards. [author’s note OMG CAPACITY PLANNING!]

Nagios Fusion 2012 – Centralized reports, new visualizations, fused graphs.

Nagios Mobile – Remote access for XI, Core. You can ack problems, schedule downtime, disable notifications, etc.

NSTI (nagios snmp trap interface) – Manage and handle SNMP traps.

NRDS (nagios remote data sender) – how do you monitor remote devices? Roaming devices? update remote configs? install remote plugins? Built on NRDP, uses its http communication. Once you’ve installed the agent, the agent sends monitoring data to the Nagios server (passive monitoring), and every time it checks in, it looks for new config files.

Nagios reflector – remote/roaming devices even though Nagios is behind a firewall. Nagios reflector is a hosted service that can monitor remote apps even though they can’t directly connect to your nagios. Uses NRDP or NSCA. Close to being released, beta testing with customers.

NPush – how to push out an agent to the Windows servers you monitor? How do you modify remote configurations? NPush fixes this problem. Close to being released, beta testing with customers.

New commercial products being developed:
Nagios Incident Manager – lightweight ticketing system. Has a callback API. Integrates with Nagios Core and XI, for dealing with Nagios problems.

Nagios Network Analyzer – NetFlow and sFlow monitoring solution. Who’s using the network and for what? Usage graphs, custom queries, integrates with Nagios XI and Core.

Nagios Reactor – Event handler and workflow management. Event handlers are already in Nagios, they can perform an action based on an event (e.g. if IIS or Apache fails, restart it. Can even cycle power on boxes). Nagios Reactor will be able to execute an event chain – not just one event – of actions to take. Can use logic, so basically it’s like having an automated flow chart, with steps and conditions.

This stuff is all pretty neat! Looks like the Nagios folks are doing a great job. I am here because I am talking about Alerting MySQL, and I am actually excited to see the rest of the sessions (I have been using Nagios since 2001).

Ethan Galstad, father of Nagios, keynoted NagiosWorlds first day.

For those who are curious, Ethan pronounces Nagios as NAH-ghee-os emphasis on the first syllable, the g is a hard g like in the word good, and a long o as in ghost.

Nagios Core been around since 1999. Extendable FLOSS monitoring and alerting engine. What people used to just call nagios. Usually you have to be a technical person to install and configure.

Nagios XI Commercial Nagios solution, built on top of Nagios core has dashboards, web configuration, auto-discovery, advanced reports. Unlike Nagios core, its much easier to use.

Nagios Fusion Centralized dashboard, can login to one place and see all Nagios XI and Nagios Core servers in your infrastructure. Theres also centralized authentication so you only have to login once and youre logged into the underlying system. Each user can set up their own customized dashboards. [I think this is true for Nagios XI too]

Recent Developments
Community Project Development 700 new projects, 3500 total projects, Nagios Exchange site.

6 new community sites Brazil (started by Roderigo Faria), Chile, Panama, Dominican Republic, Spain (started by Poloma Galan), Italy.

Nagios Core 4 coming soon.
Worker processes added so theres built-in distributed monitoring, cleaning up internal data structures for faster performance committed mostly by Andreas Ericsson.

Nagios XI 2012 Bulk management, capacity planning, scheduled reporting, audit logging, advanced reporting, more wizards. [author’s note OMG CAPACITY PLANNING!]

Nagios Fusion 2012 Centralized reports, new visualizations, fused graphs.

Nagios Mobile Remote access for XI, Core. You can ack problems, schedule downtime, disable notifications, etc.

NSTI (nagios snmp trap interface) Manage and handle SNMP traps.

NRDS (nagios remote data sender) how do you monitor remote devices? Roaming devices? update remote configs? install remote plugins? Built on NRDP, uses its http communication. Once youve installed the agent, the agent sends monitoring data to the Nagios server (passive monitoring), and every time it checks in, it looks for new config files.

Nagios reflector remote/roaming devices even though Nagios is behind a firewall. Nagios reflector is a hosted service that can monitor remote apps even though they cant directly connect to your nagios. Uses NRDP or NSCA. Close to being released, beta testing with customers.

NPush how to push out an agent to the Windows servers you monitor? How do you modify remote configurations? NPush fixes this problem. Close to being released, beta testing with customers.

New commercial products being developed:
Nagios Incident Manager lightweight ticketing system. Has a callback API. Integrates with Nagios Core and XI, for dealing with Nagios problems.

Nagios Network Analyzer NetFlow and sFlow monitoring solution. Whos using the network and for what? Usage graphs, custom queries, integrates with Nagios XI and Core.

Nagios Reactor Event handler and workflow management. Event handlers are already in Nagios, they can perform an action based on an event (e.g. if IIS or Apache fails, restart it. Can even cycle power on boxes). Nagios Reactor will be able to execute an event chain not just one event of actions to take. Can use logic, so basically its like having an automated flow chart, with steps and conditions.

This stuff is all pretty neat! Looks like the Nagios folks are doing a great job. I am here because I am talking about Alerting MySQL, and I am actually excited to see the rest of the sessions (I have been using Nagios since 2001).