At the Nagios World Conference, North America Alex Solomon of PagerDuty talked about Managing Your Heroes: The People Aspect of Monitoring.
First he goes over some acronyms:
SLA service level agreement
MTTR mean time to resolution avg time it takes to fix a problem
also mean time to response, a subset of MTTR avg time it takes to respond.
MTBF mean time between failures
How can we prevent outages?
look for single points of failure (SPOFs) engineer them to be redundant, if you cant live with the downtime.
a complex, monolithic system means that a failure in one part can fail another part. e.g. if your reporting tool is heavily loaded, that will affect your customers who want to buy your product, if your reporting tool and sales system go against the same machines.
systems that change a lot are prone to more outages
Outages WILL happen
Failure lifecycle:
monitoring -> alert -> investigate -> fix -> root-cause analysis, and from here it could go back to any of the other stages. The line between fix and investigate is blurry, because you might try something in the course of investigation and it might actually fix it.
Why monitor everything?
Metrics, metrics, metrics. If its easy, just monitor it. You cant have too many metrics.
Tools for internal, behind the firewall Nagios, Splunk. Exetrnal New Relic, Pingdom. Metrics graphite, data dog.
Severities based on business impact
sev1 large scale business loss (critical)
sev2 small to medium business loss (critical)
sev3 no immediate business loss, customer may be impacted
sev4 no immediate business loss, no customers impacted
Each severity level should have its own standard operating procedure (SOP)/SLA:
sev1 major outage, all hands on deck. Notify the team via phone/sms, response time 5 min
sev2 critical issue notify the oncall person/team via phone/sms, response time 15 min
sev3/4 non-critical issue notify on-call person via e-mail, response time next business day.
Severities can be downgraded/upgraded.
Alert *before* systems fail completely.
Oncall best practices have a cellphone for phone calls and SMS. You might want to get a pager, but the paging system isnt necessarily reliable either. A smart phone is better, because you can then handle the problem from the phone. 4G/3G internet like a 4g hotspot, a USB modem, or tethering.
Set up your system so it pages multiple times until you respond. Escalate to different phones as needed. Get a vibrating bluetooth bracelet if you sleep with someone else and they dont want to be disturbed.
Dont send calls to the whole team if one person can handle it. You wake everyone up, the issue could be ignored by everyone or duplicated by everyone.
Follow-the-sun paging, oncall schedules.
Measure on-call performance, measure MTTR, % of issues that were escalated, set up policies to encourage good performance. Managers should be in the on-call chain, and you can pay people extra to do on-call. Google pays per on-call shift, so people actually volunteer to be on-call.
NOCs reduce the MTTR drastically. Expensive (staffed 247 with multiple ppl). But you can train your NOC staff to fix a good percentage of the issues. As you scale, you might want a hybrid on-call approach NOC handles some, teams directly handle others.
Automate fixes and/or add more fault tolerance.
You need the right tools (monitoring tools were mentioned before). Soft tools:
voice conference bridge / skype / google hangout
chat hipchat, campfire [we use IRC at mozilla]
Best practice: having an incident commander provides leadership and is in charge of the situation. prevents analysis paralysis, instructs who should do what, etc.