John Sellens presents Nagios and Another Layer of Indirection at the Nagios World Conference. PDF slides are here.
All problems in computer science can be solved by another level of indirection David Wheeler
Nagios Constitution: Separation of Core and State
There are separate components and interfaces, and theyre well-defined. This separation allows us to subvert how theyre supposed to be used and do whatever we want.
Nagios if well-documented, thats one of the major strengths!
Where is there indirection in Nagios?
Favorite plugin is the negate plugin in the official nagios plugins.
Remote checking adds another layer between a local plugin and the nagios server check_by_ssh, check_nrpe, check_snmp
Another layer of indirection graphing. Apan was the original nagios grapher, but now theres performance data and plugins and event brokers that will get the data.
How can we implement indirection?
Unix Philosophy: Write programs that do one thing and do it well Doug McIlroy.
Add plugin timeout with
timeout (a unix program). Write a wrapper around an existing plugin. You can do multi-stage checks, e.g. is at least one interface up? Is at least one web server up? You can use
expect for interactions, or use webform posting tools.
You can make a pervasive wrapper e.g. it changes *everything* e.g. change the value of $USER1$ to be
Custom object variables in the environment
Environment macros means your plugins can know everything from the external and internal environment.
Try to avoid per-machine configs, but try to make them simple. e.g. to add a new webhost or db machine, want to make the changes as small as possible. Use hostgroups, sevices, etc. so all you need to do is add a host definition, including setting critical and warning variable values.
Make wrappers that change based on time of day e.g. if its off-hours, report good, because it doesnt matter. You could use a timeperiod for that, or you could hard-code it into the plugin. John made a plugin that says what storage do you have, and is it good. So you dont have to make a separate /data check or whatever. Or, the plugin can assume that the ﬁrst observed state is “normal” and complain if it changes.
Principle: derived thresholds dynamically adjust thresholds based on time of day, based on trends and past experience, based on other current state/activity.
Let machines make the configurations, not you.
How else can we use these principles?
Define exec commands in snmpd.conf check_snmpexec gets a table of everything thats available, looks for the number of the description that matches (mysql), and uses that. Then you dont need to know the number:
check_snmpexec host snmpcomm execname
Get Service: check_winservices uses files to know what should be running, and gets the running services, and complains if the files dont match.
mbdivert for a certain machine, route this way (e.g. hop through this ssh server)
..and more examples. Check the slides. John’s a really smart guy and knows how to use and abuse Nagios, in the good ways!