- Sun 14 May 2023
- server admin
- Gaige B. Paulsen
- #server admin, #python, #aws, #ansible, #automation, #nagios
Background
Generally speaking, I refresh most of my systems pretty regularly, spurred on by security concerns, general hygeine, a desire to make sure the automation doesn't age out, and certificate expiration.
Although I don't need to refersh systems due to certificate expiration, it has historically been the easiest indicator of systems that are getting a little long in the tooth.
Working on some systems this weekend, I noticed some out-of-date copies of postgresql...really out of date..like close to a year old. This is what sent me off on this weekend's adventure.
What do you mean by refresh and why?
Given our penchant or building everything using Ansible, when I indicate I'm refreshing a system, that means the old VM gets taken down and a new one is built to then-current specifications as a replacement.
Rob and I have nurtured this workflow for years (ever since moving to using ansible for automation). In all cases, I build staging environments before production and in most cases there are some reasonable automated tests for that process.
As to why? The answer is mostly one of convenience, although there are security arguments as well, both getting the latest versions of libraries that may contain vulnerabilities and dislodging anything bad that may be sitting on the virtual machines.
Monitoring the fleet age
Based on the recent discovery of some aging systems, I figured that I should find a way to add this process to our monitoring system, the venerable Nagios.
This didn't need to be particularly complex, but I needed the nagios
server to reach out to the SmartOS Global Zones in order to get information
about the running VMs. Historically, we've done with with captive SSH, using
dedicated keys and lines in ~/.ssh/authorized_keys
which take advantage
of the command=
command in order to run a program, potentially with
information from the incoming SSH connection. Results are sent in text,
but preferably encoded in JSON or similar.
a new python framework for ssh requests
Most of our previous commands piggy-backed on the check_by_ssh
checker,
which is a standard nagios plugin. However, that command assumes that we
put all of the intelligence at the other end of the line (on the recipient)
and basically run the checks there. That could be done, but the need to do
date math made coming up with an appropirate one-liner a bit ridiculous,
so I decided to go with python.
The python code was strightforward, and I used my existing poetry
-based
environment as a starting point, creating a couple of new commands which
I'd install on the nagios servers: one for SmartOS and another for AWS.
By making use of my existing poetry
workflows, I got a number of things for
free, including updating release notes, packaging releases in gitlab, etc.
Integrating with nagios
The nagios
integration should have been simple, but for one small issue:
I needed to parameterize the global zone system so that the command could
take place there.
After some digging through the
documentation for nagios,
I found the section on custom macro variables,
which is exactly what I needed in this case. I wanted to add a new variable
_GZHOST
to my existing host definitions which would indicate which host to
query about the underlying VM. I already had this infromation in the PARENTS
field, which I thought I could use as $HOSTPARENTS$
, but it turns out that
for some reason that's not exposed.
In this case, I was able to use $_HOSTGZHOST
in my command
definition in
commands.cfg
, resulting in:
define command {
command_name check_smartos_vm_age
command_line /opt/local/bin/ct-smartos-vm -H $_HOSTGZHOST$ $ARG2$ -i $USER5$/smartos-age-check-key $HOSTNAME$
}
With:
$_HOSTGZHOST$
having the Global Zone host$ARG2$
being a placeholder for optional parameters (such as overriding the timelines)$USER5$
pointing to our directory for storing ssh keys$HOSTNAME$
the name of the VM to check
Results
In the end, I found a few more systems that were out of date than I was expecting, including one I could have sworn I'd refreshed just earlier this week. So, I'm pretty happy with the system.