I write scripts to better manage the Unix servers we’re responsible for. Shell scripts, perl scripts, whatever it takes to get the information needed to stay on top of the environment. Be proactive.
Generally scripts are written when a problem is discovered. As we’re responsible for the support lab and production servers (not the development or engineering labs) and since only production servers are monitored, there are many scripts that duplicate the functions of monitoring. Unlike production, we don’t need to immediately respond to a failed device in the support lab. But we do need to eventually respond. Scripts are then written to handle that.
The monitoring environment has its own unique issues as well. It’s a reactive configuration in general. You can configure things to provide warnings but it can only warn or alert on issues it’s aware of. And still there’s the issue of not available in the support lab.
I’ve been programming in one manner or another (basic, C, perl, shell, php, etc) since 1982 or so (Timex/Sinclair). One of the things I always tried to understand and correct were compile time warnings and errors. Of course errors needed to be taken care of, but to me, warnings weren’t to be ignored. As such server management scripts are more like code than 6 or 7 lines of whatever commands are needed to accomplish the task. Variables, comments, white space, indentation, etc. The Inventory program I wrote (Solaris/Apache/Mysql/PHP; new one is LAMP) looks like this:
Creating list of files in the Code Repository Blank Lines: 25687 Comments: 10121 Actual Code: 100775 Total Lines: 136946 Number of Scripts: 702
136,000 lines of code of which 25% are blank lines or comments. The data gathering scripts is up to 11,000 or so lines in 72 scripts and another 100 or so scripts that aren’t managed via a source code repository.
The process generally consists of data captures on each server which is pulled to a central server, then reports are written and made available. One of the more common methods is a comparison file. A verified file compared to the current file. A difference greater than 0 means something’s changed and should be looked at. With some of the data being gathered being pretty fluid, being able to check for every little issue can be pretty daunting so it’s a comparison. This works pretty well overall but of course there’s setup (every server data file has to be reviewed and confirmed) and regular review of the files vs just an alert.
Another tool is the log review process. Logs are pulled from each server to a central location, then processed and filtered to remove the noise (the filter file is editable of course), and the final result concatenated into a single log file which then can be reviewed for potential issues. In most cases the log entries are inconsequential. Noise. As such they are added to the filter file which reduces the final log file size. This can become valuable in situations like the lab where monitoring isn’t installed or for situations where the team doesn’t want to get alerted (paged) but do want to be aware.
But there’s a question. Are the scripts actually valuable? Is the time spent reviewing the daily output greater than the time fixing problems should they occur? These things start off as good intentions but over time become a monolith of scripts and data. At what point do you step back and think, “there should be a better way”?
How I determined it was when I was moved to a new team. In the past, I wrote the scripts to help but if I’m the only one looking at the scripts, is it really helping the team? In moving to the new team, I still have access to the tools but I’m hands off. I can see an error that happened 18 months ago that hasn’t been corrected. I found a failed drive in the support lab 6 months later. I found a failed system fan who knows how long ago it had failed.
There was an attempt to even use Nagios as a view into the environment but there are so many issues that again, working on them becomes overwhelming.
The newest process is to have a master script check quite a few things on individual servers and present that to admins who log in to work on other tasks. Reviewing that shows over 100,000 checks on 1,200 systems and about 23,000 things that need to be investigated and corrected.
But is the problem the scripts aren’t well known enough? Did I fail to transition the use of them? Certainly I’ve passed the knowledge along in email notifications (how the failure was determined) over time and the scripts are internally documented as well as documented on the script wiki.
If a tree falls in the forest and no one is around, does it make a sound?
If a script reports an error and no one looks at the output, is there a problem?
The question then becomes, how do I transfer control to the team? I’ve never said, “don’t touch” and have always said, “make changes, feel free” but I suspect there’s still hesitation to make changes to the scripts I’ve created.
The second question is related more to if the scripts are useful. Just because I found a use for them doesn’t mean the signal to noise ratio is valuable, same as the time to review vs the time to research and resolve.
Finally if the scripts are useful but the resulting data unwieldy, what’s an alternate method of presenting the information that’s more useful. The individual server check scripts seem to be a better solution than a centralized master report with 23,000+ lines but a review shows limited use of the review process upon login.
Is it time for a meeting to review the scripts? Time to see if there is a use, is it valuable, can it be trimmed down. Is there just so much work to manage it that the scripts, while useful, just can’t be addressed due to the reduction of staff (3 admins for 1,200 systems).