I write scripts to better manage the Unix servers we’re responsible for. Shell scripts, perl scripts, whatever it takes to get the information needed to stay on top of the environment. Be proactive.
Generally scripts are written when a problem is discovered. As we’re responsible for the support lab and production servers (not the development or engineering labs) and since only production servers are monitored, there are many scripts that duplicate the functions of monitoring. Unlike production, we don’t need to immediately respond to a failed device in the support lab. But we do need to eventually respond. Scripts are then written to handle that.
The monitoring environment has its own unique issues as well. It’s a reactive configuration in general. You can configure things to provide warnings but it can only warn or alert on issues it’s aware of. And still there’s the issue of not available in the support lab.
I’ve been programming in one manner or another (basic, C, perl, shell, php, etc) since 1982 or so (Timex/Sinclair). One of the things I always tried to understand and correct were compile time warnings and errors. Of course errors needed to be taken care of, but to me, warnings weren’t to be ignored. As such server management scripts are more like code than 6 or 7 lines of whatever commands are needed to accomplish the task. Variables, comments, white space, indentation, etc. The Inventory program I wrote (Solaris/Apache/Mysql/PHP; new one is LAMP) looks like this:
Creating list of files in the Code Repository Blank Lines: 25687 Comments: 10121 Actual Code: 100775 Total Lines: 136946 Number of Scripts: 702
136,000 lines of code of which 25% are blank lines or comments. The data gathering scripts is up to 11,000 or so lines in 72 scripts and another 100 or so scripts that aren’t managed via a source code repository.
The process generally consists of data captures on each server which is pulled to a central server, then reports are written and made available. One of the more common methods is a comparison file. A verified file compared to the current file. A difference greater than 0 means something’s changed and should be looked at. With some of the data being gathered being pretty fluid, being able to check for every little issue can be pretty daunting so it’s a comparison. This works pretty well overall but of course there’s setup (every server data file has to be reviewed and confirmed) and regular review of the files vs just an alert.
Another tool is the log review process. Logs are pulled from each server to a central location, then processed and filtered to remove the noise (the filter file is editable of course), and the final result concatenated into a single log file which then can be reviewed for potential issues. In most cases the log entries are inconsequential. Noise. As such they are added to the filter file which reduces the final log file size. This can become valuable in situations like the lab where monitoring isn’t installed or for situations where the team doesn’t want to get alerted (paged) but do want to be aware.
But there’s a question. Are the scripts actually valuable? Is the time spent reviewing the daily output greater than the time fixing problems should they occur? These things start off as good intentions but over time become a monolith of scripts and data. At what point do you step back and think, “there should be a better way”?
How I determined it was when I was moved to a new team. In the past, I wrote the scripts to help but if I’m the only one looking at the scripts, is it really helping the team? In moving to the new team, I still have access to the tools but I’m hands off. I can see an error that happened 18 months ago that hasn’t been corrected. I found a failed drive in the support lab 6 months later. I found a failed system fan who knows how long ago it had failed.
There was an attempt to even use Nagios as a view into the environment but there are so many issues that again, working on them becomes overwhelming.
The newest process is to have a master script check quite a few things on individual servers and present that to admins who log in to work on other tasks. Reviewing that shows over 100,000 checks on 1,200 systems and about 23,000 things that need to be investigated and corrected.
But is the problem the scripts aren’t well known enough? Did I fail to transition the use of them? Certainly I’ve passed the knowledge along in email notifications (how the failure was determined) over time and the scripts are internally documented as well as documented on the script wiki.
If a tree falls in the forest and no one is around, does it make a sound?
If a script reports an error and no one looks at the output, is there a problem?
The question then becomes, how do I transfer control to the team? I’ve never said, “don’t touch” and have always said, “make changes, feel free” but I suspect there’s still hesitation to make changes to the scripts I’ve created.
The second question is related more to if the scripts are useful. Just because I found a use for them doesn’t mean the signal to noise ratio is valuable, same as the time to review vs the time to research and resolve.
Finally if the scripts are useful but the resulting data unwieldy, what’s an alternate method of presenting the information that’s more useful. The individual server check scripts seem to be a better solution than a centralized master report with 23,000+ lines but a review shows limited use of the review process upon login.
Is it time for a meeting to review the scripts? Time to see if there is a use, is it valuable, can it be trimmed down. Is there just so much work to manage it that the scripts, while useful, just can’t be addressed due to the reduction of staff (3 admins for 1,200 systems).
Man, this one got my thoughts spinning.
You bring up a good question. So, scripts “find issues” and send Email. Unfortunately, Email usually means nothing. An incident ticket means something (why do I always want to capitalize incident?).
We have multiple tools feeding an event manager. If any tool that thinks an event is “actionable”, it sends it to the event manager – but only event the tool thinks is actionable (we don’t need every CPU value it polled, only when you get above X for Y time). Our Event manager de-duplicates the event, calls a system that does a lookup on the event’s keys to determine if it is ticket worthy, sends back an assignment group if it is actionable, then the event manager will cut an incident record for actionable events. If the event has a high enough severity, it cuts an incident regardless.
The glory here is that all these scripts and tools just “do their thing”, and send up suspect events to an event manager for processing. The Event Manager decides whether to cut incident records to assign out for resolution. I.E., centralized control of ticketing, and a configurable interface to determine how to handle lower severity events.
With this, your scripts can just keep sending data. They can be ignored, or have someone assigned to resolve an incident. This tool is configured by the team that manages that infrastructure. They can then run reports and determine your scripts are “x% noise, so you should change them to do ‘y’ as a result” or what should be done.
But I get it, I really do. It’s hard to know where to draw that line.
And yes, I did bring up that dirty, four-letter word: ITIL.
Sending email was the old days. After I moved teams, the team requested the emails stop. I can’t blame them, there were a lot of things being noted and the diff/check stuff was mostly minor changes in the output. I was able to filter some of it; the transmit and receive information for interfaces for example, but some either was a bit harder or just got dropped due to me moving on.
Now it’s a passive report. You have to go to the servers individually or run the chkstatus script to report on either a list of servers related to a project or to all servers.
My work with the validation script is pretty much just trying to catch all the problems on the current project I’m on. 100 servers in 2 lab and 2 production sites. Firewall issues and Asymmetric routes are the biggest things I can’t fix myself. As the script is automatically installed on all servers, you then can get the same report on all servers.
Not sure if I mentioned it but I did have a process for creating tickets via email for a previous project but it was killed due to too many tickets and no one working them.
What sort of event manager are you using? I do have the 20,000 or so errors being logged into the Inventory. You can prioritize errors for all systems on one page and on the error page, reprioritize errors for specific servers depending on Service Class. Plus take ownership. I could certainly use the prioritization system to create tickets, once that’s available again.
I’ll check out ITIL again. We do have a CMDB. Like 5 of them as I recall π My Inventory is about the only really complete list of information and corporate pulls information from it to populate their databases.
We are using BMC’s TrueSight. We are using their event manager, and starting to replace older tools with Patrol (legacy name) for monitoring. We have about 50,000 Windows server and 30,000 Linux servers in production, plus application performance management tools, and 120,000+ network devices, and a gaggle of mainframe and other devices. We get a lot of events.
TrueSight dedups them all, and calls an external system that we wrote for ticketing and assignment groups. All critical and major events get ticketed regardless. From there, off to ServiceNow for an incident. All automated incidents are assigned somewhere; there is no team that “watches a queue and assigns tickets”. All sev1 and sev2 tickets send out a notification.
Sev1 and Sev2 incident notifications must be acknowledged. If there is no ack in 15 minutes, hit primary and secondary. If no ack, hit the manager. It just keeps escalating.
Tickets and past SLA late tickets are reported through management chains, to make sure everyone is getting things resolved.
All this seems a bit draconian, but it is the only way to “force” the support teams to get on incidents; once that processes is accepted, it is usually followed willingly. Getting there can be painful…
Nobody does all of this “right” – but we try to enforce some standards, and deliver events to incidents efficiently. There are a number of open source and cheaper tools; it all has to do with how each one scales.
Yea, not so many servers here (1,200 or so which seems like a lot to me π ) but when you’re manually doing things, it can be overwhelming.
Humorously, one of my scripts just bubbled up due to disk space issues. I’ve spent the last half hour walking through how the script works, where the documentation is, and what needs to be done to correct the reported problems (this is the message file parser). Logs were running 20 megabytes or smaller for all the servers when I was working in the group. Now it’s up to 2.5 gigabytes per day. The team is trying to figure out how to split things up to make it easier to work on issues (or just fix the problems/add the noise to the filters π ).