One problem with monitoring servers over a WAN is that the WAN is often down during the night. Now, if the WAN is down hard, that is a separate issue. But, just because a host does not respond to a ping does not mean the WAN is really down, or that the server is down.
What you really want to know is *are we down*. If you administer the servers, and set yourself up to be paged at night if the server is down, then you don’t want to be paged if there is a WAN blip. Often, a separate group gets paged for WAN issues anyway. Well, we have developed a perl script that deals with these issues. It uses a file in “hosts” format, and pings each host in order.
You can enable monitoring of the hosts by adding a ##M after the hosts entry.
If a host fails, our script determines the router IP address, and pings that address to see if the router passes. Further, the host must fail 4 times in a row and the router pass as well before the script pages.
Now, another issue is that if a server is down, and you know it, you don’t want to be paged. Our perl script will also allow you to schedule once-only, weekly, or monthly down-times.
Here is an example of the script running:
srv-1 : mondo : srv-33 : *************** Ping Fail : srv-33 *********** Ping failed... testing srv-33 : 10.50.100.59 with router 10.50.100.1 5/7/2003-5:18 10 pings to server srv-33, but router passed 5/7/2003-5:18 *************** Ping Fail : srv-33 *********** Ping failed... testing srv-33 : 10.50.100.59 with router 10.50.100.1 5/7/2003-5:19 20 pings to server srv-33, but router passed 5/7/2003-5:19 *************** Ping Fail : srv-33 *********** Ping failed... testing srv-33 : 10.50.100.59 with router 10.50.100.1 5/7/2003-5:19 30 pings to server srv-33, but router passed 5/7/2003-5:19 *************** Ping Fail : srv-33 *********** Ping failed... testing srv-33 : 10.50.100.59 with router 10.50.100.1 5/7/2003-5:20 Server down -- Sending Page |
After 4 rounds of failed ping tests with the router passing, we were paged. Now that we are awake, let’s add srv-33 to the schedule.txt file, so that we don’t get paged again:
xsrv-49,once 05/26/03 8:50:00 to 06/15/00 19:50:00 xsrv-3,weekly 07/15/03 05:00:00 to 07/15/00 06:00:00 xeverybodydown,once 06/21/03 20:00:00 to 06/22/00 07:00:00 xsrv-34,monthly 04/25/03 7:01:45 to 02/22/01 19:00:01 srv-33,once 05/07/03 3:50:00 to 05/07/03 19:50:00 |
Now, when the monitor script comes around to Gabby again:
5/7/2003-5:37 30 pings to server srv-33, but router passed 5/7/2003-5:37 *************** Ping Fail : srv-33 *********** Ping failed... testing srv-33 : 10.50.100.59 with router 10.50.100.1 5/7/2003-5:37 Server down -- scheduled 5/7/2003-5:38 srv-44 : |
Don’t rely on this script for production servers unless you know exactly what you are doing, and are sure that this script fits your needs, there are other Server Monitoring Tools and Software that can do all this automatically for you now, but if you want an old school method, feel free to give it a try.
Do feel free to incorporate bits of the script as you need, or the whole script if you desire. Credit NetAdminToools, though, if you feel like it. 🙂 Please read our terms of use.
There are five parts to this article:
Introduction
Main Routine
Check/Log Routines
Adding Perl Mods
pf and rf routines
Related Post: Best Ping Monitoring Software