Problems with NUT
Wednesday, March 21st, 2007NUT (Network UPS Tools) has been disappointing me pretty much since the start (two years ago) due to strange shutdowns. Today I decided to do a full analysis.
First, let me describe the configuration: I have two Compaq Proliant 3000 servers, named triumph and bmw, both running Fedora Core 5 with kernel 2.6.20-1.2300.fc5smp. I have two APC Smart-UPS, named cchSU2200 and cchSU1250. They are an APC Smart-UPS 2200 and an APC Smart-UPS 1250, respectively.
Server triumph has two power supplies, each one connected to a different UPS. Server bmw has one power supply, connected to UPS cchSU2200. Both UPSs have their signalling cables connected to serial ports on server triumph. Both UPSs have fresh batteries and “official” APC serial cables.
I won’t go into the details of my upsmon configuration (but will if you ask). Suffice to say that each server requires a minimum of one power supply and monitors all other UPSs (there is actually a third UPS several hundred miles away that is also monitored).
Here are the results and narrative of my testing:
The syslog on triumph:
Mar 22 10:28:26 My Action: Start UPS services
Mar 22 10:28:27 triumph upsd[6734]: Connected to UPS [cchSU2200]: apcsmart-ttyS0
Mar 22 10:28:27 triumph upsd[6734]: Connected to UPS [cchSU1250]: apcsmart-ttyS1
Mar 22 10:28:29 triumph upsd[6735]: Startup successful
Mar 22 10:28:29 triumph upsmon[6738]: Startup successful
Mar 22 10:28:29 triumph upsd[6735]: Connection from 127.0.0.1
Mar 22 10:28:29 triumph upsd[6735]: Client monuser@127.0.0.1 logged into UPS [cchSU2200]
Mar 22 10:28:29 triumph upsd[6735]: Connection from 127.0.0.1
Mar 22 10:28:29 triumph upsd[6735]: Client monuser@127.0.0.1 logged into UPS [cchSU1250]
Mar 22 10:28:36 triumph upsd[6735]: Connection from 192.168.1.10
Mar 22 10:28:36 triumph upsd[6735]: Client monuser@192.168.1.10 logged into UPS [cchSU2200]
Mar 22 10:28:36 triumph upsd[6735]: Connection from 192.168.1.10
Mar 22 10:29:09 My Action: Disconnect power to cchSU1250
Mar 22 10:29:10 triumph upsmon[6739]: UPS cchSU1250@localhost on battery
Mar 22 10:29:11 My Action: Connect power to cchSU1250
Mar 22 10:30:30 triumph upsmon[6739]: UPS cchSU1250@localhost on line power
Mar 22 10:30:54 My Action: Disconnect power to cchSU2200
Mar 22 10:30:55 triumph upsmon[6739]: UPS cchSU2200@localhost on battery
Mar 22 10:30:56 My Action: Connect power to cchSU2200
Mar 22 10:31:03 triumph upsmon[6739]: UPS cchSU2200@localhost on line power
Mar 22 10:34:41 My Action: Disconnect power to cchSU1250 and cchSU2200
Mar 22 10:34:42 triumph upsmon[6739]: UPS cchSU2200@localhost on battery
Mar 22 10:34:42 triumph upsmon[6739]: UPS cchSU1250@localhost on battery
Mar 22 10:34:43 My Action: Connect power to cchSU1250 and cchSU2200
Mar 22 10:35:56 triumph upsd[6735]: Client monuser@127.0.0.1 set FSD on UPS [cchSU2200]
Mar 22 10:35:56 triumph upsd[6735]: Client monuser@127.0.0.1 set FSD on UPS [cchSU1250]
Mar 22 10:36:02 triumph upsd[6735]: Host 192.168.1.10 disconnected (read failure)
Mar 22 10:36:02 triumph upsd[6735]: Host 192.168.1.10 disconnected (read failure)
Mar 22 10:36:02 triumph upsmon[6739]: Executing automatic power-fail shutdown
Mar 22 10:36:02 triumph upsmon[6739]: Auto logout and shutdown proceeding
Mar 22 10:36:07 triumph upsd[6735]: Host 127.0.0.1 disconnected (read failure)
Mar 22 10:36:07 triumph upsd[6735]: Host 127.0.0.1 disconnected (read failure)
Mar 22 10:36:07 triumph logger: upsmon.says.shutdwn
The syslog on bmw:
Mar 22 10:28:35 My Action: Start UPS services
Mar 22 10:28:36 bmw upsmon[9927]: Startup successful
Mar 22 10:29:12 bmw upsmon[9928]: UPS cchSU1250@triumph.hapgoods.com on battery
Mar 22 10:30:32 bmw upsmon[9928]: Giving up on the master for UPS [cchSU1250@triumph.hapgoods.com]
Mar 22 10:30:33 bmw upsmon[9928]: UPS cchSU1250@triumph.hapgoods.com on line power
Mar 22 10:30:58 bmw upsmon[9928]: UPS cchSU2200@triumph.hapgoods.com on battery
Mar 22 10:31:04 bmw upsmon[9928]: UPS cchSU2200@triumph.hapgoods.com on line power
Mar 22 10:34:43 bmw upsmon[9928]: UPS cchSU2200@triumph.hapgoods.com on battery
Mar 22 10:34:43 bmw upsmon[9928]: UPS cchSU1250@triumph.hapgoods.com on battery
Mar 22 10:35:57 bmw upsmon[9928]: Giving up on the master for UPS [cchSU2200@triumph.hapgoods.com]
Mar 22 10:35:57 bmw upsmon[9928]: Giving up on the master for UPS [cchSU1250@triumph.hapgoods.com]
Mar 22 10:35:57 bmw upsmon[9928]: Executing automatic power-fail shutdown
Mar 22 10:35:57 bmw upsmon[9928]: Auto logout and shutdown proceeding
Mar 22 10:36:02 bmw logger: ups.says.shutdown
And now the analysis of the results…
- On both servers, everything starts normally. The upsmon on bmw is clearly seen to connect to triumph.
- At 10:29:09, I pulled the power on cchSU1250 and immediately plugged it back in. Immediately, the master upsmon (on triumph) notes the on-battery condition, but I was disappointed to see the amount of time it took to recognize the return to on-line status (over a minute). On bmw, it is worth noting that upsmon “gives up” on the master for cchSU1250 (triumph) right at the end of this wait. This may or may not be a contributing factor to the hideous behavior we are going to observe in a few moments…(ominous background music starts…)
- At 10:30:54, I pulled the power on cchSU2200 and immediately plugged it back in. The behavior on both triumph (the master) and bmw (a slave) seems perfectly normal, with timely observation of both the on-batter and on-line condition.
- (ominous background music builds…) At 10:34:41, I pulled the power to both cchSU2200 & cchSU1250 and immediately plugged it back in to both. The master recognize the on-battery conditions, and it propagates them to the slave nicely. But the master never recognizes the immediate return to an on-line condition. The inevitable results appear about a minute later when the master starts shutting down and the slave follows suit (ominous background music reaches a crescendo, and then… silence).
I should point out that this problem has happened numerous times over the past two years. A simple 2 second power glitch will provoke the shutdown of bmw and sometimes even triumph. During that time, I have kept up-to-date with nut through the Fedora yum distribution and upgraded from Fedora Core 4 to Fedora Core 5. Currently, I am running nut version 2.0.3, release 0.1.fc5. The APC Smart protocol driver claims to be version 1.99.7 with a command table version of 2.0.
Any ideas? I ran some more auxiliary experiments and found that while upsmon seems to take a long time to recognize a return to on-line in cchSU1250, upslog reports the OL condition immediately. So I think the communication between the UPS and upsdrvctl is working.