No fault in NEC's fault tolerance
Redundant design lets this server sidestep cluster problems, but its redundancy is transparent to applications
- By John Breeden II, Carlos A. Soto
- Jun 27, 2002
Many agencies have experimented with server clusters to provide reliable uptime for their users. But clustering has its drawbacks.
For one thing, it's complicated to set up and maintain a cluster with anything beyond the most basic functionality. And a cluster is not fully redundant because a failing server must fully time out before users start to see any backup data from a second source. Cluster hardware and software costs are high, too.
The NEC EXPress 5800/ft server eliminates these clustering drawbacks by providing redundant, plug-and-play fault tolerance. Actually, it's dual plug-and-play, because the 5800/ft has two power supplies and therefore two plugs, although it can run with only one power supply active.
The Microsoft Windows 2000 operating system does not recognize the server's redundancy. To Win 2000, it seems to be a normal Pentium III server. The GCN Lab's Caw Networks Avalanche test suite put the 5800/ft on a par with nonredundant servers that have the same specifications.
The server's dual nature shows up in its double width'basically, two systems running in one box. It has two CPU and memory modules, two PCI bus modules, two network interface cards, two hard drives and two power supplies, but only one operating system. That means programs need no special coding to be 'cluster-aware' as they do for most clustering servers. If an application runs under Windows 2000 Server, it will run on the 5800/ft.
To test the hardware portion of the server, we connected it to streaming video over the lab network. Then we opened the case and exposed the redundant drive modules. To simulate hardware failure, we removed them one at a time. Each module was shaped like a desk drawer, so removal was easy.
Next we took out a PCI bus module and a CPU module. Then we unplugged one of the two power supplies. In each case, the streaming video did not even flicker. The 5800/ft continued normal operation even when half its guts were spread across the lab floor. Further testing by Caw Networks' Avalanche found no performance degradations.
Putting back the modules was just as easy as taking them out. We didn't need to prep the system at all. As each module slid back inside, it resumed lockstep with its mirrored, and functioning, components.
The only time the video paused was when we reinserted the CPU and memory modules. The transfer rate between them is about 83M per second, so it took about three seconds to dump the working memory into the newly inserted memory on our 256M test system. Afterward, the system was completely redundant again.
We could manage the modules locally from an LED status interface on the front of the server. The LED flashed an alert if anything, such as the often-overlooked power supply module, experienced a problem.
But we could also manage the server remotely from a Web interface.
NEC's ESMPRO, a Simple Network Management Protocol agent, is what lets a remote user administer the server from another machine. NEC's Management Workstation Application complemented the ESMPRO agent and let us turn the server on and off remotely.
The agent machine had to run the Windows NT File System and could not use a 16- or 32-bit File Allocation Table. That meant the agent machine needed either Win 2000 or NT 4.0 with Service Pack 6. Windows XP cannot operate the agent manager software.
One software problem was that the SNMP public group by default linked to the NvAdmin group rather than to the administrator's account. That made some extra work for the administrator.
Another problem was that setting up and synchronizing the agent machine was fairly difficult. But once we got the services up and running, remote management worked smoothly.