Lustre to battle corruption
Ever have this problem? You want to get a large piece of important data from the server to the storage system. All the systems involved seem to send the data correctly, and it is written to the disk without any error messages. Yet when the data is read later, it has been corrupted. How did that happen?
"It's staggering the risks that can happen along the entire data path," said Peter Bojanic, director of Lustre engineering at Sun Microsystems. He said that although there are plenty of studies about disk reliability, few have been done on the subject of data reliability on the network ' a problem that is increasing.
"In some large deployments, we experienced inexplicable data corruption to files on the disk," he said. In one setup, after a lengthy investigation, the technology staff found that the network cards were flipping bits.
The topic came up while we were talking about the future of the Lustre
global file system. A year ago, Sun acquired Cluster File Systems Inc., the former maintainer of Lustre. We thought it would be a good time to catch up with the company to find out what's in store for the file system.
Bojanic said one of the most interesting opportunities Sun sees for Lustre is that it could be used when used
alongside Sun's 128-bit next-generation file system, ZFS
, to provide advanced data integrity.
Advanced data integrity means the system as a whole, rather than its individual components, can guarantee that the data has stayed intact. "When you write data to the disk, you know it can be repaired if anything goes wrong," he said.
Lustre is widely used in the high-performance computing community because of its ability to pool massive numbers of storage disks into a single file system. A metadata server keeps track of all file names, directories, permissions and file layouts. A client seeking data consults the metadata server for the location and then retrieves the data directly from the appropriate storage server.
Six of the top 10 computers in the semiannual Top 500
list of the world's most powerful supercomputers use Lustre, Bojanic said. For instance, the Energy Department's Lawrence Livermore National Laboratory uses Lustre for its BlueGene/L system.
Pairing Lustre with ZFS makes sense. The most recent version introduced what Bojanic called over-the-wire checksumming, a quick numerical tally done at the beginning and end of a data transmission. If the checksum at the end of the journey matches the one at the start, then the data hasn't changed en route.
One of the nifty features of ZFS is that it runs its own checksum to ensure that the disk controller doesn't change data as it is written. So it would seem like an obvious thing to put the two operations together, which is what Bojanic and his team are doing. Lustre can even pass its checksum to ZFS so the storage system doesn't waste time calculating a new, and presumably unchanged, checksum.
Bojanic said the Lustre/ZFS combo has been working, in an early alpha state, on Linux, and the team members hope to have a production version soon for Linux and Solaris.Other new Lustre features
While we had Bojanic on the phone, we asked what else was happening with the file system. Version 2.0 should released next spring, he said.
That version will attack another growing concern associated with Lustre ' using the file system over a wide-area network (WAN). Therefore, Version 2.0 will contain full support for the Kerberos network authentication protocol. (Version 1.8 partially implements
Kerberos.) Lustre can use Kerberos to check credentials across a network.
Sun is also adding data-replication features, which should ease the process of doing backups, Bojanic said.
The Lustre team is tackling a number of other interesting problems. Chief among them is management. Lustre has long had a reputation of being difficult to maintain. Bojanic admitted that this has been a concern, though he argued that Lustre has improved in that area. The latest version allows users to format file systems in a manner similar to formatting local file systems on a Unix system. And the team is building a new browser-based console to simplify matters even further.
Another area of focus is performance. Lustre is known for its high throughput: It can deliver up to 90 percent of the raw disk bandwidth, or the bandwidth of the connection between the server and the disk, Bojanic said. The development team is exploring ways to use caching to improve the performance even further, at least for datasets that are read from multiple machines.
Posted by Joab Jackson on Oct 07, 2008 at 9:39 AM