NSF researchers test genetic storage technique
- By Joab Jackson
- May 19, 2005
While the merits of evolutionary theory recently have been debated in the courts of Kansas, the ideas of Charles Darwin have proved useful in at least one arena'distributed data storage systems'National Science Foundation-funded researchers have found.
Looking for ways to ensure data stays intact on an experimental network, the group found that genetic algorithms provided a handy shortcut in determining how to distribute data across multiple machines, according to Leana Golubchik, a professor of computer science at the University of Southern California. Golubchik, who is a co-researcher on the project, presented the technology at this year's NSF-sponsored Digital Government Research conference held this week in Atlanta.
'To the best of our knowledge we are the first to apply genetic algorithms to the data assignment problem in many-to-one applications,' read the paper that accompanied Golubchik's presentation.
The findings stem from a data transfer technology called Bistro
that the researchers have been working on for the past few years with support from NSF. Bistro would allow agencies to accept large amounts of data arriving simultaneously from many sources. Such a service could prove useful to agencies such as IRS, which sees an influx of material during tax deadlines.
The technology requires end users to upload their data to intermediary computers located around a network, called Bistros. For each submission, the Bistro generates a receipt ' a unique checksum that verifies the integrity of a block of data ' and sends it to the agency. The agency records the time each receipt is received. Since these checksums are much smaller than the original files, the agency's servers won't be overloaded with traffic as users try to submit their material all at once before a deadline, Golubchik said.
In order to keep the data intact even if a number of Bistro computers go offline, researchers utilized a technique for breaking up the files in such a way that they can be completely reconstructed even if a few of their pieces are missing. The files then could be distributed throughout the Bistro system.
One of the problems the researchers face in designing Bistro is designing the most efficient way of distributing these file fragments. 'The data assignment problem is important,' Golubchik said, because bad design can slow performance. The researchers found that an approach loosely based on genetic algorithms provided the most efficient way of distributing this data.
For this study, they assigned data packets to Bistro computers in three ways reminiscent of evolutionary principles'by utilizing the best computers as much as possible, by swapping data between random pairs of computers and by a process known as mutation, in which a small number of data fragments are randomly assigned across different computers.
To test the results, the researchers then compared this method with more traditional approaches, such as putting all the data on the most reliable computer, spreading the data evenly across all the computers (a technique used in RAID-based storage devices) or assigning a proportional amount of data to each computer based on how reliable that computer was. They also tried the most precise technique, called the brute-force approach, which would calculate the best location to place each fragment. That approach would be too computationally intensive for use on a complex network.
Using genetic algorithms 'seems promising in that it seems efficient when compared to the brute-force approach, and more accurate than the simple strategies,' Golubchik said.
Golubchik cautioned that using genetic algorithms only provides a shorthand way of approaching the problem of storing data. 'It is just another approximation method if you think computation of the optimal method is too costly,' she said. 'If you are searching through the solution space, this is just one method to guide the search.'
Joab Jackson is the senior technology editor for Government Computer News.