Donald Becker | The inside story of the Beowulf saga
- By Joab Jackson
- Apr 13, 2005
Donald Becker, Beowulf pioneer
In early 1993, then-NASA employees Donald Becker and Thomas Sterling devised a way to yoke multiple low-cost desktop computers together so they could offer the combined performance of a much higher-cost supercomputer.
Twelve years later, Becker'who has since left NASA and founded cluster software maker Sclyd Software'can take a degree of pride in the fact that more than 50 percent of the machines on the Top 500 List of supercomputers are clusters of this sort. And many on the list are built on Becker and Sterling's own specific architecture, the Beowulf cluster.
In addition to co-developing Beowulf, Becker has also been one of the major contributors to Linux, contributing over 60 device drivers to the open-source operating system.
Although Becker still stays abreast of the Beowulf community'he regularly attends the Baltimore-Washington Beowulf User Group meetings'he mostly keeps busy as chief scientist for Scyld of Annapolis, Md. Scyld, which Becker formed in 1998, is now a subsidiary of scalable-computer vendor Penguin Computing Inc. of San Francisco.
While you can build your own Beowulf cluster'visit www.beowulf.org'Scyld offers software to manage clusters in an enterprise setting.
Becker has a bachelor's degree in electrical engineering and computer science from the Massachusetts Institute of Technology.
GCN associate writer Joab Jackson interviewed Becker by telephone.GCN: Prior to creating Beowulf, what did you do at NASA?
Becker: I actually moved over from the Institute of Defense Analysis Supercomputer Research Center, which was essentially doing research for the National Security Agency. When I wasn't able to successfully start a project there, the co-founder of the Beowulf Project, Thomas Sterling, found funding through NASA. So I moved over to NASA's Center of Excellence in Space Data and Information Sciences, which is run by the University Space Research Association. So basically I moved from NSA funding to NASA funding, in both cases through a nonprofit institute.
NASA was interested [in the project] primarily for modeling climate data and processing sensory information. The clusters were to supplement supercomputers.GCN: How did you and Sterling come up with the idea of a cluster?
Becker: I had worked on parallel processing, especially shared-memory parallel processing, since I was at the Massachusetts Institute of Technology. I worked on distributed computing with tightly coupled shared-memory machines. I felt those machines were expensive and tended to lag on the leading edge. The leading edge was starting to curve towards personal computers. Previously, you would find the best price-performance [ratio] at the very high end, with the supercomputers. But that was becoming increasingly less true. Workstations had the lead for a while, but clearly PCs were starting to offer the best price-performance.
For the very scientific uses, the only thing people cared about was how many computing cycles they could get out of the machine. So then the key element became how to put these machines together to make a more powerful machine.
From a price-performance curve, it was obvious to me, but so many people in the high-performance computing community completely rejected the idea without even considering it. In the government world, there was resistance against even small-scale funding of efforts.GCN: Why did you choose the name Beowulf (hero of the 11th-century epic poem of the same name)?
Becker: The credit goes to Sterling, who is an Anglophile. Beowulf is the oldest written English. Some translations have the line 'Because my heart is pure, I have the strength of a thousand men.' With the Beowulf project, we were trying to follow the Linux model for development'not just to build one piece of software for one machine, but build a community effort.GCN: What is the difference between a Beowulf cluster and other multiprocessor systems, such as failover clusters and symmetric multiprocessing (SMP) machines?
Becker: I think a lot of people don't really have a good definition of what a cluster is. To me, a cluster combines independent machines, machines capable of standalone operation, into a unified system, using a combination of software and networking.
Beowulf clusters are scalable performance machines. Failover clusters offer higher reliability with the unified system. SMP is a machine designed generally within one chassis, which has a number of tightly integrated processors. With a cluster, you have the opportunity to incrementally scale, where an SMP is generally built to a [preconfigured] size.GCN: Were you surprised by the success of Beowulf?
Becker: I was surprised both by the success of Linux and the success of Beowulf clusters. In both cases, we were trying to influence the world, but not necessarily to directly succeed. If by providing examples, we helped other people to build better computer systems, that would have been enough to call it a success.GCN: What does Scyld do?
Becker: I founded the company with the goal of taking everything we had learned and designing a new system to make it easy to deploy clusters. Clusters were far too complex for people whose primary goal wasn't to do scientific computation. We wanted a deployment system where every cluster looked the same from an architectural point of view.
At first, we were still thinking in terms of high-performance computing. We were trying to make the perfect Message Passing Interface-based machine. But we recognized that what we were really doing was managing large sets of machines, which is much more general than supercomputing.
Today we are making large sets of machines as easy to administer as a single standalone machine. In a supercomputing setting, you are trying to get all the machines to work on a single job. We have come up with a system that can deploy dozens or hundreds of servers working on independent jobs.
People have long complained about the management costs of machines. The cost of the hardware has dropped, but the cost of managing those servers'updating those servers, tracking what software is running on them, balancing applications and services over those servers'has increased.
There are many different ad-hoc systems that try to manage Microsoft Windows and Unix systems, but they have a completely different architectural approach: You do a full install on each machine, then try to build a set of scripts to manage those machines. Instead, we have a master machine with the reference software installed, and every other machine will be a client system or an attached server system. And that turns out to be much easier to manage.GCN: It sounds like you're offering software very similar to what Cassatt Corp. of San Jose, Calif., offers.
Becker: Well, in fact, we came out with our system just before Cassatt announced the technical plans for what it was going to do, in 2001. I had to give a talk [at a conference] just after Steven Oberlin, the chief architect of Cassatt's system. I was able to say, 'Everything they said they were going to do, we can already do.'
But Cassatt has exactly the right idea. Why not manage a large set of machines as a single machine? We won't need to buy expensive SMPs, we can use commodity machines and put a software layer over them that makes them as easy to manage as the big SMPs.
In the beginning, with the first Beowulf clusters, we were just trying to make the machines work. The next [challenge] was how to make these machines communicate. And those problems were essentially solved. So we moved to the next level up'how do we deploy these machines, where the software looks like an overall unified system?
Along each step of the way, there have been people who said it couldn't possibly work. Everybody today says they came up with the idea of clusters, but if you look back then, the people who weren't ignoring it were opposed to the idea.
You can only have that perspective from the inside, of course.