Staging the data in Amazon's Elastic Block Storage platform makes Cenesus information more easily accessible than getting it from Census, which offers the material in ZIP files via FTP.
Amazon has reposted a large set of Census Bureau geographic data so its cloud-computing patrons can readily use it.
Users of the Web-based online store's Elastic Cloud Computing (EC2) can point their virtual machines to copies of the Census Bureau's Topologically Integrated Geographic Encoding and Referencing (TIGER) shapefiles.
The TIGER files are also available from the Census Bureau itself, but only via File Transfer Protocol, in compressed packages. The Amazon platform would allow Amazon cloud-based Web applications to pull in copies of the shapefiles as they are needed.
Web development firm Development Seed prepared the TIGER data for the Amazon repository, as a by-product part of a project it is supporting for the New America Foundation, called the Federal Education Budget Project (FEBP). The shapefiles will be used to make maps to be embedded in Web pages, each page devoted to revealing statistics about a given school district. There are 14,000 school districts in the country.
In order for the underlying Web application to build each map, the shapefiles had to be copied from the Census site to another location. Census only offers the files bundled into ZIP files via FTP. Pulling them directly into the Web application as they are needed is not feasible.
To make this data available for FEBP, Development Seed unzipped the material and staged it in Amazon's Elastic Block Storage (EBS). EBS is the storage back end for Amazon's EC2, which offers a range of operating systems, databases and middleware that can be run on Amazon's servers.
A batch of data on EBS "appears just like an external hard drive when it's mounted to an EC2 instance, which is a virtual machine," said Eric Gundersen, president of Development Seed. "So you can hook up this public virtual disk to your virtual machine and work with the data as if it's local to your virtual machine."
TIGER is one of a number of large, public-domain datasets that Amazon has recently posted, free-of-charge, for its cloud-computing customers. Other public data sets include federal contracts from the Federal Procurement Data Center and Influenza Genome Sequencing from the National Center for Biotechnology Information (NCBI).
Such datasets are made available as an enticement to use Amazon's EC2.
"Money starts changing hands when users create EC2 images ... and transferring and processing the data," said Tom MacWright, Development Seed Web developer. "So Amazon has an interest in keeping this data in the cloud."
Advantages also accrue to the developers in this set up. The data is instantly available, so it doesn't need to be downloaded or uncompressed for EC2 apps. Data transfers from the storage to the EC2 application incur no charge. Such large hosted datasets like the NCBI data are a "huge selling point for people who want to run big computing tasks on this kind of data," MacWright said.
By going with EC2, rather than running an application in-house, a development firm can take advantage of a number of pre-existing virtual machines on EC2 that can streamline the build out of an application.
For the FEBP project, Development Seed itself used a virtual machine that runs as Mapnik, an open-source online map builder. Mapping technologies tend to be rather difficult to set up properly. "Being able to boot up a pre-made free image of an OS and the software significantly reduces the barrier to entry," MacWright noted.