Dividing storage resources into multiple layers offers both cost and performance advantages
Alternatives: The IBM DS8000 storage system (top and left) can hold up to 96 petabytes of data. IBM System Storage DS4000 Series (below) is aimed at the storage needs of small and midsize organizations.
TWO APPROACHES: Network Appliance's NearStore R200 (left) is a disk-based secondary storage device. Right a five-bay EMC Symmetrix DMX-3.
Storage growth is a major problem facing all types of IT managers. While storage costs per gigabyte are plummeting, the demand for capacity is rising even faster. As a result, agencies face two related but distinct issues: how to cut storage costs while continuing to provide users timely access to their data.
When it comes to access, the Geological Survey's Data Center in Middleton, Wis., is ahead of the game. The center uses 40GB of Solid State Disk devices from Texas Memory Systems Inc. of Houston to hold its most active databases in RAM.
'The solid-state disks hold the data that is high priority to give to customers fast, or it might be data files that are hot and get hit a lot,' said data center director Harry House. 'If you are I/O bound, SSD is a godsend. You can achieve some real performance breakthroughs with it.'Breaking ground
But just as important as cutting costs is the long-term preservation of electronic data. The National Archives and Records Administration is leading the way in this area. Last year, NARA awarded a $308 million contract to a team headed by Lockheed Martin Corp. to begin establishing the Electronic Records Archive system.
'NARA's in the business of archiving information for the life of the republic, and the electronic records will continue to grow,' said Clyde Relick, Lockheed Martin's program director for the ERA contract. 'Essentially we are building a system that has to be able to incorporate new technology and be scalable for having unlimited amounts of storage.'
Whether one is concerned with providing enough disks right now or looking at an archive to last the millennia, proper system design is essential.Storage vs. archiving
The term 'archiving' is used for two distinctly different purposes. One is as part of a standard backup or disaster recovery program, where data is put on tape for offsite storage. The other purpose is to make the data available for long-term access. In this case, the data can either be stored on disk or tape.
An approach growing in popularity of late, multitiered storage is primarily a means of balancing costs and availability. According to San Francisco-based consulting firm the 451 Group, primary SCSI disk storage costs about $2 to $6 per gigabyte, secondary ATA drives about 50 cents per gigabyte and tape only 12 cents per GB.
From an availability viewpoint, everything would be on disk. From a cost viewpoint, everything should be on tape, but tape doesn't meet the need for availability.
'Tape in general is increasingly being exposed as a substandard medium for backups,' said Simon Robinson, storage research director at the 451 Group. 'Users like it because it's cheap; but apart from that, it's inherently unreliable and delivers poor performance.'
Backup, Continuous Data Protection, mirroring, replication, snapshots and other technologies typically use some form of tiered storage. This approach lets you prioritize stored data and use different storage media for each type'say, mission-critical Tier 1 data on optical disks, less commonly used data on ATA, and rarely used or backup data on tape. Tape is the old standard for backups, but disk technologies are gaining popularity as the prices for the two media converge.
'It's exceptionally difficult to restore from tape (especially if the data you want is very old or is stored off-site), which if you think about it is the whole point of backup,' says Robinson.
'Tape still has its place as a longer-term archive format, but it is being superseded at a rapid rate by disk, especially in larger enterprises,' he added.
The trick is to find the optimum balance between tape and disk. This is where Information Lifecycle Management comes in.
"Archiving is a mechanism to expand your tiering strategies and provide the right-cost component to business needs,' said Robert Stevenson, managing director of storage for The Info Pro in New York. 'Archiving is looking at how data changes over time. You can control consumption of Tier One and Tier Two high-cost storage by moving data to a tertiary tier.'
But in doing so, it is not necessary to archive everything.
'There is a difference between archiving and hoarding,' said Dorian Cougias, CEO of Network Frontiers LLC of Oakland, Calif., as well as the co-author of The Backup Book: Disaster Recovery from Desktop to Data Center (Schaser-Vartan Books). 'Archiving is done to fulfill a compliance obligation, and 95 percent of the data you are storing on the network falls completely outside that scope.'What's ILM?
Information Lifecycle Management, or ILM, is a strategy for automatically moving data from one storage tier to another, in order to cut costs of storing less frequently accessed material.
For instance, notice how a bank handles a customer's deposit, Cougias said. For the first few weeks, the teller can produce a copy of the transaction. Then it goes onto a system the bank manager can access. After six months, the data is archived, and the customer has to put in a request and wait to receive a copy. Eventually the data is erased or destroyed.
Robert Eckstein, assistant network manager for the Ninth Circuit Court of Appeals, manages 400GB of single-tier storage, which is backed up on tape. That system is adequate for now but may not meet future needs. 'We are looking at ILM,' he said.
'We expect that our storage will increase significantly when our court is on the new Case Management/Electronic Case Files system. At that point we will need a much more in-depth storage system.'
The simplest way of implementing ILM is to store all the data initially on Tier One storage and then migrate the material to other, cheaper tiers over time. But this approach doesn't necessarily meet all business needs, so more complex sets of rules'based on the types of documents being held or how often they have been accessed'have been suggested by vendors. This approach, too, has its limitations.
'ILM has ended up being used by the industry to create a notion that data will be created on a certain class of storage, then'based on policies, age or something else'will dynamically migrate to lower-cost storage,' said Manish Goel, vice president and general manager of data protection and retention for Network Appliance Inc. of Sunnyvale, Calif. 'That is such an administratively complex architectural solution and has never really taken off.'
That doesn't mean, however, that the basic concept of moving data through different levels of storage is incorrect, but that a mature approach is needed to define a strategy appropriate to one's own needs.
Databases, for example, need to be handled differently than documents. The Transportation Department's Research and Innovative Technology Administration (RITA) has 15TB of storage for data warehousing of transportation-related databases that are used for analysis and reporting, data collection and processing systems, and Web sites to make that data publicly available.
'At this time, all of our data are maintained on a single tier; we have not archived anything,' said Terry M. Klein, director of the Office of Information Technology and deputy CIO. 'We do, of course, back up all our data to tape.'
The USGS data center also keeps its databases active, but migrates them to different types of storage based on the level of requests for data. Those that receive more requests, or require faster response times, stay on the Tier One SSD devices.
Others with a lower number of I/O requests stay on Tier Two disks. Middle-tier storage consists of several terabytes of databases on Network Appliance storage appliances. These are then backed up to about 10TB of disk storage devices from Excel Meridien Data Inc. of Carrollton, Texas. Finally, the data is archived to tape and moved off-site.
The University of New Mexico's Health Sciences Center, however, does use the traditional ILM concept of migrating documents with age. The university has 10TB of storage for general-purpose information, which largely consists of about 10 million files users have uploaded to the central storage to back up their hard drives.
About two-thirds of that is primary storage. The center uses hierarchical storage management software from CaminoSoft Corp. of Westlake Village, Calif. The software creates a stub file in the primary tier that points to the document's location in the secondary tier.
'We do it on a simple rule,' said IT systems manager Barney D. Metzner. 'If the file creation date and last access date goes longer than, on average, six months, we migrate it, though there are a number of exceptions for files such as databases and Power Points.'
Such systems are complicated, and Metzner said some of his technicians who have to deal with the complexity view ILM as a negative.
'I still weigh it as a positive,' said Metzner. 'It continues to keep us in business as our storage needs grow.'
Beyond moving data to cheaper disks or archiving to tape to cut storage costs, the type of archiving for long-term access has its own set of challenges. To begin with, there is the ability to access the data despite changing technologies and the deterioration of storage media.Long-term view
'Government agencies face the same problems everyone does: maintaining secure and cost-effective long-term readability, physically and logically,' said Michael Peterson, the program director of the Storage Networking Industry Association's Data Management Forum. 'Media has to be migrated every three to five years to assure physical readability, and application data formats have to be maintained throughout revision changes, application changes, and reader changes.'
There is also the matter of finding the data once it has been archived. It is not the same as restoring a file from a backup tape.
'Backup is for mass restoration,' said Cougias. 'Archiving is 'Give me the needle in the haystack, and I want it in a readable format.' '
Before issuing an RFP, therefore, he recommends doing a thorough analysis of what one's compliance requirements are, defining what a record is and then defining how those records will be used. Typically, only five percent of the data actually needs to be archived.
An agency can't just take a cut-and-paste approach to writing an RFP for its own archive, said Grant Stephen, CEO of Tessella Inc. of Newton, Mass. Tessella oversaw the national archives projects in the U.K. and the Netherlands and is part of the ERA project team.
Each organization should have its own policies and procedures for data creation, security and storage. Unless these are examined ahead of time and an actual understanding reached of what needs to be archived and how it will be accessed, the RFP winds up being self-contradictory. (Stephen said he has seen a few of those.)
'The organization has spent millions or billions or hundreds of billions of dollars building the system,' he said. 'The number One thing is to stop thinking about the problem and start dealing with it.'