2 technologies that can make storage a small matter

Data deduplication and thin provisioning can change the storage game, saving a lot of space and money

As the guardians of enterprise data, storage specialists tend to be conservative when it comes to adopting new approaches. However, many storage specialists at federal agencies are moving to adopt two fairly new technologies: data deduplication and thin provisioning.

“This is a very risk-averse crowd,” said Dave Russell, a vice president at Gartner Research. However, when it comes to deduplication and thin provisioning, Russell said, “these two technologies are getting a lot of attention, and they’re getting a lot of deploying, too, and deservedly so.”

“In my 20 years in storage, I’ve seen never a technology go from talked about to being deployed in major data centers this fast," Russell said of data deduplication.

Russell and other analysts say the primary factor that drives such rapid adoption of the new storage technologies is simple: dollars. Because both technologies reduce the investment required to move and store data, they result in lower equipment costs and savings in power and cooling.

Mike DiMeglio, product manager at FalconStor Software, a provider of deduplication software, estimated that deduplication could easily save as much as 50 percent of an organization’s wide-area network costs.


Related stories:

To each his own storage

The means to go green


In principle, deduplication is simple. Most data on a server contains a lot of duplicate data. Users might save 20 different versions of a PowerPoint presentation that has only a single changed slide. Or an e-mail message might go to 40 people with the same image attached. With deduplication, when that data is backed up or archived, only a single copy of the image or the presentation is saved, along with an index so individual files or presentations can be rebuilt, or rehydrated, if necessary.

The potential benefits in terms of required storage space — especially for federal agencies and departments, which are required by law to keep detailed backups for long periods din many cases — are immense.

With typical office data files, deduplication can reduce the size of backups by 90 percent or more. “We get on the order of 95 percent optimization,” DiMeglio said.

Moving smaller amounts of data not only consumes less power and other resources but also can change the way an organization works.

“The next cascading effect is that now we’ve got less data to move, and we can afford to replicate that data from one location to another," Russell said. "Whereas before, we might have only relied on writing out the physical tape and then either using our own people or hiring a service to come and pick up those tapes every day. Now we can transmit them over the wire.”

Although deduplicating can save greatly on storage requirements and related costs, it can also slow performance.

“Dedup is heavy math,” said Steve Foley, director of federal programs at 3Par, a utility storage vendor. “It requires CPU cycles and the system resources within a primary storage array. If it’s deduping while it is maintaining a Web site, say, it will affect the user experience, and the performance of the Web site would be affected because the system is searching for data that it can dedup.”

Potential performance problems also arise when a user needs to retrieve deduplicated data. “Once it’s deduped, the read/write heads have to do a fair bit of thrashing on reads to retrieve the data blocks,” said Dave Swan, an analyst at Knight Point Systems, a systems integration company.

However, vendors and analysts say most users won’t notice the performance lag caused by deduplication when it is used for backups and archiving. That’s partially because most users are moving from tape systems to faster disk-based backup, and disk backup with deduplication is generally significantly faster than tape backup.

Also, users’ performance expectations are different with backup and archiving than they are with many other applications. “In the backup world, if there’s going to be a performance hit of a second or two, you’d never even blink an eye," Russell said. "The same is true in an archiving situation. Now when we get into highly transactional credit card processing system…even minimal overhead is unacceptable.”

Even when used for backup and archiving, deduplication is more effective with certain kinds of data than others.

“What is kryptonite for deduplication?” Russell asked. “Anything that has been previously compressed. Also, anything that is encrypted.”

Compression algorithms generally remove much of the duplication of data in files. Similarly, encrypted files are encoded in a way that the software cannot recognize duplicated data. Files with large amounts of unique data, such as image files, also are not good candidates for deduplication.

Analysts caution that vendors approach deduplication in different ways, and agencies and departments that are considering using deduplication should understand which methods are the best fit for their data.

For starters, deduplication software can employ different chunking methods, or methods of defining blocks of data for comparison. Some systems will examine specified sizes of data chunks on drives, and others will compare only complete files, a method known as single instance storage. The most sophisticated and most CPU-intensive method is named sliding block, in which the software examines and analyzes data streams for repeating patterns.

Although some deduplication systems can only handle a single logical volume or disk spindle, others can be applied across an entire storage array.

Another major difference between deduplication systems is whether they work their magic during the backup process or after data has been moved to secondary storage, a process some call cache and crush.

Although cache and crush imposes fewer CPU demands on the primary storage network, it requires significantly more disk space in secondary storage because it will need to accommodate the entire dataset until deduplication takes place.

Deduplication systems also vary in a number of potentially important ways. For example, not all of them support Symantec’s Open Storage Technology. And not all vendors offer a Fibre Channel interface. Most, if not all, vendors support Redundant Array of Independent Disks 6 and employ redundant power supplies, but not all systems offer failover capabilities.

One complicating factor is that most storage vendors now bundle deduplication tools with their products, yet many customers are not aware of the differences among those tools. “It has become a very pervasive technology, but it’s not standardized,” Russell said. He said he advices clients to do a careful examination of requirements and predeployment testing before making a decision on new storage solutions.

Knight Point’s Swan said vendors might not be helpful in that process. His team was called in to help select a backup system for a new data center in the intelligence community that was to host thousands of servers, some virtualized and some not. The datasets included typical office files and specialized sets of encrypted data.

“When I got on the project, the first thing I noticed was that it was grossly undersized,” Swan said. “I believe they were sizing it based on price, not on requirements. We started talking about the amount of floor space necessary, the power and cooling necessary, the number of racks required to back up that much data. It was clear that this thing was going to consume half of the data center, which was not a good fit. So we said they really should look at deduplication as a means of reducing the physical footprint by eliminating redundant data in the backup.”

Swan talked to a number of vendors. “Each of them pretty much wet their finger and stuck it in the air, decided which direction the wind was blowing and said, ‘I think you have X millions of dollars. Here’s your solution.’ And they pushed it across the table,” Swan said. “Never once did any of them do any math. How much data do you have? How fast do you need to shove it in? How fast do you need to get it out? These are big issues, yet none of them did that until we forced them to do it.”

Ultimately, Swan’s team settled on FalconStor, primarily because it allowed them to deduplicate only the unencrypted data, a capability that some vendors didn't offer. Another factor was that FalconStor could deduplicate data across multiple virtual tape libraries, a capability that the other primary candidate lacked, Swan said.

Russell agreed that agencies and departments should be careful about requirements in considering deduplication offerings. “In the grand scheme of things, it’s still incredibly nascent technology,” he said. “It’s going to evolve. That could suggest purchasing tactically or at least trying a solution in a more limited scope as the market continues to evolve.”

Thin Provisioning

Deduplication primarily aims to save disk space on backup servers. Thin provisioning primarily aims to conserve disk space on primary storage.

As with deduplication, thin provisioning's underlying concept is simple. But instead of compressing data, thin provisioning works by telling a simple lie to applications.

Many applications, such as e-mail servers and databases, need to know how much disk space they have to work with. For example, a systems administrator might reserve a terabyte of disk space for a database, and the database might only need 100G at that time. With thin provisioning, the administrator tells the database that 1T of space is available, but only the space needed is delivered, in this case 100G. That frees 900G for other applications to use.

Thin provisioning lets organizations configure applications for the future without needing to come up with the entire capital outlay for hardware. With the cost of storage continually dropping, when the additional hardware is actually needed, it will cost less than if it had been purchased at the outset.

The potential reduction in required storage space is significant. “Some studies in the marketplace say that 75 percent of the storage has been presented but is stranded and is never written to,” 3Par's Foely said. The company is the first storage vendor to offer thin provisioning. “So if you have a 100T system, on average only 25 to 30 terabytes are actually written data.”

When Bryan Gilley, director of management information systems for Skokie, Ill., had to replace some servers, he looked to thin provisioning as a way to stretch his budget. “We were able to create a virtualized environment with VM and a storage-area network at less cost than we had planned to spend on the servers,” Gilley said. “We’re able to grow our storage environment in a more planned way and not have all that cost upfront.”

However, analysts warn that there are some risks involved with thin provisioning. Specifically, to avoid running out of disk space, administrators need to know the real storage requirements of all their applications and provide additional disk space in time.

“It does add the risk that if you don’t properly identify growth as it happens, you could run out of space,” said Andrew Reichman, senior analyst at Forrester Research. “It’s important for any organization that is considering using thin provisioning to be really clear about the reporting tools that you’re going to use to mitigate the risk and the processes that you’re going to use to look at those reports and make decisions and keep it safe.”

Thin provisioning has been adopted by most SAN vendors, and it is generally offered at no extra cost. But one area in which vendors’ offerings tend to differ is specifically in the reporting and alarm tools, Reichman said.

Swan said procurement cycles might not match up with an administrator’s need for new storage if growth is not adequately projected. When an administrator realizes there’s a crunch, it could take nine months to get new equipment, he said. “Well, it’s a little late,” he said. “My Oracle just crashed.”

Although most SAN vendors offer thin provisioning, analysts say there are significant differences among those implementations.

“The quality of an implementation, the functionality of an implementation, the ease of use of an implementation, the breadth of applications that can benefit from an implementation are going to vary by vendor and by model,” said Stanley Zaffos, vice president of Gartner Research.

For example, some thin-provisioning systems require administrators to assign all of the storage space in full, while others allow incremental assignments, Zaffos said. In addition, some solutions can span different types of storage arrays, but others cannot.

As Foley said, some thin-provisioning tools can’t recognize when a database administrator deletes files in Oracle and reclaim the disk space. That was a capability only recently added to 3Par’s thin provisioning. “Our array is now able to recognize that and reclaim space at the block level,” he said. “That makes tons of difference. In large environments, there are terabytes and terabytes of data that can be reclaimed and reused if it is removed at the database level.”

Zaffos said one unusual challenge with thin storage is that the answers are rapidly changing. “If you’ve got four or five questions and you’re looking at three or four vendors, this is really not an onerous task,” he said. “The problem is that the answers are changing over time. This is a dynamic landscape. Even if you provide a cheat sheet for your readers today, it’s likely to change.”

Swan said not all applications are good candidates for certain thin-provisioning systems. “A little thin provisioning can be a good idea," he said. "But a lot of thin provisioning is a bad idea. Not all applications are going to be good candidates. And the disk array that is used is going to matter. Does it allocate one block at a time? Or does it allocate a 128M chunklet, like 3Par does? Just be aware of the underlying mechanism and how rapidly it grows. Know how it works under the covers before you deploy it too heavily.”

Reader Comments

Please post your comments here. Comments are moderated, so they may not appear immediately after submitting. We will not post comments that we consider abusive or off-topic.

Please type the letters/numbers you see above