Smarter than smart

How long will your hard drives last? New reports suggest that estimates aren't reliable and that life cycles might not be as long as you think.

For further research:

'Get S.M.A.R.T. for Reliability' (Seagate, 1999)

This is the paper that first defined the benefits of hard drive SMART technology, the industry standard that could be used to predict hard drive failures.

'Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you?'

In this paper, two Carnegie Mellon University researchers analyzed how often disk drives were replaced in large data centers. They found that drives were replaced more often than predicted by manufacturers' estimates for how long the drives should last.

'Failure Trends in a Large Disk Drive Population'

Google researchers studied hundreds of thousands of disks on their own server farms and found the SMART technology did little to help predict failures. disk_failures.pdf

Hard drive smarts

A feature of new hard drives, the Self-Monitoring, Analysis and Reporting Technology (SMART) reports on hard drive conditions, that include the driver, disk heads, surface state and electronics. All the major hard drive manufacturers subscribe to the SMART system, and system and operating systems vendors incorporate various combinations of SMART attributes.

SMART keeps track of dozens of hard drive attributes. Some critical SMART attributes are:

  • Read error rate: The rate of hardware errors when reading data from a disk surface.
  • Reallocated sectors count: The number of sectors reallocated, meaning their data was transferred to another sector after a read/write/verification error.
  • Reallocation event count: The number of attempts to transfer data from reallocated sectors.
  • Current pending sector count: The number of unstable sectors waiting to be remapped.
  • Uncorrectable sector count: The number of uncorrectable errors when reading/writing sectors.
  • Disk shift: The distance the disk has shifted relative to the spindle.

Other attributes that SMART reports on are various measures of temperature, and the flying height of the heads above the disk.

'Edmund X. DeJesus

It's a miracle that hard drives work at all. The read-write heads fly only nanometers above the disk surface, which spins at 7,500 revolutions per minute or faster. If the heads fly a little too high, the magnetic domains become inaccessible and the data is unreadable. If they fly just a little too low, it's crash city.

Despite this precarious state of affairs, agencies continue to entrust their most valuable data to hard drives. It's a breathtaking leap of faith.

And now this faith is being tested as never before, through some studies that show hard drives don't last nearly as long as previously imagined.

Hundreds of millions of hard drives are already in use in agency data centers, and millions more are sold and installed every year. These hard drives handle current data for applications and backed-up data for archiving. Hard drive failure can mean not only temporary data unavailability but also permanent data loss.

With such large numbers of hard drives deployed, managing them consumes a significant part of information technology budget and effort. Agencies need to keep the data stored and flowing. This means anticipating hard drive failure, moving data on risky drives and replacing failing drives before they give up the ghost completely.

And anticipating these failures may be trickier than previously assumed.

Studying failure

Luckily, it's not necessary to rely completely on faith to make such predictions. We can analyze hard drive failures the same way we do for any other electromechanical device.

There are basically two classes of failures: predictable and unpredictable. Unpredictable failures, such as circuits burning out, occur suddenly and randomly. There's no warning or advance notice, so there's no strategy for anticipating them. All you can do is mop up after one occurs.

The situation is more hopeful for predictable failures, which include most mechanical failures. Typically, parts age and wear out gradually. As its performance degrades ' or simply changes ' over time, we can anticipate the ultimate failure of the drive well before it happens.

A landmark study by Seagate in 1999 indicated that some 60 percent of drive failures are predictable. Any handle we can get on such failures will clearly be helpful in managing vast numbers of hard drives.

Fortunately, we have technologies in place to help monitor the condition of hard drives, make repairs and adjustments, and predict the onset of failure. Self-Monitoring, Analysis and Reporting Technology (SMART) reports on hard drive conditions, including the driver, disk heads, surface state and electronics (See sidebar, 'Hard drive smarts').

The goal of SMART is to warn systems administrators of impending drive failure while there's still time to take preventive action, such as copying threatened data to another storage device.

All the major hard drive manufacturers subscribe to SMART, and system and operating systems vendors incorporate various combinations of SMART attributes.
The original version of SMART functioned by monitoring certain online hard drive attributes. The next version included off-line attributes, which gave more information and thus improved failure prediction. The latest version, SMART III, adds the ability to detect and repair sector errors, using more off-line data acquired during periods of hard drive inactivity.

Not so fast

Industry estimates suggest that SMART can predict about 30 percent of hard drive failures. However, two recent independent studies indicate that failure prediction is not so simple ' and SMART may not be as helpful as once thought.

The USENIX Conference on File and Storage Technologies in February included two studies of hard drive failures that reached similar conclusions.

The first, 'Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You?' was written by Bianca Schroeder and Garth Gibson of Carnegie Mellon University's computer science department. These researchers looked at about 100,000 hard drives, some over their entire five-year lifetime.

The hard drives in this study had nominal mean times to failure (MTTF) ranging from 1 million hours to 1.5 million hours, which is typical for the industry.

Those levels suggest a failure rate of at most 0.88 percent per year. However, the researchers found that in the field, annual replacement rates (ARR) routinely exceeded 1 percent, with 2 percent to 4 percent common, and some were as high as 13 percent. The weighted average ARR was actually 3.4 times higher than 0.88 percent. They concluded that such failure rates were not what we should expect based on the rated MTTFs.

The researchers used ARR rather than MTTF because in actual practice, administrators replace drives that may not have failed yet. 'Drive replacements include drives that have not failed, commonly referred to as no-trouble-found drives, which can make up as much as 40 percent of that replacement population,' said David Szabados, a spokesman at Seagate Technology.

Zeroing in on specific age ranges of hard drives, the researchers found that for older systems (five to eight years of age), MTTFs underestimated actual replacement rates by a factor of as much as 30. Even for young drives (less than three years), the difference was as large as a factor of 6.

Contrary to industry expectations, they observed that replacement rates grew steadily with age. Clearly, all these results have significance for planning hard drive acquisitions and managing hard drive populations.

What is the lesson here? First, hard drives may require replacement more frequently than stated MTTFs suggest ' in fact, several times more frequently. In addition, even after a shakedown year establishes that a drive isn't a dud, you cannot trust that the drive will be good for the next three to four years, as is commonly assumed. Because drive failure increases with time, older drives are more suspect.

Not so smart

The other study from the USENIX Conference was 'Failure Trends in a Large Disk Drive Population' by Eduardo Pinheiro, Wolf-Dietrich Weber and Luiz Andre Barroso, all researchers at Google, an enterprise with considerable experience with hard drives.

In many ways, Google is in the perfect position to study hard drives, because they run so many of them in their data centers.

Like the Carnegie Mellon work, this study involved more than 100,000 hard drives. The researchers observed annual failure rates (AFRs) from 1.7 percent for drives in their first year to more than 8.6 percent for those at least three years old. These results closely match the Carnegie Mellon outcomes.

But the Google study went one step further. It also included an analysis of SMART parameters and their correlation to hard drive failure.

Although other studies ' and common sense ' suggest that higher temperature or activity levels would contribute to hard drive failure, this study found little correlation. However, certain SMART attributes were found to have a large impact on failure probability. For example, after their first scan error occurs, hard drives are 39 times more likely to fail within 60 days than drives with no such errors. Similarly, first errors in reallocations, off-line reallocations and probational counts also strongly correlate to higher failure probability.

'Methods of predicting hard drive longevity are mostly age-related, with modifiers such as the four SMART data we found,' said Pinheiro, a software engineer at Google.

Yet perhaps the most remarkable finding of this study was the lack of impact of SMART attributes on most failures. Out of all the failed drives, more than 56 percent of them had no count in any of those four best SMART predictors. So even models based on those good predictors can never anticipate more than half of drive failures. Even including all SMART attributes, more than 36 percent of all failed drives had zero counts. In other words, many failed drives gave no advance warning that they were failing.

The researchers concluded that 'given the lack of occurrence of predictive SMART signals on a large fraction of failed drives, it is unlikely that an accurate predictive failure model can be built based on these signals alone.' Better predictions will need to use more information than SMART provides.

What does this mean? Try not to rely too heavily on SMART. The first occurrence of certain SMART events ' namely, scan errors, reallocation counts, off-line reallocation counts and probational counts ' can be important clues that hard drive failure is imminent. However, most hard drive failures will occur with no meaningful warning whatsoever. 'The accuracy of longevity prediction is very poor for any single drive,' Pinheiro said.

Industry experts have different perspectives on the Google and Carnegie Mellon studies. For example, Aloke Guha, chief technology officer of storage solution provider COPAN Systems, points out that the results of both papers are based on drives that are always spinning and thus do not provide insights into how AFRs depend on actual power-on-hours (POHs).

'It's no surprise that drives meant to be used in low-duty-cycle or low POH are exhibiting high failure rates when used in transactional storage systems,' Guha said. He suggests that drives using Massive Array of Idle Disks technologies experience lower AFRs, from about 0.22 percent. COPAN is the exclusive provider of MAID systems.

Similarly, David Lethe, president of diagnostic software vendor SANtools, suggests that proper burn-in testing can weed out bad disks early. 'Some storage and subsystem manufacturers invest millions of dollars developing appropriate testing methodology and algorithms,' Lethe said. He also recommends monitoring SMART attributes continuously, not just on start-up, as many BIOSes do.

Dollar wise

The lessons from both studies ' and from industry experts ' translate readily into strategies for acquiring and managing hard drives, especially for large agency installations. First, administrators should select the proper drive for the task, not the cheapest drive. 'It costs more ' considering downtime, replacement costs, labor and data lost ' to buy consumer-class disks and use them for demanding tasks,' Lethe said. 'You get what you pay for.'

Because stated hard drive MTTFs may not accurately reflect the replacement rates of hard drives, administrators may need to requisition more drives than they normally would. Administrators should investigate the wording of current supply arrangements: Are they tied to specific numbers of hard drives or levels of actual storage?

For large-scale hard drive use, agencies may want to formulate burn-in strategies to eliminate dud drives. Since this is so complex, it may be simpler to work with vendors on their burn-in procedures.

In addition, administrators should not expect that hard drive failures will peak in the newest and oldest drives, with ages in the middle exhibiting lower failure rates ' the so-called bathtub curve. Instead, they should anticipate that failure rates should increase with age and schedule spare drives in advance to replace failing drives.

Because several SMART attributes seem to have some predictive powers, administrators should ensure that they are monitoring those attributes. This may require configuring BIOS, operating system, Simple Network Management Protocol reporting, network and management software to pass the necessary indicators along. SMART monitoring should be continuous.

Administrators should take warnings seriously. Hard drives that give a clue to a possible failure should be retired and their data transferred to other devices. 'Strategies for dealing with hard drive failure include moving data, replicating data, masking failures by diverting accesses' and Redundant Array of Independent Disks, Pinheiro said.

Although most hard drive failures may have no prior warning from SMART, administrators should be on the lookout for trends in their population of hard drives. 'It is important to not only understand the kind of drive being used but the system or environment in which it was placed and its workload.'

Szabados said. It may be that your hard drive failures involve the characteristics of your local installation: power, workload, temperature or other conditions.
Regardless of the reason for hard drive failure, administrators should be prepared to restore data from backup and resume operations. No predictive mechanism is ever going to replace the insurance these simple systems can provide.

Hard drive research will continue, and it will be relevant to systems administrators. Hard drive manufacturers may respond to such research with new predictive attributes, different estimates of MTTF or new strategies for dealing with hard drive failure.

Regardless of where the industry goes, agencies should evolve their own strategies for dealing with hard drive failure ' and keep the faith.

Reader Comments

Please post your comments here. Comments are moderated, so they may not appear immediately after submitting. We will not post comments that we consider abusive or off-topic.

Please type the letters/numbers you see above