Replacing a RAID Drive

By Sergey Nosov

March 3, 2013

If you are new to RAID technology and just got your redundant hard drive array configured, you may be wondering what happens when a drive in the array fails; what to do next, what is the appropriate procedure for replacing failed components and repairing the array? This article is for you then.

First of all, in this article we will concentrate on hardware RAID controllers, and we will assume that plug-and-play of storage devices is supported, as with most modern RAID controllers running under Windows Server operating system. In this article we will also stay away from vendor specific terms, and try to use general language that should be easy for you to translate into user interface items of your particular storage controller. As always, studying vendor’s documentation is highly recommended.

Second, what we are talking about here is storage subsystem of a single server, such as a typical dedicated web server. Strategies for large SAN/NAS solutions will be somewhat different.

So, you have your server. If it is a 1U server, you may only have space for as few as three hard drives there. A possible configuration is then to create a RAID 1 mirrored array with two hard drives, and use the third hard drive for automated daily incremental backups. When there are no vacant storage bays, your options are somewhat limited. For the most part, you can only remove a failed hard drive, and put a new one in its place.

If you step up to a 2U server, you suddenly get much more freedom with configuration options, ongoing maintenance, and with how you replace your hard drives. A typical modern 2U server has six or eight hot-swap drive bays. When spare slots are available, I am a big proponent of the hot-spare feature many of the modern storage controllers support.

Hot-spare is a hard drive that is normally unused. The storage controller keeps hot-spares in low power state, and spins one of these drives up when it is needed to replace a failed drive in a redundant array in the system. In other words, hot spare is a standby drive that automatically substitutes a drive that has fallen out from an array. After such substitution the array is rebuilt to healthy state with no human intervention.

Some administrators argue against hot-spares, saying that before array is rebuilt the first step should be to perform backup. The reasoning here is that the rebuilding procedure may be placing more stress on the array in critical state than backup procedure, and such automatic rebuild may kill the array before there is a chance to make good fresh backup. In my opinion, if this level of protection is important, then you should look into redundant array levels, such as RAID 6, that do not go into critical state with loss of a single hard drive.

A redundant array is in critical state when loss of another hard drive will compromise data.

So, you received an alert from your server that its storage array is in critical state. What should you do?

Modern hard drives are resilient creatures, just because the storage controller decided to kick a hard drive out of the array it does not necessarily mean that the hard drive as dead as a brick. If you force rebuild the failed hard drive, it may work for long time until another failure rears its ugly head. Should you do that, depends on a few factors.

One such factor is rebuild time. On low to mid-level storage controllers rebuilding even a relatively small (300 GB) and fast (15k RPM, SAS) drive can take as long as eight hours. So, if you can grab a replacement hard drive from your storage cabinet or a friendly nearby computer store on the way to the datacenter a few minutes away, then the answer is no, go replace the failed hard drive with a new one, and then start the rebuilding.

If, on the other hand, it will be days or weeks before the server can be tended, forcing the failed drive back into the array may be appropriate interim measure. It is important to try to understand why the hard drive was forced off-line. Monitor temperatures, run diagnostic if you can, and make sure that with hardware RAID controllers you use only compatible hard drives that implement time limited error recovery (TLER), error recovery control (ERC), or command completion time limit (CCTL) feature.

Using desktop-grade drives without TLER/ERC/CCTL may cause you to lose drives out of the array as they go into recovery or maintenance mode and do not answer storage controller requests before timeout.

But enough of that; let us start swapping drives. The first step is, slow down. Pulling a wrong drive on a critical array can be disastrous for your data.

Run storage management software to verify the state of your logical arrays and physical disks.

If a disk is in failed state, you can remove it and replace with a new one. Make sure that you are removing the correct disk. The numbering of disks in enclosures is not standard. Disk 0 can be the one in the lower left corner, with Disk 1 to the right of it. It can also be the one in the left-top, with Disk 1 underneath. Or your server can use some other scheme all together. Hot-swap bays usually come with numbered stickers; use them.

The storage controller may offer you help, indicating the failed drive a certain way.

For extra precaution, consider writing down the failed hard drive serial number from the storage management software, then shutting the server down, removing the hard drive, and verifying the serial number on the hard drive before turning the server back on.

That takes care of replacing a failed disk. But what if you want to replace a disk that is a still functioning part of a redundant array? The most radical school of thought on this subject is “you paid for this fancy RAID controller, may as well use it.” Subscribers to this theory would not blink before simply yanking a working drive out of the array, and relying on the storage controller to keep your data safe and to rebuild the array once the replacement hard drive is installed.

What you have to realize, though, is that your storage system is a concert of multiple active elements built by different vendors. There are hard drives that can have multiple firmware versions, there is the backplane controller, there is the storage controller that may or may not have the latest firmware, and then there are the operating system drivers. Yes, all this cacophony is designed to keep your data safe, but how well your particular combination of microcode had been tested to work in the exact state that the system may be in at the precise moment you pull the active hard drive out?

This sort of testing is good exercise for servers in pre-deployment. For production servers I recommend a gentler approach.

If there are empty hard drive bays in your system, you may opt for installing a replacement hard drive before removing the old one. Once the new hard drive is in the system, through the management software, you can tell the storage controller to replace the old hard drive with the new one by copying the data. Just be aware that the process of replacement may take long time, hours for magnetic drives. As such, you may want to schedule removal of the old drive for your next data center visit or make other appropriate arrangements.

If, on the other hand, you are taking the hard drive out of the array without replacing it as described in the previous paragraph, you can tell the storage controller to take the drive off-line. Depending on the type of redundant array you use, this may make the array critical. That is if you now lose another hard drive, or accidently remove a wrong hard drive, you will lose your data.

The final step, before physically pulling a hard drive from the enclosure, is to tell the storage controller to prepare the hard drive for removal. This will erase controller information from the hard drive and make the drive ready for using with another controller.

As you can see, having extra hot-swap hard drive bays available can make the process of replacing hard drives safer and more controllable.

And if you are in planning stages for a new server, consider not only how many hard drive bays you need for the server daily operations but also the following:

Any future storage extension
Your procedures for hard drive replacements or upgrades
Employing hot-spare hard drives

If you are new to the architecture, before server deployment, dedicate some time to emulate hard drive failure and to test hard drive replacement, so that you are familiar with mitigation process and your storage controller software.