Replacing failed disk in Linux RAID1

In this past article I’ve written about setting a sofware based RAID1 on Linux. After less then two years one disks failed, so I had to replace it – here are my experiences with this procedure.

Symptoms

After forced reboot (power off), server did not boot – rebooting manually I’ve found that problem is with raid – system complains about raid disk failure and offers to start with raid1 in degraded mode. If this was not chosen it opened maintenance shell.

Identifying failed disk

Boot with degraded raid1. Check current raid status

cat /proc/mdstat 
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md0 : active raid1 sdb1[0]
      2930133824 blocks super 1.2 [2/1] [U_]
      
unused devices: <none>

Clearly second disk is missing form raid1, next you should check what’s wrong with the second disk:

check logs and linux kernel messages – like dmesg | grep sdc,
check SMART status of the disk: sudo smartctl -a /dev/sdc (also note down disk serial for later)
check attributes like “Raw_Read_Error_Rate”, or “Offline_Uncorrectable” for non zero values
optionally run test on the disk sudo smartctl --test=long /dev/hdc (takes several hours)

Remove failed disk

After boot with degraded raid1 problematic disk should be already removed from raid group. Check it with:

sudo mdadm --detail /dev/md0

If problematic disk is still in raid group you have to remove disk from group:

sudo mdadm --manage /dev/md0 --fail /dev/sdc1
sudo mdadm --manage /dev/md0 --remove /dev/sdc1

Then the disk could be physically removed from server (double check disk serial, that you are removing correct disk).

Install replacement disk

Install new disk and start with degraded raid1. Create same partitions on new disk (install gdisk package):

sgdisk -R /dev/sdc /dev/sdb # careful - not to switch names
sgdisk -G /dev/sdc # create new GUID for new partition

After you can also change new partition name (with parted).
And add new partion to raid1:

mdadm --manage /dev/md0 --add /dev/sdc1

And check cat /proc/mdstat that raid1 is synchronizing:

cat /proc/mdstat 
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md0 : active raid1 sdc1[2] sdb1[0]
      2930133824 blocks super 1.2 [2/1] [U_]
      [>....................]  recovery =  0.1% (3767680/2930133824) finish=349.5min speed=139543K/sec

or mdadmin --detail:

sudo mdadm --detail /dev/md0
/dev/md0:
        Version : 1.2
  Creation Time : Fri Mar  8 12:21:09 2013
     Raid Level : raid1
     Array Size : 2930133824 (2794.39 GiB 3000.46 GB)
  Used Dev Size : 2930133824 (2794.39 GiB 3000.46 GB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

    Update Time : Mon Oct 20 21:49:33 2014
          State : clean, degraded, recovering 
 Active Devices : 1
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 1

 Rebuild Status : 1% complete

           Name : nas:0  (local to host nas)
           UUID : f0135e24:8557a5bb:c13ac7a7:9a5cfd6a
         Events : 7014

    Number   Major   Minor   RaidDevice State
       0       8       17        0      active sync   /dev/sdb1
       2       8       33        1      spare rebuilding   /dev/sdc1

After synchronization everything should look normal again:

cat /proc/mdstat 
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md0 : active raid1 sdc1[2] sdb1[0]
      2930133824 blocks super 1.2 [2/2] [UU]

sudo mdadm --detail /dev/md0
/dev/md0:
        Version : 1.2
  Creation Time : Fri Mar  8 12:21:09 2013
     Raid Level : raid1
     Array Size : 2930133824 (2794.39 GiB 3000.46 GB)
  Used Dev Size : 2930133824 (2794.39 GiB 3000.46 GB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

    Update Time : Tue Oct 21 06:54:02 2014
          State : clean 
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

           Name : nas:0  (local to host nas)
           UUID : f0135e24:8557a5bb:c13ac7a7:9a5cfd6a
         Events : 8531

    Number   Major   Minor   RaidDevice State
       0       8       17        0      active sync   /dev/sdb1
       2       8       33        1      active sync   /dev/sdc1

Ivanovo

Replacing failed disk in Linux RAID1

Symptoms

Identifying failed disk

Remove failed disk

Install replacement disk

Leave a Reply Cancel reply

My Digital Bits And Pieces