In this past article I’ve written about setting a sofware based RAID1 on Linux. After less then two years one disks failed, so I had to replace it – here are my experiences with this procedure.
Symptoms
After forced reboot (power off), server did not boot – rebooting manually I’ve found that problem is with raid – system complains about raid disk failure and offers to start with raid1 in degraded mode. If this was not chosen it opened maintenance shell.
Identifying failed disk
Boot with degraded raid1. Check current raid status
cat /proc/mdstat Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md0 : active raid1 sdb1[0] 2930133824 blocks super 1.2 [2/1] [U_] unused devices: <none>
Clearly second disk is missing form raid1, next you should check what’s wrong with the second disk:
- check logs and linux kernel messages – like
dmesg | grep sdc
, - check SMART status of the disk:
sudo smartctl -a /dev/sdc
(also note down disk serial for later)
check attributes like “Raw_Read_Error_Rate”, or “Offline_Uncorrectable” for non zero values - optionally run test on the disk
sudo smartctl --test=long /dev/hdc
(takes several hours)
Remove failed disk
After boot with degraded raid1 problematic disk should be already removed from raid group. Check it with:
sudo mdadm --detail /dev/md0
If problematic disk is still in raid group you have to remove disk from group:
sudo mdadm --manage /dev/md0 --fail /dev/sdc1 sudo mdadm --manage /dev/md0 --remove /dev/sdc1
Then the disk could be physically removed from server (double check disk serial, that you are removing correct disk).
Install replacement disk
Install new disk and start with degraded raid1. Create same partitions on new disk (install gdisk package):
sgdisk -R /dev/sdc /dev/sdb # careful - not to switch names sgdisk -G /dev/sdc # create new GUID for new partition
After you can also change new partition name (with parted).
And add new partion to raid1:
mdadm --manage /dev/md0 --add /dev/sdc1
And check cat /proc/mdstat that raid1 is synchronizing:
cat /proc/mdstat Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md0 : active raid1 sdc1[2] sdb1[0] 2930133824 blocks super 1.2 [2/1] [U_] [>....................] recovery = 0.1% (3767680/2930133824) finish=349.5min speed=139543K/sec
or mdadmin --detail:
sudo mdadm --detail /dev/md0 /dev/md0: Version : 1.2 Creation Time : Fri Mar 8 12:21:09 2013 Raid Level : raid1 Array Size : 2930133824 (2794.39 GiB 3000.46 GB) Used Dev Size : 2930133824 (2794.39 GiB 3000.46 GB) Raid Devices : 2 Total Devices : 2 Persistence : Superblock is persistent Update Time : Mon Oct 20 21:49:33 2014 State : clean, degraded, recovering Active Devices : 1 Working Devices : 2 Failed Devices : 0 Spare Devices : 1 Rebuild Status : 1% complete Name : nas:0 (local to host nas) UUID : f0135e24:8557a5bb:c13ac7a7:9a5cfd6a Events : 7014 Number Major Minor RaidDevice State 0 8 17 0 active sync /dev/sdb1 2 8 33 1 spare rebuilding /dev/sdc1
After synchronization everything should look normal again:
cat /proc/mdstat Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md0 : active raid1 sdc1[2] sdb1[0] 2930133824 blocks super 1.2 [2/2] [UU]
sudo mdadm --detail /dev/md0 /dev/md0: Version : 1.2 Creation Time : Fri Mar 8 12:21:09 2013 Raid Level : raid1 Array Size : 2930133824 (2794.39 GiB 3000.46 GB) Used Dev Size : 2930133824 (2794.39 GiB 3000.46 GB) Raid Devices : 2 Total Devices : 2 Persistence : Superblock is persistent Update Time : Tue Oct 21 06:54:02 2014 State : clean Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0 Name : nas:0 (local to host nas) UUID : f0135e24:8557a5bb:c13ac7a7:9a5cfd6a Events : 8531 Number Major Minor RaidDevice State 0 8 17 0 active sync /dev/sdb1 2 8 33 1 active sync /dev/sdc1