In this past article I’ve written about setting a sofware based RAID1 on Linux. After less then two years one disks failed, so I had to replace it – here are my experiences with this procedure.
Symptoms
After forced reboot (power off), server did not boot – rebooting manually I’ve found that problem is with raid – system complains about raid disk failure and offers to start with raid1 in degraded mode. If this was not chosen it opened maintenance shell.
Identifying failed disk
Boot with degraded raid1. Check current raid status
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 sdb1[0]
2930133824 blocks super 1.2 [2/1] [U_]
cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 sdb1[0]
2930133824 blocks super 1.2 [2/1] [U_]
unused devices: <none>
cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 sdb1[0]
2930133824 blocks super 1.2 [2/1] [U_]
unused devices: <none>
Clearly second disk is missing form raid1, next you should check what’s wrong with the second disk:
- check logs and linux kernel messages – like
dmesg | grep sdc
,
- check SMART status of the disk:
sudo smartctl -a /dev/sdc
(also note down disk serial for later)
check attributes like “Raw_Read_Error_Rate”, or “Offline_Uncorrectable” for non zero values
- optionally run test on the disk
sudo smartctl --test=long /dev/hdc
(takes several hours)
Remove failed disk
After boot with degraded raid1 problematic disk should be already removed from raid group. Check it with:
sudo mdadm --detail /dev/md0
sudo mdadm --detail /dev/md0
sudo mdadm --detail /dev/md0
If problematic disk is still in raid group you have to remove disk from group:
sudo mdadm --manage /dev/md0 --fail /dev/sdc1
sudo mdadm --manage /dev/md0 --remove /dev/sdc1
sudo mdadm --manage /dev/md0 --fail /dev/sdc1
sudo mdadm --manage /dev/md0 --remove /dev/sdc1
sudo mdadm --manage /dev/md0 --fail /dev/sdc1
sudo mdadm --manage /dev/md0 --remove /dev/sdc1
Then the disk could be physically removed from server (double check disk serial, that you are removing correct disk).
Install replacement disk
Install new disk and start with degraded raid1. Create same partitions on new disk (install gdisk package):
sgdisk -R /dev/sdc /dev/sdb # careful - not to switch names
sgdisk -G /dev/sdc # create new GUID for new partition
sgdisk -R /dev/sdc /dev/sdb # careful - not to switch names
sgdisk -G /dev/sdc # create new GUID for new partition
sgdisk -R /dev/sdc /dev/sdb # careful - not to switch names
sgdisk -G /dev/sdc # create new GUID for new partition
After you can also change new partition name (with parted).
And add new partion to raid1:
mdadm --manage /dev/md0 --add /dev/sdc1
mdadm --manage /dev/md0 --add /dev/sdc1
mdadm --manage /dev/md0 --add /dev/sdc1
And check cat /proc/mdstat that raid1 is synchronizing:
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 sdc1[2] sdb1[0]
2930133824 blocks super 1.2 [2/1] [U_]
[>....................] recovery = 0.1% (3767680/2930133824) finish=349.5min speed=139543K/sec
cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 sdc1[2] sdb1[0]
2930133824 blocks super 1.2 [2/1] [U_]
[>....................] recovery = 0.1% (3767680/2930133824) finish=349.5min speed=139543K/sec
cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 sdc1[2] sdb1[0]
2930133824 blocks super 1.2 [2/1] [U_]
[>....................] recovery = 0.1% (3767680/2930133824) finish=349.5min speed=139543K/sec
or mdadmin --detail:
sudo mdadm --detail /dev/md0
Creation Time : Fri Mar 8 12:21:09 2013
Array Size : 2930133824 (2794.39 GiB 3000.46 GB)
Used Dev Size : 2930133824 (2794.39 GiB 3000.46 GB)
Persistence : Superblock is persistent
Update Time : Mon Oct 20 21:49:33 2014
State : clean, degraded, recovering
Rebuild Status : 1% complete
Name : nas:0 (local to host nas)
UUID : f0135e24:8557a5bb:c13ac7a7:9a5cfd6a
Number Major Minor RaidDevice State
0 8 17 0 active sync /dev/sdb1
2 8 33 1 spare rebuilding /dev/sdc1
sudo mdadm --detail /dev/md0
/dev/md0:
Version : 1.2
Creation Time : Fri Mar 8 12:21:09 2013
Raid Level : raid1
Array Size : 2930133824 (2794.39 GiB 3000.46 GB)
Used Dev Size : 2930133824 (2794.39 GiB 3000.46 GB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent
Update Time : Mon Oct 20 21:49:33 2014
State : clean, degraded, recovering
Active Devices : 1
Working Devices : 2
Failed Devices : 0
Spare Devices : 1
Rebuild Status : 1% complete
Name : nas:0 (local to host nas)
UUID : f0135e24:8557a5bb:c13ac7a7:9a5cfd6a
Events : 7014
Number Major Minor RaidDevice State
0 8 17 0 active sync /dev/sdb1
2 8 33 1 spare rebuilding /dev/sdc1
sudo mdadm --detail /dev/md0
/dev/md0:
Version : 1.2
Creation Time : Fri Mar 8 12:21:09 2013
Raid Level : raid1
Array Size : 2930133824 (2794.39 GiB 3000.46 GB)
Used Dev Size : 2930133824 (2794.39 GiB 3000.46 GB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent
Update Time : Mon Oct 20 21:49:33 2014
State : clean, degraded, recovering
Active Devices : 1
Working Devices : 2
Failed Devices : 0
Spare Devices : 1
Rebuild Status : 1% complete
Name : nas:0 (local to host nas)
UUID : f0135e24:8557a5bb:c13ac7a7:9a5cfd6a
Events : 7014
Number Major Minor RaidDevice State
0 8 17 0 active sync /dev/sdb1
2 8 33 1 spare rebuilding /dev/sdc1
After synchronization everything should look normal again:
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 sdc1[2] sdb1[0]
2930133824 blocks super 1.2 [2/2] [UU]
cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 sdc1[2] sdb1[0]
2930133824 blocks super 1.2 [2/2] [UU]
cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 sdc1[2] sdb1[0]
2930133824 blocks super 1.2 [2/2] [UU]
sudo mdadm --detail /dev/md0
Creation Time : Fri Mar 8 12:21:09 2013
Array Size : 2930133824 (2794.39 GiB 3000.46 GB)
Used Dev Size : 2930133824 (2794.39 GiB 3000.46 GB)
Persistence : Superblock is persistent
Update Time : Tue Oct 21 06:54:02 2014
Name : nas:0 (local to host nas)
UUID : f0135e24:8557a5bb:c13ac7a7:9a5cfd6a
Number Major Minor RaidDevice State
0 8 17 0 active sync /dev/sdb1
2 8 33 1 active sync /dev/sdc1
sudo mdadm --detail /dev/md0
/dev/md0:
Version : 1.2
Creation Time : Fri Mar 8 12:21:09 2013
Raid Level : raid1
Array Size : 2930133824 (2794.39 GiB 3000.46 GB)
Used Dev Size : 2930133824 (2794.39 GiB 3000.46 GB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent
Update Time : Tue Oct 21 06:54:02 2014
State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
Name : nas:0 (local to host nas)
UUID : f0135e24:8557a5bb:c13ac7a7:9a5cfd6a
Events : 8531
Number Major Minor RaidDevice State
0 8 17 0 active sync /dev/sdb1
2 8 33 1 active sync /dev/sdc1
sudo mdadm --detail /dev/md0
/dev/md0:
Version : 1.2
Creation Time : Fri Mar 8 12:21:09 2013
Raid Level : raid1
Array Size : 2930133824 (2794.39 GiB 3000.46 GB)
Used Dev Size : 2930133824 (2794.39 GiB 3000.46 GB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent
Update Time : Tue Oct 21 06:54:02 2014
State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
Name : nas:0 (local to host nas)
UUID : f0135e24:8557a5bb:c13ac7a7:9a5cfd6a
Events : 8531
Number Major Minor RaidDevice State
0 8 17 0 active sync /dev/sdb1
2 8 33 1 active sync /dev/sdc1