Replacing failed drives in a md array

2025-04-01

by alex

in 2025

Broken HD

In a Linux softraid, you may have to replace failed drives. As usual, backups are still needed, but hardware failures can be covered by RAID levels.

burning

Check the array status with cat /proc/mdstat:

xm3:~# cat /proc/mdstat
Personalities : [raid1] 
md0 : active raid1 sdb[0] sdc[2]
      67042304 blocks super 1.2 [2/2] [UU]

unused devices: <none>

Check the serial number of your drives to make sure you are replacing the right one: lsblk -do +VENDOR,MODEL,SERIAL:

NAME  MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS VENDOR MODEL                 SERIAL
loop0   7:0    0 108.1M  1 loop /.modloop                
sda   253:0    0   232G  0 disk             ATA    WDC_WD2500AVVS-73M8B0 WD-WCAV94350152
sdb   253:16   0   232G  0 disk             ATA    WDC_WD2500AVVS-73M8B0 WD-WCAV94350152
sdc   253:32   0   232G  0 disk             ATA    WDC_WD2500AVVS-73M8B0 WD-WCAV94350152

Remove the faulty drive
- Hot plugging, you can do this while the system is running, if your hardware supports it.
  - Remove the drive from the system.
  - Run:
```
mdadm --manage /dev/md0 --remove failed
```
    Note that failed worked for me. Other examples mentioned detached.
- Warm plugging, some hardware requires you to enter some commands:
  - mark drive as faulty
```
mdadm --manage /dev/md0 --fail /dev/sdc`
```
  - remove drive from array
```
mdadm --manage /dev/md0 --remove /dev/sdc
```
  - remove the drive from the kernel
```
echo 1 > /sys/block/sdc/device/delete
```
- This is how mdstat looks like after:
```
xm3:~# cat /proc/mdstat 
Personalities : [raid1] 
md0 : active raid1 vdb[0]
      67042304 blocks super 1.2 [2/1] [U_]

 unused devices: <none>
```
  When removing the physical drive, if possible, unplug the power cable first, data cable next.
Insert the new drive. When possible, plug the data cable first, power cable next. Wait 10-15 seconds

Run

for file in /sys/class/scsi_host/*/scan; do
  echo "- - -" > $file;
done;

This will re-scan the SATA buses. This is not always needed.

Add the new drive to the array:
```
mdadm --add /dev/md0 /dev/sdc
```

Check that the configuration updated correctly:

xm3:~# mdadm --detail /dev/md0
/dev/md0:
           Version : 1.2
     Creation Time : Tue Nov  5 14:44:26 2024
        Raid Level : raid1
        Array Size : 67042304 (63.94 GiB 68.65 GB)
     Used Dev Size : 67042304 (63.94 GiB 68.65 GB)
      Raid Devices : 2
     Total Devices : 2
       Persistence : Superblock is persistent

       Update Time : Wed Nov 13 14:03:42 2024
             State : clean, degraded, recovering 
    Active Devices : 1
   Working Devices : 2
    Failed Devices : 0
     Spare Devices : 1

Consistency Policy : resync

    Rebuild Status : 0% complete

              Name : xm3.virtual:0  (local to host xm3.virtual)
              UUID : 95d17a6b:bee76241:c72a05d1:3cfd7d62
            Events : 28

    Number   Major   Minor   RaidDevice State
       0     253       16        0      active sync   /dev/sdb
       2     253       32        1      spare rebuilding   /dev/sdc

You can monitor the re-build process with:

# watch cat /proc/mdstat
Personalities : [raid1] 
md0 : active raid1 vdc[2] vdb[0]
      67042304 blocks super 1.2 [2/1] [U_]
      [=========>...........]  recovery = 45.6% (30607360/67042304) finish=3.0min speed=200060K/sec

unused devices: <none>

You should check the output of dmesg:

[  703.836264] md/raid1:md0: Disk failure on sdc, disabling device.
[  703.836264] md/raid1:md0: Operation continuing on 1 devices.
[ 1789.815521] md: recovery of RAID array md0

Simlarly, you should use:

smartctl -a /dev/sdc

To check the health status of the drive.

In here I am using raw drives (without partitioning). If you are using partitions make sure to run fdisk when appropriate.

If you have smartmontools installed and running, we need to reset the daemon so it doesn't keep warning about the drive we removed.

advanced troubleshooting