NDS-4600 - SATA Drive Failures In Linux

Submitted by gpmidi on Tue, 12/01/2020 - 17:01

One issue I've recently run into with a failed SATA drive in one of my NDS-4600 units is that Linux frequently tries to recover the drive by resetting the bus. This takes out a few other disks in the group with it. The resulting IO timeouts cause problems for my Ceph OSDs using those disks. 

It should be noted that only some types of disk failures cause this. The host bus resets only are done by the Linux kernel in some cases (I think) and I suspect the cause of the other disks errors is said disk. 

Once the "problem" disk is soft removed (via echo 1 > /sys/block/$DEVICE/device/delete) the other disks and OSDs using the disks no longer have any problems. 

I suspect that it would only affect SATA disks and not SAS disks due to the HA nature of SAS. But that's only a suspicion.