NDS-4600 - SATA Drive Failures In Linux

One issue I've recently run into with a failed SATA drive in one of my NDS-4600 units is that Linux frequently tries to recover the drive by resetting the bus. This takes out a few other disks in the group with it. The resulting IO timeouts cause problems for my Ceph OSDs using those disks. 

It should be noted that only some types of disk failures cause this. The host bus resets only are done by the Linux kernel in some cases (I think) and I suspect the cause of the other disks errors is said disk. 

"error from slirp4netns while setting up port redirection: map[desc:bad request: add_hostfwd: slirp_add_hostfwd failed]"

I was getting this from podman on a CentOS 8 box: 

"error from slirp4netns while setting up port redirection: map[desc:bad request: add_hostfwd: slirp_add_hostfwd failed]"

It was fixed by killing off all podman and /usr/bin/conmon processes as the user that I was running the commands as. Note: Don't do that as root using killall unless you limit to only your user. 

The underlying error may have been running out of FD.

Ceph With Many OSDs

While setting up my Ceph cluster on a set of Dell R710s, one with 60 disks attached to it, I found that I needed to raise fs.aio-max-nr to around 1,000,000. SELinux also needed to be disabled. Once that was done the normal cephadm osd install worked great, even with 60 disks. 

$ cat /etc/sysctl.d/99-osd.conf

# For OSDs
fs.aio-max-nr=1000000

Tags

Subscribe to