A few of my Ceph nodes with 60-ish disks each were experiencing frequent reboots. Turns out kernel.nmi_watchdog was rebooting it due to disks holding it up under very high load. By turning it off via `echo "kernel.nmi_watchdog=0" > /etc/sysctl.d/99-watchdog.conf` the problem was solved. Although I suspect there are better ways to tune NMI Watchdog to fix this. I'm being lazy.
- Log in to post comments
Comments
Four Days Later
Four days later and zero issues!