Z Uber Move - Second Draft Of Plans

Submitted by gpmidi on Sat, 02/01/2020 - 14:45

Updated Plans!

First up, the door install is set for Tuesday, February 11th. Once the door is in two of the libraries (the two working ones) will come inside. I'll probably end up using movers to handle getting the boxes, shelving, tool boxes, and other stuff out to the garage. Although I'm not sure yet if I'll try to get the libraries inside before, during, or after the movers come. 

As for my media, I'm planning on using Ceph. The only trick there is that each of the R610s (R620s eventually) will need 30GB of RAM plus 1GB more for every TB of disk. That's 270GB assuming 8TB disks with two R610s/R620s per 60-bay DAS. The 60-bay DAS I'm using can split up the disks into either four 15-bay sets, two 30-bay sets, or one 60-bay set.

Since the R610s max out at 192GB I've maxed four of them out. In a while I'll get some R620s and move the RAM from the R610s to the R620s and hook them up to the 60-bay DAS. Until then, the R610s with 192GB of RAM will handle 15 8TB drives and 15 2TB or 15 3TB drives. That'll give me four nodes with a good bit of space.

At that point I'll probably store the main data set on the four nodes using erasure coding with osd-level redundancy (aka disk) using K=12, M=3. That'll let me survive three disk failures without data loss. However I'll have to set the CRUSH rules so that it doesn't mind having two (or more) chunks on the same physical server. That means the system can't stay up if one of the four R610s/R620s dies. But a quick reboot should fix that. Or replacing the box ;)

There are a few options for which erasure code plugin to use. Since I'll have a K of 12 it makes sense to use a plugin that'll help keep restores in check as Jerasure, ISA, and SHEC all will need to read a chunk of data from 12 other OSDs/disks in order to recover one chunk. Both Locally Repairable and CLAY can help with this. 

Locally Repairable works by storing extra copies of metadata that are used for recovery. A super simple view of CLAY is that it reads part of a chunk from all of the good chunks including "parity" chunks. With CLAY you also get to chose the underlying algorithm: Jerasure, ISA, or SHEC. In my case I'll probably opt for the default, Jerasure. 

Since space is important I'll probably just go with CLAY as Local-Repair costs extra disk space. A K=12, M=3 setting would result in the same sort of space usage and survivability of a 15-disk RAIDz3 vdev. The cool thing about using CLAY+Jerasure and Ceph is that all 129 remaining OSDs/disks in the cluster will be used for recovery load rather than just 14. That reduces the per-disk load a TON and increased the rate of recovery as it's basically a full mesh of 129 nodes all reading, writing, and transferring data around during a recovery. Since I'll be using 2x10Gbps connections between all four nodes that should go pretty fast. Although the R610s/R620s CPUs will likely be the real limiter in recovery speed. 

There are a few options for accessing the stored data: RGW (S3-like API) with a FUSE file system like rclone that would allow mounting it locally, CephFS, or RDB (with a regular file system on top). Since I need multi-user access RDB is out unless I also stacked GFS on top. So CephFS which talks to OSDs directly or drop an S3 like proxy in between too.

CephFS isn't without risks though. It'd mean putting all of my eggs in one preveriable basket to some extent. The first challenge is determining whether to use the FUSE or kernel driver. The kernel driver is obviously faster but can be less stable and has fewer features. The FUSE driver is slower but more stable, better tested, and has the full CephFS feature set. 

The next challenge is the data storage: CephFS needs OMAP for the pool that stores the metadata. The data pool doesn't need it. As a result I can store the data on an erasure coded pool but not the metadata. This is fine as far as I'm concerned. The metadata should end up on SSD in any case. 

It's also worth noting that CephFS needs an MDS or three. That's not a huge deal but will require a host or two to run them on. Using the OSDs seems like a bad idea since they'll be pretty short on RAM during recoveries and very, very low on CPU. So the MDS(s) will probably end up on the compute servers.

That'll be a small two or three node Docker Swarm cluster or perhaps K8S. Which will depend on the distro selected for the base OS. Ideally one of the new RedHat options will have a CentOS like project I can pull the image from. If not, one of the other minimalistic, static distros may be used. 

At the end of the day the progress here is blocked by the need to keep way to many boxes and other items stored in the space that I need to get the servers moved around in. Once that clears up (ie libraries and boxes are moved) then it'll be pretty easy. 

One final problem: The 60-bay DAS units are LONG. Very long. The first one fit in the rack without a problem. The second might not fit due to the zero U PDUs I have. That'll take a quick (if a bit heavy) test to determine.