An i6000, Bacula, & You

Submitted by gpmidi on Mon, 07/08/2019 - 07:45
LTO-6 In An i6000

The i2000 & i6000

Around December 2018 I purchased a used Scalar i2000 and a few weeks later a used Scalar i6000 gen1. Both are from Quantum. Although ADIC made the previous revisions. IIRC there is a third vendor that also sold them under their name. 

Both the i2000 and i6000 gen1 are the exact same hardware, just different software. The i6000 gen2 has a very different picker and other upgraded hardware. You can tell gen2 hardware from gen1 by a cone shaped plastic cover over the top of the picker. The gen1 hardware is basically flat. Gen2 hardware only supports LTO-1 through LTO-8. Gen1 hardware can support either SDLT and up to LTO-5 if the firmware is less than or equal to 617G. If it's above 617G then it's LTO-1 to LTO-8; they dropped SDLT support and added newer LTO revisions. 

The gen1 hardware has a few options on the control unit (that's the first rack with the MCB (aka main control board), picker, and first tape drive. One option is zero to four FC blades. Some models do 1/2 Gbps while others do 1/2/4 Gbps. They're useful for older, slower drives if you want to keep the amount of cabling down and/or reduce the number of FC switch ports you need. But they do require you to map LUN IDs to WWNs on them; they only expose one WWN on the external ports. My guess here is that it was done this way so they didn't have to implement a full FC switch. 

Another option are EEBs, aka Ethernet Expander Boards. They hook to newer LTO drives and enable a few features including library management via the LTO drive's WWN, faster drive firmware updates, and some other things I think. They're expensive to get used so I don't have any right now. 

While the i2000 has a few hardware issues, namely the robot is unhappy, the i6000 works great with LTO-5 and below. After months of effort I was finally able to procure a used MCB (aka main control board) with a newer revision of firmware. The previous version, 617G, could do SDLT and up to LTO-5. The new firmware, 657G, works great with LTO-1 through LTO-6. Supposedly, although I can't test this at the moment, it works up through LTO-8. 

The i6000 is attached to my storage nodes via two Cisco FC switches. Since I only had 4/8 Gbps FC SFPs around I ended up hooking the library's control port, which is 1Gbps, to a FC blade's input and the FC blade's output to the switch. Then the FC mapping in the i6000's software had to be set to map the control channel to a LUN on the FC blade. The LTO-6 drives are all hooked directly to the FC switch as they support 8Gbps FC while the FC blades only go up to 4Gbps. From the switch I have a 2x8Gbps trunk back to a second switch which is then hooked to all of my storage nodes via 2x or 4x 8Gbps lines. 

My i6000 in its current drive configuration (2x LTO-3, 2x LTO-4, 2x LTO-5, 1x LTO-6, & 5x unused drive bays) can hold 720 tapes. Currently I have 362 LTO-6 tapes and 7 cleaning tapes. This results in 905 TB of tape. All of the tape has barcodes on it. Although there is no real pattern to the barcodes as some was purchased with barcodes already in place and others without. Bacula keeps track of what tape holds what files in it's Postgresql database. 

Bacula

The backup server, which runs Bacula, is currently running Bacula Director, bacula-dir, and Bacula Storage Daemon, bacula-sd, on the same system. This and all other systems are Linux based - CentOS 7 to be exact.

Currently the bacula-fd (the client, Bacula File Daemon) is installed on all systems that need backups performed for. This includes all of the local el7 systems, a few remote el7 systems, and a local windows 10 desktop. 

Since 80% of the storage is directly attached to one system, the bacula-sd is located on that system. This reduces overall network traffic and helps speed up backups for this system. Although now that the i6000 is online and will have more tape drives soon it would be possible to have more nodes with bacula-sd on them to split up the backup load some. 

The storage being backed up is either OS configuration files or the big heavy hitter - ZFS backed arrays. These arrays range in size from 25TiB to over 100TiB in usable size and are mostly either 8TB or 2TB disks. RAIDz3 is used for 12 disks and above while RAIDz2 is used for lower disks count VDEVs. To help keep the pools flexible to changes each VDEV is in its own zpool. The pools are then merged via glusterfs or mergerfs depending on requirements. 

Bacula uses both tape and a few TB of disk for backup storage. The quick backups are all small in size and run at least once a day if not every hour in a few special cases. The normal backups run daily or weekly and capture pretty much everything. They go directly to tape. 

Since large file backups and everything else backups have very different needs they've been split into two separate bacula pools. The everything else backups go to a small pool of tape, are retained for a long period of time, and occur fairly frequently. The large file backups do incrementals weekly and fulls every six months.

One cool thing about the large file backups: A single job that does 250+ TiB of data would be a problem if it failed at 95% and would take days or weeks to run. So the large file jobs are split up by the first letter or first two letters of the top level directory name. This results in close to 7k jobs but each job is a reasonable size. Further more the full backups can be staggered. A small Python script was created to generate the required configurations for this automatically. The total size of the generated configurations is over 75MB. 

All of the Bacula jobs are run automatically by Bacula Director in one of a few configured cadences. This combined with all tapes being in one library result in a 100% automatic backup solution. 

Comments