Today, a request for code review came across the ZFS developers’ mailing list. Developer George Amanakis has ported and revised code improvement that makes the L2ARC—OpenZFS’s read cache device feature—persistent across reboots. Amanakis explains:
The last couple of months I have been working on getting L2ARC persistence to work in ZFSonLinux.
This effort was based on previous work by Saso Kiselkov (@skiselkov) in Illumos (https://www.illumos.org/issues/3525), which was later ported by Yuxuan Shui (@yshui) to ZoL (https://github.com/zfsonlinux/zfs/pull/2672), subsequently modified by Jorgen Lundman (@lundman), and rebased to master with multiple additions and changes by me (@gamanakis).
The end result is in: https://github.com/zfsonlinux/zfs/pull/9582
For those unfamiliar with the nuts and bolts of ZFS, one of its distinguishing features is the use of the ARC—Adaptive Replacement Cache—algorithm for read cache. Standard filesystem LRU (Least Recently Used) caches—used in NTFS, ext4, XFS, HFS+, APFS, and pretty much anything else you’ve likely heard of—will readily evict “hot” (frequently accessed) storage blocks if large volumes of data are read once.
By contrast, each time a block is re-read within the ARC, it becomes more heavily prioritized and more difficult to push out of cache as new data is read in. The ARC also tracks recently evicted blocks—so if a block keeps getting read back into cache after eviction, this too will make it more difficult to evict. This leads to much higher cache hit rates—and therefore lower latencies and more throughput and IOPS available from the actual disks—for most real-world workloads.
The primary ARC is kept in system RAM, but an L2ARC—Layer 2 Adaptive Replacement Cache—device can be created from one or more fast disks. In a ZFS pool with one or more L2ARC devices, when blocks are evicted from the primary ARC in RAM, they are moved down to L2ARC rather than being thrown away entirely. In the past, this feature has been of limited value, both because indexing a large L2ARC occupies system RAM which could have been better used for primary ARC and because L2ARC was not persistent across reboots.
The issue of indexing L2ARC consuming too much system RAM was largely mitigated several years ago, when the L2ARC header (the part for each cached record that must be stored in RAM) was reduced from 180 bytes to 70 bytes. For a 1TiB L2ARC, servicing only datasets with the default 128KiB recordsize, this works out to 640MiB of RAM consumed to index the L2ARC.
Although the RAM constraint problem is largely solved, the value of a large, fast L2ARC was still sharply limited by a lack of persistence. After each system reboot (or other export of the pool), the L2ARC empties. Amanakis’ code fixes that, meaning that many gigabytes of data cached on fast solid state devices will still be available after a system reboot, thereby increasing the value of an L2ARC device. At first blush, this seems mostly important for personal systems that get rebooted often—but it also means far more heavily loaded servers might potentially need much less “babying” while they warm up their caches after a reboot.
This code has not yet been merged into master, but Brian Behlendorf, Linux platform lead of the OpenZFS project, has signed off on it, and it’s awaiting another code review before merge into master, which is expected to happen sometime in the next few weeks if nothing bad comes up in further review or initial testing.