I spent some time yesterday building out a UEFI server that didn’t have on-board hardware RAID for its system drives. In these situations, I always use Linux’s md
RAID1 for the root filesystem (and/or /boot
). This worked well for BIOS booting since BIOS just transfers control blindly to the MBR of whatever disk it sees (modulo finding a “bootable partition” flag, etc, etc). This means that BIOS doesn’t really care what’s on the drive, it’ll hand over control to the GRUB code in the MBR.
With UEFI, the boot firmware is actually examining the GPT partition table, looking for the partition marked with the “EFI System Partition” (ESP) UUID. Then it looks for a FAT32 filesystem there, and does more things like looking at NVRAM boot entries, or just running BOOT/EFI/BOOTX64.EFI
from the FAT32. Under Linux, this .EFI
code is either GRUB itself, or Shim which loads GRUB.
So, if I want RAID1 for my root filesystem, that’s fine (GRUB will read md
, LVM, etc), but how do I handle /boot/efi
(the UEFI ESP)? In everything I found answering this question, the answer was “oh, just manually make an ESP on each drive in your RAID and copy the files around, add a separate NVRAM entry (with efibootmgr
) for each drive, and you’re fine!” I did not like this one bit since it meant things could get out of sync between the copies, etc.
The current implementation of Linux’s md
RAID puts metadata at the front of a partition. This solves more problems than it creates, but it means the RAID isn’t “invisible” to something that doesn’t know about the metadata. In fact, mdadm
warns about this pretty loudly:
# mdadm --create /dev/md0 --level 1 --raid-disks 2 /dev/sda1 /dev/sdb1
mdadm: Note: this array has metadata at the start and
may not be suitable as a boot device. If you plan to
store '/boot' on this device please ensure that
your boot-loader understands md/v1.x metadata, or use
--metadata=0.90
Reading from the mdadm man page:
-e, --metadata=
...
1, 1.0, 1.1, 1.2 default
Use the new version-1 format superblock. This has fewer
restrictions. It can easily be moved between hosts with
different endian-ness, and a recovery operation can be
checkpointed and restarted. The different sub-versions
store the superblock at different locations on the
device, either at the end (for 1.0), at the start (for
1.1) or 4K from the start (for 1.2). "1" is equivalent
to "1.2" (the commonly preferred 1.x format). "default"
is equivalent to "1.2".
First we toss a FAT32 on the RAID (mkfs.fat -F32 /dev/md0
), and looking at the results, the first 4K is entirely zeros, and file
doesn’t see a filesystem:
# dd if=/dev/sda1 bs=1K count=5 status=none | hexdump -C
00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00001000 fc 4e 2b a9 01 00 00 00 00 00 00 00 00 00 00 00 |.N+.............|
...
# file -s /dev/sda1
/dev/sda1: Linux Software RAID version 1.2 ...
So, instead, we’ll use --metadata 1.0
to put the RAID metadata at the end:
# mdadm --create /dev/md0 --level 1 --raid-disks 2 --metadata 1.0 /dev/sda1 /dev/sdb1
...
# mkfs.fat -F32 /dev/md0
# dd if=/dev/sda1 bs=1 skip=80 count=16 status=none | xxd
00000000: 2020 4641 5433 3220 2020 0e1f be77 7cac FAT32 ...w|.
# file -s /dev/sda1
/dev/sda1: ... FAT (32 bit)
Now we have a visible FAT32 filesystem on the ESP. UEFI should be able to boot whatever disk hasn’t failed, and grub-install
will write to the RAID mounted at /boot/efi
.
However, we’re left with a new problem: on (at least) Debian and Ubuntu, grub-install
attempts to run efibootmgr
to record which disk UEFI should boot from. This fails, though, since it expects a single disk, not a RAID set. In fact, it returns nothing, and tries to run efibootmgr
with an empty -d
argument:
Installing for x86_64-efi platform.
efibootmgr: option requires an argument -- 'd'
...
grub-install: error: efibootmgr failed to register the boot entry: Operation not permitted.
Failed: grub-install --target=x86_64-efi
WARNING: Bootloader is not properly installed, system may not be bootable
Luckily my UEFI boots without NVRAM entries, and I can disable the NVRAM writing via the “Update NVRAM variables to automatically boot into Debian?” debconf prompt when running: dpkg-reconfigure -p low grub-efi-amd64
So, now my system will boot with both or either drive present, and updates from Linux to /boot/efi
are visible on all RAID members at boot-time. HOWEVER there is one nasty risk with this setup: if UEFI writes anything to one of the drives (which this firmware did when it wrote out a “boot variable cache” file), it may lead to corrupted results once Linux mounts the RAID (since the member drives won’t have identical block-level copies of the FAT32 any more).
To deal with this “external write” situation, I see some solutions:
- Make the partition read-only when not under Linux. (I don’t think this is a thing.)
- Create higher-level knowledge of the root-filesystem RAID configuration is needed to keep a collection of filesystems manually synchronized instead of doing block-level RAID. (Seems like a lot of work and would need redesign of
/boot/efi
into something like /boot/efi/booted
, /boot/efi/spare1
, /boot/efi/spare2
, etc)
- Prefer one RAID member’s copy of
/boot/efi
and rebuild the RAID at every boot. If there were no external writes, there’s no issue. (Though what’s really the right way to pick the copy to prefer?)
Since mdadm
has the “--update=resync
” assembly option, I can actually do the latter option. This required updating /etc/mdadm/mdadm.conf
to add <ignore>
on the RAID’s ARRAY
line to keep it from auto-starting:
ARRAY <ignore> metadata=1.0 UUID=123...
(Since it’s ignored, I’ve chosen /dev/md100
for the manual assembly below.) Then I added the noauto
option to the /boot/efi
entry in /etc/fstab
:
/dev/md100 /boot/efi vfat noauto,defaults 0 0
And finally I added a systemd
oneshot service that assembles the RAID with resync and mounts it:
[Unit]
Description=Resync /boot/efi RAID
DefaultDependencies=no
After=local-fs.target
[Service]
Type=oneshot
ExecStart=/sbin/mdadm -A /dev/md100 --uuid=123... --update=resync
ExecStart=/bin/mount /boot/efi
RemainAfterExit=yes
[Install]
WantedBy=sysinit.target
(And don’t forget to run “update-initramfs -u
” so the initramfs has an updated copy of /dev/mdadm/mdadm.conf
.)
If mdadm.conf
supported an “update=
” option for ARRAY
lines, this would have been trivial. Looking at the source, though, that kind of change doesn’t look easy. I can dream!
And if I wanted to keep a “pristine” version of /boot/efi
that UEFI couldn’t update I could rearrange things more dramatically to keep the primary RAID member as a loopback device on a file in the root filesystem (e.g. /boot/efi.img
). This would make all external changes in the real ESPs disappear after resync. Something like:
# truncate --size 512M /boot/efi.img
# losetup -f --show /boot/efi.img
/dev/loop0
# mdadm --create /dev/md100 --level 1 --raid-disks 3 --metadata 1.0 /dev/loop0 /dev/sda1 /dev/sdb1
And at boot just rebuild it from /dev/loop0
, though I’m not sure how to “prefer” that partition…
© 2018, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.