OKD 4.17 Bare Metal UPI Masters failing boot
Hello I am looking for some help trying to solve my issue - this is my first time trying a production install so please bare with me if I sound like a novice.
I am tasked with repurposing our old server rack to do a HA OKD on premise install. I am using 4.17.0-okd-scos.1 with a UPI install (We can't do an assisted installer due to being in a restricted env).
I am able to get the bootstrap up and running but I am running into an error when trying to boot the master clusters. I was able to live boot the FCOS and get the ignition install to work and it starts booting into CentOS, but then it fails and goes into emergency mode.
This is the rdsosreport log section that the error keeps repeating on
[ 5.240465] localhost systemd[1]: Started Device-Mapper Multipath Device Controller.
[ 5.240927] localhost systemd[1]: Reached target Preparation for Local File Systems.
[ 5.241297] localhost systemd[1]: Reached target Local File Systems.
[ 5.241645] localhost systemd[1]: Reached target System Initialization.
[ 5.242028] localhost systemd[1]: Reached target Basic System.
[ 5.242588] localhost systemd[1]: Persist Osmet Files (ISO) was skipped because of an unmet condition check (ConditionKernelCommandLine=coreos.liveiso).
[ 5.328086] localhost systemd-journald[419]: Missed 18 kernel messages
[ 5.362183] localhost kernel: scsi 0:0:0:0: CD-ROM Cisco Virtual CD/DVD 1.22 PQ: 0 ANSI: 0
[ 5.362972] localhost kernel: scsi 0:0:0:1: Direct-Access Cisco Virtual FDD/HDD 1.22 PQ: 0 ANSI: 0 CCS
[ 5.363720] localhost kernel: scsi 0:0:0:2: Direct-Access Cisco Virtual Floppy 1.22 PQ: 0 ANSI: 0 CCS
[ 5.364787] localhost kernel: sr 0:0:0:0: Power-on or device reset occurred
[ 5.466906] localhost kernel: sr 0:0:0:0: [sr1] scsi3-mmc drive: 0x/0x cd/rw caddy
[ 5.468390] localhost kernel: sr 0:0:0:0: Attached scsi CD-ROM sr1
[ 5.468480] localhost kernel: sr 0:0:0:0: Attached scsi generic sg1 type 5
[ 5.468776] localhost kernel: scsi 0:0:0:1: Attached scsi generic sg2 type 0
[ 5.469092] localhost kernel: scsi 0:0:0:2: Attached scsi generic sg3 type 0
[ 5.571522] localhost kernel: sd 0:0:0:1: Power-on or device reset occurred
[ 5.572471] localhost kernel: sd 0:0:0:1: [sda] Media removed, stopped polling
[ 5.572967] localhost kernel: sd 0:0:0:1: [sda] Attached SCSI removable disk
[ 5.573098] localhost kernel: sd 0:0:0:2: Power-on or device reset occurred
[ 5.573983] localhost kernel: sd 0:0:0:2: [sdb] Media removed, stopped polling
[ 5.574459] localhost kernel: sd 0:0:0:2: [sdb] Attached SCSI removable disk
[ 139.440654] localhost dracut-initqueue[685]: Warning: dracut-initqueue: timeout, still waiting for following initqueue hooks:
[ 139.441933] localhost dracut-initqueue[685]: Warning: /lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fdisk\x2fby-uuid\x2f9fb627a7-ba2f-40fa-b875-6e6bfecf85be.sh: "if ! grep -q After=remote-fs-pre.target /run/systemd/generator/systemd-cryptsetup@*.service 2>/dev/null; then
[ 139.441933] localhost dracut-initqueue[685]: [ -e "/dev/disk/by-uuid/9fb627a7-ba2f-40fa-b875-6e6bfecf85be" ]
[ 139.441933] localhost dracut-initqueue[685]: fi"
[ 139.443642] localhost dracut-initqueue[685]: Warning: dracut-initqueue: starting timeout scripts
From what it looks like is its losing the drive when the CentOS boots up (but it is booting from the drive because the USB is unplugged after the FCOS install was finished originally)
When i run lsblk its returning:
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
sda 8:0 1 0B 0 disk
sdb 8:16 1 0B 0 disk
sdc 8:32 1 7.3G 0 disk
`-sdc1 8:33 1 7.3G 0 part
sr0 11:0 1 1024M 0 rom
sr1 11:1 1 1024M 0 rom
This is missing the current drive which it was installed to which was sdd, which previously showed up in the FCOS live boot and was installed there