Christie Koehler

2 months ago • •

Christie Koehler
2 months ago • •

Okay Linux systems/platform folks, I need your ideas.

We run Linux VMs using #QEMU. Periodically we have a workload where there is a significant delay between our orchestrator invoking qemu-system-x86_64 and the guest VM OS starting.

(Guest OS starting in this case is evidenced by clout-init logging written to serial console that we persist.)

The delay is in the range of 10-30 minutes. When it happens, we see a significant increase in the following hypervisor metrics from qemu invocation until the guest OS starts cloud init:

- disk read requests
- disk read volume (kbs)
- % of cpu time spent running the kernel
- % time during which i/o requests were issued

How can we figure out what's going on?

Some additional things that might be helpful to know:

- hypervisor hosts are running debian 10 on intel hardware
- we haven't identified any other patterns when this happens, though it seems to always happen on hosts that are already running other VMs
- the data volume is LVM on LUKS encrypted partition
- hypervisor hosts have 2 physical drives in RAID configuration (the exact one I need to figure out)
- happens with stock debian 11 cloud image as guest OS

Any ideas?

I want to better understand what qemu is doing between that initial invocation when when the guest OS starts cloud-init. Is there any way to get more logging during this time? What are other ways to surveil what might be the bottleneck?

#qemu

⇧

Christie Koehler

Christie Koehler 2 months ago • •

Christie Koehler
2 months ago • •