Beyond the Kernel (Part 2) – Linux Breaking News

Continuing where we left off in the last installment, I turned my eye toward the microcode warnings in the VM. They seemed really strange because the machine now clearly had an updated microcode, and the host was correctly reporting the necessary flag, yet the VM wasn’t able to recognize it. What was more puzzling was that the exact same behavior was exhibited on the upstream Linux kernel as well.

The last observation about the upstream kernel being affected was crucial since now I could disregard this as a SUSE-specific bug and instead could direct my efforts on debugging what’s going on in the upstream kernel. That way, if an issue was discovered, I’d be able to fix it both for the wider community and for SUSE’s customers. This is thanks to SUSE’s “Upstream first” policy, where if we are working on some issue/feature that could benefit more people, we’d tend to first develop it against the upstream kernel, incorporate all the feedback we could get from other kernel developers and finally merge it into the upstream as well as our kernels.

My first hypothesis was that the flag was somehow not being correctly exposed to the guest by the hypervisor; hence, the guest was not exposing it in its virtualized CPUID representation. To verify this theory, I first checked the source code of KVM; in it, there is a function called kvm_set_cpu_caps, and it contains logic that drives which particular CPUID flags on the host are going to be exposed to the guest. Naturally, not everything can be exposed, and the IBPB_BRTYPE flag (which is the one that drives the warning), but in this case, the flag is set via a call to kvm_cpu_cap_check_and_set(). So, it seemed that KVM was seemingly doing the right thing. Subsequently, I figured out if any bits might get cleared after they are set. As it turns out, no bits are cleared, at least not by the kernel’s accord.

Qemu enters the sage

At this stage, I was at a loss as to why some information that is visible on the host and is passed through to the guest (at least looking just at the kernel code) simply disappears when the guest is booted. Having been depleted of ideas, I consulted a couple of colleagues who specialize in virtualization. They suggested it could be something related to Qemu and how it manages the CPUID of guests. Then, my colleague Fabian Vogt provided the much-needed information. So when Qemu starts up a VM based on the “-cpu” parameter, different flags could be masked off despite being supported by KVM. What QEMU does is first invoke KVM_GET_SUPPORTED_CPUID ioctl to kvm, which in turn replies with the available CPUID flags:

In the above line, I’ve extracted just the content of CPUID leaf 0x80000021, which contains the various mitigation bits, and in it, we can see that bit 28 is set eax=0x18000245; afterwards, QEMU executed KVM_SET_CPUID ioctl :

This actually sets only bits 0,2,6, and as a result, the guest kernel doesn’t really see the presence of the IBPB_BRTYPE flag. So, as it turns out, the hypervisor needs to cooperate in order to properly expose the aforementioned flag. At present, the necessary code in upstream QEMU to expose this flag is missing, and my colleague Fabiano Rosas has sent a tentative patch so that the issue can be resolved once and for all.

Conclusion

What started as a seemingly trivial kernel bug turned out to be a lot more involved and eventually didn’t even concern the kernel directly. Also, it’s important to note that despite the warning being produced by the guest, if the microcode is applied to the host machine, then the guest will also benefit from this, so people shouldn’t be concerned.

(Visited 1 times, 1 visits today)

Source link