SMT off fixing it is a good clue. It heavily points to the problem being within ...

foota · on Oct 8, 2023

Some state being in the CPU or kernel? I guess if you have per core state in the kernel that's only accessed by that core, then you're safe to modify it without locking, but this isn't the case for data shared between hyperthreads of a core?

AnotherGoodName · on Oct 8, 2023

Within the CPU core (terminology here gets awkward). Spectre in its simplest to exploit form used the fact that the branch prediction based state wasn't completely cleared between running different threads on the CPU and led to a really easy timing attack. Time your own code to see which way the branch of the previous code likely went. There's a lot of CVE's around this but that's the simplest case.

The mitigations work by making sure to clear everything they can when the CPU switches contexts. This includes clearing loaded local CPU cache, clearing the branch prediction table, clearing all speculative execution entries. The mitigations likely avoid a crash by clearing something that should be cleared between context switches in all cases. As in the mitigations likely accidentally fix this. Which is a good clue. It'd be nice if the mitigations were more fine grained than all or nothing so we could turn on the mitigations one by one but the microcode is not open so we can't do this :(.

vlovich123 · on Oct 8, 2023

Is it likely the issue is within the Linux kernel itself or within the microcode for the CPU?

AnotherGoodName · on Oct 9, 2023

I'm going to say microcode until someone reproduces on a completely different architecture :).

jdiff · on Oct 9, 2023

A lot of these mitigations are joint efforts between the kernel and CPU, and the mitigations kernel parameter isn't a binary on/off, you can configure individual mitigations. It's not a single flag that does signals the microcode to do spooky proprietary things.