Pretty shocking to see (extremely unlikely) non-malicious code work / not-work depending on a security mitigation setting.
Curious to see where this goes as to whether it's a kernel bug and nobody is paying attention to `mitigations=off` now or the unlikely outcome that it's an actual hardware bug where mitigations work around it and nobody has noticed.
Seems he had overclocked and maybe disabling the migation has revealed an instability in his setup that's otherwise stable (or at best marginal).
mitigations=off does a lot of things, it's a bundle of options
Perhaps the next step is trying to figure out which mitigation exactly causes it to fail. Then kernel peeps should have a fighting chance of tracking this down
In the video comments it looks like he's done some of that and it's one of the spectre v2 mitigations. Makes me wonder if there's now less QA happening on the oem side without mitigations=on.
SMT off fixing it is a good clue. It heavily points to the problem being within a core rather than between execution cores. That's still a broad area though. Within a core some state is inadvertantly shared between the shared threads running on it but there's a lot of possibilities of what that could be.
Some state being in the CPU or kernel? I guess if you have per core state in the kernel that's only accessed by that core, then you're safe to modify it without locking, but this isn't the case for data shared between hyperthreads of a core?
Within the CPU core (terminology here gets awkward). Spectre in its simplest to exploit form used the fact that the branch prediction based state wasn't completely cleared between running different threads on the CPU and led to a really easy timing attack. Time your own code to see which way the branch of the previous code likely went. There's a lot of CVE's around this but that's the simplest case.
The mitigations work by making sure to clear everything they can when the CPU switches contexts. This includes clearing loaded local CPU cache, clearing the branch prediction table, clearing all speculative execution entries. The mitigations likely avoid a crash by clearing something that should be cleared between context switches in all cases. As in the mitigations likely accidentally fix this. Which is a good clue. It'd be nice if the mitigations were more fine grained than all or nothing so we could turn on the mitigations one by one but the microcode is not open so we can't do this :(.
A lot of these mitigations are joint efforts between the kernel and CPU, and the mitigations kernel parameter isn't a binary on/off, you can configure individual mitigations. It's not a single flag that does signals the microcode to do spooky proprietary things.
Mitigations can have very large effects on system stability, I ran into this with my old Haswell system. Mitigations on allowed, I believe, 100Mhz higher at lower voltage - but it may have been 200Mhz. These settings were tested over many months, completely stable, and mitigations off with the same setting wouldn't even allow booting. Huge effect relative to anything else, basically like adding an extra 0.1V vcore.
Perhaps AMD has also gone down the same path. It's unfortunate that the hardware industry has also formalised planned obsolescence ("end of life" and similar phrasings) which drastically decreases the motivation to achieve perfection, as they can then defer "fixing" anything to "buy the new model" (complete with its own, unknown, new set of bugs...)
Unfortunately, hardware is a hard solve for fixing things after its out the door, especially when the brainbox is the actual problem. Say i make a CPU and release it to the world. A bug is discovered 6 months later, after thousands have bought my CPU. I cant simply go "oopsie doodle, lets replace your cpu for free to fix the problem", as thatd kill business, and the returned CPU is effectively trash or a paperweight, because it cant be reused (without melting down, separation, etc). So, a new CPU is created, and a new generation of the chip is formed with the fix in place.
CPU mfgs have semisolved some of this through the use of microcode, which is a way to control the CPU workings, and do really 'stupid' things, like burning efuses, which is a what mfgs have designed to allow poking the chip by permanently setting values, and causing a trace in the CPU to literally burn, locking it in place. These are habdy for early things like disabling cores, locking frequencies, etc, but it comes with the caveat of you can only do it once. So again, back to the oopsie doodle. We might be able to pop an efuse through the microcode, if we thought ahead and knew there was potential for issues, but, that might bring other issues, like we see now (decreased performance, etc). And so, the only answer, is again, a new CPU and new generation.
Theres really no way around the rinse repeat cycle of hardware. What does chap my balls about the CPU industry though, is all of the crap we keep doing in the name of compatibility or because what we have now has been layered on a known "working" process for a long time. There are feature sets and design details that have been just stuck in the CPU since the dawn of the x86, and that is really where the problem lies. I saw a talk, or maybe an article ages ago that im unable to find; there are parts of the CPU that we just keep using over and over in the name of compatibility, which is a real problem. I wont go into all the compat things, but look how long its taken for RISC or ARM to really gain traction. Its because i cant simply take a copy of whatever program, and slap it onto a different architecture without lots of work, which is what removing some of these existing x86 features would effectively create.
I do think theres a good solve for some part of this, but i dont think its really the best, because of speed setbacks. If we take the "chiplet" approach, but modularize it, we can start replacing pieces of the CPU, without replacing it whole. For example, say we had a CPU socket that took a bunch of, what ill call "nanochips", which are the individual CPU components (individually replaceable cores, cache, etc), broken out into their own little chips. If a piece has a problem, say, a hardware level exploit not correctable by efuse, fixing it becomes much easier, and cheaper. This isnt a whole solution though, because evolution, and demand for more IO, speeds, etc. Things will always be changing.
I think there was a post a few months ago that in recent AMD CPUs with up to date microcode, mitigations=off actually runs slower than leaving them on. This here seems to be the final nail in the coffin for turning them off, at least on AMD. In the very beginning we decided to turn them off on our fleet of Linux devices, but stopped after that post, even though we don't even run any AMD currently.
That is more a myth and makes no sense and is just the result of people benchmarking random nonsense. mitigations=off is 1-2% faster compiling open source software. On every not last gen silicon mitigations actually have a major impact on performance.
https://www.youtube.com/live/1UnoBfw6soI
Pretty shocking to see (extremely unlikely) non-malicious code work / not-work depending on a security mitigation setting.
Curious to see where this goes as to whether it's a kernel bug and nobody is paying attention to `mitigations=off` now or the unlikely outcome that it's an actual hardware bug where mitigations work around it and nobody has noticed.
Seems he had overclocked and maybe disabling the migation has revealed an instability in his setup that's otherwise stable (or at best marginal).