Mitigations=off considered harmful or spurious SIGILL on AMD Zen4

2bluesc · on Oct 8, 2023

Interesting walk through in the linked video as he tried to troubleshoot this:

https://www.youtube.com/live/1UnoBfw6soI

Pretty shocking to see (extremely unlikely) non-malicious code work / not-work depending on a security mitigation setting.

Curious to see where this goes as to whether it's a kernel bug and nobody is paying attention to `mitigations=off` now or the unlikely outcome that it's an actual hardware bug where mitigations work around it and nobody has noticed.

Seems he had overclocked and maybe disabling the migation has revealed an instability in his setup that's otherwise stable (or at best marginal).

yellow_lead · on Oct 8, 2023

He says that he tried stock voltage for everything, and even undervolted/underclocked CPU/memory and still had the bug.

tux3 · on Oct 8, 2023

mitigations=off does a lot of things, it's a bundle of options

Perhaps the next step is trying to figure out which mitigation exactly causes it to fail. Then kernel peeps should have a fighting chance of tracking this down

simcop2387 · on Oct 8, 2023

In the video comments it looks like he's done some of that and it's one of the spectre v2 mitigations. Makes me wonder if there's now less QA happening on the oem side without mitigations=on.

AnotherGoodName · on Oct 8, 2023

SMT off fixing it is a good clue. It heavily points to the problem being within a core rather than between execution cores. That's still a broad area though. Within a core some state is inadvertantly shared between the shared threads running on it but there's a lot of possibilities of what that could be.

foota · on Oct 8, 2023

Some state being in the CPU or kernel? I guess if you have per core state in the kernel that's only accessed by that core, then you're safe to modify it without locking, but this isn't the case for data shared between hyperthreads of a core?

AnotherGoodName · on Oct 8, 2023

Within the CPU core (terminology here gets awkward). Spectre in its simplest to exploit form used the fact that the branch prediction based state wasn't completely cleared between running different threads on the CPU and led to a really easy timing attack. Time your own code to see which way the branch of the previous code likely went. There's a lot of CVE's around this but that's the simplest case.

The mitigations work by making sure to clear everything they can when the CPU switches contexts. This includes clearing loaded local CPU cache, clearing the branch prediction table, clearing all speculative execution entries. The mitigations likely avoid a crash by clearing something that should be cleared between context switches in all cases. As in the mitigations likely accidentally fix this. Which is a good clue. It'd be nice if the mitigations were more fine grained than all or nothing so we could turn on the mitigations one by one but the microcode is not open so we can't do this :(.

vlovich123 · on Oct 8, 2023

Is it likely the issue is within the Linux kernel itself or within the microcode for the CPU?

AnotherGoodName · on Oct 9, 2023

I'm going to say microcode until someone reproduces on a completely different architecture :).

jdiff · on Oct 9, 2023

A lot of these mitigations are joint efforts between the kernel and CPU, and the mitigations kernel parameter isn't a binary on/off, you can configure individual mitigations. It's not a single flag that does signals the microcode to do spooky proprietary things.

tyrfing · on Oct 8, 2023

Mitigations can have very large effects on system stability, I ran into this with my old Haswell system. Mitigations on allowed, I believe, 100Mhz higher at lower voltage - but it may have been 200Mhz. These settings were tested over many months, completely stable, and mitigations off with the same setting wouldn't even allow booting. Huge effect relative to anything else, basically like adding an extra 0.1V vcore.

Levitating · on Oct 9, 2023

What does he say at 8:17?[1] Arsink Overism?

[1]: https://www.youtube.com/live/1UnoBfw6soI?si=R8nJ1FxdE4zBuO0i...

mastensg · on Oct 9, 2023

"rsync algorithm"

eigenform · on Oct 9, 2023

would've been nice to see a whole copy of the actual GCC binary in question :(

edit: looks like it dies in this: https://github.com/gcc-mirror/gcc/blob/master/gcc/tree-switc...

edit2: it also would've been nice to see the actual bytes being decoded when it faults, maybe it ends up spuriously writing over .text

edit3: on the LKML https://lore.kernel.org/lkml/D99589F4-BC5D-430B-87B2-72C2037...

Levitating · on Oct 9, 2023

You can see his complete test environment here: https://www.youtube.com/channel/UCJLLl6AraX1POemgLfhirwg/com...

eigenform · on Oct 10, 2023

Oh, I didn't see that `cc1` is in the initrd image! Thanks!

Seems like there's already a potential answer to this on the LKML - looks like it has something to do with BTB state being shared between threads?

https://lore.kernel.org/lkml/20231010083956.GEZSUN3OKuYSyU18...

nubinetwork · on Oct 8, 2023

Still waiting for amd to fix random bugs in znver1. https://bugs.gentoo.org/724314

ta988 · on Oct 8, 2023

Unlikely to happen.

vlovich123 · on Oct 8, 2023

From the end of the bug tracker sounds like the issue is actually resolved (or at least no longer repros)

ta988 · on Oct 9, 2023

yes solved in the compilers not at the cpu level.

nightfly · on Oct 8, 2023

Have they tested this on more than one machine? This seems like it could be a hardware fault

userbinator · on Oct 9, 2023

I am reminded of this article several years ago remarking on Intel's CPU bugs: https://news.ycombinator.com/item?id=16058920

Perhaps AMD has also gone down the same path. It's unfortunate that the hardware industry has also formalised planned obsolescence ("end of life" and similar phrasings) which drastically decreases the motivation to achieve perfection, as they can then defer "fixing" anything to "buy the new model" (complete with its own, unknown, new set of bugs...)

CableNinja · on Oct 9, 2023

Unfortunately, hardware is a hard solve for fixing things after its out the door, especially when the brainbox is the actual problem. Say i make a CPU and release it to the world. A bug is discovered 6 months later, after thousands have bought my CPU. I cant simply go "oopsie doodle, lets replace your cpu for free to fix the problem", as thatd kill business, and the returned CPU is effectively trash or a paperweight, because it cant be reused (without melting down, separation, etc). So, a new CPU is created, and a new generation of the chip is formed with the fix in place.

CPU mfgs have semisolved some of this through the use of microcode, which is a way to control the CPU workings, and do really 'stupid' things, like burning efuses, which is a what mfgs have designed to allow poking the chip by permanently setting values, and causing a trace in the CPU to literally burn, locking it in place. These are habdy for early things like disabling cores, locking frequencies, etc, but it comes with the caveat of you can only do it once. So again, back to the oopsie doodle. We might be able to pop an efuse through the microcode, if we thought ahead and knew there was potential for issues, but, that might bring other issues, like we see now (decreased performance, etc). And so, the only answer, is again, a new CPU and new generation.

Theres really no way around the rinse repeat cycle of hardware. What does chap my balls about the CPU industry though, is all of the crap we keep doing in the name of compatibility or because what we have now has been layered on a known "working" process for a long time. There are feature sets and design details that have been just stuck in the CPU since the dawn of the x86, and that is really where the problem lies. I saw a talk, or maybe an article ages ago that im unable to find; there are parts of the CPU that we just keep using over and over in the name of compatibility, which is a real problem. I wont go into all the compat things, but look how long its taken for RISC or ARM to really gain traction. Its because i cant simply take a copy of whatever program, and slap it onto a different architecture without lots of work, which is what removing some of these existing x86 features would effectively create.

I do think theres a good solve for some part of this, but i dont think its really the best, because of speed setbacks. If we take the "chiplet" approach, but modularize it, we can start replacing pieces of the CPU, without replacing it whole. For example, say we had a CPU socket that took a bunch of, what ill call "nanochips", which are the individual CPU components (individually replaceable cores, cache, etc), broken out into their own little chips. If a piece has a problem, say, a hardware level exploit not correctable by efuse, fixing it becomes much easier, and cheaper. This isnt a whole solution though, because evolution, and demand for more IO, speeds, etc. Things will always be changing.

nathants · on Oct 8, 2023

lucky me, i’m running mitigations=off and smt=off on a 7950x.

zamadatix · on Oct 8, 2023

Out of curiosity did you have smt=off for performance reasons or other reasons?

nathants · on Oct 8, 2023

performance testing vulkan and physx on linux.

TimeBearingDown · on Oct 11, 2023

Note recent reports of mitigations=off _hurting_ Zen4 performance in many cases.

I don't have Zen4 so I'm not super up-to-date on this -- r2-t2 above disagrees, but offers no data.

https://www.phoronix.com/news/AMD-Zen-4-Mitigations-Off

iforgotpassword · on Oct 9, 2023

I think there was a post a few months ago that in recent AMD CPUs with up to date microcode, mitigations=off actually runs slower than leaving them on. This here seems to be the final nail in the coffin for turning them off, at least on AMD. In the very beginning we decided to turn them off on our fleet of Linux devices, but stopped after that post, even though we don't even run any AMD currently.

r2-t2 · on Oct 10, 2023

That is more a myth and makes no sense and is just the result of people benchmarking random nonsense. mitigations=off is 1-2% faster compiling open source software. On every not last gen silicon mitigations actually have a major impact on performance.

Levitating · on Oct 9, 2023

If you want to test this for yourself, he has provided a testing environment: https://www.youtube.com/channel/UCJLLl6AraX1POemgLfhirwg/com...

hulitu · on Oct 9, 2023

Testing is hard. I had the same issue 20 years ago with signal 11 on high load on a 486. Eventually it was fixed with a kernel update.

Ckirby · on Oct 8, 2023

[flagged]

sph · on Oct 8, 2023

Someone got an axe to grind.

ezekiel68 · on Oct 8, 2023

You know how it is in social media -- just like in academia: "Publish or perish".

secondcoming · on Oct 8, 2023

Isn't there a rule about titles containing 'considered harmful'?

Either way, this is pretty interesting and the guy is pretty excited about finding it.

fabianhjr · on Oct 8, 2023

There is a rule on mostly not editorializing titles and in this case that is the title of the source article.

> Otherwise please use the original title, unless it is misleading or linkbait; don't editorialize.

burnished · on Oct 8, 2023

I think its merely considered gauche, that and people seem to have broadly realized anyone could use that phrase to imply consensus.