> No one has come close to solving the problem of optimizing software for multip...

StillBored · 2025-07-08T19:09:06 1752001746

I'm not sure I understand what your trying to say here WRT to cpu microarch optimization with multiple cpu microarches in the machine. Maybe something about SMT/hyperthreading? But that doesn't appear to be what your saying either.

AKA: I'm talking about the uplift one gets from say -march=native (or your arch of choice), FDO/PGO and various other optimization choices. Ex: Instruction selection for OoO cores. The compiler can know you only have to functional units capable of some operation with coreX, and your codes critpath is bottlnecked by those operations and can adjust the instruction mix to (mis)use some other functional units in parallel. Two units doing X, and one doing Y. Or just load to use latency, or avoidance of certain sequences, etc.

Those optimizations are tightly bound to a given core type. Sure modern OoO cores do a better job of keeping units busy, but its not uncommon to be working around some core deficiency by tweaking the compiler heuristics even now. Trolling through the gcc machine definitions:

https://github.com/gcc-mirror/gcc/blob/master/gcc/config/i38...

So, when the CPU's are heterogeneous with differing optimization targets the code author ends up picking 'generic' optimization targets, and this decision by itself can frequently mean leaving a generation or two of performance behind vs the usual method of just building a handful of shared libraries/etc and picking one at runtime based on the cpu type.

Although, sure an application author can on some platforms hook a rescheduling notification, and then run a custom thread local jump table update to reconfigure which code paths are being run, or other non-standard operation. Or for that matter just set their affinity to a matching set of cores, but none of this is a core operation in any of the normal runtime/etc environments without considerable effort on the part of the application vendor.

smolder · 2025-07-10T04:35:02 1752122102

Yeah, sorry, everything you're saying is right. Compilers won't do the work for you. I just took issue with the wording about it being unsolved. If we can produce optimal binaries for a given process for multiple architectures we can also swap them as needed. I don't think any big new ideas need to come around, just work to implement ideas we have.

smolder · 2025-07-10T08:00:12 1752134412

By the way, compilers can conceivably do "lowest common denominator" architecture optimization to get decent perf on heterogenous cores as a compromise, without leaning into every optimization for both core types.