> No one has come close to solving the problem of optimizing software for multiple heterogeneous CPU's with differing micro-architectures when the scheduler is 'randomly' placing threads.
I think this isn't wholly correct. The comp sci part of things is pretty well figured out. You can do work-stealing parallelism to keep queues filled with decent latency, you can even dynamically adjust work distribution to thread performance (i.e. manual scheduling.) It's not trivial to use the best techniques for parallelism on heterogenous architecture, especially when it comes to adapting existing code bases that aren't fundamentally compatible with those techniques. Things get even more interesting as you take cache-locality, io, and library/driver interactions into consideration. However, I think it's more accurately described as an adoption problem than something that's unsolved.
It took many years after their debut for homogenous multicore processors to be well supported across software for similar reasons. There are still games that are actively played which don't appreciably leverage multiple cores. (e.g. Starcraft 2 does some minimal offloading to a 2nd thread.)
I'm not sure I understand what your trying to say here WRT to cpu microarch optimization with multiple cpu microarches in the machine. Maybe something about SMT/hyperthreading? But that doesn't appear to be what your saying either.
AKA: I'm talking about the uplift one gets from say -march=native (or your arch of choice), FDO/PGO and various other optimization choices. Ex: Instruction selection for OoO cores. The compiler can know you only have to functional units capable of some operation with coreX, and your codes critpath is bottlnecked by those operations and can adjust the instruction mix to (mis)use some other functional units in parallel. Two units doing X, and one doing Y. Or just load to use latency, or avoidance of certain sequences, etc.
Those optimizations are tightly bound to a given core type. Sure modern OoO cores do a better job of keeping units busy, but its not uncommon to be working around some core deficiency by tweaking the compiler heuristics even now. Trolling through the gcc machine definitions:
So, when the CPU's are heterogeneous with differing optimization targets the code author ends up picking 'generic' optimization targets, and this decision by itself can frequently mean leaving a generation or two of performance behind vs the usual method of just building a handful of shared libraries/etc and picking one at runtime based on the cpu type.
Although, sure an application author can on some platforms hook a rescheduling notification, and then run a custom thread local jump table update to reconfigure which code paths are being run, or other non-standard operation. Or for that matter just set their affinity to a matching set of cores, but none of this is a core operation in any of the normal runtime/etc environments without considerable effort on the part of the application vendor.
Yeah, sorry, everything you're saying is right. Compilers won't do the work for you. I just took issue with the wording about it being unsolved. If we can produce optimal binaries for a given process for multiple architectures we can also swap them as needed. I don't think any big new ideas need to come around, just work to implement ideas we have.
By the way, compilers can conceivably do "lowest common denominator" architecture optimization to get decent perf on heterogenous cores as a compromise, without leaning into every optimization for both core types.
I think this isn't wholly correct. The comp sci part of things is pretty well figured out. You can do work-stealing parallelism to keep queues filled with decent latency, you can even dynamically adjust work distribution to thread performance (i.e. manual scheduling.) It's not trivial to use the best techniques for parallelism on heterogenous architecture, especially when it comes to adapting existing code bases that aren't fundamentally compatible with those techniques. Things get even more interesting as you take cache-locality, io, and library/driver interactions into consideration. However, I think it's more accurately described as an adoption problem than something that's unsolved. It took many years after their debut for homogenous multicore processors to be well supported across software for similar reasons. There are still games that are actively played which don't appreciably leverage multiple cores. (e.g. Starcraft 2 does some minimal offloading to a 2nd thread.)