Maybe older models?
I tried it again yesterday with GPT. GPT-5 manages quite well too in thinking mode, but starts crackling in instant mode. 4o completely failed.
It's not that LLMs are unable to solve things like that at all, but it's really easy to find some variations that make them struggle really hard.
Maybe older models?