Yes, but they are reasoning within their dataset, which will contain multiple example of html+css clocks.
They are just struggling to produce good results because they are language models and don’t have great spatial reasoning skills, because they are language models.
Their output normally has all the elements, just not in the right place/shape/orientation.
So I suspect it's more that lessons from diffusion image models don't carry over to text LLMs.
And the Image models which are based on multi-mode LLMs (like Nano Banana) seem to do a lot better at novel concepts.