Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

These types of tests are fundamentally flawed. I was able to create perfect clock using gemini 2.5 pro - https://gemini.google.com/share/136f07a0fa78


The website is regenerating the clocks every minute. When I opened it, Gemini 2.5 was the only working one. Now, they are all broken.

Also, your example is not showing the current time.


It wouldn't be hard to tell to pick up browser time as the default start point. Just a piece of prompt.


Even Gemini Flash did really well for me[0] using two prompts - the initial query and one to fix the only error I could identify.

> Please generate an analog clock widget, synchronized to actual system time, with hands that update in real time and a second hand that ticks at least once per second. Make sure all the hour markings are visible and put some effort into making a modern, stylish clock face.

Followed by:

> Currently the hands are working perfectly but they're translated incorrectly making then uncentered. Can you ensure that each one is translated to the correct position on the clock face?

[0] https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...


I don't think this is a serious test. It's just an art piece to contrast different LLMs taking on the same task, and against themselves since it updates every minute. One minute one of the results was really good for me and the next minute it was very, very bad.


Aren't they attempting to also display current time though? Your share is a clock starting at midnight/noon. Kimi K2 seems to be the best on each refresh.


How are they flawed?


The results are not reproducable, as evidenced by parent poster.


isn't that kind of the point of non-determinism?


No. Good nondeterministic models reproducibly generate equally desirable output - not identical output, but interchangeable.


oh I see, thank you for clarifying




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: