> If there's a clear jump in evals from one model to the next (ie Gemini 2 to 2....

> If there's a clear jump in evals from one model to the next (ie Gemini 2 to 2.5, or Claude 3.7 to 4)

How do you know that their evals match behavior in your application? What if the older, "worse" model actually does some things better, but if you don't have comprehensive enough evals for your own domain, you simply don't know to check the things it's good at?

FWIW I agree that in general, you should start with the most powerful model you can afford, and use that to bootstrap your evals. But I do not think you can rely on generic benchmarks and evals as a proxy for your own domain. I've run into this several times where an ostensibly better model does no better than the previous generation.