Look, I'm not defending the big labs, I think they're terrible in a lot of ways. And I'm actually suspending judgement on whether there is ~some kind of nerf happening.
But the anecdote you're describing is the definition of non-empirical. It is entirely subjective, based entirely on your experience and personal assessment.
It's not non-empirical. He was careful to give it the same experiment twice. The dependent variable is his judgment, sure, but why shouldn't we trust that if he's an experienced SWE?
Unless he was able to sample with temperature 0 (and get fully deterministic results both times), this can just be random chance. And experience as SWE doesn't imply experience with statistics and experiment design.
> But the anecdote you're describing is the definition of non-empirical. It is entirely subjective, based entirely on your experience and personal assessment.
Well, if we see this way, this is true for Antrophic’s benchmarks as well.
Btw the definition of empirical is: “based on observation or experience rather than theory or pure logic”
So what I described is the exact definition of empirical.
Look, I'm not defending the big labs, I think they're terrible in a lot of ways. And I'm actually suspending judgement on whether there is ~some kind of nerf happening.
But the anecdote you're describing is the definition of non-empirical. It is entirely subjective, based entirely on your experience and personal assessment.