It's so easy to ship completely broken AI features because you can't really unit...

sk7 · 2025-12-08T19:40:30 1765222830

Tests are called "evals" (evaluations) in the AI product development world. Basically you let humans review LLM output or feed it to another LLM with instructions how to evaluate it.

https://www.lennysnewsletter.com/p/beyond-vibe-checks-a-pms-...

azemetre · 2025-12-08T20:20:18 1765225218

Interesting, never really thought of it outside of this comment chain but I'm guessing approaches like this hurt the typical automated testing devs would do but seeing how this is MSFT (who already stopped having dedicated testing roles for a good while now, rip SDET roles) I can only imagine the quality culture is even worse for "AI" teams.

ethbr1 · 2025-12-08T23:09:42 1765235382

Yes. Because why would there ever be a problem with a devqaops team objectively assessing their own work's effectiveness?

KurSix · 2025-12-09T10:13:58 1765275238

Traditional Microsoft devs are used to deterministic tests: assert result == expected, whereas AI requires probabilistic evals and quality monitoring in prod. I think Microsoft simply lacks the LLM Ops culture right now to build a quality evaluation pipeline before release; they are testing everything on users

bn-l · 2025-12-08T19:23:36 1765221816

Microsoft: What? You want us to eat this slop? Are you crazy?!

danudey · 2025-12-08T19:33:36 1765222416

50% of our code is being written by AI! Or at least, autocompleted by AI. And then our developers have to fix 50% of THAT code so that it does what they actually wanted to do in the first place. But boy, it sure produces a lot of words!