It's so easy to ship completely broken AI features because you can't really unit test them and unit tests have been the main standard for whether code is working for a long time now.
The most successful AI companies (OpenAI, Anthropic, Cursor) are all dogfooding their products as far as I can tell, and I don't really see any other reliable way to make sure the AI feature you ship actually works.
Tests are called "evals" (evaluations) in the AI product development world. Basically you let humans review LLM output or feed it to another LLM with instructions how to evaluate it.
Interesting, never really thought of it outside of this comment chain but I'm guessing approaches like this hurt the typical automated testing devs would do but seeing how this is MSFT (who already stopped having dedicated testing roles for a good while now, rip SDET roles) I can only imagine the quality culture is even worse for "AI" teams.
Traditional Microsoft devs are used to deterministic tests: assert result == expected, whereas AI requires probabilistic evals and quality monitoring in prod. I think Microsoft simply lacks the LLM Ops culture right now to build a quality evaluation pipeline before release; they are testing everything on users
50% of our code is being written by AI! Or at least, autocompleted by AI. And then our developers have to fix 50% of THAT code so that it does what they actually wanted to do in the first place. But boy, it sure produces a lot of words!
The most successful AI companies (OpenAI, Anthropic, Cursor) are all dogfooding their products as far as I can tell, and I don't really see any other reliable way to make sure the AI feature you ship actually works.