- I think given public available metrics, it's clear that this isn't translating into more products/apps getting shipped. That could be because devs are now running into other bottlenecks, but it could also indicate that there's something wrong with these studies.
- Most devs who say AI speeds them up assert numbers much higher than what those studies have shown. Much of the hype around these tools is built on those higher estimates.
- I won't claim to have read every study, but of the ones I have checked in the past, the more the methodology impressed me the less effect it showed.
- Prior to LLMs, it was near universally accepted wisdom that you couldn't really measure developer productivity directly.
- Review is imperfect, and LLMs produce worse code on average than human developers. That should result in somewhat lowered code quality with LLM usage (although that might be an acceptable trade off for some). The fact that some of these studies didn't find that is another thing that suggests there shortcomings in said studies.
> - Prior to LLMs, it was near universally accepted wisdom that you couldn't really measure developer productivity directly.
Absolutely, and all the largest studies I've looked at mention this clearly and explain how they try to address it.
> Review is imperfect, and LLMs produce worse code on average than human developers.
Wait, I'm not sure that can be asserted at all. Anecdotally not my experience, and the largest study in the link above explicitly discuss it and find that proxies for quality (like approval rates) indicate more improvement than a decline. The Stanford video accounts for code churn (possibly due to fixing AI-created mistakes) and still finds a clear productivity boost.
My current hypothesis, based on the DORA and DX 2025 reports, is that quality is largely a function of your quality control processes (tests, CI/CD etc.)
That said, I would be very interested in studies you found interesting. I'm always looking for more empirical evidence!
> I see people claiming 20 - 50%, which lines up with the studies above
Most of those studies either measure productivity using useless metrics like lines of code, number of PRs, or whose participants are working for organizations that are heavily invested in future success of AI.
As mentioned in the thread I linked, they acknowlege the productivity puzzle and try to control for it in their studies. It's worth reading them in detail, I feel like many of them did a decent job controlling for many factors.
For instance, when measure the number of PRs they ensure that each one goes through the same review process whether AI-assisted or not, ensuring these PRs meet the same quality standards as humans.
Furthermore, they did this as a randomly controlled trial comparing engineers without AI to those with AI (in most cases, the same ones over time!) which does control for a lot of the issues with using PRs in isolation as a holistic view of productivity.
>... whose participants are working for organizations that are heavily invested in future success of AI.
That seems pretty ad hom, unless you want to claim they are faking the data. Along with co-authors who are from premier institutes like NBER, MIT, UPenn, Princeton, etc.
And here's the kicker: they all converge on a similar range of productivity boost, such as the Stanford study:
> https://www.youtube.com/watch?v=tbDDYKRFjhk (from Stanford, not an RCT, but the largest scale with actual commits from 100K developers across 600+ companies, and tries to account for reworking AI output. Same guys behind the "ghost engineers" story.
The preponderence of evidence paints a very clear picture. The alternative hypothesis is that ALL these institutes and companies are colluding. Occam's razor and all that.
> if at all realistic numbers are mentioned, I see people claiming 20 - 50%
IME most people claim small integer multiples, 2-5x.
> all the largest studies I've looked at mention this clearly and explain how they try to address it.
Yes, but I think pre-AI virtually everyone reading this would have been very skeptical about their ability to do so.
> My current hypothesis, based on the DORA and DX 2025 reports, is that quality is largely a function of your quality control processes (tests, CI/CD etc.)
This is pretty obviously incorrect, IMO. To see why, let's pretend it's 2021 and LLMs haven't come out yet. Someone is suggesting no longer using experienced (and expensive) first world developers to write code. Instead, they suggest hiring several barely trained boot camp devs (from low cost of living parts of the world so they're dirt cheap) for every current dev and having the latter just do review. They claim that this won't impact quality because of the aforementioned review and their QA process. Do you think that's a realistic assessment? If and on the off chance you think it is, why didn't this happen on a larger scale pre-LLM?
The resolution here is that while quality control is clearly important, it's imperfect, ergo the quality of the code before passing through that process still matters. Pass worse code in, and you'll get worse code out. As such, any team using the method described above might produce more code, but it would be worse code.
> the largest study in the link above explicitly discuss it and find that proxies for quality (like approval rates) indicate more improvement than a decline
Right, but my point is that that's a sanity check failure. The fact that shoving worse at your quality control system will lower the quality of the code coming out the other side is IMO very well established, as is the fact that LLM generated code is still worse than human generated (where the human knows how to write the code in question, which they should if they're going to be responsible for it). It follows that more LLM code generation will result in worse code, and if a study finds the opposite it's very likely that the it made some mistake.
As an analogy, when a physics experiment appeared to find that neutrino travel faster than the speed of light in a vacuum, the correct conclusion was that there had almost certainly been a problem with the experiment, not that neutrinos actually travel faster than the speed of light. That was indeed the explanation. (Note that I'm not claiming that "quality control processes cannot completely eliminate the effect of input code quality" and "LLM generated code is worse than human generated code" are as well established as relativity.)
> Yes, but I think pre-AI virtually everyone reading this would have been very skeptical about their ability to do so.
That's not quite true: while everybody acknowledged it was folly to measure absolute individual productivity, there were aggregate metrics many in the industry were aligning on like DORA or the SPACE framework, not to mention studies like https://dl.acm.org/doi/abs/10.1145/3540250.3558940
Similarly, many of these AI coding studies do not look at productivity on an individual level at a point of time, but in aggregate and over an extended period of time using a randomized controlled trial. It's not saying Alice is more productive than Bob, it's saying Alice and Bob with AI are on average more productive than themselves without AI.
> They claim that this won't impact quality because of the aforementioned review and their QA process. Do you think that's a realistic assessment? If and on the off chance you think it is, why didn't this happen on a larger scale pre-LLM?
Interestingly, I think something similar did happen pre-LLM at industry-scale! My hypothesis (based on observations when personally involved) is that this is exactly what allowed offshoring to boom. The earliest attempts at offshoring were marked by high-profile disasters that led many to scoff at the whole idea. However companies quickly learned and instituted better processes that basically made failures an exception rather than the norm.
> ... as is the fact that LLM generated code is still worse than human generated...
I still don't think that can be assumed as a fact. The few studies I've seen find comparable outcomes, with LLMs actually having a slight edge in some cases, e.g.
> My hypothesis (based on observations when personally involved) is that this is exactly what allowed offshoring to boom.
Offshoring did happen, but if you were correct that only the quality control process impacted final quality, the software industry would have looked something like e.g. garment industry, with basically zero people being paid to actually write software in the first world, and hires from the developing world not requiring much skill. What we actually see is that some offshoring occurred, but it was limited and when it did occur companies tried to hire highly trained professionals in the country they outsourced to, not the cheapest bootcamp dev they could find. That's because the quality of the code at generation does matter, so it becomes a tradeoff between cost and quality.
> I still don't think that can be assumed as a fact. The few studies I've seen find comparable outcomes, with LLMs actually having a slight edge in some cases, e.g.
Anthropic doesn't actually believe in their LLMs as strongly as you do. You know how I can tell? Because they just spent millions acquihiring the Bun team instead of asking Claude to write them a JS runtime (not to mention the many software engineering roles they're advertising on their website). They know that their SOTA LLMs still generate worse code than humans, that they can't completely make up for it in the quality control phase, and that they at the very least can't be confident of that changing in the immediate future.
Offshoring wasn't really limited... looking at India as the largest offshoring destination, it is in the double-digit billions annually, about 5 - 10% of the entire Indian GDP, and it was large enough that it raised generations of Indians from lower middle-class to the middle and upper-middle class.
A large part of the success was, to your point, achieved by recruiting highly skilled workers at the client and offshoring ends, but they were a small minority. The vast majority of the workforce was much lower skilled. E.g. at one point the bulk of "software engineers" hired didn't even study computer science! The IT outsourcing giants would come in and recruit entire batches of graduates regardless of their education background. A surprisingly high portion of, say, TCS employees have a degree in something like Mechanical Engineering.
They key strategy was that these high-skilled workers acted as high-leverage points of quality control that were scaled to a much larger force of lower-skilled workers via processes. As the lower strata of workers upskilled over time, they were in turn promoted to lead other projects with lower-skilled workers.
In fact, you see this same dynamic in high-performing software teams, where there is a senior tech lead and a number of more junior engineers. The quality of output depends heavily on the skill-level of the lead rather than the more numerous juniors.
Re: Anthropic, I think we're conflating coding and software engineering. Writing an entire JS runtime is not just coding, it's a software engineering project, and I totally agree that AI cannot do software engineering: https://news.ycombinator.com/item?id=46210907
- I think given public available metrics, it's clear that this isn't translating into more products/apps getting shipped. That could be because devs are now running into other bottlenecks, but it could also indicate that there's something wrong with these studies.
- Most devs who say AI speeds them up assert numbers much higher than what those studies have shown. Much of the hype around these tools is built on those higher estimates.
- I won't claim to have read every study, but of the ones I have checked in the past, the more the methodology impressed me the less effect it showed.
- Prior to LLMs, it was near universally accepted wisdom that you couldn't really measure developer productivity directly.
- Review is imperfect, and LLMs produce worse code on average than human developers. That should result in somewhat lowered code quality with LLM usage (although that might be an acceptable trade off for some). The fact that some of these studies didn't find that is another thing that suggests there shortcomings in said studies.