top of page
axon-active-vietnam-da-nang-office (4).JPG

Why AI Gains Are Real and Your Balance Sheet Doesn't Show It

Updated: 6 days ago

Why measured AI gains at the task level keep failing to show up in productivity data — and what we measure instead at Axon Active. 


According to Stack Overflow's 2025 Developer Survey, 84% of developers now use or plan to use AI tools. Most organizations see no measurable improvement on the metrics that matter. 


That sentence is the question every CTO and CFO is privately wrestling with. Pilot reports show 20–50% time savings on coding tasks. Vendors quote impressive headline numbers. Studies in Management Science find 26% throughput gains across nearly five thousand developers at major enterprises. And yet, when finance closes the books on the year, the productivity bump is hard to find. The cycle time has not improved. The defect rate has not dropped. The team is shipping roughly the same amount of value, with more AI tooling expense on the line. 



This is not because the studies are wrong, or because the developers are exaggerating. The task-level gains are real. The gap is somewhere else — in the long, badly-instrumented chain between an individual task speeding up and a business outcome improving. This post is about that gap. What it is, where the value leaks out, and what we measure at Axon Active to make sure the gains we promise our clients are the ones their finance team can see. 

You can speed up the engine. But if the road has traffic, you're still stuck. 

The benchmark trap 

The AI-and-productivity debate is full of strong numbers. They are not lies. They are answers to the wrong question. 


Worth pausing first on the baseline AI is being measured against. Before AI tools entered the picture, McKinsey's research found that technical debt typically accounts for around 40% of IT balance sheets, with 30% of CIOs reporting that more than 20% of their technology budget is diverted to resolving tech debt issues rather than building new value. Stripe's Developer Coefficient report found 42% of professional developer time goes to managing technical debt. In other words: a meaningful fraction of every developer's week was already disappearing into maintenance and rework before AI came along. The relevant question is not whether AI makes coding faster. It is whether AI compresses or expands that pre-existing tax — and on the evidence, the answer depends almost entirely on how the team is governed. 


At the level of a single task, the evidence is now substantial. Brynjolfsson, Li, and Raymond's Generative AI at Work, a customer-support field experiment with 5,179 agents, found a 14% productivity gain on average — and 34% for novices. A Google enterprise randomized trial by Paradis and colleagues found developers worked roughly 21% faster on coding tasks. The largest study to date — Cui et al., Management Science 2026 — measured 4,867 developers across Microsoft, Accenture, and a Fortune 100 firm and found a 26.08% increase in completed tasks. 


Our own pilot data lines up: 20–25% with GitHub Copilot, 25–30% with Cursor, 30–50% with Claude Code, measured across our pilot teams over six months. The gains depend on project size, codebase complexity, the model's context window, the kind of work, and the developer's familiarity with the tool. Vendors quote single numbers because they sell tools. We measure ranges because we ship outcomes. But across the variation, the gains are there. The debate at the task level is over. 


Then comes the awkward question. If individual tasks are 20–50% faster, why is corporate productivity not 20–50% higher? Why do most enterprises struggle to point to material EBIT impact from generative AI? Why does the St. Louis Fed's analysis suggest the economy-wide productivity contribution is closer to one percentage point per year than to anything resembling a transformation? 


Because the benchmark is measuring the wrong thing. A benchmark measures how fast a developer completes a task. The balance sheet measures whether the customer got more value. Between those two there is an entire chain of work — code review, QA, deployment, integration, support, customer feedback — and AI does not accelerate most of it. In some cases AI actively slows it down. 


A benchmark measures how fast a developer completes a task. The balance sheet measures whether the customer got more value. The gap between those two is where almost every AI productivity claim breaks. 

Where the gains disappear 

If you watch a sprint closely, the AI gain is visible. The developer using Cursor or Claude Code does in two hours what previously took three. That is real. The disappearance happens later, in places most measurement frameworks do not look. 

Five places, specifically: 

Where it leaks 

What happens 

Verification overhead 

AI generates faster than humans can verify. The minutes saved on the prompt are spent — and often more than spent — in the review window. In a published worked example of an agentic compliance workload, monthly token cost was $25 and monthly human review cost was $29,000. The ratio that matters is the second one. 

Workslop 

AI-generated content that looks formally plausible but is substantively thin and requires downstream rework. A 2025 BetterUp / Stanford Social Media Lab study of 1,150 US workers found 40% reported receiving such output in the past month. Each incident cost roughly two hours to deal with. What looks like one team's efficiency reappears as another team's rework. 

Approval fatigue 

BCG's 2026 Brain Fry study of 1,488 US workers found about 14% of AI users reporting cognitive exhaustion from constantly supervising and evaluating AI output. After enough Accept clicks, attention degrades. The AI is now unsupervised whether the reviewer meant it or not. 

Skill atrophy 

An Anthropic study of 52 developers learning a new library found that heavy AI use made them marginally faster but led to 17% worse results on a knowledge test — with the difference depending entirely on whether AI was used for explanations or for delegating the work. Speed today, weaker foundation tomorrow, unless the use is actively governed. 

The incentive problem 

Workers have good reasons to keep quiet about time savings, because admitting that a task that once took five hours now takes three opens the door to a heavier workload. Executives and tool vendors have the opposite problem: they need to justify budgets. Neither side is lying. They just see different things — and the productivity number that emerges from the gap is unreliable in both directions. 

 

 

Where developer time actually goes when AI enters the workflow. Writing code drops; review, testing, and documentation rise. The total may be smaller — but the proportion shifts toward judgment-heavy work. 
Where developer time actually goes when AI enters the workflow. Writing code drops; review, testing, and documentation rise. The total may be smaller — but the proportion shifts toward judgment-heavy work. 


Each of these by itself is small. Together, they routinely consume the entire task-level gain — and sometimes more. Consider a hypothetical team with a 30% velocity gain at the coding step. The same team spends an extra two hours per week reviewing thinly-disguised AI output, an extra hour or two on cognitive recovery from constant supervision, additional time per quarter helping juniors who delegated their learning to AI relearn fundamentals, and quietly buffers a few hours of saved time into communication overhead. The arithmetic is illustrative — but the pattern is real, and across our pilot squads we have seen sprint-level gains that vanish almost entirely by the time work reaches the customer. The work moved. The output did not. 


This is not a hypothetical pattern. It is what the published productivity research keeps finding. The Danish registry study by Humlum and Vestergaard linked AI usage surveys with administrative labor-market data and found zero effects on income or recorded work hours two years after chatbots arrived. The St. Louis Fed found average time savings of 5.4% of working hours among active users — which dilutes to 1.4% across the workforce, and translates to a potential productivity gain of roughly 1.1% only under assumptions that the saved time gets converted into value-creating work. The European firm-level study by Aldasoro and colleagues found a 4% labor productivity gain on average — but only in firms that had already made complementary investments in software, data, and training. The gain is real. The conditions for capturing it are demanding. 


The veteran Silicon Valley engineer Bill Cox, who has personally written more than 240,000 lines of production code under AI supervision, gives perhaps the most direct illustration of what is at stake. In one of his projects, the same engineer with the same AI model produced 35,000 lines of clean, shipping code. In another, the same engineer with the same tools produced 58,000 lines that mostly had to be thrown away. The variable was governance — what Cox calls the harness. What determines whether AI gains land on your balance sheet is rarely the model. It is the structure your team builds around it. 


Why most AI measurement fails 

The standard AI productivity dashboard tracks: lines of code generated, tickets closed, drafts produced, prompts run, hours self-reported as saved. All of these are activity metrics. None of them are value metrics. As soon as a metric measures activity, the system optimizes for the metric — not for the value it was supposed to represent. 


Jan Sauermann's labor economics research describes this distortion well: once a specific observable metric feeds into evaluations or incentives, people optimize for that metric. AI makes the problem worse, because AI produces an extraordinary volume of countable outputs. More drafts. More PRs. More tickets touched. Dashboards fill up with green numbers. Senior leadership sees motion. The customer sees nothing different. 


The fix is structural. We measure five different things at Axon Active, and explicitly distrust five common substitutes. 

✓ What we measure 

✗ What we don't trust 

Cycle time 

Task start → PR merged. Trending down once the team settles in = real gain. 

'Feels faster' 

Significant gap between perception and objective measurement (METR, 2025). 

PR review time 

Goes up during transition. Means review is happening. 

AI tool usage count 

Activity is not productivity. Usage without governance is noise. 

Defect density 

Bugs per feature. Stable or decreasing after transition. Key quality signal. 

Lines of code written 

AI inflates volume. More code equals more review burden, not more value. 

Velocity accuracy 

Planned vs delivered vs team confidence. 

Velocity self-report 

Teams overestimate consistently. We stopped asking and started measuring. 

Code duplication 

Monitored per codebase. Increases in AI codebases without governance. 

PR count per sprint 

AI generates larger PRs. Count is meaningless without quality context. 

 


How we think about AI measurement at Axon Active — what we track and what we deliberately do not. 
How we think about AI measurement at Axon Active — what we track and what we deliberately do not. 

 

Two of these deserve a short note. 


Cycle time, not task completion time. We measure the time from a task being started to the corresponding pull request being merged into the main branch and accepted as Done. That captures everything between the start and the customer-visible result. A 30% AI speed-up on the coding sub-step that vanishes during a longer review will show up as flat cycle time. A real gain shows up as cycle time trending down once the team has settled into the new way of working. 


PR review time as a positive signal. In every transition we have run, PR review time goes up significantly in the first sprints. This is expected and necessary. It means the review is happening, not being skipped. A team that adopts AI without review time increasing is not adopting AI safely — it is offloading the review burden to whoever finds the bug in production. 



We are also introducing survivorship as our most honest measure of AI's contribution — the share of AI-assisted code that endures unchanged in production over time. Lines generated is a vanity metric. Lines that ship and stay shipped is the real one. The handbook's measurement chapter currently tracks the five metrics above; survivorship is the metric we are building toward, with pilot measurement starting in selected squads. 


Stepping back, a comprehensive measurement framework for AI in knowledge work needs at least five levels: cycle time of full processes, error and rework rates, quality, customer value, and economic impact. Our current measurement covers the first three rigorously and the fourth indirectly through client-facing satisfaction tracking. Direct measurement of economic impact — revenue, margin, conversion attributable to AI-assisted delivery — is the harder horizon, and the one most consultancies skip entirely because it requires honest cooperation with finance teams across the client boundary. We are building toward it with the most transparent clients first. The principle is the same one that runs through everything we measure: activity is not value, and the question is always whether the customer ended up better off. 


 

The DORA metrics — Deployment Frequency, Lead Time for Changes, Mean Time to Restore, Change Failure Rate — measure delivery health, not output volume. They are the right scoreboard when AI is in the loop. 


Flow governance — what it actually means 

Software delivery is a chain. Requirements feed planning. Planning feeds the sprint. The sprint feeds code review. Code review feeds QA. QA feeds deployment. Deployment feeds customer feedback. Customer feedback feeds the next requirements cycle. AI accelerates one of those links — the sprint, specifically the part where developers are writing code. It does not accelerate the other six. Code review and QA, in particular, get heavier — more code to review, more output to test, more decisions to make under the same time budget. 


 

AI accelerates one link in the delivery chain. Lead time is governed by Little's Law: lead time = work-in-progress ÷ throughput. If only one link gets faster, the bottleneck shifts. 

This is the governance question Little's Law makes precise. Little's Law states that the average time an item spends in a system equals the average number of items in the system divided by the average completion rate. In delivery terms: lead time equals work-in-progress divided by throughput. If a team uses AI to lift sprint throughput but does not also raise review and QA throughput proportionally, work-in-progress accumulates. Items pile up at the bottleneck. Lead time stays flat or gets worse. The faster sprint produced more output, but more of it is now sitting in someone's review queue, waiting. 


This is why the standard pilot scorecard misleads. The pilot squad reports a 35% velocity gain at the sprint level. The CTO multiplies through and assumes the same gain at the delivery level. Then the actual delivery numbers come in flat, and everyone is confused. The pilot was not wrong. The flow was not governed. AI accelerated one link. The chain was not optimized end-to-end. 


Flow governance, in our delivery model, means three things. First, measuring at the chain level — cycle time, defect density, code duplication — not just at the link AI is fastest at. Second, expanding capacity at the bottleneck rather than feeding more work into the unblocked stage. If review is the bottleneck after AI adoption, the answer is investing senior review capacity, automating the routine parts of review, and tightening the Definition of Done — not generating more PRs. Third, naming the bottleneck shift to the client up front, before it happens, so that when PR review time triples in week three the response is recognition rather than alarm. 


Faster sprints without flow governance produce bigger queues, not faster delivery. 

Expect a dip before the gain 

There is one more piece of organizational reality the productivity data points at. AI gains are not instantaneous. In every transition we have run, there is a velocity dip in the first few sprints, before sustained gains appear in the months that follow. 

 

The transition pattern: assessment, enablement, an expected velocity dip in the first sprints, then sustained gains. We tell our clients this in the assessment phase, not after the dip starts. 
The transition pattern: assessment, enablement, an expected velocity dip in the first sprints, then sustained gains. We tell our clients this in the assessment phase, not after the dip starts. 


This is consistent with the macro pattern in the productivity literature. THE DECODER's Frontier Radar synthesis of the historical comparison to the PC, the internet, and cloud computing suggests roughly a decade tends to pass between a technology becoming available and measurable productivity gains showing up in the aggregate data. The reason is not that the technology is slow; it is that organizations are. Workflows have to be redesigned. Measurement systems have to be built. Incentive structures have to be adjusted. Accountability frameworks have to be created. None of this happens in a single sprint. 


At the squad level, the same dynamic plays out compressed into weeks. In the first sprint after AI adoption, the team is figuring out which prompts work and which produce workslop. Working agreements get tightened. Review capacity gets expanded. Prompt libraries get built. Within a few sprints, the team is operating in a steady state with the new tooling — and from there, sustained gains in cycle time and quality become visible. 


We tell our clients this in the assessment phase, not after the dip starts. The honesty is itself part of the differentiation. Most AI consultancies pitch the upside without the transition cost. We pitch both, because both are real, and a CFO who sees the dip without being warned will quite reasonably conclude the project has failed. 


How we do this at Axon Active 


Our AI Transition Framework — backed by the AI Adoption Leadership Handbook
Our AI Transition Framework — backed by the AI Adoption Leadership Handbook

Our AI Transition Framework — backed by the AI Adoption Leadership Handbook (March 2026) we use internally — puts the measurement discipline into practice. Four stages, run as a continuous cycle: 

  • Assess. Two to three weeks. We map team maturity, codebase risk, and compliance context, and produce a recommended AI Responsibility Level per workstream. The output is a 90-day transition roadmap, not a one-size-fits-all rollout. 

  • Enable. Dojo sessions in your actual codebase. AI Coaches embedded with squad leaders. Working agreements that define where AI is allowed and forbidden. Prompt libraries pre-built for your environment. Tool selection matched to the AI Level and the compliance context. 

  • Govern. Four AI Responsibility Levels — Assisted (Level 1), Augmented (Level 2), Supervised (Level 3), Orchestrated (Level 4) — that classify how responsibility is shared between human and AI on a given task. Definition of Done with AI-specific clauses. Security pipeline. The 3-Prompt Rule (escalate rather than persist when AI is not moving a problem forward). 

  • Measure. Cycle time, PR review time, defect density, velocity accuracy, code duplication. Reported transparently to the client. If AI is not delivering, we say so and adjust. 


The cycle is continuous. Measurement triggers re-assessment. Re-assessment refines the working agreements and the AI Levels. Refined governance shows up in the next sprint's measurement. This is the operational form of the position described in the companion piece, AI Changes the Practice. The Scrum Team Stays. — practice evolves, the team stays, governance is the work. 


Why this matters 

The standard playbook is to sell tools and hope the productivity follows. We measure the productivity and govern the conditions that produce it. That is a different business model. It is the one that fits the Axon Model — Swiss-quality dedicated-team engineering that has worked for our regulated clients for fifteen-plus years — and it is the one we have committed to for the next decade of AI delivery. 


The question your finance team is going to ask you is not how many developers used AI last quarter, or how many lines of code were generated. It is whether the cycle time improved, whether the defect rate held, whether the customer noticed anything different. Those are the questions we built our measurement framework to answer — honestly, transparently, and on a timeline that includes the dip before the gain. 


If your AI rollout is not yet showing up in the metrics that matter, that is not a sign the technology has failed. It is a sign the chain has not been governed. We can help with that. 

 

Sources and further reading 

All sources used in this article, with direct links to the original research. 


Productivity research and field studies 

AI's hidden costs in knowledge work 

Adoption baseline, technical debt, and synthesis 

Companion article 

 
 
bottom of page