Shape And Ship - EN

Cheap to build, costly to keep

Sat, 04 Jul 2026 00:00:00 GMT

Over 40% of committed code is now AI-assisted (1). AI has broken the cost of writing code. It hasn't touched the cost of owning it.

Velocity metrics look great. We ship more, faster. Everything is green. That's worth questioning.

There are two prices

AI has collapsed the price of writing code to near zero. Features that used to take a week now ship in two days, a major lever for any scale-up.

But writing code was never the most expensive part. What costs is maintenance and evolution.

The first large-scale empirical studies converge on the same finding: AI-generated code introduces 1.7x more issues than human code. Without guardrails, that compounds: maintenance costs reach 4x traditional levels by the second year (1). These numbers come from an Ox Security report relayed by InfoQ. Ox sells application security tooling, so they have skin in the game. But other studies point in the same direction (4)(6), and the pattern matches what I observe in the field.

Why? Because AI is additive by default. Take a common case: an endpoint returns a badly formatted date. An experienced developer would trace back to the parser and fix the format at the source. AI adds a .toISOString() in the controller, a sanitizeDate() wrapper in the service, and a test that validates the workaround. The bug is "fixed." Three layers of code added, zero lines removed. The root cause is still there.

A junior developer would make the same mistake. The difference is scale. AI produces this kind of palliative on every PR, in every module, without anyone systematically pushing back. What used to be a coaching moment becomes a systemic codebase problem.

GitClear's longitudinal study across millions of lines of code (6) quantifies the shift: the share of changed lines associated with refactoring dropped from 25% in 2021 to under 10% in 2024. In the same period, code duplication rose from 8.3% to 12.3%. AI amplifies the pattern: it generates new code rather than restructuring what exists. (On greenfield projects, high addition rates are expected. The signal matters on code in maintenance, over time.)

Another signal worth tracking: code churn, the percentage of code rewritten within two weeks of its creation. The same study (6) measured churn rising 84% between 2020 and 2024, from 3.1% to 5.7%. The period also covers post-COVID shifts and the Great Resignation, so AI adoption isn't the only factor. But the correlation is strong enough to warrant monitoring. A feature that ships in two days but gets rewritten the following sprint is not a velocity gain. It's a deferred cost wearing a speed label.

These are the indicators that separate teams where AI builds from teams where AI just adds.

The code works. Nobody understands why.

Addy Osmani named this phenomenon: comprehension debt (3). The code runs. Tests pass. Syntax is flawless. But no one on the team can explain how it works. The team merged code it didn't write, didn't truly read, and couldn't reproduce without AI.

This has nothing to do with classic technical debt. There's nothing to refactor. The code is clean. It's just that nobody carries it in their head.

When a production incident hits, resolution time increases. The team discovers the implementation in real time, under pressure. An empirical study of 304,000 commits confirms it (4): developers place excessive trust in AI-generated code and merge it without thorough validation. Issues frequently go unfixed.

This is where bus factor becomes a leading indicator. If AI writes code that only AI can explain, the bus factor of affected modules tends toward zero. Not because a single person holds the knowledge, but because no one does. In a post-mortem, this surfaces as "nobody knew this code existed," which is worse than "only one person knew."

One way to detect it: track the MTTR Drift (2), the deviation of Mean Time To Recovery from its pre-AI baseline:

MTTR Drift = (MTTR post-AI - MTTR pre-AI) / MTTR pre-AI

The proposed thresholds (2) are as follows. Between -10% and +10%, the team has internalized the generated code. Above +30%, it's a signal worth investigating. MTTR is a noisy indicator. A single major incident can skew a quarter, so measure it as a rolling median over at least three months. The drift can have other causes (turnover, infrastructure changes), but if it correlates with AI adoption and nothing else explains it, comprehension debt is a serious hypothesis.

If the drift is real, the intervention point is clear. Code review is the last moment where the team can take ownership of code it didn't write. Automating convention and pattern checks (via hooks or review bots) frees human attention for business logic and architecture decisions. A hook that blocks PRs over 400 lines unless the author provides a section-by-section breakdown helps keep reviews at a human scale.

Measuring real impact, not velocity

"The faster you go, the further ahead you need to look."
— Todd Gagne, The Barrels Paradox

Lines of code generated, number of PRs, completion speed: these are production metrics. They measure volume. They say nothing about the cost of what we produce.

The obvious starting point is DORA metrics before and after AI adoption. If deployment frequency goes up but change failure rate does too, we're not going faster. We're breaking more often. One study measured a +30% increase in change failure rate within 90 days of AI adoption (1). Part of that may reflect the learning curve of new tooling rather than a structural problem.

Three questions to ask: does the AI produce code we keep? Can the review pipeline absorb the volume? Does the architecture hold?

On the review pipeline specifically: review cycle time on AI-assisted PRs versus manual ones is a revealing metric. A study of over 8,000 AI-agent PRs (7) shows that 35% are never merged, either closed or left to rot. Merge rates vary from 42% to 82% depending on the tool. If a third of AI-generated PRs never land, the velocity gain measured at commit time is absorbed downstream. The bottleneck moves from writing to reviewing.

In his paper (2), Nadarajah applies this reasoning to two scenarios with the same tool and the same team, over a one-month cycle. Without guardrails: net impact of -4,200. The team spends its time cleaning up. With the right safety nets: net impact of +5,780. Same tool, opposite outcome.

The guardrails make the difference

The numbers tell part of the story. There's also a human cost that metrics don't capture. Seniors spend their days reviewing code they didn't write and understand less and less. Experienced developers lose touch with their own codebase. In the teams I work with, that's often the first warning sign: not a metric going off, but a tech lead saying "I don't recognize the code anymore."

theThe answer is to give AI the right context. With one of my clients, we formalized architecture conventions in tool-readable specs that the AI loads before generating code, set up pre-commit hooks to block known anti-patterns, and invested in building shared skills across the team so that everyone can evaluate what the AI produces. The hooks catch issues before they reach review; the shared understanding catches everything else.

It's too early to measure the impact. The setup is recent. But the logic holds: give AI the context to produce aligned code from the start, rather than fixing it after the fact. I'll detail the full pipeline (specs, hooks, task templates, review automation) in a follow-up article.

AI also reduces certain types of bugs, improves test coverage on boilerplate, and enables small teams to deliver what used to take months. The risks described here are real, but they exist alongside genuine gains.

What to watch

If you do one thing Monday morning: pull the insertion/deletion ratio on your three most active repos for the last quarter. If it exceeds 10:1, start asking which PRs contribute the most.

Signals to track over time:

Code churn within 14 days. Extractible from git, no vendor dependency. If AI code is rewritten more than human code, the speed is illusory. (Tagging AI vs. human commits requires convention: commit message tags or Copilot metadata.)
Refactoring ratio on your repos. If it drops below 10% of changed lines, the codebase is bloating. The 10% threshold comes from GitClear's methodology (6). Your mileage may vary with different tooling.
MTTR Drift (rolling median, 3+ months). Above +30%, the team understands less of what it ships. Threshold from (2), not an industry standard.
Review cycle time, AI PRs vs. manual. If AI PRs take longer to merge, the bottleneck moved. Normalize by PR size to account for size differences.
Change failure rate before and after AI. If it's going up after the adoption curve has stabilized, we're breaking faster than we're building.

Today's acceleration becomes tomorrow's bottleneck when nobody looks beyond the velocity dashboards. That's a leadership choice.

Sources

(1) AI-Generated Code Creates New Wave of Technical Debt — InfoQ / Ox Security (Nov 2025)
(2) The Velocity Mirage: The Agentic Impact Framework — Mag-Stellon Nadarajah (March 2026)
(3) Comprehension Debt: The Hidden Cost of AI-Generated Code — Addy Osmani (March 2026)
(4) Debt Behind the AI Boom: A Large-Scale Empirical Study of AI-Generated Code in the Wild — arXiv (March 2026)
(5) The Barrels Paradox: Why AI Makes Leadership More Human, Not Less — Todd Gagne (Feb 2025)
(6) AI Copilot Code Quality: 2025 Data Suggests 4x Growth in Code Clones — GitClear (2025)
(7) Why AI Agent-Involved Pull Requests Remain Unmerged — arXiv (Feb 2026)

Craft is dead, long live the craft

Mon, 22 Jun 2026 00:00:00 GMT

"Craft is outdated."

A candidate told me this during an interview after we flagged gaps in his engineering craft culture. His answer was blunt: with AI, what matters is speed.

Easy to dismiss. Except I hear the same thing in the teams I work with, just phrased differently.

What the data already shows

Faros measured the impact of AI adoption across 22,000 developers and 4,000 teams over two years.

Individual throughput is up 34%. Epics completed per developer, up 66%. AI accelerates code production. Nobody disputes that.

On the other side: bugs per developer are up 54%. The incident-to-PR ratio has tripled. Code churn is up 9x. Lead time, 5x. And 31% more PRs are being merged without any review at all.

Some of this can be explained by increased volume and more ambitious projects. You could also argue that higher churn reflects refactorings teams would never have tackled without AI. But that doesn't explain why teams are discovering their own implementations for the first time during production incidents.

The hardest finding in the report: organizations with strong pre-AI engineering practices are not protected. Teams with solid foundations, mature review processes and high DORA scores see the same degradation. Code generation has become so cheap that it overwhelms existing quality mechanisms.

Some CTOs I talk to are convinced their mature teams will absorb AI naturally. The data suggests otherwise.

That doesn't mean old practices should survive as-is.

Craft was never about the code

For years, we believed craft meant writing elegant code, applying patterns, maintaining clean architecture. AI is very good at producing all of that. Idiomatic code, well-named, stylistically consistent. If craft were just that, the candidate would be right: it's outdated.

Code was the observable surface of something else. A team's ability to build a shared mental model of its system and domain. Knowing which trade-offs were made and why. Knowing what deserves to be simplified and what needs to stay complex.

In a team I was in the process of structuring, I tried to introduce pairing. The response was immediate: "we have Claude Code." Same thing with cross-reviewing technical specs: "one review is enough."

No resistance. A rational shortcut. And not an absurd one. Pairing was expensive. Mob programming was slow. Nobody misses the days when a two-hour refactoring took a full day in a pair. You could even argue that Claude Code is a form of pairing, with a machine instead of a colleague.

Except a machine's challenge doesn't replace a human's. It improves an individual decision. It doesn't automatically build collective understanding. Pairing is about one developer understanding and challenging another's choices, not about writing code together. Spec review is about the team aligning on the problem before coding. TDD is about specifying expected behavior before writing a single line (Kent Beck goes as far as enforcing TDD in his agents' system prompts, because without it, the agent deletes the test rather than fixing the code). Mob programming is about anchoring conventions through collective imitation.

Each of these practices produces a deliverable. And each builds, in parallel, a shared understanding of the system. AI produces the deliverable. Not the rest.

The gain is real. We deliver faster, we have more tests. But I observe that teams don't use the time saved to think. They use it to pick up the next ticket. Sometimes to improve overall quality, strengthening tests or documentation for instance. Often, practices like pair programming, mob programming, spec reviewing don't even get the chance to take hold. AI provides the shortcut before the practice has proved its value.

This isn't a critique of AI. It's a critique of what AI amplifies. Dave Farley puts it bluntly: teams with good practices benefit massively from AI. The rest produce worse software, faster.

Craft is shifting

Some teams delegate to AI. The spec, the tests, the code. AI produces, the team validates and moves on. Others use it to think. They specify expected behavior in TDD, then let AI code. They write the first draft of the spec themselves, then use AI to tear it apart: what could break, what case was missed. They read the generated tests asking what they don't cover.

Martin Fowler talks about "letting go of the obsession with perfect code line by line to strengthen understanding of the domain and the overall architecture." Craft isn't disappearing. It's shifting. Execution matters less. Judgment and domain modeling matter more. This already separated good teams from the rest, but now the gap is visible.

The question for every team: are we using AI as a sparring partner that forces us to think, or as a competent colleague we delegate to?

Craft isn't dead. Not yet.

For twenty years, engineering practices served two purposes: producing software and building collective understanding. AI now produces much of the software.

We measure throughput, closed tickets, merged PRs. We almost never measure distributed understanding. Who knows what in the team. How long it takes to modify a system six months after building it. What it costs to explain an architectural decision to someone who just joined.

For a long time, production speed was a reasonable signal of a team's health. When writing code becomes nearly free, that signal gradually stops being reliable. If AI now produces the software, there's a question nobody is asking enough: how do we measure the quality of collective understanding that allows us to evolve it?

That candidate thought craft was dead. Execution craft is receding, yes. But craft was never really about the code. Craft is now about producing the understanding that allows the code to evolve.

AI makes code abundant. Understanding remains scarce.

Sources and related articles

Data

Faros AI Engineering Report 2026, "The Acceleration Whiplash" — 22,000 developers, 4,000 teams, two years of telemetry

Craft voices on AI

Kent Beck — TDD, AI agents and coding (The Pragmatic Engineer)
Kent Beck — Augmented Coding: Beyond the Vibes
Dave Farley — The AI Shift is Bigger Than Internet and Agile (Aviator podcast)
Martin Fowler & Kent Beck — Tech Truth: Agile Evolution & the Future of SW Engineering (GOTO 2025)

Related articles (AI and Quality series)

Cheap to build, costly to keep (comprehension debt)
"It's historical" (organizational knowledge)
Tokenmaxxing (AI metrics and ROI)
AI doesn't replace juniors (training and knowledge transfer)

We doubled the team and nothing got faster - A story of perfomance

Mon, 15 Jun 2026 00:00:00 GMT

A few years ago, at a scale-up where I was CTO, the engineering team was cut in half. Tough economic context, like many companies at the time. On paper, it was a disaster.

Except velocity didn't drop by half. It dropped by maybe 20%, possibly less. Standups got shorter. PRs moved faster. The people who stayed knew exactly what to do, and they finally had room to do it.

To be fair, what made the difference is partly that the right people had stayed. The ones actually driving projects forward hadn't left. Critical domains were still covered. Scope had shrunk. The headcount reduction didn't improve anything on its own. What mattered was who remained.

And that raises a broader question: why do some people weigh so much more than others in an organization's ability to ship?

Barrels

In any team, there are two or three people without whom nothing ships. The ones who take a vague topic, break it down, get others on board, and push it to production.

Keith Rabois put a name on this while observing PayPal: out of 254 people, 12 to 17 carried projects end to end. He calls them barrels. Everyone else is ammunition: the people who execute.

The ratio is striking. It's also a heuristic drawn from one specific company at one specific time. A 2000s fintech, a B2B SaaS platform, and an e-commerce scale-up don't have the same dynamics. But the intuition holds: in most organizations, a small number of people carry the bulk of the ability to ship.

And "carrying" doesn't look the same everywhere. The Staff Engineer who secures platform architecture carries. The EM who unblocks their team every week carries. The PM who makes priority calls and protects product focus carries, without writing a line of code. The SRE whose reliability lets everyone else build with confidence carries too. Very different profiles, same function: turning an initiative into a result.

An organization's real capacity is the number of these people. Not total headcount.

More people, less speed

When things slow down, the reflex is almost always the same: hire. Look at the roadmap, count the features, divide by estimated capacity per developer, conclude you need fifteen more people. Brooks showed the problem back in 1975: coordination cost grows at n(n-1)/2. Teams of 5-7 consistently outperform teams of 15-20 in per-capita productivity (QSM, across thousands of projects).

Sometimes headcount really is the bottleneck. An undersized platform team, 24/7 coverage to maintain, a multiplication of business domains requiring new expertise. In those cases, hiring is the right call.

But when the problem is "we're not moving fast enough on existing work," it's rarely a staffing problem. Ten more developers with nobody to frame the project, break it down, unblock it when things get stuck: ten people waiting or heading in different directions. The bottleneck is the number of people able to carry a topic end to end.

You can't hire into a broken system

On another engagement, I faced strong pressure to hire into a struggling team. My stakeholder couldn't understand why I was pushing back. The team was struggling, so it needed more people. To him, it was obvious.

I saw the outcome of that logic in a different context: external hires added to a dysfunctional team. They didn't push back, didn't bring the fresh perspective everyone hoped for. Within weeks, they'd adopted the same workarounds as the rest of the team, the same shortcuts in review, the same topics everyone avoids in standup. The resources were there, the goodwill too. But when the system has structural flaws, new hires don't fix them. They absorb them.

Will Larson calls this organizational debt. You don't hire into a broken system expecting the newcomers to fix it despite themselves.

Find the bottleneck, not the headcount gap

Goldratt laid it out in The Goal: a system's throughput is limited by a single bottleneck at a time. Optimizing anything else is waste. In The Phoenix Project, Brent is the one everything flows through, the one without whom nothing moves. Redistributing his load is the only lever. Hiring next to him does nothing.

In the team I watched shrink by half, that's what happened naturally. Fewer people, so fewer projects in parallel. Those who remained finally had the bandwidth to run their topics end to end, without being interrupted every hour to coordinate.

Skelton and Pais frame the same idea differently in Team Topologies: the real constraint is cognitive load, not headcount. When a team is overloaded, you need to reduce the surface area, the scope per person, the number of domains to keep in mind. Not add more heads.

And often, the bottleneck isn't even in engineering. It's in product decisions. Fuzzy priorities, calls that don't get made, business dependencies that nobody resolves. You can have all the technical barrels you want: if nobody on the product side holds the vision and protects focus, the team builds fast in directions that don't converge.

Growing barrels

Barrels don't stand out by their technical level. They stand out by how they react to ambiguity.

On a recent engagement, I needed barrels in a team that didn't have enough. Rather than hiring directly, I gave a leadership role to about ten people, in rotation. Same conditions, same scope, same expectations.

Within weeks, the difference was clear. Some grabbed the role: they broke down problems on their own, asked questions about the why not just the how. They felt accountable for outcomes beyond their own scope. Others still needed guidance, and that's fine. Not a judgment call, just a snapshot of readiness at a given point.

Where some waited for instructions or escalated, those few started moving.

It takes time. It's less dramatic than a senior hire showing up with an impressive resume. But people you grow internally know the context, the product, the debt. They don't need three months of onboarding. And above all, they have the team's trust. You can't mandate that.

AI doesn't make barrels

All of this was observed and theorized before every team had access to AI. AI doesn't change the equation. It makes it worse.

Todd Gagne offers an analogy that clarifies things. When electricity arrived in factories in the 19th century, the first industrialists simply replaced the steam engine with an electric motor while keeping the same production layout. The gains were marginal. It took a complete rethinking of factory organization for electricity to deliver on its promise.

AI is the same story. Deploying Copilot or Cursor on a team that lacks barrels is putting an electric motor in a steam-age factory. AI generates code, tests, documentation. But it doesn't know what to build, it doesn't know how to prioritize, and it doesn't push anything to production.

Using Rabois's heuristic: if 12 barrels drove 254 people at PayPal, AI lets those 254 produce ten times more code. But if the barrels are still 12, the bottleneck hasn't moved. It's gotten worse: there's now ten times more output to sort, validate, and carry to production.

A team without barrels armed with AI produces more code that nobody ships. The result is noise, not throughput.

Barrels don't stand out by their technical level. They stand out by their ability to turn uncertainty into execution.

When an organization slows down, the first question isn't "How many people are we missing?" The first question is "Who actually turns vague topics into concrete results?"

As long as those people remain the bottleneck, hiring more doesn't change the trajectory.

Sources

Keith Rabois, How to Operate — Stanford/YC, 2014. The barrels vs ammunition concept.
Fred Brooks, The Mythical Man-Month, 1975. Coordination cost at n(n-1)/2.
Will Larson, Sizing Engineering Teams — and An Elegant Puzzle, 2019. Organizational debt and team sizing.
Eliyahu Goldratt, The Goal, 1984. Theory of Constraints.
Gene Kim et al., The Phoenix Project, 2013. Brent as the human bottleneck.
Matthew Skelton & Manuel Pais, Team Topologies, 2019. Cognitive load as the real constraint.
Todd Gagne, The Barrels Paradox, 2025. The electricity/AI analogy and why human judgment becomes the real bottleneck.

Shape And Ship - EN

Cheap to build, costly to keep

There are two prices

The code works. Nobody understands why.

Measuring real impact, not velocity

The guardrails make the difference

What to watch

Sources

Craft is dead, long live the craft

What the data already shows

Craft was never about the code

Craft is shifting

Craft isn't dead. Not yet.

Sources and related articles

We doubled the team and nothing got faster - A story of perfomance

Barrels

More people, less speed

You can't hire into a broken system

Find the bottleneck, not the headcount gap

Growing barrels

AI doesn't make barrels

Sources

Further reading