Research · Risk

The Verification Debt Nobody Put on the Balance Sheet

Tomosu·May 2026·6 min read

AI agents now initiate 96% of the work in collaborator-driven repositories. Almost none of it is governed by anything other than a human glancing at a diff. The gap between those two numbers is where risk is quietly accumulating — and most engineering organizations have no instrument that measures it.

The two numbers that don't reconcile

A peer-reviewed study published this month analyzed 29,585 pull request lifecycles across the five leading AI coding tools — OpenAI Codex, GitHub Copilot, Devin, Cursor, and Claude Code.1 It found something most engineering leaders already feel but have not yet quantified: in collaborator-driven repositories, agents now initiate at least 96% of the work — opening branches, writing the code, carrying the pull request forward. And yet terminal merge authority remains almost exclusively human. The study placed the upper bound on genuinely autonomous merges at 0.07% of all PRs.1

Read those two numbers together. Ninety-six percent of the work is agent-initiated. Effectively all of the governance is a human decision at the merge gate. That is not a balanced system. That is a system where the volume of what needs reviewing has scaled with the agents, and the capacity to review it has stayed exactly where it was — one human, one diff, one judgment call, at the end of a pipeline moving faster than any human can absorb.

AGENT ACTIVITY · arXiv:2605.08017 Work initiated by AI agents 96% Merges that are fully autonomous 0.07% VERIFICATION DEBT ACCUMULATES IN THIS GAP DEVELOPER BEHAVIOUR · SONAR SUMMIT 2026 Developers who distrust AI-generated code 96% Developers who verify it before committing 48% The distrust is real. The verification is not happening. Distrust real but action absent — the debt compounds silently.
Two independent data sets. One story. Top: arXiv:2605.08017 — 29,585 PR lifecycles. Bottom: Sonar Summit 2026 survey data. The gap between the purple bars in each section is where verification debt accrues.

AWS CTO Werner Vogels has a name for the cost that accumulates in that gap. He calls it verification debt.2 It is the same shape as technical debt, with one difference that matters: technical debt is visible in the codebase, and verification debt is invisible until an incident makes it visible. Sonar's own survey data sharpens the picture — 96% of developers express distrust of AI-generated code, yet only 48% verify it before committing.2 The distrust is real. The verification is not happening. The debt compounds.

96%
of PRs in collaborator-driven repos are agent-initiated — yet the merge gate hasn't scaled with the volume.
0.07%
upper bound on genuinely autonomous merges. Human review is still effectively universal — and overwhelmed.
1.7×
more issues per AI-generated PR than human-authored ones (10.83 vs 6.45), per CodeRabbit analysis.

Why the merge gate is where the debt comes due

The study named a second finding that is, if anything, more consequential for anyone with audit or compliance responsibility. The researchers call it the observation boundary.

When repository automation executes a merge, the system event logs record the automated executor — the bot, the action, the pipeline step that performed the merge. They do not record the human decision-maker, or the substance of what that person reviewed before authorizing it.1 The log captures who pushed the button. It does not capture who made the decision, or whether they understood what they were approving.

For an unregulated side project, that gap is harmless. For a regulated enterprise repository — one subject to SOC 2 change-management criteria, or to the documentation requirements arriving with the EU AI Act — that gap is an audit failure waiting to be discovered.

WHAT CI/CD LOGS RECORD WHAT CI/CD LOGS MISS Which bot or action executed the merge Exact timestamp of the merge event Which automated checks passed Target branch and commit SHA Diff size and files touched Which human authorized the merge What risk context they had when deciding The rationale behind their approval What residual risk they accepted Whether the code was AI-authored
The observation boundary: CI/CD event logs were built to record code events, not governance decisions. An auditor asking "who decided this, and on what basis?" will find the right column empty — every time.

The control framework presumes a comprehending human signatory behind every production change. The CI/CD log presumes the executing process is sufficient evidence. Those are not the same artifact, and the difference between them is exactly what an auditor is trained to find.

The reflex that makes it worse

The instinct, when verification can't keep pace, is to add another agent to do the verifying. The industry is moving in this direction, and some of the tooling is genuinely good. Sonar's Agent Centric Development Cycle — announced at Sonar Summit in March, structured as a four-stage Guide → Generate → Verify → Solve loop — is a serious attempt to embed verification into the agent workflow rather than bolting it on after the fact.3 Deterministic analysis inside the agent loop is a real improvement over hoping a human catches everything in a five-thousand-line diff.

But automated verification, however good, does not close the governance gap. It narrows the quality gap. Those are different gaps.

QUALITY GAP Guide Generate Verify Solve Code quality risk narrowed ✓ Sonar Agent Centric Development Cycle (AC/DC), March 2026 GOVERNANCE GAP Who decided What they saw when deciding Why approved No dedicated infra Gap remains open ✗ TOMOSU — MERGE-GATE GOVERNANCE LAYER Captures the human decision as a structured, auditable first-class event Automated verification closes the quality gap. It does not close the governance gap — those require separate infrastructure.
Two distinct gaps require two distinct solutions. Sonar's AC/DC loop and its equivalents close the quality gap. The governance gap — who decided, what they saw, why — has no existing infrastructure and remains entirely open.

An AI reviewer can tell you the code is structurally sound, well-tested, and free of known vulnerability patterns. It cannot tell you whether this change belongs in this codebase right now — whether it fits the architecture, matches the direction the team committed to, and is something the organization is prepared to be accountable for in production. That last judgment is the merge decision, and it is the one that has to remain a deliberate, attributable human act.

The verification agents handle the quality. The merge gate has to handle the governance. Conflating the two is how organizations end up with excellent automated checks and no defensible record of who decided to ship.

What a governed merge gate actually produces

The merge decision is where AI-generated code becomes the company's code. Before the merge, it is a candidate. After the merge, it is a liability — operational, architectural, regulatory. Every governance question that matters about AI-generated code resolves at that boundary, and the boundary is the one point in the pipeline that has full context — the codebase, the team, the history, the deployment posture — while still being upstream of consequence.

A merge gate worth building scores the decision, not just the diff. It produces a single, trendable signal at the moment of the merge: how large is the real blast radius, how critical is the service being touched, was this code AI-generated and to what degree, is this code path historically associated with incidents, does it touch regulated data flows.

PRODUCTION RISK INDEX · TOMOSU PR #4821  ·  Add payment webhook handler  ·  Claude Code  ·  847 lines  ·  12 files Change Scope Service Criticality AI-Origin Signal Incident History Compliance Exposure 847 lines · 12 files Payments & billing service 94% AI-authored · Claude Code 3 incidents in 90 days on this path Touches PCI data flows HIGH CRITICAL HIGH MEDIUM HIGH PRI: 87/100 ESCALATE Decided by: Sarah Chen  ·  Reviewed: 14:32 UTC  ·  Residual risk accepted: on record
The Production Risk Index scores every PR across five dimensions and captures the human decision-maker alongside the score. The CI/CD log records who executed. The governance record captures who decided — and the basis on which they decided it.

The score does not replace the human reviewer. It gives the reviewer the context they were already reaching for, and it produces the attributable record the CI/CD log never will — closing the observation boundary by capturing not just who merged, but the basis on which they decided.

That record is what turns verification debt from an invisible liability into a measured, managed one. You cannot pay down a debt you cannot see. The first step is instrumenting the place where it accrues.

This is what we built

Tomosu is a merge-gate governance layer. It scores every pull request — especially AI-generated code — at the moment of the merge decision, producing a Production Risk Index across change scope, service criticality, AI-origin signal, and compliance exposure, with the auditable trail behind it. The human still decides. The decision finally goes on the record.

We are live: a private beta, and a public trial you can download and run with no credit card required.

If 96% of your repository's work is now agent-initiated and your governance is still one human and one diff at the end of the pipeline, the verification debt is already accruing. The question is whether you can measure it.

Try it free, no credit card: tomosu.ai


Tomosu builds merge-gate governance infrastructure for teams where AI coding agents have outpaced the audit trail. Book a call →

References

  1. "Collaborator or Assistant? How AI Coding Agents Partition Work Across Pull Request Lifecycles," arXiv:2605.08017, May 2026. 29,585 PR lifecycles across OpenAI Codex, GitHub Copilot, Devin, Cursor, and Claude Code. ≥96% agent-initiated in collaborator workflows; autonomous-merge upper bound 0.07%; observation-boundary finding on merge logging.
  2. Sonar survey data and Werner Vogels "verification debt" framing, as reported in coverage of Sonar Summit 2026 (byteiota, April 1, 2026): AI authored ~41% of code in 2026; 96% developer distrust vs. 48% pre-commit verification; AI PRs contain 1.7× more issues (10.83 vs 6.45) per CodeRabbit analysis.
  3. "Sonar Introduces the Agent Centric Development Cycle," Sonar press release, March 3, 2026 (Sonar Summit, Austin). AC/DC four-stage loop: Guide → Generate → Verify → Solve. CEO Tariq Shaukat.