Production Resilience Is the Thing You Stopped Buying When You Bought Velocity

The engineering velocity numbers look great. The on-call rotation is on fire. These two things are now happening at the same time, on purpose, by accident, in roughly every enterprise that has adopted autonomous coding agents at scale.

Production resilience is the thing that broke when the change rate decoupled from the review rate. The dashboard does not show it. The on-call engineer pays it. The CFO signs the quarterly without knowing it is happening. The General Counsel inherits the regulator's question when the loss finally surfaces.

This is what we built Tomosu for. This is what the next 90 days of pilot work will look like for the three to five engineering organizations we partner with this quarter.

The one-line version of what the product does: Tomosu scores every pull request for reliability, operational risk, security exposure, and policy compliance, against the production system the change is about to enter.

What broke

Engineering organizations adopting autonomous coding agents follow the same trajectory. The shape is consistent enough now to call it a curve.

Month one, the velocity dashboard goes green. Pull requests get larger. Cycle times drop. Senior engineers who shipped one feature a sprint are now shipping four. Leadership presents the throughput numbers to the board.

Month four, the pager starts ringing. Race conditions that did not trigger on the local test suite. Memory leaks that took a week of production traffic to manifest. Cache invalidations that worked correctly in isolation and failed against the legacy auth service the agent had never been told about. The change shipped two months ago. The agent that wrote it is no longer in context. The engineer who reviewed it does not remember the diff.

Month five, the senior engineers start leaving. Exit interviews say burnout, or family, or a new opportunity. The pattern inside the exit interviews is the same: I am on call for code I cannot defend and did not write, and the pager is ringing more than it used to.

The velocity dashboard stays green. The reliability cost accumulates in the on-call rotation, the exit interviews, and eventually the quarterly. The two metrics have come unmoored from each other.

The velocity dashboard is still green. The on-call rotation is in crisis. The two metrics have come unmoored from each other, and the gap between them is the wound the organization is now bleeding from.

Faros AI's Acceleration Whiplash study, looking at 22,000 developers, measured the wound. Incidents per pull request rose 242% over a two-year horizon. Bugs per developer rose 54%. AI-co-authored PRs averaged 10.83 distinct structural issues each, against 6.45 for purely human-written ones. Roughly 31% of changes were shipping with no human review at all. Veracode's 2025 GenAI Code Security Report put the AI-code security-fail rate at around 45%.

These are not abstract numbers. They are descriptions of a person, asleep, until the phone vibrates.

Why local tests do not catch it

The local tests do not catch it because the agent that wrote the code also wrote the local tests. A test written by the author of the code is a second draft of the same mind agreeing with itself. The mind got it wrong in the same way twice. The mind did not consider the legacy auth service because the mind did not know the legacy auth service existed.

The agent tests what it can imagine. The thing it cannot imagine is the production system it has never seen. The mind passed its own tests. The pager rang anyway.

This is the self-grading loop, and it is the simplest part of the problem. The harder part is that even fixing the self-grading loop does not fix the broader problem. Blast radius is a property of the surrounding system, not of the code in isolation. The cache invalidation looks correct in isolation. It is correct in isolation. What is not correct is the assumption it makes about how the surrounding system behaves under concurrent load. And the surrounding system was never in the diff.

The verifier that can score that has to live outside the loop the agent operates in and above the diff the agent produces. The on-call engineer at 2am is providing exactly that verification, retroactively, expensively, under stress, with the cost socialized to their sleep.

This is what Tomosu was built to replace.

What Tomosu does

Tomosu sits above your existing GitHub or GitLab as the verifier the architecture has been missing. Every pull request, especially AI-generated ones, gets scored before it lands on main. The score is the Production Reliability Index, a single trendable number composed of signals you can name to a board.

Four headline signals compose the PRI. AI-origin is a cross-cutting modifier, not a fifth dimension — it sharpens each signal when the author is an agent rather than a human who can defend the diff.

Change scope. The blast radius of the diff. What services, what data paths, what downstream systems this change can affect. The cache invalidation that looked harmless in isolation gets surfaced because its blast radius touches the legacy auth service the agent did not know about.

Service criticality. What it costs if the service this change touches goes down. The tier of the system the diff is reaching into. A change to the payments path scores differently from a change to the marketing site, even if the diffs look identical.

Compliance exposure. Whether the change touches regulated paths. PCI on the payments side. HIPAA on the patient-data side. SOC 2 on the audit-bearing side. The change that the CFO will be asked about by a regulator gets flagged before it ships, not after.

Policy adherence. Whether the change conforms to the organization's declared engineering and security policies. CODEOWNERS satisfied. Required reviewers actually reviewed. Branch protection respected. The dependency the agent pulled in is on the approved allowlist. Each of these is binary or near-binary, each is auditable, and together they answer the SOC 2 CC8.1 question: was this change made under the controls the organization said it operates under.

AI-origin is a cross-cutting modifier on all four dimensions, not a fifth dimension on its own. The agent's diff is scored not because AI is bad, but because the verification discipline a human reviewer applied informally has to be made explicit when the author is not human.

These are four of the seven signals that compose the PRI. The full set rolls up into one number the board can track across quarters. The point is not the signal count. The point is that the score is computed at the moment the merge decision is made, against the system the change is about to enter, with the receipts attached.

Over time, the moat compounds through a cross-fleet governance graph that learns from millions of code changes, incidents, and remediation actions across organizations. Each customer's deployment makes every customer's verification sharper, without crossing any tenant boundary on the underlying code or data.

What changes for the people in the chain

The same shift in four roles: the verification burden moves from retroactive and human to proactive and systematic. The on-call engineer, the senior reviewer, the CFO, and the General Counsel each receive the same underlying upgrade.

The on-call engineer is no longer responding to incidents on changes they did not review and cannot defend. The senior engineer's attention goes to the surfaced changes, with the score and context visible. The CFO is being shown, in a board-trendable score, what the organization's production risk posture actually is. The General Counsel has the structured record that was produced when the decision was made, not reconstructed afterward.

The on-call rotation stops being the verification layer of last resort. That role belongs at the merge gate, where it can be done deliberately, in business hours, with full context, before the change ships.

A 90-day pilot

We are partnering with three to five engineering organizations in regulated industries this quarter. The pilot is measurable, not vibes.

Four milestones. Read-only integration at Week 1. Visible risk ledger at Day 30. Closed-loop policy enforcement at Day 60. Board-ready PRI trendline at Day 90. The PR queue keeps moving throughout.

Week one. Read-only integration with your GitHub or GitLab, observability stack, and ticketing. No agents added to the merge path. No rip-and-replace. The PRI starts scoring your existing PR flow against your existing services and policies.

Day 30. Visible risk ledger. Every pull request in the last 30 days has a score and a reason. The board number is now trended for the first quarter your team has ever had it. The fragility that has been accumulating is now visible in advance.

Day 60. Closed loop. Policy is set. Surfacing rules are tuned to your services. AI-origin modifiers are calibrated to your codebase. Escalation paths route to the right reviewers.

Day 90. Board-ready PRI trendline. The production reliability number is a metric your team owns. The merge decision is on the record. The on-call rotation is no longer being asked to do verification work the merge gate should have done.

The line

Production resilience is what you stopped buying when you bought velocity. The dashboard did not tell you because the dashboard cannot see the cost. The cost lives in the on-call rotation, in the exit interviews, in the regulator's question that has not yet arrived but will. The merge gate is where the cost can be paid in advance, deliberately, with the receipts attached, instead of retroactively, at 2am, by the engineer who happened to be holding the pager.

Tomosu is the gate.

We are opening three to five pilot partnerships with engineering organizations in regulated industries this quarter. 30 minutes to see your PRI score on your actual codebase. Book a conversation →

Questions? contact@tomosu.ai · tomosu.ai