Engineering Essays · Code Review

Blog · May 31, 2026

How an AI-Native Engineering Team Does Code Review

When the writer and the reviewer are the same intelligence, the pull request gate stops doing what it was designed to do. We rebuilt code review at AgentField from the ground up — what changed, why, and what we open-sourced.

Santosh Kumar RadhaCo-founder & CTO

27 min read

Read this later

We'll send this piece + the next one we publish. No spam. Unsubscribe in one click.

An engineer reading a pull request through a telescope of risk dimensions.

An engineer's day, for most of the last fifty years, was structured around the parts of software development that required human intelligence: reading code, reasoning about it, writing more of it, debating tradeoffs in design reviews, holding the model of a complex system in your head while you changed one small piece of it. The structure of the day worked because the intelligence doing the work was the same kind on every part of the loop. Over the last two years, that has stopped being true. A meaningful share of the work that used to require an engineer can now be done by a different kind of intelligence, and the practices we built around the engineer's day were designed when the intelligence in the seat was a person.

Almost every software engineering practice we run today was shaped by that older assumption. Standups, sprints, code review, release engineering, on-call rotations, design reviews, the scrum and waterfall structures underneath them: each was designed for a workflow in which humans were the only intelligence moving the work forward. When the kind of intelligence in the seat changes, the practice does not just need to be faster or cheaper. The job the practice was doing changes too, because the job was always a function of who was sitting in the seat.

We think there is a real advantage available to teams willing to rebuild these practices from scratch instead of retrofitting AI into the structure they already have. Teams that do the rebuild earlier will get years of compounding benefit; teams that keep the older shapes around will spend the same years working harder to get less. There is no portable "AI process" we know of that does the rebuild for you. Each team has to do the work itself, because the answer depends on the specific intelligence stack the team runs, where it routes which decisions, where it keeps a human in the loop, and where the local context makes a non-obvious choice the right one.

This post is about one of those rebuilds: pull request review at AgentField. We are starting with this one because it is the practice we have the clearest view on. Sprints, releases, on-call, design review, and the rest are conversations of their own, and we will write about them as the picture sharpens.

Why we stopped gating pull requests on correctness

The part of our engineering practice that most surprises engineers we describe it to is this: at AgentField, the pull request is not the gate where we decide whether the code is correct.

The reaction this gets, the first time, is reasonable. For most of the time any of us have been writing software, the pull request gate has been the place where teams catch bugs and enforce quality, and removing it from that role looks, on the surface, like removing quality control altogether. Careful review is still part of how we ship code. What we stopped doing is asking the pull request, specifically, to be the place where correctness gets certified. The intelligence writing the code and the intelligence available to check it became the same kind of intelligence, and that fact alone changed what the pull request gate could be expected to do.

What pull request review was actually doing for us

The four jobs of code review and how each one changes when AI writes the code.

The reason removing correctness from the gate did not break anything important is that correctness was never the only thing the gate was doing. The clean story about code review is that it is a correctness check: a second pair of eyes catches what the first pair missed, bugs get filtered out before merge, and the codebase stays healthy. The clean story has always been incomplete. Looking at the pull request gate as it has worked for the last twenty years, it was doing four jobs at once, and only one of them was the job everybody talked about:

Correctness. Whether the code does what it claims to do.
Knowledge transfer. Whether at least one other person on the team now understands how this piece of the system works.
Risk visibility. Whether the team can see, at the moment of merge, what is being introduced into the system they collectively own.
Distributed accountability. Whether a second human's judgment, and a second human's name, sit next to the change.

These four jobs traveled together for so long that nobody thought to separate them, and the reason they could travel together was speed. A medium-sized change used to take an engineer the better part of a day to write. The reviewer would spend an hour or two reading it, asking questions, suggesting changes, and that span of reading and conversation was where all four jobs happened in parallel. Correctness got checked because the reviewer was reading carefully. Knowledge transferred because the reviewer had to understand the change well enough to comment on it. Risk became visible because the change passed in front of multiple humans before it merged. Accountability got distributed because the reviewer's approval sat in the audit log next to the author's commit. The bundle was efficient because the work could not be done quickly, and the slowness was paying for the breadth.

When AI started writing significant portions of the code, the writing speed went up by roughly an order of magnitude, and the bundle came apart unevenly. Each of the four jobs changed in a different way, and it is worth taking them one at a time.

Correctness became cheap and parallel. The same intelligence that wrote the code can now check the code, in seconds, in arbitrary numbers of parallel passes. The job has not disappeared. It has stopped being the kind of work that requires a human reviewer sitting in front of the diff to do it.

Knowledge transfer became asymmetric. When a human wrote a change, the reviewer's questions taught the author something, the author would push back, the reviewer would learn from the pushback, and the team's shared model of the system grew denser on both sides. When an AI writes the change, the reviewer still learns about the code, but the author does not learn anything that propagates back into the team's shared model. The flow that used to go in both directions now only goes one way.

Risk visibility became more, not less, important. The volume of code passing through the gate has grown faster than the team's capacity to read it carefully, and that means the gate's role as the place where the team sees what is being shipped has gone from useful to load-bearing.

Distributed accountability stayed exactly where it was. No intelligence on the artificial side can yet bear accountability. The audit log still needs a second human's name. The legal, ethical, and organizational systems that surround software still assume that a person made the call. The deeper version of this — the accountability gap that opens up as AI takes more autonomous actions — is its own essay.

The pull request, as a structure, did not change. The mix of work it was carrying through it did.

Three eras of code review

Three eras of code review drawn as a rising staircase, with the human role stepping up the abstraction ladder while writer and reviewer merge into one intelligence.

The work the pull request gate has been doing has gone through three eras over the last few years, defined by who was writing the code and who was reviewing it.

In the first era, humans wrote and humans reviewed. The bundle worked the way it had for decades. Writing was slow, reading was slow, and the correctness check was honest because a fresh reader could plausibly notice things the original author had missed.

In the second era, which is where most engineering teams operate today, humans and AI collaborate on writing the code. The human is still in the inner loop of every change, accepting suggestions from Copilot or Cursor, modifying drafts from Claude Code, reviewing AI-generated diffs before submission. AI began to participate on the review side too; CodeRabbit, Qodo, Greptile, Sourcery, Ellipsis, and similar products became reasonable additions to the gate. Writing got significantly faster than in the first era, and the reviewer side got cheaper and more parallel, but the bundle continued to hold because the writer was still a person who could be educated by the reviewer's feedback. Knowledge kept transferring back, even when the reviewer was no longer a colleague.

The third era is not yet prevalent outside a handful of frontier labs, but it is the direction the work is heading. In the third era, autonomous fleets of AI engineers write code continuously, and there is no human in the inner loop of any individual change. This is the era we already operate in at AgentField. Our SWE agents run against open issues in parallel branches and ship code at a velocity that does not belong to the same process the second era was running. The orchestration system we run them under is open-sourced as SWE-AF, and we wrote up the lessons from running it on real production builds in Beyond Vibe Coding.

The velocity that opens up in the third era is structural. The human's job moves up a level of abstraction with each era: reading each line in the first, accepting or modifying each change in the second, setting goals and reviewing system-level outcomes in the third. The compounding velocity is in that abstraction shift, and most of it stays invisible to teams still operating under second-era assumptions.

The bundle of four jobs the pull request was carrying (correctness, knowledge transfer, risk visibility, distributed accountability) comes apart most visibly in this era, because the assumption underneath the bundle was that a human would be in front of the diff at some point. When that stops being true, the synchronous submit, block, wait for review, merge sequence is buying very little new information; the same intelligence that wrote the code already had access to check it during writing. The latency in the middle is structure left over from a time when a human in the inner loop could be educated by review feedback. Replacing the gate with something that does the work the gate was actually doing for the rest of the team is the rebuild this post is about.

There is, separately, a longer conversation about what this shift does to the standalone code-review business as a category, and that conversation is its own piece. We are writing it up separately, and the easiest way to get it the moment it lands is to drop your email below.

The compounding velocity is in that abstraction shift.

What we use the pull request for now

The pull request as a risk telescope — a multi-dimensional readout where every dimension has its own threshold.

Architecture decisions at AgentField happen before a pull request opens. Module boundaries, interface contracts, dependency choices, the way new code will fit into existing systems: all of this is settled in design conversations among the engineers responsible for that area, and by the time a branch lands in the PR queue, the architectural questions have already been answered. If a pull request ends up challenging an architectural decision, what we have on our hands is a design conversation that should have happened earlier, and we have it then, in the open, instead of burying it inside a review thread.

Given that architectural work is settled upstream, what the pull request gate does is make the risk distribution of a specific change visible. We treat the gate as a telescope rather than a test. Its output is not a verdict on whether the code is correct. It is a map of where in the change the risk lives: where the diff touches security-sensitive code, where compound interactions across files might exist, where the change is making an assumption that may not hold under production load. The map has dimensions, and the dimensions have different thresholds. A finding on the security dimension above its threshold blocks the merge. A finding on the naming-consistency dimension above its threshold gets logged and the code ships. A finding on architectural fit is not a finding for review at all; it is a signal that the upstream design conversation missed something, and it goes back to the design queue rather than to the reviewer. The thresholds are explicit, the team sees them, and they get tuned over time because the cost of being wrong is different for every dimension.

The practical difference between a verdict and a map shows up in who consumes the output. A pass/fail verdict gets consumed by a merge button. A risk map gets consumed by a person who is deciding what to attend to this week, given everything else on the engineering plate. That difference, more than any other, is what we mean when we say the pull request gate has become a telescope. The judgment is still ours. The instrument has become honest about what kind of judgment it is helping us make.

Risk as a threshold, not a verdict

Finding bugs is not fixing them: the tangled maze of code review discoveries versus the single clean line of the patch.

Two principles fall out of running this practice, and they are the part of the rebuild we expect to translate to teams whose intelligence stack and codebase look different from ours.

The first principle is asymmetry. Finding all the risks in a piece of code is, in practice, an intractable problem. The space of things that could go wrong, in interaction with the rest of the codebase and the systems it touches, is too large to enumerate, and the cost of trying to enumerate it grows faster than the cost of writing the code in the first place. Fixing a risk that has been named, by contrast, is usually fast. Most known risks take an engineer minutes once they have been described in enough detail. The expensive half of code quality work is discovery, and a process designed around that asymmetry invests heavily in discovery: more reviewer agents, more dimensions of analysis, more passes that look for cross-file interactions. It stops trying to certify the absence of risk altogether, because certifying the absence of risk was never the cheap part.

The second principle is that risk is a threshold rather than a binary. Every dimension has its own bar. Security risks have a high bar; naming-consistency risks have a low one; architectural-fit risks belong to a different process altogether. We treat the bars as adjustable parameters of the gate rather than fixed values baked in somewhere. They are visible to the team, they get debated, and they get tuned as the team's intuition about the cost of being wrong on each dimension improves. The bar for any given dimension on day one is rarely the right bar a quarter later, and that is fine; the goal of the practice is to make the bar a knob the team can turn rather than a value buried in someone's head.

Risk is a dial, not a verdict. A code review process that accepts this is honest about what it can do and what it cannot. The older shape, which tried to deliver a verdict, is asking the gate to do a job the system underneath it no longer supports.

Once we took the asymmetry seriously for pull request review, it started reshaping other practices around it. Release engineering, evaluation, incident response, on-call ownership, even how we think about engineering ladders: each of them has a version of "we are gating on the wrong thing for the wrong reason" buried inside it, once you start asking which part of the work is actually discovery and which part is fix. Each of those is its own rebuild, and we expect to write about them separately. The lens itself transfers cleanly enough: the moment a team stops treating quality as a pass/fail certification and starts treating it as a threshold on a measured dimension, most of the engineering structure around quality starts to look like it could be redesigned.

Risk is a dial, not a verdict.

PR-AF: the reviewer we open-sourced

We did not design this practice up front. It came out of running our own engineering on AgentField, the platform we build for this kind of process redesign. AgentField lets the team that runs a workflow decide how work routes through it, where the gates sit, and where the thresholds land, rather than taking a tool's defaults. The piece that makes that possible is the harness, which we covered separately in What Is Harness Orchestration. The pull request rebuild is one workflow we ran through it. Others are still in progress.

The piece of the rebuild we are open-sourcing now is the reviewer itself, and the reason it is open source ties back to the point this post opened with. No two engineering teams have the same DNA: the intelligence stack they run, the dimensions of risk they care about, the thresholds they tune, the way the rest of their practices fit together. A closed-source reviewer with hardcoded defaults could only ever fit one team's version of the answer, and we do not believe one version of the answer exists. The right form for a tool that has to be molded to each team's shape is open source.

What we built is called PR-AF, and it is the philosophy of this post implemented as a system you can run. For each pull request, PR-AF reasons about which dimensions of risk the specific change introduces instead of running a fixed checklist. It spawns reviewer agents per dimension, runs them in parallel, has them challenge their own findings adversarially to suppress false positives, and surfaces the resulting risk map as inline GitHub comments. A deep review of a five-hundred-line PR costs about eighty cents in LLM calls. The output is a map, not a verdict.

Inside PR-AF: the autonomous multi-agent pipeline behind risk-as-map, with the per-team dimension configuration panel showing how two different teams pick different risk dimensions on top of the same architecture.

You can deploy PR-AF directly as a starting point, or you can fork it and reshape it around your team's stack: your risk dimensions, your thresholds, your judgment of which signals should block a merge and which should ship to a log instead. We expect most teams who take the practice seriously to do the second, eventually. The philosophy is what travels between teams; the specific code is one version of how that philosophy can be expressed in a particular team's hands.

There is a wider reason for open-sourcing it as well. The operating shift this post describes is not, in the end, the kind of advantage worth holding onto. Teams that figure out the shift earlier will build software that fits the new foundation; teams that figure it out later will build it anyway, just at the cost of more pain in the meantime, and we would rather speed up the broader conversation than draw lines around the answer.

Clone it, point it at one of your pull requests, and judge for yourself whether risk-as-map is more useful to your team than verdict-as-gate. The ground has shifted. The practices on top of it are going to change with it, one workflow at a time, and the pull request is the workflow we got to first.

The code

PR-AF is the open-source reviewer described in this post. AgentField is the platform underneath it. SWE-AF is the autonomous-fleet orchestration system that put us in the third era to begin with. All three are Apache 2.0.

curl -X POST http://localhost:8080/api/v1/execute/async/pr-af.review \
  -H "Content-Type: application/json" \
  -d '{"input": {"pr_url": "https://github.com/owner/repo/pull/123"}}'

The next post in this series goes deeper into what this same shift does to the standalone code-review business as a category. Subscribe below to get new parts the moment they ship.