All Guides

08 / 25

Code Review in an Agentic World

Core Questions

  • Which reviews are automated?
  • What do humans still review?
  • How is drift prevented?

Following along? Subscribe for new guides.

Human code review for agent PRs does not scale. If agents can produce 50 PRs a day, humans cannot review 50 PRs a day carefully. The answer is not to skip review. The answer is to redesign what gets reviewed, by whom, and how.

The review scaling problem

Traditional code review assumes humans write code at human speed. One developer, a few PRs per day. Reviewers can read every line, understand the context, catch bugs.

Agents break this assumption. They can produce more code, faster. If you try to maintain the same review process, you get:

  • Review backlog: PRs pile up faster than humans can review them.
  • Rubber stamping: Reviewers skim and approve to clear the queue.
  • Reviewer burnout: Humans spend all day reading agent code instead of writing their own.

The new model: tiered review

Not all code needs the same level of review. Tier your review process:

  • Automated review: Agents and tools check for common issues. No human needed unless something fails.
  • Lightweight human review: Human glances at the diff, checks the summary, approves if nothing looks wrong.
  • Deep human review: Human reads every line, understands the design, thinks about edge cases.

The tier depends on what changed, not who wrote it.

Questions to answer as a platform team:

  • What is your human review budget per day, in minutes?
  • What happens when agent output exceeds that budget, by policy?
  • Which PR types are allowed to auto-merge, and what is your rollback contract when you are wrong?

What can be automated

Some review tasks are mechanical. They follow rules. They can be automated completely. Humans should not spend review time on these categories.

A useful rule of thumb is harsh: if static analysis can catch it, static analysis must catch it. Teach humans to trust the process, and teach your process to be trustworthy.

Automatable Review Tasks

Linting & formatting

ESLint, Prettier, gofmt. Either it passes or it doesn't. No judgment needed.

Type checking

TypeScript, mypy, type annotations. The compiler is a reviewer.

Test execution

Unit, integration, and end to end tests. The main question is whether the change is validated by a repeatable check.

Security scanning

Semgrep, CodeQL, dependency vulnerabilities. Known patterns, automated detection.

Architecture rules

Import restrictions, layer violations, dependency cycles. Enforced by tools like dependency-cruiser or ArchUnit.

API compatibility

Did the public API change? Breaking changes detected by schema diffing.

If these checks pass, a human reviewer should not re-check them by hand. Humans should focus on things automation cannot catch.

What humans still review

Some things require human judgment. These are the high-value review tasks:

Human Review Required

Design & architecture

Is this the right approach? Does it fit the system's patterns? Will it scale? These require understanding context that agents don't have.

Business logic correctness

Does this actually implement what was requested? Are the edge cases handled according to business rules? Only someone who understands the domain can judge.

Security-sensitive code

Authentication, authorization, cryptography, data handling. Even with security scanners, human review is essential for sensitive areas.

Novel patterns

First use of a new library, new architectural pattern, new integration. Humans decide if this is a good precedent.

User-facing copy

Error messages, help text, UI labels. Tone and clarity require human judgment.

Agent reviewers

Agents can review code too. Not just linting, actual code review. An agent reviewer can:

  • Summarize what the PR does
  • Flag potential issues
  • Check if tests cover the changes
  • Verify the PR matches the spec
  • Suggest improvements

You can run multiple reviewer bots with different jobs. Avoid using the same model, prompts, and tools for both author and reviewer roles.

A practical reviewer bot lineup

Examples of reviewer bots that map well to real needs:

  • Cursor Bugbot: Good at spotting likely bugs, missing edge cases, and suspicious logic.
  • slopcannon.dev: Good at detecting patterns, repetition, and drift that agents introduce over time.
  • Your architecture drift bot: Enforce your system boundaries. This is not about implementation details; it is about architecture. Write your own rules, and fail or flag PRs that violate them.

This is agents reviewing agents. It sounds recursive but it works. The reviewer agent has a different job than the author agent. It is looking for problems, not building features.

Review tiers by change type

Define which tier of review applies based on what the PR changes:

Review Tier Matrix

Documentation only

README, comments, JSDoc. Low risk.

Auto-merge if CI passes

Test changes only

New tests, test refactoring. Improves safety.

Agent review + auto-merge

Standard code changes

Bug fixes, features, refactoring in non-sensitive areas.

Agent review + lightweight human

Sensitive areas

Auth, payments, data handling, security configs.

Agent review + deep human

Architecture changes

New patterns, API changes, infrastructure modifications.

Design review + multiple humans

Implement this with CODEOWNERS and branch protection rules. Files in /docs have different rules than files in /src/auth.

Humans still triage what they review

Many teams think they used to read every line. In practice, they never did. They already triaged their attention based on risk and evidence.

This matters more in an agent world because diffs get larger. The job of your system is to help the reviewer focus, not to force them to hunt.

Example: a login page redesign

Imagine a PR that includes 500 lines of CSS, three new JSX components, and one change to a Next.js server action that touches authentication.

A good review experience does not ask the human to read 500 lines of CSS. In most teams, CSS is reviewed by looking at the output, checking screenshots, and validating interaction flows. The backend change, the database query, and the auth boundary should get line level review.

Questions to push into your agent PR template:

  • Which files are high risk and require line level review?
  • Which files are better validated by runtime evidence (screenshots, traces, repro steps)?
  • What should the reviewer ignore because it is low risk and mechanically verified?

Make review a pipeline, not a wall of diffs

If work starts by assigning an issue to an agent, you can standardize the lifecycle and turn review into a set of checkpoints. The reviewer should arrive late, after evidence exists, not at the beginning.

A simple default is a status ladder that the agent must post back to the issue or PR:

  • Env ready: the agent has a working environment and can run the project.
  • Reproduced: if this is a bug, the agent reproduced it and attached evidence.
  • Fix implemented: changes are made, with a clear description of what and why.
  • Tests added: a repeatable check validates the change.
  • Ready for human: the PR is ready for the right tier of human review.

Consequence: review becomes a pipeline with gates. It stops being a wall of diffs that humans must interpret from scratch.

Preventing drift

When agents produce lots of code quickly, architectural drift happens fast. Small deviations compound. Suddenly your codebase has three different ways to do the same thing.

Anti-drift strategies

Architecture decision records (ADRs): Document decisions about patterns. Agents can read them. Reviewers can check compliance.
Pattern libraries: Instead of "don't do X," show "here's how we do X." Example code that agents can reference.
Automated pattern detection: Lint rules that catch known anti-patterns. Fail the build when patterns drift.
Periodic architecture review: Weekly or monthly human review of overall patterns. Catch drift before it compounds.

What goes wrong

Rubber stamp reviews

Humans can't keep up. They approve without reading. Bad code ships. Bugs pile up. Trust in the agent erodes.

Review theater

Lots of comments, no substance. Nitpicking style while missing logic bugs. Reviews feel thorough but catch nothing important.

Agents reviewing themselves

Same agent writes and reviews. Blind spots are shared. Use different agents or configurations for author vs reviewer.

Over-automation

Everything auto-merges. Nobody looks at anything. Subtle bugs accumulate. Eventually something big breaks and nobody understands the code.

Summary

  • Tier your reviews: not all code needs the same level of scrutiny.
  • Automate mechanical checks with static analysis. Humans should not review lint, formatting, type errors, or scanner output.
  • Use specific reviewer bots with specific jobs: bug finding, pattern detection, and architecture drift.
  • Prevent drift with ADRs, pattern libraries, and automated enforcement.

Stay updated

Get notified when we publish new guides or make major updates.
(We won't email you for little stuff like typos — only for new content or significant changes.)

Found this useful? Share it with your team.