08 / 25
Code Review in an Agentic World
Core Questions
- Which reviews are automated?
- What do humans still review?
- How is drift prevented?
Following along? Subscribe for new guides.
Human code review for agent PRs does not scale. If agents can produce 50 PRs a day, humans cannot review 50 PRs a day carefully. The answer is not to skip review. The answer is to redesign what gets reviewed, by whom, and how.
The review scaling problem
Traditional code review assumes humans write code at human speed. One developer, a few PRs per day. Reviewers can read every line, understand the context, catch bugs.
Agents break this assumption. They can produce more code, faster. If you try to maintain the same review process, you get:
- Review backlog: PRs pile up faster than humans can review them.
- Rubber stamping: Reviewers skim and approve to clear the queue.
- Reviewer burnout: Humans spend all day reading agent code instead of writing their own.
The new model: tiered review
Not all code needs the same level of review. Tier your review process:
- Automated review: Agents and tools check for common issues. No human needed unless something fails.
- Lightweight human review: Human glances at the diff, checks the summary, approves if nothing looks wrong.
- Deep human review: Human reads every line, understands the design, thinks about edge cases.
The tier depends on what changed, not who wrote it.
Questions to answer as a platform team:
- What is your human review budget per day, in minutes?
- What happens when agent output exceeds that budget, by policy?
- Which PR types are allowed to auto-merge, and what is your rollback contract when you are wrong?
What can be automated
Some review tasks are mechanical. They follow rules. They can be automated completely. Humans should not spend review time on these categories.
A useful rule of thumb is harsh: if static analysis can catch it, static analysis must catch it. Teach humans to trust the process, and teach your process to be trustworthy.
Automatable Review Tasks
Linting & formatting
ESLint, Prettier, gofmt. Either it passes or it doesn't. No judgment needed.
Type checking
TypeScript, mypy, type annotations. The compiler is a reviewer.
Test execution
Unit, integration, and end to end tests. The main question is whether the change is validated by a repeatable check.
Security scanning
Semgrep, CodeQL, dependency vulnerabilities. Known patterns, automated detection.
Architecture rules
Import restrictions, layer violations, dependency cycles. Enforced by tools like dependency-cruiser or ArchUnit.
API compatibility
Did the public API change? Breaking changes detected by schema diffing.
If these checks pass, a human reviewer should not re-check them by hand. Humans should focus on things automation cannot catch.
What humans still review
Some things require human judgment. These are the high-value review tasks:
Human Review Required
Design & architecture
Is this the right approach? Does it fit the system's patterns? Will it scale? These require understanding context that agents don't have.
Business logic correctness
Does this actually implement what was requested? Are the edge cases handled according to business rules? Only someone who understands the domain can judge.
Security-sensitive code
Authentication, authorization, cryptography, data handling. Even with security scanners, human review is essential for sensitive areas.
Novel patterns
First use of a new library, new architectural pattern, new integration. Humans decide if this is a good precedent.
User-facing copy
Error messages, help text, UI labels. Tone and clarity require human judgment.
Agent reviewers
Agents can review code too. Not just linting, actual code review. An agent reviewer can:
- Summarize what the PR does
- Flag potential issues
- Check if tests cover the changes
- Verify the PR matches the spec
- Suggest improvements
You can run multiple reviewer bots with different jobs. Avoid using the same model, prompts, and tools for both author and reviewer roles.
A practical reviewer bot lineup
Examples of reviewer bots that map well to real needs:
- Cursor Bugbot: Good at spotting likely bugs, missing edge cases, and suspicious logic.
- slopcannon.dev: Good at detecting patterns, repetition, and drift that agents introduce over time.
- Your architecture drift bot: Enforce your system boundaries. This is not about implementation details; it is about architecture. Write your own rules, and fail or flag PRs that violate them.
This is agents reviewing agents. It sounds recursive but it works. The reviewer agent has a different job than the author agent. It is looking for problems, not building features.
Review tiers by change type
Define which tier of review applies based on what the PR changes:
Review Tier Matrix
Documentation only
README, comments, JSDoc. Low risk.
Test changes only
New tests, test refactoring. Improves safety.
Standard code changes
Bug fixes, features, refactoring in non-sensitive areas.
Sensitive areas
Auth, payments, data handling, security configs.
Architecture changes
New patterns, API changes, infrastructure modifications.
Implement this with CODEOWNERS and branch protection rules. Files in /docs have different rules than files in /src/auth.
Humans still triage what they review
Many teams think they used to read every line. In practice, they never did. They already triaged their attention based on risk and evidence.
This matters more in an agent world because diffs get larger. The job of your system is to help the reviewer focus, not to force them to hunt.
Example: a login page redesign
Imagine a PR that includes 500 lines of CSS, three new JSX components, and one change to a Next.js server action that touches authentication.
A good review experience does not ask the human to read 500 lines of CSS. In most teams, CSS is reviewed by looking at the output, checking screenshots, and validating interaction flows. The backend change, the database query, and the auth boundary should get line level review.
Questions to push into your agent PR template:
- Which files are high risk and require line level review?
- Which files are better validated by runtime evidence (screenshots, traces, repro steps)?
- What should the reviewer ignore because it is low risk and mechanically verified?
Make review a pipeline, not a wall of diffs
If work starts by assigning an issue to an agent, you can standardize the lifecycle and turn review into a set of checkpoints. The reviewer should arrive late, after evidence exists, not at the beginning.
A simple default is a status ladder that the agent must post back to the issue or PR:
- Env ready: the agent has a working environment and can run the project.
- Reproduced: if this is a bug, the agent reproduced it and attached evidence.
- Fix implemented: changes are made, with a clear description of what and why.
- Tests added: a repeatable check validates the change.
- Ready for human: the PR is ready for the right tier of human review.
Consequence: review becomes a pipeline with gates. It stops being a wall of diffs that humans must interpret from scratch.
Preventing drift
When agents produce lots of code quickly, architectural drift happens fast. Small deviations compound. Suddenly your codebase has three different ways to do the same thing.
Anti-drift strategies
What goes wrong
Rubber stamp reviews
Humans can't keep up. They approve without reading. Bad code ships. Bugs pile up. Trust in the agent erodes.
Review theater
Lots of comments, no substance. Nitpicking style while missing logic bugs. Reviews feel thorough but catch nothing important.
Agents reviewing themselves
Same agent writes and reviews. Blind spots are shared. Use different agents or configurations for author vs reviewer.
Over-automation
Everything auto-merges. Nobody looks at anything. Subtle bugs accumulate. Eventually something big breaks and nobody understands the code.
Summary
- →Tier your reviews: not all code needs the same level of scrutiny.
- →Automate mechanical checks with static analysis. Humans should not review lint, formatting, type errors, or scanner output.
- →Use specific reviewer bots with specific jobs: bug finding, pattern detection, and architecture drift.
- →Prevent drift with ADRs, pattern libraries, and automated enforcement.
Related Guides
Stay updated
Get notified when we publish new guides or make major updates.
(We won't email you for little stuff like typos — only for new content or significant changes.)
Found this useful? Share it with your team.