19 / 25
The Agentic Build Loop
Core Questions
- How do you use agents to build agents?
- What does the meta-loop look like?
- How do you avoid infinite recursion in agentic development?
At some point, you'll use agents to build agents. Your code generation agent writes the code review agent. Your eval pipeline is itself maintained by agents. This is the meta-loop — turtles all the way down. It's powerful and dangerous. The same properties that make agents useful (autonomy, scale) make them risky when pointed at themselves.
The meta-loop
Traditional software development has a clear separation: humans write code, machines run code. Agentic development blurs this. Agents write code too. And when the code they write is other agents, you have recursion.
Layers of the Meta-Loop
Layer 0: Your application code
The actual product you're building. Written by humans, agents, or both.
Layer 1: Agents that write application code
Code generation agents, refactoring agents, test-writing agents. They operate on Layer 0.
Layer 2: Agents that build/maintain Layer 1 agents
Agents that improve prompts, tune parameters, add capabilities to your code generation agents.
Layer 3: Infrastructure agents
Agents managing the orchestration, monitoring, and evaluation systems that all other agents depend on.
Each layer needs its own guardrails
An agent at Layer 2 can modify agents at Layer 1. If Layer 2 has a bug or is compromised, it can break everything below it. Higher layers need stronger controls, not weaker ones.
Bootstrapping agent tooling
You need agents before you can use agents to build agents. The bootstrap problem: what builds the first generation?
Bootstrap Strategies
Human-written v0
Write the first generation of agents by hand. Simple prompts, basic tooling. These bootstrap agents don't need to be good — they need to be good enough to iterate on.
Copy from open source
Fork existing agent implementations. Modify for your use case. Standing on shoulders of giants accelerates bootstrapping.
Vendor tooling as foundation
Use Claude, GPT, or other foundation models directly. Layer your customizations on top. You don't need to build the model.
Progressive autonomy
Start with humans doing most work, agents assisting. Gradually shift responsibility as agents prove reliable. Never flip the switch all at once.
# Bootstrap sequence example
# Week 1: Human writes v0 code generation agent
agents/
codegen/
v0/
prompt.md # Hand-written prompt
tools.json # Basic tool definitions
config.yaml # Manual configuration
# Week 2: v0 agent helps write v1 agent
# Human reviews all changes, agent does heavy lifting
agents/
codegen/
v1/
prompt.md # Improved by v0 + human review
tools.json
config.yaml
CHANGELOG.md # Track what changed and why
# Week 4: v1 agent maintains itself with human oversight
# Agent proposes changes, human approves
agents/
codegen/
v2/
...
meta/
improve-prompt.md # Agent that improves codegen prompts
eval-results/ # Data driving improvementsEval-driven agent development
You can't improve what you can't measure. Agent development requires evaluation suites — test cases that measure whether agents are getting better or worse.
Eval Components
Golden datasets
Curated input/output pairs. Given this prompt and context, the agent should produce something like this. Human-verified ground truth.
Behavioral tests
Does the agent use the right tools? Does it ask for clarification when appropriate? Does it respect boundaries? Observable behaviors, not just outputs.
Regression detection
Does a change to the agent break something that was working? Run the full eval suite on every change. No silent regressions.
A/B comparisons
Run the old agent and new agent on the same inputs. Which produces better results? Human judges or automated metrics.
# evals/codegen/suite.yaml
name: Code Generation Eval Suite
version: 2.3
cases:
- id: basic-function
input: "Write a function that reverses a string"
expected_behavior:
- uses_no_external_deps: true
- handles_empty_input: true
- has_type_annotations: true
golden_output_similarity: 0.8
- id: refactor-extract
input: "Extract the validation logic into a separate function"
context_file: fixtures/messy-handler.ts
expected_behavior:
- creates_new_function: true
- original_still_works: true
- no_behavior_change: true
- id: ambiguous-request
input: "Make it better"
expected_behavior:
- asks_for_clarification: true
- does_not_guess: true
metrics:
- name: pass_rate
threshold: 0.95
- name: avg_latency_ms
threshold: 5000
- name: cost_per_task_usd
threshold: 0.50Principle
No agent change ships without eval results
Every PR that modifies an agent must include eval results. Did the change improve the metrics? Did it regress anything? If you can't show improvement, don't ship.
Self-improving pipelines
The dream: agents that improve themselves. The reality: this is where things get dangerous. Self-improvement loops need careful constraints.
Safe Self-Improvement Patterns
Propose-review-apply
Agent proposes improvements. Human reviews. Human applies. The agent never directly modifies itself — it submits PRs like any other contributor.
Eval-gated deployment
Self-improvements only deploy if they pass the eval suite. The eval suite is human-maintained and not modifiable by the agent being evaluated.
Bounded improvement scope
The agent can improve its prompts but not its tools. Or it can adjust parameters but not architecture. Constrain what's self-modifiable.
Rollback on regression
Automated rollback if metrics drop below threshold. The agent can't make itself worse and stay that way — regression triggers revert.
What you should never do
- • Let an agent modify its own eval suite
- • Let an agent approve its own changes
- • Let an agent modify the guardrails that constrain it
- • Deploy self-improvements without human review
- • Run self-improvement loops without rate limits
Dogfooding your agent infra
The best way to find problems with your agent infrastructure is to use it yourself. Dogfooding — using your own tools for real work — surfaces issues that synthetic tests miss.
Dogfooding Levels
Level 1: Use for internal tools
Build internal tooling with your agents. Low stakes, fast feedback. If the agent breaks a script, you fix it and learn.
Level 2: Use for agent development itself
Use your code generation agent to improve your code generation agent. Recursive dogfooding. Finds edge cases you didn't anticipate.
Level 3: Use for production features
Agent-written code ships to customers. Highest stakes, maximum learning. Only do this with strong review processes in place.
The goal is to feel the pain. If your agents are slow, you'll feel it. If they produce bad code, you'll review it. Dogfooding creates natural feedback pressure to improve.
Versioning agent behavior
Agent behavior changes when you update prompts, tools, or models. You need to track these changes like you track code changes — with versions, changelogs, and the ability to roll back.
# agents/codegen/CHANGELOG.md
## [2.1.0] - 2024-01-15
### Changed
- Updated system prompt to emphasize test coverage
- Switched from GPT-4 to Claude 3.5 Sonnet for better reasoning
### Eval Results
- Pass rate: 94% -> 96%
- Avg latency: 4200ms -> 3100ms
- Cost per task: $0.42 -> $0.38
## [2.0.0] - 2024-01-08
### Breaking
- New tool schema (v2) - requires updated tool definitions
### Added
- Support for multi-file edits
- Context window management for large codebases
### Eval Results
- Pass rate: 89% -> 94%
- Now handles files >500 lines
## [1.5.2] - 2024-01-02
### Fixed
- Regression in TypeScript type inference
- Timeout handling for slow model responses
### Eval Results
- Pass rate: 91% -> 89% (known regression, fixing in 2.0)What to Version
Prompts
System prompts, user prompt templates, few-shot examples. Small changes can have large behavioral effects.
Tool definitions
What tools are available, their schemas, their implementations. Tool changes affect what the agent can do.
Model configuration
Which model, what temperature, what context window. Model changes affect capabilities and costs.
Eval results
Tie eval results to versions. You should be able to see exactly how each version performed on the test suite.
Avoiding infinite recursion
When agents build agents, you can end up in loops. Agent A improves Agent B, which triggers Agent C, which needs Agent A... Cycle detection and termination conditions are essential.
Recursion Safeguards
Depth limits
Maximum nesting depth for agent calls. Agent A can call Agent B can call Agent C, but no deeper. Configurable per use case.
Call graphs
Track which agents called which. Detect cycles. If Agent A is already in the call stack, don't let it be called again.
Budget limits
Total token budget across all agents in a workflow. Once exhausted, everything stops. Prevents runaway costs from loops.
Convergence detection
If an agent produces the same output twice, stop. The loop isn't making progress. Useful for iterative improvement workflows.
What goes wrong
Agent optimizes the wrong metric
Self-improving agent finds a way to game the eval. Metrics look great, actual quality is terrible. Goodhart's law in action.
Fix: Multiple diverse metrics, not one score. Include human evaluation samples. Watch for metric/reality divergence.
Cascading failures
Bug in Layer 2 agent corrupts Layer 1 agents, which produce broken application code. By the time you notice, there's damage everywhere.
Fix: Staged deployments. Layer changes propagate slowly. Monitoring at each layer catches problems before they cascade.
Context dilution
As agents iterate on themselves, prompts grow. Context windows fill with accumulated cruft. Performance degrades gradually.
Fix: Periodic human review of agent configurations. Prune accumulated complexity. Treat prompt size as technical debt.
Lost reproducibility
Agents modify their own behavior. You can't recreate last week's version because it was overwritten. No rollback possible.
Fix: All agent configs in version control. Immutable versions. Explicit deployment process. Never allow in-place modification.
Summary
- •The meta-loop (agents building agents) is powerful but needs layer-specific guardrails
- •Bootstrap with human-written v0, then progressively increase agent autonomy
- •No agent change ships without eval results — measure improvement or don't deploy
- •Self-improvement needs strict constraints: agents can't modify their own evals or guardrails
- •Version everything — prompts, tools, configs, eval results. Rollback must be possible.
Stay updated
Get notified when we publish new guides or make major updates.
(We won't email you for little stuff like typos — only for new content or significant changes.)
Found this useful? Share it with your team.