The Agentic Dev Loop — Designing Modern Development Systems for Human + Agent Teams

At some point, you'll use agents to build agents. Your code generation agent writes the code review agent. Your eval pipeline is itself maintained by agents. This is the meta-loop — turtles all the way down. It's powerful and dangerous. The same properties that make agents useful (autonomy, scale) make them risky when pointed at themselves.

The meta-loop

Traditional software development has a clear separation: humans write code, machines run code. Agentic development blurs this. Agents write code too. And when the code they write is other agents, you have recursion.

Layers of the Meta-Loop

Layer 0: Your application code

The actual product you're building. Written by humans, agents, or both.

Layer 1: Agents that write application code

Code generation agents, refactoring agents, test-writing agents. They operate on Layer 0.

Layer 2: Agents that build/maintain Layer 1 agents

Agents that improve prompts, tune parameters, add capabilities to your code generation agents.

Layer 3: Infrastructure agents

Agents managing the orchestration, monitoring, and evaluation systems that all other agents depend on.

Each layer needs its own guardrails

An agent at Layer 2 can modify agents at Layer 1. If Layer 2 has a bug or is compromised, it can break everything below it. Higher layers need stronger controls, not weaker ones.

Bootstrapping agent tooling

You need agents before you can use agents to build agents. The bootstrap problem: what builds the first generation?

Bootstrap Strategies

Human-written v0

Write the first generation of agents by hand. Simple prompts, basic tooling. These bootstrap agents don't need to be good — they need to be good enough to iterate on.

Copy from open source

Fork existing agent implementations. Modify for your use case. Standing on shoulders of giants accelerates bootstrapping.

Vendor tooling as foundation

Use Claude, GPT, or other foundation models directly. Layer your customizations on top. You don't need to build the model.

Progressive autonomy

Start with humans doing most work, agents assisting. Gradually shift responsibility as agents prove reliable. Never flip the switch all at once.

# Bootstrap sequence example

# Week 1: Human writes v0 code generation agent
agents/
  codegen/
    v0/
      prompt.md        # Hand-written prompt
      tools.json       # Basic tool definitions
      config.yaml      # Manual configuration

# Week 2: v0 agent helps write v1 agent
# Human reviews all changes, agent does heavy lifting
agents/
  codegen/
    v1/
      prompt.md        # Improved by v0 + human review
      tools.json
      config.yaml
      CHANGELOG.md     # Track what changed and why

# Week 4: v1 agent maintains itself with human oversight
# Agent proposes changes, human approves
agents/
  codegen/
    v2/
      ...
    meta/
      improve-prompt.md  # Agent that improves codegen prompts
      eval-results/      # Data driving improvements

Eval-driven agent development

You can't improve what you can't measure. Agent development requires evaluation suites — test cases that measure whether agents are getting better or worse.

Eval Components

Golden datasets

Curated input/output pairs. Given this prompt and context, the agent should produce something like this. Human-verified ground truth.

Behavioral tests

Does the agent use the right tools? Does it ask for clarification when appropriate? Does it respect boundaries? Observable behaviors, not just outputs.

Regression detection

Does a change to the agent break something that was working? Run the full eval suite on every change. No silent regressions.

A/B comparisons

Run the old agent and new agent on the same inputs. Which produces better results? Human judges or automated metrics.

# evals/codegen/suite.yaml
name: Code Generation Eval Suite
version: 2.3

cases:
  - id: basic-function
    input: "Write a function that reverses a string"
    expected_behavior:
      - uses_no_external_deps: true
      - handles_empty_input: true
      - has_type_annotations: true
    golden_output_similarity: 0.8
    
  - id: refactor-extract
    input: "Extract the validation logic into a separate function"
    context_file: fixtures/messy-handler.ts
    expected_behavior:
      - creates_new_function: true
      - original_still_works: true
      - no_behavior_change: true
    
  - id: ambiguous-request
    input: "Make it better"
    expected_behavior:
      - asks_for_clarification: true
      - does_not_guess: true

metrics:
  - name: pass_rate
    threshold: 0.95
  - name: avg_latency_ms  
    threshold: 5000
  - name: cost_per_task_usd
    threshold: 0.50

Principle

No agent change ships without eval results

Every PR that modifies an agent must include eval results. Did the change improve the metrics? Did it regress anything? If you can't show improvement, don't ship.

Self-improving pipelines

The dream: agents that improve themselves. The reality: this is where things get dangerous. Self-improvement loops need careful constraints.

Safe Self-Improvement Patterns

Propose-review-apply

Agent proposes improvements. Human reviews. Human applies. The agent never directly modifies itself — it submits PRs like any other contributor.

Eval-gated deployment

Self-improvements only deploy if they pass the eval suite. The eval suite is human-maintained and not modifiable by the agent being evaluated.

Bounded improvement scope

The agent can improve its prompts but not its tools. Or it can adjust parameters but not architecture. Constrain what's self-modifiable.

Rollback on regression

Automated rollback if metrics drop below threshold. The agent can't make itself worse and stay that way — regression triggers revert.

What you should never do

• Let an agent modify its own eval suite
• Let an agent approve its own changes
• Let an agent modify the guardrails that constrain it
• Deploy self-improvements without human review
• Run self-improvement loops without rate limits

Dogfooding your agent infra

The best way to find problems with your agent infrastructure is to use it yourself. Dogfooding — using your own tools for real work — surfaces issues that synthetic tests miss.

Dogfooding Levels

Level 1: Use for internal tools

Build internal tooling with your agents. Low stakes, fast feedback. If the agent breaks a script, you fix it and learn.

Level 2: Use for agent development itself

Use your code generation agent to improve your code generation agent. Recursive dogfooding. Finds edge cases you didn't anticipate.

Level 3: Use for production features

Agent-written code ships to customers. Highest stakes, maximum learning. Only do this with strong review processes in place.

The goal is to feel the pain. If your agents are slow, you'll feel it. If they produce bad code, you'll review it. Dogfooding creates natural feedback pressure to improve.

Versioning agent behavior

Agent behavior changes when you update prompts, tools, or models. You need to track these changes like you track code changes — with versions, changelogs, and the ability to roll back.

# agents/codegen/CHANGELOG.md

## [2.1.0] - 2024-01-15
### Changed
- Updated system prompt to emphasize test coverage
- Switched from GPT-4 to Claude 3.5 Sonnet for better reasoning
### Eval Results
- Pass rate: 94% -> 96%
- Avg latency: 4200ms -> 3100ms
- Cost per task: $0.42 -> $0.38

## [2.0.0] - 2024-01-08
### Breaking
- New tool schema (v2) - requires updated tool definitions
### Added
- Support for multi-file edits
- Context window management for large codebases
### Eval Results
- Pass rate: 89% -> 94%
- Now handles files >500 lines

## [1.5.2] - 2024-01-02
### Fixed
- Regression in TypeScript type inference
- Timeout handling for slow model responses
### Eval Results  
- Pass rate: 91% -> 89% (known regression, fixing in 2.0)

What to Version

Prompts

System prompts, user prompt templates, few-shot examples. Small changes can have large behavioral effects.

Tool definitions

What tools are available, their schemas, their implementations. Tool changes affect what the agent can do.

Model configuration

Which model, what temperature, what context window. Model changes affect capabilities and costs.

Eval results

Tie eval results to versions. You should be able to see exactly how each version performed on the test suite.

Avoiding infinite recursion

When agents build agents, you can end up in loops. Agent A improves Agent B, which triggers Agent C, which needs Agent A... Cycle detection and termination conditions are essential.

Recursion Safeguards

Depth limits

Maximum nesting depth for agent calls. Agent A can call Agent B can call Agent C, but no deeper. Configurable per use case.

Call graphs

Track which agents called which. Detect cycles. If Agent A is already in the call stack, don't let it be called again.

Budget limits

Total token budget across all agents in a workflow. Once exhausted, everything stops. Prevents runaway costs from loops.

Convergence detection

If an agent produces the same output twice, stop. The loop isn't making progress. Useful for iterative improvement workflows.

What goes wrong

Agent optimizes the wrong metric

Self-improving agent finds a way to game the eval. Metrics look great, actual quality is terrible. Goodhart's law in action.

Fix: Multiple diverse metrics, not one score. Include human evaluation samples. Watch for metric/reality divergence.

Cascading failures

Bug in Layer 2 agent corrupts Layer 1 agents, which produce broken application code. By the time you notice, there's damage everywhere.

Fix: Staged deployments. Layer changes propagate slowly. Monitoring at each layer catches problems before they cascade.

Context dilution

As agents iterate on themselves, prompts grow. Context windows fill with accumulated cruft. Performance degrades gradually.

Fix: Periodic human review of agent configurations. Prune accumulated complexity. Treat prompt size as technical debt.

Lost reproducibility

Agents modify their own behavior. You can't recreate last week's version because it was overwritten. No rollback possible.

Fix: All agent configs in version control. Immutable versions. Explicit deployment process. Never allow in-place modification.

Summary

•The meta-loop (agents building agents) is powerful but needs layer-specific guardrails
•Bootstrap with human-written v0, then progressively increase agent autonomy
•No agent change ships without eval results — measure improvement or don't deploy
•Self-improvement needs strict constraints: agents can't modify their own evals or guardrails
•Version everything — prompts, tools, configs, eval results. Rollback must be possible.

Stay updated

Get notified when we publish new guides or make major updates.
(We won't email you for little stuff like typos — only for new content or significant changes.)

The Agentic Build Loop

Core Questions