The Agentic Dev Loop — Designing Modern Development Systems for Human + Agent Teams

An agent says it fixed the bug. Did it? Without evidence, you're taking its word for it. Artifacts are the proof — screenshots, test results, logs, diffs, recordings. They're how you verify work without re-doing it. They're how you build trust over time. And they're how you debug when something goes wrong.

Why artifacts matter

Human developers leave natural trails: commit messages, PR descriptions, Slack threads, meeting notes. Agents don't have this ambient documentation. You have to build the artifact pipeline explicitly.

Artifact Use Cases

Verification

Did the change actually work? Screenshot shows the button renders. Test results show the behavior is correct. Diff shows what changed.

Review

Reviewers can see before/after without running the code. Video recording shows the feature in action. Logs show the execution path.

Debugging

When something breaks later, artifacts show what the agent saw. Console logs, network traces, error messages — all captured at runtime.

Trust building

Over time, consistent high-quality artifacts build confidence. You can see the agent's track record. Success rate becomes measurable.

Compliance

Some industries require audit trails. Artifacts prove what was done, when, and by whom. Chain of custody for agent-authored changes.

Types of artifacts

Different tasks produce different evidence. A comprehensive artifact pipeline captures all relevant proof for the type of work being done.

Artifact Types

Screenshots

Visual proof of UI state. Before/after comparisons. Error states captured. Most valuable for frontend work and visual verification.

Video recordings

Interaction flows, multi-step processes, timing-dependent behaviors. Shows what screenshots can't: transitions, animations, sequences.

Test results

Pass/fail outcomes with details. Coverage reports. Performance benchmarks. Machine-readable proof that code works as intended.

Diffs

Exactly what changed, line by line. Git diffs, file-level changes, configuration updates. The source of truth for modifications.

Logs

Console output, build logs, runtime traces. What happened during execution. Essential for debugging failures.

API responses

Captured request/response pairs. Useful for API development, integration testing, and verifying backend behavior.

Screenshot pipelines

Automated screenshot capture is one of the highest-value artifact investments. It turns "I think it works" into "here's proof it works."

// scripts/capture-screenshots.ts
import { chromium } from 'playwright';
import { mkdir, writeFile } from 'fs/promises';

interface ScreenshotConfig {
  name: string;
  url: string;
  waitFor?: string;
  actions?: Array<{ type: string; selector?: string; value?: string }>;
}

async function captureScreenshots(
  configs: ScreenshotConfig[],
  outputDir: string
) {
  const browser = await chromium.launch();
  const context = await browser.newContext({
    viewport: { width: 1280, height: 720 }
  });
  
  await mkdir(outputDir, { recursive: true });
  
  for (const config of configs) {
    const page = await context.newPage();
    await page.goto(config.url);
    
    if (config.waitFor) {
      await page.waitForSelector(config.waitFor);
    }
    
    // Execute any setup actions
    for (const action of config.actions || []) {
      if (action.type === 'click') {
        await page.click(action.selector!);
      } else if (action.type === 'fill') {
        await page.fill(action.selector!, action.value!);
      }
    }
    
    const screenshot = await page.screenshot({ fullPage: true });
    await writeFile(`${outputDir}/${config.name}.png`, screenshot);
    
    await page.close();
  }
  
  await browser.close();
}

// Example usage in CI
captureScreenshots([
  { name: 'homepage', url: 'http://localhost:3000' },
  { name: 'login', url: 'http://localhost:3000/login' },
  { 
    name: 'dashboard-loaded', 
    url: 'http://localhost:3000/dashboard',
    waitFor: '[data-testid="dashboard-content"]'
  }
], 'artifacts/screenshots');

Screenshot Best Practices

Consistent viewport

Use the same resolution for all screenshots. Makes before/after comparisons meaningful. Standard: 1280x720 or 1920x1080.

Wait for stability

Don't screenshot during loading. Wait for network idle, animations complete, data loaded. Flaky screenshots waste everyone's time.

Meaningful names

homepage-logged-out.png beats screenshot-1.png. Include state in the filename. Make it greppable.

Before and after

For changes, capture both states. Before the fix, after the fix. The comparison tells the story.

Test result archiving

Test results are the primary quality signal for most codebases. Archive them with enough context to understand what passed, what failed, and why.

# artifacts/test-results/run-2024-01-15-143022.json
{
  "run_id": "run-2024-01-15-143022",
  "timestamp": "2024-01-15T14:30:22Z",
  "trigger": {
    "type": "agent-task",
    "task_id": "fix-login-validation",
    "agent": "codegen-v2"
  },
  "summary": {
    "total": 142,
    "passed": 140,
    "failed": 2,
    "skipped": 0,
    "duration_ms": 34521
  },
  "failed_tests": [
    {
      "name": "auth.login.validation.emptyEmail",
      "file": "tests/auth/login.test.ts",
      "error": "Expected 'Email required' but got 'Invalid email format'",
      "stack": "...",
      "screenshot": "screenshots/login-empty-email-error.png"
    },
    {
      "name": "auth.login.rateLimit.exceeded",
      "file": "tests/auth/login.test.ts",
      "error": "Timeout waiting for rate limit message",
      "stack": "...",
      "known_flaky": true
    }
  ],
  "coverage": {
    "lines": 84.2,
    "branches": 72.1,
    "functions": 91.3
  },
  "environment": {
    "node": "20.10.0",
    "os": "ubuntu-22.04",
    "ci": true
  }
}

Structure test results as data, not just logs. Machine-readable results enable trend analysis, flaky test detection, and automated triage.

Diff summaries

Raw diffs are noisy. Summaries highlight what matters: new files, deleted files, functions changed, lines added/removed. Give reviewers the signal.

# artifacts/diff-summary.md

## Change Summary

**Task:** Fix login validation error messages
**Branch:** fix/login-validation  
**Files changed:** 3

### Modified Files

| File | +Lines | -Lines | Changes |
|------|--------|--------|---------|
| lib/auth/validation.ts | +12 | -4 | Updated error messages |
| components/LoginForm.tsx | +3 | -1 | Added error display |
| tests/auth/login.test.ts | +24 | -0 | Added validation tests |

### Key Changes

- **validation.ts**: Changed error messages to be more user-friendly
  - "Invalid input" → "Email address is required"  
  - "Bad format" → "Please enter a valid email address"
  
- **LoginForm.tsx**: Now displays validation errors below input field
  - Added ErrorMessage component
  - Errors clear on input change

### Test Impact

- 4 new tests added for validation edge cases
- All existing tests pass
- Coverage: 84.2% → 85.1%

### Screenshots

- [before-fix.png](./screenshots/before-fix.png)
- [after-fix.png](./screenshots/after-fix.png)

Principle

The summary should answer "what changed and why" without reading code

A reviewer glancing at the summary should understand the scope and intent of the change. If they need more detail, they can dig into the full diff.

Structured output capture

Agents produce various outputs: logs, decisions, tool calls, results. Capture all of it in a structured format for later analysis.

# artifacts/agent-run/run-abc123.jsonl

{"ts":"2024-01-15T14:30:00Z","type":"task_start","task_id":"fix-login","agent":"codegen-v2"}
{"ts":"2024-01-15T14:30:01Z","type":"context_load","files":["lib/auth/validation.ts","components/LoginForm.tsx"]}
{"ts":"2024-01-15T14:30:02Z","type":"analysis","finding":"Error messages are generic, not user-friendly"}
{"ts":"2024-01-15T14:30:05Z","type":"decision","choice":"Update validation.ts first, then update component"}
{"ts":"2024-01-15T14:30:10Z","type":"tool_call","tool":"edit_file","args":{"file":"lib/auth/validation.ts"}}
{"ts":"2024-01-15T14:30:12Z","type":"tool_result","status":"success","lines_changed":16}
{"ts":"2024-01-15T14:30:15Z","type":"tool_call","tool":"run_tests","args":{"pattern":"auth"}}
{"ts":"2024-01-15T14:30:45Z","type":"tool_result","status":"success","tests":{"passed":24,"failed":0}}
{"ts":"2024-01-15T14:30:46Z","type":"task_complete","status":"success","duration_ms":46000}

What to Capture

Every tool call

What tool, what arguments, what result. The full interaction history with external systems.

Decisions and reasoning

Why did the agent choose approach A over B? Log the decision points. Essential for debugging wrong turns.

Timing information

How long did each step take? Where is time being spent? Enables performance optimization.

Resource usage

Tokens consumed, API calls made, cost incurred. Track the economics of each run.

Artifact retention policies

Artifacts accumulate. Screenshots, logs, test results — they add up. You need retention policies that balance storage costs against debugging needs.

Retention Tiers

Hot (7 days)

Recent artifacts for active debugging. Full resolution, fast access. Screenshots, detailed logs, all test outputs.

Warm (30 days)

Recent history for trend analysis. Compressed, indexed for search. Test summaries, diff metadata, key screenshots.

Cold (1 year)

Compliance and long-term analysis. Heavily compressed, slow retrieval. Aggregate metrics, audit-relevant events only.

Permanent

Artifacts tied to shipped releases. Never delete. Reference point for production debugging and historical comparison.

# .github/workflows/artifact-cleanup.yml
name: Artifact Retention

on:
  schedule:
    - cron: '0 2 * * *'  # Daily at 2 AM

jobs:
  cleanup:
    runs-on: ubuntu-latest
    steps:
      - name: Clean hot tier (>7 days)
        run: |
          find artifacts/hot -mtime +7 -delete
          
      - name: Compress to warm tier (>7 days)
        run: |
          find artifacts/hot -mtime +7 -name "*.png" \
            -exec convert {} -quality 60 artifacts/warm/{} \;
            
      - name: Archive to cold tier (>30 days)
        run: |
          tar -czf cold/archive-$(date +%Y%m%d).tar.gz \
            artifacts/warm/*-mtime +30
          rm -rf artifacts/warm/*-mtime +30
          
      - name: Tag release artifacts as permanent
        run: |
          # Artifacts from tagged releases never expire
          for tag in $(git tag --list 'v*'); do
            cp -r artifacts/release/$tag artifacts/permanent/
          done

Building trust through artifacts

Consistent, high-quality artifacts build confidence over time. When every agent run produces verifiable evidence, patterns emerge. You can see success rates, common failure modes, and improvement trends.

Trust Metrics from Artifacts

Success rate

What percentage of agent tasks complete successfully? Track over time. Rising rate = improving agent.

Test pass rate

Do agent-authored changes pass tests on first try? High rate = reliable code generation. Low rate = needs tuning.

Review approval rate

How often do humans approve agent PRs without changes? Tracks whether the agent meets human quality standards.

Post-deploy incidents

Do agent changes cause production issues? The ultimate trust metric. Tie artifacts to post-deploy outcomes.

What goes wrong

Artifacts not captured

Agent completes task but no artifacts saved. Later, you need to verify what happened. No evidence exists.

Fix: Make artifact capture mandatory, not optional. Task isn't complete until artifacts are stored. Fail the run if capture fails.

Screenshots miss the problem

Screenshot captured but doesn't show the relevant state. Wrong viewport, wrong timing, element not visible.

Fix: Capture multiple screenshots at different states. Include full-page and focused views. Wait for stability before capture.

Storage costs explode

Every run saves gigabytes of artifacts. Storage bill grows exponentially. Nobody can find what they need anyway.

Fix: Implement retention tiers. Compress aggressively. Index and search instead of browse. Delete what's not needed.

Artifacts not linked to context

You have a screenshot but don't know which task, branch, or commit it came from. The artifact exists but lacks meaning.

Fix: Include metadata with every artifact. Task ID, commit SHA, timestamp, agent version. Make context queryable.

Summary

•Artifacts are proof — screenshots, tests, logs, diffs — that verify agent work
•Capture automatically and make it mandatory — tasks aren't complete without artifacts
•Structure outputs as data (JSON, JSONL) for analysis, not just logs
•Implement retention tiers — hot, warm, cold, permanent — to manage costs
•Use artifact patterns over time to build trust and measure improvement

Stay updated

Get notified when we publish new guides or make major updates.
(We won't email you for little stuff like typos — only for new content or significant changes.)

Artifacts of Success & Failure

Core Questions