22 / 25
Artifacts of Success & Failure
Core Questions
- What evidence gets produced by each agent run?
- How do you capture and store proof — screenshots, logs, test results, diffs?
- How do you use artifacts to build trust over time?
An agent says it fixed the bug. Did it? Without evidence, you're taking its word for it. Artifacts are the proof — screenshots, test results, logs, diffs, recordings. They're how you verify work without re-doing it. They're how you build trust over time. And they're how you debug when something goes wrong.
Why artifacts matter
Human developers leave natural trails: commit messages, PR descriptions, Slack threads, meeting notes. Agents don't have this ambient documentation. You have to build the artifact pipeline explicitly.
Artifact Use Cases
Verification
Did the change actually work? Screenshot shows the button renders. Test results show the behavior is correct. Diff shows what changed.
Review
Reviewers can see before/after without running the code. Video recording shows the feature in action. Logs show the execution path.
Debugging
When something breaks later, artifacts show what the agent saw. Console logs, network traces, error messages — all captured at runtime.
Trust building
Over time, consistent high-quality artifacts build confidence. You can see the agent's track record. Success rate becomes measurable.
Compliance
Some industries require audit trails. Artifacts prove what was done, when, and by whom. Chain of custody for agent-authored changes.
Types of artifacts
Different tasks produce different evidence. A comprehensive artifact pipeline captures all relevant proof for the type of work being done.
Artifact Types
Screenshots
Visual proof of UI state. Before/after comparisons. Error states captured. Most valuable for frontend work and visual verification.
Video recordings
Interaction flows, multi-step processes, timing-dependent behaviors. Shows what screenshots can't: transitions, animations, sequences.
Test results
Pass/fail outcomes with details. Coverage reports. Performance benchmarks. Machine-readable proof that code works as intended.
Diffs
Exactly what changed, line by line. Git diffs, file-level changes, configuration updates. The source of truth for modifications.
Logs
Console output, build logs, runtime traces. What happened during execution. Essential for debugging failures.
API responses
Captured request/response pairs. Useful for API development, integration testing, and verifying backend behavior.
Screenshot pipelines
Automated screenshot capture is one of the highest-value artifact investments. It turns "I think it works" into "here's proof it works."
// scripts/capture-screenshots.ts
import { chromium } from 'playwright';
import { mkdir, writeFile } from 'fs/promises';
interface ScreenshotConfig {
name: string;
url: string;
waitFor?: string;
actions?: Array<{ type: string; selector?: string; value?: string }>;
}
async function captureScreenshots(
configs: ScreenshotConfig[],
outputDir: string
) {
const browser = await chromium.launch();
const context = await browser.newContext({
viewport: { width: 1280, height: 720 }
});
await mkdir(outputDir, { recursive: true });
for (const config of configs) {
const page = await context.newPage();
await page.goto(config.url);
if (config.waitFor) {
await page.waitForSelector(config.waitFor);
}
// Execute any setup actions
for (const action of config.actions || []) {
if (action.type === 'click') {
await page.click(action.selector!);
} else if (action.type === 'fill') {
await page.fill(action.selector!, action.value!);
}
}
const screenshot = await page.screenshot({ fullPage: true });
await writeFile(`${outputDir}/${config.name}.png`, screenshot);
await page.close();
}
await browser.close();
}
// Example usage in CI
captureScreenshots([
{ name: 'homepage', url: 'http://localhost:3000' },
{ name: 'login', url: 'http://localhost:3000/login' },
{
name: 'dashboard-loaded',
url: 'http://localhost:3000/dashboard',
waitFor: '[data-testid="dashboard-content"]'
}
], 'artifacts/screenshots');Screenshot Best Practices
Consistent viewport
Use the same resolution for all screenshots. Makes before/after comparisons meaningful. Standard: 1280x720 or 1920x1080.
Wait for stability
Don't screenshot during loading. Wait for network idle, animations complete, data loaded. Flaky screenshots waste everyone's time.
Meaningful names
homepage-logged-out.png beats screenshot-1.png. Include state in the filename. Make it greppable.
Before and after
For changes, capture both states. Before the fix, after the fix. The comparison tells the story.
Test result archiving
Test results are the primary quality signal for most codebases. Archive them with enough context to understand what passed, what failed, and why.
# artifacts/test-results/run-2024-01-15-143022.json
{
"run_id": "run-2024-01-15-143022",
"timestamp": "2024-01-15T14:30:22Z",
"trigger": {
"type": "agent-task",
"task_id": "fix-login-validation",
"agent": "codegen-v2"
},
"summary": {
"total": 142,
"passed": 140,
"failed": 2,
"skipped": 0,
"duration_ms": 34521
},
"failed_tests": [
{
"name": "auth.login.validation.emptyEmail",
"file": "tests/auth/login.test.ts",
"error": "Expected 'Email required' but got 'Invalid email format'",
"stack": "...",
"screenshot": "screenshots/login-empty-email-error.png"
},
{
"name": "auth.login.rateLimit.exceeded",
"file": "tests/auth/login.test.ts",
"error": "Timeout waiting for rate limit message",
"stack": "...",
"known_flaky": true
}
],
"coverage": {
"lines": 84.2,
"branches": 72.1,
"functions": 91.3
},
"environment": {
"node": "20.10.0",
"os": "ubuntu-22.04",
"ci": true
}
}Structure test results as data, not just logs. Machine-readable results enable trend analysis, flaky test detection, and automated triage.
Diff summaries
Raw diffs are noisy. Summaries highlight what matters: new files, deleted files, functions changed, lines added/removed. Give reviewers the signal.
# artifacts/diff-summary.md
## Change Summary
**Task:** Fix login validation error messages
**Branch:** fix/login-validation
**Files changed:** 3
### Modified Files
| File | +Lines | -Lines | Changes |
|------|--------|--------|---------|
| lib/auth/validation.ts | +12 | -4 | Updated error messages |
| components/LoginForm.tsx | +3 | -1 | Added error display |
| tests/auth/login.test.ts | +24 | -0 | Added validation tests |
### Key Changes
- **validation.ts**: Changed error messages to be more user-friendly
- "Invalid input" → "Email address is required"
- "Bad format" → "Please enter a valid email address"
- **LoginForm.tsx**: Now displays validation errors below input field
- Added ErrorMessage component
- Errors clear on input change
### Test Impact
- 4 new tests added for validation edge cases
- All existing tests pass
- Coverage: 84.2% → 85.1%
### Screenshots
- [before-fix.png](./screenshots/before-fix.png)
- [after-fix.png](./screenshots/after-fix.png)Principle
The summary should answer "what changed and why" without reading code
A reviewer glancing at the summary should understand the scope and intent of the change. If they need more detail, they can dig into the full diff.
Structured output capture
Agents produce various outputs: logs, decisions, tool calls, results. Capture all of it in a structured format for later analysis.
# artifacts/agent-run/run-abc123.jsonl
{"ts":"2024-01-15T14:30:00Z","type":"task_start","task_id":"fix-login","agent":"codegen-v2"}
{"ts":"2024-01-15T14:30:01Z","type":"context_load","files":["lib/auth/validation.ts","components/LoginForm.tsx"]}
{"ts":"2024-01-15T14:30:02Z","type":"analysis","finding":"Error messages are generic, not user-friendly"}
{"ts":"2024-01-15T14:30:05Z","type":"decision","choice":"Update validation.ts first, then update component"}
{"ts":"2024-01-15T14:30:10Z","type":"tool_call","tool":"edit_file","args":{"file":"lib/auth/validation.ts"}}
{"ts":"2024-01-15T14:30:12Z","type":"tool_result","status":"success","lines_changed":16}
{"ts":"2024-01-15T14:30:15Z","type":"tool_call","tool":"run_tests","args":{"pattern":"auth"}}
{"ts":"2024-01-15T14:30:45Z","type":"tool_result","status":"success","tests":{"passed":24,"failed":0}}
{"ts":"2024-01-15T14:30:46Z","type":"task_complete","status":"success","duration_ms":46000}What to Capture
Every tool call
What tool, what arguments, what result. The full interaction history with external systems.
Decisions and reasoning
Why did the agent choose approach A over B? Log the decision points. Essential for debugging wrong turns.
Timing information
How long did each step take? Where is time being spent? Enables performance optimization.
Resource usage
Tokens consumed, API calls made, cost incurred. Track the economics of each run.
Artifact retention policies
Artifacts accumulate. Screenshots, logs, test results — they add up. You need retention policies that balance storage costs against debugging needs.
Retention Tiers
Hot (7 days)
Recent artifacts for active debugging. Full resolution, fast access. Screenshots, detailed logs, all test outputs.
Warm (30 days)
Recent history for trend analysis. Compressed, indexed for search. Test summaries, diff metadata, key screenshots.
Cold (1 year)
Compliance and long-term analysis. Heavily compressed, slow retrieval. Aggregate metrics, audit-relevant events only.
Permanent
Artifacts tied to shipped releases. Never delete. Reference point for production debugging and historical comparison.
# .github/workflows/artifact-cleanup.yml
name: Artifact Retention
on:
schedule:
- cron: '0 2 * * *' # Daily at 2 AM
jobs:
cleanup:
runs-on: ubuntu-latest
steps:
- name: Clean hot tier (>7 days)
run: |
find artifacts/hot -mtime +7 -delete
- name: Compress to warm tier (>7 days)
run: |
find artifacts/hot -mtime +7 -name "*.png" \
-exec convert {} -quality 60 artifacts/warm/{} \;
- name: Archive to cold tier (>30 days)
run: |
tar -czf cold/archive-$(date +%Y%m%d).tar.gz \
artifacts/warm/*-mtime +30
rm -rf artifacts/warm/*-mtime +30
- name: Tag release artifacts as permanent
run: |
# Artifacts from tagged releases never expire
for tag in $(git tag --list 'v*'); do
cp -r artifacts/release/$tag artifacts/permanent/
doneBuilding trust through artifacts
Consistent, high-quality artifacts build confidence over time. When every agent run produces verifiable evidence, patterns emerge. You can see success rates, common failure modes, and improvement trends.
Trust Metrics from Artifacts
Success rate
What percentage of agent tasks complete successfully? Track over time. Rising rate = improving agent.
Test pass rate
Do agent-authored changes pass tests on first try? High rate = reliable code generation. Low rate = needs tuning.
Review approval rate
How often do humans approve agent PRs without changes? Tracks whether the agent meets human quality standards.
Post-deploy incidents
Do agent changes cause production issues? The ultimate trust metric. Tie artifacts to post-deploy outcomes.
What goes wrong
Artifacts not captured
Agent completes task but no artifacts saved. Later, you need to verify what happened. No evidence exists.
Fix: Make artifact capture mandatory, not optional. Task isn't complete until artifacts are stored. Fail the run if capture fails.
Screenshots miss the problem
Screenshot captured but doesn't show the relevant state. Wrong viewport, wrong timing, element not visible.
Fix: Capture multiple screenshots at different states. Include full-page and focused views. Wait for stability before capture.
Storage costs explode
Every run saves gigabytes of artifacts. Storage bill grows exponentially. Nobody can find what they need anyway.
Fix: Implement retention tiers. Compress aggressively. Index and search instead of browse. Delete what's not needed.
Artifacts not linked to context
You have a screenshot but don't know which task, branch, or commit it came from. The artifact exists but lacks meaning.
Fix: Include metadata with every artifact. Task ID, commit SHA, timestamp, agent version. Make context queryable.
Summary
- •Artifacts are proof — screenshots, tests, logs, diffs — that verify agent work
- •Capture automatically and make it mandatory — tasks aren't complete without artifacts
- •Structure outputs as data (JSON, JSONL) for analysis, not just logs
- •Implement retention tiers — hot, warm, cold, permanent — to manage costs
- •Use artifact patterns over time to build trust and measure improvement
Stay updated
Get notified when we publish new guides or make major updates.
(We won't email you for little stuff like typos — only for new content or significant changes.)
Found this useful? Share it with your team.