12 / 25
Observability for the Dev Loop
Core Questions
- What do we log?
- How do we trace actions across systems?
- How do we measure quality and cost?
If you can't see what agents are doing, you can't trust what they've done. Observability for agentic systems isn't optional — it's how you build confidence, debug failures, and optimize costs. Log everything. Trace everything. Measure everything.
What to log
Agent logs need to answer: what was the agent trying to do, what did it actually do, and what was the outcome?
Log Categories
Task lifecycle
Task received, started, completed, failed. Include task ID, spec reference, timestamps, duration.
Agent reasoning
What the agent decided to do and why. Planning steps, tool selections, decision points. This is gold for debugging.
Tool invocations
Every command run, API called, file modified. Input, output, exit code, duration. The detailed audit trail.
Model interactions
Prompts sent, responses received, token counts. Useful for cost analysis and prompt optimization.
Errors and retries
What went wrong, how the agent responded, whether recovery succeeded. Critical for understanding failure modes.
Structured log example
{
"timestamp": "2025-01-15T10:30:45Z",
"level": "info",
"task_id": "task-abc123",
"agent": "claude-coder",
"event": "tool_invocation",
"tool": "shell",
"input": {
"command": "npm test -- --grep 'email validation'"
},
"output": {
"exit_code": 0,
"stdout_lines": 15,
"stderr_lines": 0
},
"duration_ms": 3420,
"tokens": {
"prompt": 0,
"completion": 0
}
}Tracing across systems
A single task touches many systems: the agent runtime, git, CI, artifact storage, maybe external APIs. Trace IDs let you follow the thread.
Trace propagation
# Commit with trace ID git commit -m "Fix email validation Task-ID: task-abc123 Agent: claude-coder Charter: github.com/org/repo/issues/42" # PR labels gh pr create --label "agent:claude-coder" --label "task:task-abc123" # CI environment TASK_ID=task-abc123 npm test
Metrics to track
Metrics tell you how well your agent system is performing. Track both operational metrics (is it working?) and quality metrics (is it working well?).
Key Metrics
Task success rate
% of tasks completed successfully. Track by agent, task type, repo.
Time to completion
How long tasks take. Track p50, p90, p99. Identify slow outliers.
PR merge rate
% of agent PRs that get merged (vs closed without merge). Quality signal.
Revision rate
How often PRs need changes after review. Lower is better.
Bug introduction rate
Bugs found in agent-authored code post-merge. Track over time.
Cost accounting
Agents aren't free. Track costs to optimize spending and catch runaway tasks.
Cost Components
LLM tokens
Input and output tokens per model. This is usually the biggest cost. Track per task, per agent, per repo.
Compute
VM/container runtime. Per-second billing if available. Track by task type to right-size instances.
Storage
Artifact storage, cache storage, log retention. Often small but grows over time.
External APIs
GitHub API calls, cloud service calls. Usually negligible but can spike.
Cost per task example
Task: task-abc123
Type: bug-fix
Duration: 12 minutes
Costs:
LLM (Claude Sonnet):
Input: 45,000 tokens × $0.003/1K = $0.135
Output: 12,000 tokens × $0.015/1K = $0.180
Compute (2 vCPU, 4GB):
12 min × $0.05/hour = $0.01
Total: $0.325
Cost per successful task (90% success rate): $0.36Set cost budgets per task. If a task exceeds its budget, it's either stuck or doing too much. Kill and escalate.
Alerting
Not everything needs human attention. Set up alerts for things that actually matter:
Alert Triggers
Success rate drop
Success rate falls below threshold (e.g., < 80% over 1 hour).
Cost spike
Hourly/daily cost exceeds 2× normal. Could be runaway task.
Long-running task
Task exceeds expected duration by 3×. Probably stuck.
Repeated failures
Same task type failing repeatedly. Systemic issue.
What goes wrong
No logs when needed
Something breaks. You go to debug. Logs are empty or unhelpful. You can't figure out what happened. Log more than you think you need.
Untraceable PRs
A bug is found in merged code. Which task produced it? Which agent? No trace IDs in commits. Can't reconstruct the history.
Cost blindness
End of month: surprise $10K bill. No idea which tasks or agents drove the cost. Can't optimize what you can't measure.
Alert fatigue
Too many alerts. Team ignores them. Real issue happens; nobody notices. Alert on actionable things only.
Summary
- →Log everything: task lifecycle, reasoning, tool calls, model interactions.
- →Propagate trace IDs across systems — logs, commits, CI, artifacts.
- →Track quality metrics (success rate, merge rate) and cost per task.
- →Alert on actionable conditions. Avoid alert fatigue.
Related Guides
Stay updated
Get notified when we publish new guides or make major updates.
(We won't email you for little stuff like typos — only for new content or significant changes.)
Found this useful? Share it with your team.