All Guides

12 / 25

Observability for the Dev Loop

Core Questions

  • What do we log?
  • How do we trace actions across systems?
  • How do we measure quality and cost?

If you can't see what agents are doing, you can't trust what they've done. Observability for agentic systems isn't optional — it's how you build confidence, debug failures, and optimize costs. Log everything. Trace everything. Measure everything.

What to log

Agent logs need to answer: what was the agent trying to do, what did it actually do, and what was the outcome?

Log Categories

Task lifecycle

Task received, started, completed, failed. Include task ID, spec reference, timestamps, duration.

Agent reasoning

What the agent decided to do and why. Planning steps, tool selections, decision points. This is gold for debugging.

Tool invocations

Every command run, API called, file modified. Input, output, exit code, duration. The detailed audit trail.

Model interactions

Prompts sent, responses received, token counts. Useful for cost analysis and prompt optimization.

Errors and retries

What went wrong, how the agent responded, whether recovery succeeded. Critical for understanding failure modes.

Structured log example

{
  "timestamp": "2025-01-15T10:30:45Z",
  "level": "info",
  "task_id": "task-abc123",
  "agent": "claude-coder",
  "event": "tool_invocation",
  "tool": "shell",
  "input": {
    "command": "npm test -- --grep 'email validation'"
  },
  "output": {
    "exit_code": 0,
    "stdout_lines": 15,
    "stderr_lines": 0
  },
  "duration_ms": 3420,
  "tokens": {
    "prompt": 0,
    "completion": 0
  }
}

Tracing across systems

A single task touches many systems: the agent runtime, git, CI, artifact storage, maybe external APIs. Trace IDs let you follow the thread.

Trace propagation

Task ID: Generated when task starts. Included in all logs, commits, PR metadata, artifact tags.
Span IDs: Sub-units within a task. "Reproduce bug" span, "implement fix" span, "run tests" span.
Git trailers: Commit messages include task ID. Links code changes to the task that produced them.
CI labels: Workflows triggered by agent PRs carry the task ID. Links CI results to tasks.
# Commit with trace ID
git commit -m "Fix email validation

Task-ID: task-abc123
Agent: claude-coder
Charter: github.com/org/repo/issues/42"

# PR labels
gh pr create --label "agent:claude-coder" --label "task:task-abc123"

# CI environment
TASK_ID=task-abc123 npm test

Metrics to track

Metrics tell you how well your agent system is performing. Track both operational metrics (is it working?) and quality metrics (is it working well?).

Key Metrics

Task success rate

% of tasks completed successfully. Track by agent, task type, repo.

Target: > 90%

Time to completion

How long tasks take. Track p50, p90, p99. Identify slow outliers.

Varies by type

PR merge rate

% of agent PRs that get merged (vs closed without merge). Quality signal.

Target: > 80%

Revision rate

How often PRs need changes after review. Lower is better.

Target: < 2 rounds

Bug introduction rate

Bugs found in agent-authored code post-merge. Track over time.

Trend: decreasing

Cost accounting

Agents aren't free. Track costs to optimize spending and catch runaway tasks.

Cost Components

LLM tokens

Input and output tokens per model. This is usually the biggest cost. Track per task, per agent, per repo.

Compute

VM/container runtime. Per-second billing if available. Track by task type to right-size instances.

Storage

Artifact storage, cache storage, log retention. Often small but grows over time.

External APIs

GitHub API calls, cloud service calls. Usually negligible but can spike.

Cost per task example

Task: task-abc123
Type: bug-fix
Duration: 12 minutes

Costs:
  LLM (Claude Sonnet):
    Input:  45,000 tokens × $0.003/1K = $0.135
    Output: 12,000 tokens × $0.015/1K = $0.180
  Compute (2 vCPU, 4GB):
    12 min × $0.05/hour = $0.01
  
  Total: $0.325

Cost per successful task (90% success rate): $0.36

Set cost budgets per task. If a task exceeds its budget, it's either stuck or doing too much. Kill and escalate.

Alerting

Not everything needs human attention. Set up alerts for things that actually matter:

Alert Triggers

Success rate drop

Success rate falls below threshold (e.g., < 80% over 1 hour).

Critical

Cost spike

Hourly/daily cost exceeds 2× normal. Could be runaway task.

Warning

Long-running task

Task exceeds expected duration by 3×. Probably stuck.

Investigate

Repeated failures

Same task type failing repeatedly. Systemic issue.

Critical

What goes wrong

No logs when needed

Something breaks. You go to debug. Logs are empty or unhelpful. You can't figure out what happened. Log more than you think you need.

Untraceable PRs

A bug is found in merged code. Which task produced it? Which agent? No trace IDs in commits. Can't reconstruct the history.

Cost blindness

End of month: surprise $10K bill. No idea which tasks or agents drove the cost. Can't optimize what you can't measure.

Alert fatigue

Too many alerts. Team ignores them. Real issue happens; nobody notices. Alert on actionable things only.

Summary

  • Log everything: task lifecycle, reasoning, tool calls, model interactions.
  • Propagate trace IDs across systems — logs, commits, CI, artifacts.
  • Track quality metrics (success rate, merge rate) and cost per task.
  • Alert on actionable conditions. Avoid alert fatigue.

Stay updated

Get notified when we publish new guides or make major updates.
(We won't email you for little stuff like typos — only for new content or significant changes.)

Found this useful? Share it with your team.