All Guides

05 / 25

Task Routing & Orchestration

Core Questions

  • Which agent handles which task?
  • How is concurrency controlled?
  • How are failures retried?

This guide isn't about building a fancy router. It's about the real problem platform teams face: how does agent work start, how do you keep it from growing unbounded, and how do humans stay in the loop via the bug tracker.

How an agent starts (recommended: assignment)

The easiest way to make agent work legible is to use the system you already use to track work: GitHub issues, Jira, Linear, etc. The trigger should be explicit and visible to humans.

A popular and growing pattern is to create an agent identity in your tracker and treat it like a teammate: when an issue is assigned to that agent, the agent starts.

Trigger options (in practice)

  • Assignment (recommended): assign the issue to an agent user. Agent acknowledges and starts.
  • Slash commands: a comment like /agent fix starts work on demand.
  • Chat triggers (Slack, etc): convenient for small “prompt tasks” and triage, but easy to lose context and hard to make reproducible.
  • Labels/workflow state: applying agent:run or moving to “Ready for agent”.
  • Scheduled/batch: nightly triage, flaky test sweeps, backlog cleanup.

Default recommendation: start with assignment. It makes the “why did an agent start?” question answerable without new tooling.

Task types: prompt vs spec

Not every task should start from a chat prompt. Agents are great at deeper work, but deeper work needs durable input: acceptance criteria, constraints, and links. That usually means a markdown spec attached to an issue/ticket.

Prompt tasks (chat is fine)

  • “Investigate this error log”
  • “Summarize this PR”
  • “Triage which test is flaky”
  • “Generate a quick patch idea”

Spec tasks (tracker is better)

  • Multi-file refactors
  • New features
  • Large bug fixes with edge cases
  • Migrations and “touch prod-adjacent code”

Rule of thumb: if a human would write a 1-2 page markdown design note, the agent needs one too. Assignment + a spec beats a Slack thread.

Status updates in the bug tracker

If assignment is your trigger, the tracker becomes your UI. The agent should post status when it starts, when it hits key phases, and when it needs help. Otherwise humans can't tell progress from silence.

Treat status updates as part of the platform contract. A task that doesn't report status is a task you can't trust.

Recommended status events

Acknowledged

Agent posts within minutes of assignment: task accepted, and what it will do next.

Environment ready

Runtime/toolchain is up; agent can run tests and commands. If it can't, it should say so explicitly.

Reproduced (when applicable)

For bugs: confirm the bug exists and attach evidence (logs, failing test, screenshot). If it can’t reproduce, it should stop and ask.

Plan posted

Short plan, assumptions, and what signals will be used to verify (tests, screenshots, logs).

Progress checkpoints

When it finishes a major step: reproduced, fixed, tests passing, PR opened.

Needs help / blocked

If stuck: what failed, what it tried, and the smallest question a human can answer to unblock it.

Example: start comment template

[agent] Starting work

- Task: investigate + fix
- Repo/branch: <repo>@<branch>
- Commit: <sha> (if pinned)
- Environment: <nix flake.lock hash | image digest>
- Next: reproduce issue + add a failing test (if applicable)
- Updates: I will post here at each checkpoint and when blocked

Lifecycle checkpoints (what “progress” looks like)

Agents should report progress when they cross meaningful boundaries. This keeps humans oriented and prevents “silent failure.”

  • Reproduced: confirmed the bug and captured evidence
  • Fix in progress: approach selected and constraints noted
  • Verified: tests pass, repro no longer occurs, artifacts saved
  • PR opened: link + summary + what to review
  • Blocked: smallest human decision needed

When the pool is full

With fixed concurrency, an assignment doesn’t always mean “starting now.” That’s fine, but it must be visible. On assignment, either start immediately or acknowledge that the issue is queued.

[agent] Queued

- Reason: agent pool is full
- Queue: <global|team|repo>
- Position: <n> (optional)
- ETA: <estimate> (optional)
- I will post again when I actually start

Concurrency: start fixed, not unbounded

The temptation is to let agents scale without limit: every new assignment starts immediately. That feels fast, until it becomes chaos: cost spikes, conflicts multiply, and human review turns into a bottleneck.

Default recommendation: start with a fixed-size pool of concurrent agent tasks. Treat it like a queue of work items, not an unlimited swarm.

Concurrency caps (practical defaults)

Global pool size

Max tasks running across the org. Everything else waits, even if it's assigned.

pool: 3-10

Per-repo cap

Avoid multiple tasks contending for the same codebase and reviewers.

per_repo: 1-2

Per-area / per-file cap

Prevent two agents from editing the same surface area simultaneously.

per_area: 1

Per-team quota (optional)

One team can’t starve everyone else. Useful once adoption spreads.

quota: N/team

When the pool is full, you need admission control: which assigned issue starts next? FIFO is fine at first. As you scale, add priority (P0s jump the line) and aging (nothing starves forever).

Failure modes and escalation

Agents fail in different ways than CI. Some failures are transient. Some are environmental. Some are “I am stuck in a loop and need a human decision.” Your system should push those states back to the tracker instead of silently retrying forever.

Failure classification (and what to do)

Transient: Network timeout, API rate limit, flaky infra. Retry with backoff + jitter. Post a brief update if it persists.
Environment / infra: can't start dev env, missing toolchain, broken secret injection, down services. Don't loop. Post a “blocked” update with the exact failure and the smallest ask.
Stuck / ambiguous: agent is iterating but not converging. Escalate with context: what it tried, what it believes is wrong, and 1-2 options for a human to choose.

Loop detection (the common failure mode)

When an agent gets stuck, it often looks like “make a change, run tests, fail, try again” without new information. Don’t let it burn the whole pool. Set a threshold and escalate.

  • Max failed attempts on the same command/test
  • Max wall-clock time without producing new artifacts (tests, logs, PR)
  • Max cost / tokens / tool calls for a single task

Budgets and stop conditions

Fixed concurrency controls surprise spend. Budgets prevent runaway tasks inside that fixed pool. The key is to enforce budgets and report outcomes back to the tracker.

Budget levels (enforce, don’t suggest)

Per-task budget

Max cost/time per task. On exceed: stop, post status, and request human input.

$5/task

Hourly budget

Max spend per hour. When exceeded: pause starts; leave assigned issues queued.

$50/hour

Daily budget

Hard daily cap. When reached: stop starting tasks and post a global status/alert.

$500/day

Per-repo/team budget

Allocate budgets to teams. Prevent one repo or team from consuming the whole pool.

$2K/team/month

One good default: if the agent can’t get to “tests running” quickly, it should stop and ask for help. Environment problems aren’t fixed by more retries.

Pair budgets with permission guardrails: agents can open PRs and propose changes, but protected branches and merges remain human-owned. See Identity, Secrets & Trust Boundaries for why agent identity and scoped credentials matter here.

Minimum observability

You don’t need a complex system to start. You do need to be able to answer: “what is running, what is stuck, and what is costing us.”

Signals to track

Pool utilization

How often you’re saturated; whether the fixed pool is too small or too large.

Queue wait time

Time from assignment to start (especially for high priority issues).

Completion and escalation rates

What fraction completes cleanly vs gets blocked vs loops.

Cost per task

By repo/type. Use this to tune budgets and pool size.

What goes wrong

Unbounded starts

Every assignment starts immediately. Costs spike, tasks conflict, and humans can’t review the output fast enough. Start with a fixed pool.

Silent agent

The issue is assigned, but no status arrives. Nobody knows if it started, failed, or is stuck. Make status updates mandatory.

Looping on environment failures

The dev environment is broken and the agent keeps retrying installs/tests. It burns budget and blocks the pool. Escalate early with a minimal ask.

Review bottleneck

Agents produce PRs faster than humans can review. Throughput becomes limited by approvals, not compute. That’s normal: adjust pool size and priorities to match human capacity.

Summary

  • Default trigger: assign issues to an agent identity. Make agent work visible in the tracker.
  • Require status updates at key phases: acknowledged, environment ready, plan, checkpoints, blocked.
  • Start with a fixed-size concurrency pool. Add priorities and quotas as adoption grows.
  • Design for failure modes: transient retries, environment/infra blocks, and loop detection with escalation.

Stay updated

Get notified when we publish new guides or make major updates.
(We won't email you for little stuff like typos — only for new content or significant changes.)

Found this useful? Share it with your team.