All Guides

11 / 25

Human-in-the-Loop Design

Core Questions

  • Where must humans approve?
  • How are confidence thresholds set?
  • How are overrides handled?

Full autonomy is a goal, not a starting point. The path to trusted agent systems runs through human oversight — not because agents can't be trusted eventually, but because trust is earned through demonstrated reliability. Design for humans in the loop from day one; remove them gradually as confidence builds.

Where humans must approve

Not every action needs human approval. But some do, always. Identify these early and enforce them structurally.

Mandatory Human Approval

Production deployments

Code can be merged autonomously; deploying it to production requires a human trigger. The human is accountable for what ships.

Security-sensitive changes

Auth, permissions, encryption, data access. Even if agents can write this code, humans approve it before merge.

Irreversible operations

Database migrations that drop data, infrastructure teardown, certificate rotation. Anything that can't be easily undone.

External communications

Emails to customers, public announcements, support responses. The company speaks through humans, not agents (for now).

Financial transactions

Billing changes, refunds, pricing updates. Money movements require human authorization.

These are the hard gates — places where no amount of agent confidence should bypass human approval. Everything else is negotiable based on risk tolerance and demonstrated reliability.

The trust ladder

Trust is built incrementally. Start with tight human oversight and loosen it as the agent proves reliable:

Trust Levels

1

Watch

Agent proposes actions but doesn't execute. Human reviews every proposal and decides whether to proceed. Training wheels.

2

Review

Agent executes, but results require human approval before they're finalized. Human reviews every outcome. PRs need approval; deploys need signoff.

3

Approve exceptions

Agent executes and finalizes routine actions. Human only reviews flagged exceptions or high-risk items. Most work flows without human touch.

4

Audit

Agent operates autonomously. Human reviews logs and metrics periodically. Intervention only on anomalies or scheduled audits.

Moving up the ladder requires evidence: low error rates, consistent quality, predictable behavior. Moving down (tightening oversight) should happen immediately when something goes wrong.

Confidence thresholds

Agents can express confidence in their work. Use this signal to route to appropriate oversight:

Confidence-based routing

High (90%+)→ Auto-approve if all checks pass
Medium (70-90%)→ Agent reviewer, then human if concerns
Low (50-70%)→ Human review required
Uncertain (<50%)→ Escalate; may need human intervention to complete

Important: calibrate confidence to reality. If the agent says 90% confidence but is wrong 30% of the time, the threshold is meaningless. Track actual outcomes vs stated confidence and adjust thresholds accordingly.

Agent confidence expression

## PR Summary

**Confidence:** 85% (Medium-High)

### What I'm confident about:
- The fix addresses the reported null pointer exception
- All existing tests pass
- Added test covers the specific edge case

### What I'm less certain about:
- There may be other code paths that hit the same function
  with unexpected input (I found this one but there could be more)
- The error message wording — not sure if it matches your style guide

### Recommendation:
Approve if error message is acceptable. Consider auditing
other callers of validateEmail() for similar issues.

Escalation flows

When an agent hits a situation it can't handle, it needs to escalate. Define clear escalation paths:

Escalation Triggers

Ambiguous requirements

Agent can't determine what's expected. Escalate to the requester for clarification rather than guessing.

Repeated failures

Agent has tried multiple approaches and keeps failing. Human intervention needed to unblock.

Out-of-scope situations

Task requires actions outside the agent's permissions or expertise. Route to a human or different agent.

Conflicting constraints

Requirements contradict each other. Human needs to decide which takes priority.

Ethical/policy concerns

Agent is asked to do something that seems problematic. Escalate rather than proceed or refuse.

Escalation should be easy and encouraged. An agent that escalates appropriately is better than one that plows through uncertainty and makes mistakes.

Emergency stops

When things go wrong, you need to stop quickly. Build emergency stops into your system:

Emergency stop mechanisms

Kill switch: Immediately halt all agent activity. No new tasks start; running tasks are terminated. Nuclear option.
Pause: Stop taking new tasks. Let running tasks complete. Softer than kill switch; prevents pile-up.
Quarantine: Isolate a specific agent or task type. Other work continues. For targeted problems.
Rollback: Revert recent agent changes. May be automatic (on error rate spike) or manual (on discovery of issue).

These should be accessible, tested, and well-documented. When you need them, you need them fast — not the time to figure out how they work.

Handling overrides

Sometimes humans need to override agent decisions or bypass normal approval flows. Design for this, but make it visible:

Override principles

Overrides are logged: Every override creates an audit trail. Who, when, why, what was bypassed.
Overrides require justification: Not just a button press — require a reason. "Emergency fix" or "False positive"or "Approved by security team."
Overrides are reviewed: Periodic audit of overrides. Too many? Maybe the policy is wrong. Patterns? Maybe automation needs fixing.
Some things can't be overridden: The hardest gates (production access, security changes) may have no bypass at all.

What goes wrong

Approval fatigue

Too many approval requests. Humans start approving without looking. The oversight becomes theater. Right-size what needs approval.

No escalation path

Agent gets stuck with no way to ask for help. It either fails silently or keeps retrying forever. Always provide an escalation route.

Override abuse

Overrides become the norm. People bypass policies routinely. Eventually a bad override causes an incident. Monitor override rates.

Untested emergency stops

Kill switch exists but hasn't been tested. During a real emergency, it doesn't work as expected. Test your emergency procedures.

Summary

  • Identify hard gates that always need human approval. Make them structural.
  • Build trust incrementally: watch → review → approve exceptions → audit.
  • Use agent confidence to route to appropriate oversight levels.
  • Make escalation easy. Build and test emergency stops.

Stay updated

Get notified when we publish new guides or make major updates.
(We won't email you for little stuff like typos — only for new content or significant changes.)

Found this useful? Share it with your team.