11 / 25
Human-in-the-Loop Design
Core Questions
- Where must humans approve?
- How are confidence thresholds set?
- How are overrides handled?
Full autonomy is a goal, not a starting point. The path to trusted agent systems runs through human oversight — not because agents can't be trusted eventually, but because trust is earned through demonstrated reliability. Design for humans in the loop from day one; remove them gradually as confidence builds.
Where humans must approve
Not every action needs human approval. But some do, always. Identify these early and enforce them structurally.
Mandatory Human Approval
Production deployments
Code can be merged autonomously; deploying it to production requires a human trigger. The human is accountable for what ships.
Security-sensitive changes
Auth, permissions, encryption, data access. Even if agents can write this code, humans approve it before merge.
Irreversible operations
Database migrations that drop data, infrastructure teardown, certificate rotation. Anything that can't be easily undone.
External communications
Emails to customers, public announcements, support responses. The company speaks through humans, not agents (for now).
Financial transactions
Billing changes, refunds, pricing updates. Money movements require human authorization.
These are the hard gates — places where no amount of agent confidence should bypass human approval. Everything else is negotiable based on risk tolerance and demonstrated reliability.
The trust ladder
Trust is built incrementally. Start with tight human oversight and loosen it as the agent proves reliable:
Trust Levels
Watch
Agent proposes actions but doesn't execute. Human reviews every proposal and decides whether to proceed. Training wheels.
Review
Agent executes, but results require human approval before they're finalized. Human reviews every outcome. PRs need approval; deploys need signoff.
Approve exceptions
Agent executes and finalizes routine actions. Human only reviews flagged exceptions or high-risk items. Most work flows without human touch.
Audit
Agent operates autonomously. Human reviews logs and metrics periodically. Intervention only on anomalies or scheduled audits.
Moving up the ladder requires evidence: low error rates, consistent quality, predictable behavior. Moving down (tightening oversight) should happen immediately when something goes wrong.
Confidence thresholds
Agents can express confidence in their work. Use this signal to route to appropriate oversight:
Confidence-based routing
Important: calibrate confidence to reality. If the agent says 90% confidence but is wrong 30% of the time, the threshold is meaningless. Track actual outcomes vs stated confidence and adjust thresholds accordingly.
Agent confidence expression
## PR Summary **Confidence:** 85% (Medium-High) ### What I'm confident about: - The fix addresses the reported null pointer exception - All existing tests pass - Added test covers the specific edge case ### What I'm less certain about: - There may be other code paths that hit the same function with unexpected input (I found this one but there could be more) - The error message wording — not sure if it matches your style guide ### Recommendation: Approve if error message is acceptable. Consider auditing other callers of validateEmail() for similar issues.
Escalation flows
When an agent hits a situation it can't handle, it needs to escalate. Define clear escalation paths:
Escalation Triggers
Ambiguous requirements
Agent can't determine what's expected. Escalate to the requester for clarification rather than guessing.
Repeated failures
Agent has tried multiple approaches and keeps failing. Human intervention needed to unblock.
Out-of-scope situations
Task requires actions outside the agent's permissions or expertise. Route to a human or different agent.
Conflicting constraints
Requirements contradict each other. Human needs to decide which takes priority.
Ethical/policy concerns
Agent is asked to do something that seems problematic. Escalate rather than proceed or refuse.
Escalation should be easy and encouraged. An agent that escalates appropriately is better than one that plows through uncertainty and makes mistakes.
Emergency stops
When things go wrong, you need to stop quickly. Build emergency stops into your system:
Emergency stop mechanisms
These should be accessible, tested, and well-documented. When you need them, you need them fast — not the time to figure out how they work.
Handling overrides
Sometimes humans need to override agent decisions or bypass normal approval flows. Design for this, but make it visible:
Override principles
What goes wrong
Approval fatigue
Too many approval requests. Humans start approving without looking. The oversight becomes theater. Right-size what needs approval.
No escalation path
Agent gets stuck with no way to ask for help. It either fails silently or keeps retrying forever. Always provide an escalation route.
Override abuse
Overrides become the norm. People bypass policies routinely. Eventually a bad override causes an incident. Monitor override rates.
Untested emergency stops
Kill switch exists but hasn't been tested. During a real emergency, it doesn't work as expected. Test your emergency procedures.
Summary
- →Identify hard gates that always need human approval. Make them structural.
- →Build trust incrementally: watch → review → approve exceptions → audit.
- →Use agent confidence to route to appropriate oversight levels.
- →Make escalation easy. Build and test emergency stops.
Related Guides
Stay updated
Get notified when we publish new guides or make major updates.
(We won't email you for little stuff like typos — only for new content or significant changes.)
Found this useful? Share it with your team.