All Guides

10 / 25

Policy & Guardrails

Core Questions

  • What actions are forbidden?
  • What violations are blocking?
  • How is drift detected early?

Agents are fast. They can produce more code in an hour than a human can in a day. That's the upside. The downside: they can also introduce more bugs, security holes, and architectural drift in an hour than a human can in a day. Guardrails aren't about slowing agents down — they're about catching problems before they ship.

What actions are forbidden?

Start with a deny list. These are actions that agents should never take, regardless of what they're asked to do. The list depends on your context, but common entries:

You can't enforce this in the LLM

Here's the uncomfortable truth: telling an agent "don't delete production data" is not enforcement. It's a suggestion. LLMs can misunderstand, hallucinate, or simply ignore instructions. If the agent can connect to production, eventually it will do something you didn't want.

Real enforcement happens at the infrastructure level:

  • Network isolation: The agent's environment simply cannot reach production. Firewall rules, VPC segmentation, network policies. No route = no risk.
  • No production credentials: The agent never has production database passwords, API keys, or service account tokens. You can't delete what you can't authenticate to.
  • Separate environments: Agents work in dev/staging environments that are architecturally isolated from production. They can break staging all day — that's what it's for.
  • Read-only access where needed: If an agent needs production data for debugging, give it read-only replicas. No write path exists.

Make these impossible, not just forbidden

Access production — No network route, no credentials, period
Modify auth systems — Auth code requires human PR approval
Commit secrets — Pre-commit hooks + CI scanning blocks merge
Disable security controls — Config changes gated by CODEOWNERS
Push to main directly — Branch protection, no exceptions
Modify CI/CD pipelines — Workflow files require security team approval
Add arbitrary dependencies — Allowlisted packages only, or human review
Deploy to production — Deployment requires human trigger

The principle: if an agent can do something by ignoring instructions, your guardrail has failed. Enforcement must be structural — network policies, IAM permissions, branch protection rules, CODEOWNERS requirements. These are things the agent cannot bypass no matter what it decides to do.

Defense in depth still matters, but the layers are different:

  • Layer 1 — Network/IAM: The agent physically cannot reach production systems. This is the hard boundary.
  • Layer 2 — Git/CI enforcement: Branch protection, required reviews, CODEOWNERS. The agent can write code but can't merge without approval.
  • Layer 3 — Static analysis: Semgrep, secret scanners, dependency checks. Catches dangerous patterns before merge.
  • Layer 4 — Agent instructions: Yes, still tell the agent what not to do. This reduces wasted work. But don't rely on it for safety.

If Layer 1 or 2 fails, you have a serious infrastructure problem. Layers 3 and 4 are about efficiency and catching mistakes — not about preventing a determined (or confused) agent from doing harm.

Policy as code

Policies written in English are ambiguous. Policies written in code are precise. For agent guardrails, policy as code means expressing your rules in a format that can be automatically evaluated.

Policy Patterns

File path restrictions

Agents can only modify files matching certain patterns. Protects sensitive areas of the codebase.

# .agent-policy.yaml
allowed_paths:
- "src/**"
- "tests/**"
- "docs/**"
forbidden_paths:
- ".github/workflows/**"
- "infrastructure/**"
- "**/secrets/**"

Dependency policies

Control what dependencies agents can add or update. Prevent supply chain surprises.

dependencies:
allow_new: false # Must be explicitly approved
allow_updates: patch # Only patch versions
blocked:
- "eval-*" # No eval libraries
- "*crypto*" # Flag crypto changes

Code pattern rules

Detect and block specific code patterns. Useful for security and architectural rules.

patterns:
block:
- pattern: "eval\\("
message: "eval() is forbidden"
- pattern: "dangerouslySetInnerHTML"
message: "XSS risk - requires security review"
- pattern: "TODO.*HACK"
message: "No shipping hacks"

The format doesn't matter as much as the principle: if you can't express a rule as code, you can't enforce it automatically. YAML, JSON, Rego (Open Policy Agent), TypeScript — pick what works for your team.

Open Policy Agent example

OPA/Rego is designed for policy decisions. Here's a policy that blocks PRs modifying certain files:

package agent.pr

import future.keywords.in

default allow := true

# Block changes to CI workflows
deny[msg] {
    some file in input.changed_files
    startswith(file, ".github/workflows/")
    msg := sprintf("Cannot modify CI workflow: %s", [file])
}

# Block changes to auth module without security label
deny[msg] {
    some file in input.changed_files
    contains(file, "/auth/")
    not "security-reviewed" in input.labels
    msg := "Auth changes require security-reviewed label"
}

# Final decision
allow := false { count(deny) > 0 }

Hard vs. soft enforcement

Not all violations are equal. Some should block the PR entirely. Others should warn but allow override. The distinction matters for both safety and velocity.

Enforcement Levels

🛑 Block

PR cannot merge until violation is fixed. No override possible without changing the policy itself. Use for: security issues, data safety, compliance.

Hard

⚠️ Require approval

PR blocked until a specific person or team approves. Use for: architectural changes, dependency additions, sensitive areas.

Gated

💬 Warn

PR can merge, but violation is highlighted. Reviewer decides. Use for: style issues, potential concerns, things that might be intentional.

Soft

📊 Log

Record the violation but don't surface it. Use for: gathering data before deciding on enforcement, tracking trends.

Monitor

A common pattern: start soft, then harden. When you introduce a new policy:

  1. Week 1-2: Log only. See how often the rule would trigger. Identify false positives.
  2. Week 3-4: Warn. Let teams know the rule exists and will become blocking.
  3. Week 5+: Block. The rule is now enforced. Teams have had time to adapt.

This avoids the "surprise enforcement" problem where a new rule breaks everyone's PRs on day one.

Architectural guardrails

Beyond security, guardrails can enforce architectural decisions. Agents don't inherently understand your architecture — they'll add a database call to a presentation layer component if the instructions don't say otherwise.

Architectural rules to consider

Layer dependencies: UI components cannot import from data layer directly. Services cannot import from UI.
Module boundaries: Feature A cannot import internals from Feature B. Only public APIs.
Database access patterns: Only repository classes can make database queries. No raw SQL in controllers.
API versioning: Breaking changes require a new API version. Old versions must remain compatible.
Naming conventions: Files in /hooks must be named use*.ts. Components must be PascalCase.

Tools like dependency-cruiser (JavaScript), ArchUnit (Java), or custom ESLint rules can enforce these patterns in CI. The key is making the architecture machine-readable, not just documented.

Drift detection

Guardrails catch violations at PR time. Drift detection catches violations that slipped through — or were introduced before guardrails existed. Run periodic scans of your codebase to find:

  • Policy violations: Code that violates current policies but was merged before the policy existed.
  • Architectural decay: Module boundaries that have eroded. Dependencies that shouldn't exist.
  • Security regressions: Vulnerabilities that were introduced or that CVEs have revealed in existing code.
  • Configuration drift: Settings that have diverged from intended state.

Drift detection workflow

Run drift detection on a schedule (nightly or weekly):

# .github/workflows/drift-detection.yml
name: Drift Detection
on:
  schedule:
    - cron: '0 2 * * *'  # 2 AM daily
  workflow_dispatch:

jobs:
  scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Run policy scan
        run: ./scripts/policy-scan.sh
        
      - name: Run architecture scan
        run: npx dependency-cruiser src --config .dependency-cruiser.js
        
      - name: Run security scan
        run: npm audit --audit-level=high
        
      - name: Report violations
        if: failure()
        run: ./scripts/report-drift.sh

When drift is detected, create issues automatically. Track drift over time. If drift is increasing, your guardrails aren't working.

Auto-rollback patterns

The fastest way to recover from a bad agent change is to never deploy it. The second fastest is to roll it back automatically. Build rollback into your deployment pipeline:

Rollback Triggers

Error rate spike

If error rate increases by > 2x within 5 minutes of deploy, auto-rollback. The new code is probably broken.

Latency degradation

If p99 latency increases by > 50%, auto-rollback. The new code might have introduced a performance regression.

Health check failure

If the service fails health checks after deploy, auto-rollback. Don't wait for traffic to reveal the problem.

Canary failure

If canary instances show problems before full rollout, abort and rollback. Canaries exist to catch exactly this.

Auto-rollback requires two things: good observability (you can detect problems quickly) and reliable rollback (going back actually works). Test both regularly.

For agent-authored changes specifically, consider more conservative thresholds. If a human wrote the code, you might tolerate a small error uptick while investigating. For agent code — especially early in your trust-building — roll back first, investigate second.

What goes wrong

Policy theater

Policies exist but aren't enforced. Everyone knows the rules but ignores them because there's no enforcement. This is worse than no policy — it creates false confidence.

Alert fatigue

Too many soft warnings. Teams start ignoring them. The one real issue drowns in noise. Warnings should be rare and meaningful; if they're constant, either fix the violations or remove the rule.

Guardrail bypass

Developers find workarounds. Policies check for "eval(" so they use"ev\" + \"al(" instead. If you're playing whack-a-mole with bypasses, the policy is too narrow — address the underlying risk.

Rollback that doesn't

Auto-rollback triggers, but the rollback fails. Database migrations can't be reversed. The old container image was garbage-collected. Test your rollback path as often as your deploy path.

Tools that help

Open Policy Agent (OPA)

General-purpose policy engine. Write policies in Rego. Integrates with CI, Kubernetes, and pretty much everything else.

Policy
dependency-cruiser

Validate and visualize JavaScript/TypeScript dependencies. Enforce architectural rules like "no cycles" or "UI cannot import data layer."

Architecture
Semgrep

Static analysis that's easy to customize. Write rules in YAML. Great for security patterns and custom code standards.

Code patterns
Danger JS

Automate PR feedback. Write rules in JavaScript/TypeScript. Good for"warn if..." style soft enforcement.

PR checks

Summary

  • Define forbidden actions explicitly. Enforce at multiple layers — instructions, tools, CI, runtime.
  • Express policies as code. If you can't automate enforcement, the policy is just a suggestion.
  • Use hard enforcement for security, soft for style. Start soft and harden over time.
  • Run drift detection regularly. Guardrails catch new violations; drift detection catches old ones.

Stay updated

Get notified when we publish new guides or make major updates.
(We won't email you for little stuff like typos — only for new content or significant changes.)

Found this useful? Share it with your team.