The Agentic Dev Loop — Designing Modern Development Systems for Human + Agent Teams

Ephemeral means resettable. An ephemeral runtime is leased for a task, runs, then gets destroyed (or reverted to a known snapshot).

Should every task run fresh?

Agents: yes. CI: almost always. Humans: not always. But you want a one-click reset when the environment gets weird.

Fresh runtimes buy isolation and reproducibility. They cost cold-start time (boot, pull, install) and sometimes money (spin-up overhead). The right default depends on who’s running the work and which failure modes you can tolerate.

Rule of thumb

Agents: fresh runtime per task (destroy or revert-to-snapshot).
CI: fresh runtime per job; share only deterministic caches.
Humans: keep workspaces persistent; reset runtimes on demand.

Why agents need reset semantics

Humans detect state drift. Agents don’t. Agents just run commands. If a prior run left a stale node_modules, a zombie process, a temp file, or cached auth, an agent will fail cryptically, succeed incorrectly, or waste cycles debugging ghosts.

Resettable runtimes delete that entire class of failures. When something breaks, you can trust it broke from a baseline.

For humans, the inner loop includes context: half-made changes, logs, tabs, and a mental model you’ve partially externalized. The fix is to separate workspaces (persist) from runtimes (reset).

Your code and git checkout live in a workspace that survives. The execution sandbox (the container/VM where commands run) is disposable.

Two platform decisions (and their consequences)

If you’re building a 1st-party agentic dev environment, the hard part isn’t “can we run code in the cloud.” The hard part is making environment switching cheap for humans without letting hidden state leak between tasks.

1) What are you optimizing for?

Pick a primary constraint. Everything else is a trade you’re consciously making.

Handoff speed: developers can enter any agent context in seconds.
Reproducibility: every run is attributable (image, lockfiles, inputs) and repeatable.
Throughput / cost: lots of tasks cheaply, even if humans step in less often.

For most internal platforms, a good default is handoff speed + reproducibility. You can always buy more compute later; you can’t buy back developer trust.

2) How do humans step into agent context?

You need a deliberate handoff model. Otherwise you end up with the worst hybrid: slow switching and flaky parity.

Pull-local (recommended)

Agents run remotely by default. When a human steps in, they rehydrate the agent’s context locally and debug with their normal inner loop.

Thin client (remote-first)

Humans connect into a remote environment. Handoff is a connection change, but latency becomes a permanent tax.

Pull-local only works if you pull an environment identity, not an environment blob. The identity pins what matters; local caches provide the speed.

Repo ref: branch/commit SHA the agent worked on
Image/toolchain: a pinned image digest (or equivalent) used to run commands
Task snapshot ID: small, semantic task state (workspace diff + key artifacts), not a VM disk
Cache keys: lockfile/input hashes that select safe caches

Enter context checklist

This is the acceptance test for your platform. If you can’t do these in seconds, humans will stop stepping in.

Pull the agent’s branch/commit and open the repo locally
Pull (or already have) the pinned image/toolchain used by the agent
Restore the task snapshot (workspace diff + artifacts), not a whole disk
Run the dev server/tests without additional setup steps
Reset back to baseline on exit (no leaked processes, ports, or auth state)

How long should workspaces live?

Workspace lifetime is really two decisions: what triggers deletion and who guarantees cleanup.

For pull-local platforms, this choice is tightly coupled to handoff speed. Longer-lived workspaces reduce cold starts, but they increase the cost of enforcing reset semantics and keeping snapshots small and meaningful.

Workspace lifetime patterns

Per-task (fully ephemeral)

Clone → run → destroy. Maximum isolation, maximum cold-start cost.

Deletion: end of task · Owner: orchestrator

Minutes

Per-branch

Reuse across commits. Deleted on merge or branch delete.

Deletion: merge/delete · Owner: git automation

Hours to days

Per-project (long-lived)

One workspace per repo. Manual cleanup. Most like local dev.

Deletion: manual · Owner: humans (expect cruft)

Weeks to months

Per-developer (persistent)

A “laptop in the cloud.” Fast, personal, and inevitably stateful.

Deletion: rare · Owner: humans (treat as a laptop)

Months to years

For agents, per-task or per-branch makes the most sense. Per-branch is a strong compromise: warmed dependencies across multiple runs, but isolation between branches.

For humans, per-project or per-developer avoids flow-breaking cold starts.

What’s allowed to persist?

Persist deterministic, content-addressed artifacts. Discard per-run variance.

Safe to persist

✓Package caches — keyed by lockfile hash
✓Build caches — keyed by inputs
✓Base images — OS + language runtimes
✓Read-only fixtures — golden data

Dangerous to persist

✗Runtime state — processes, ports, temp files
✗Credentials — tokens, cookies, cached sessions
✗CLI/browser auth state — ~/.config, keychains
✗Git state — stashes, conflicts, uncommitted changes
✗Database state — leftover test data

If a task needs a secret, inject it at runtime and guarantee it’s wiped on reset. Don’t store it in the workspace.

The common implementation pattern is layered storage:

Layer 0: base image
Layer 1: shared caches
Layer 2: workspace
Layer 3: ephemeral runtime state

Artifacts, caches, and provenance

Ephemeral runtimes destroy runtime state, but they should still produce durable evidence: logs, test output, screenshots, and build artifacts. Store those outputs with enough provenance to reproduce and audit the run.

The common trap is treating caches like artifacts (or vice versa). Caches are derived and disposable. Artifacts are durable and should be attributable to a specific run.

Minimum artifact metadata

{
  "artifact_type": "test-results|logs|docker-image|binary|screenshot",
  "created_at": "2026-01-15T10:30:00Z",
  "task_id": "task-xyz789",
  "agent_identity": "agent-app|machine-user",
  "repo": "github.com/org/repo",
  "commit": "abc123def456",
  "environment_identity": {
    "toolchain": "flake.lock-hash|image-digest",
    "lockfiles": ["pnpm-lock.yaml:sha256:..."],
    "snapshot_id": "optional"
  }
}

Cache safety rules (so caches don’t poison runs)

→

Content-addressed keys: cache keys include hashes of inputs (lockfiles, source).

→

Immutable once written: never overwrite a cache entry; write a new one.

→

Verify on read: checksums and version checks fail fast instead of silently drifting.

→

Tight write permissions: shared caches should be write-restricted; agents typically read.

Cold start vs. warm pools

Cold start is the enemy of fast feedback. “Fresh” doesn’t have to mean “slow,” but you need the right semantics.

In a pull-local model, the goal is to make cold starts rare: keep base images and caches warm locally, and keep remote snapshots small so restore is effectively “pull identity, hydrate from cache.”

Warm pool strategies

Keep a pool of ready environments that already did the slow setup work. Warm pools only count as ephemeral if every checkout includes a reset step: destroy the VM, or revert to a known snapshot before the task starts.

Pre-provisioned VMs: lease → run → reset/replace
Snapshots: restore after slow setup
Persistent caches: mount caches into fresh runtimes
Branch prediction: pre-warm on PR open

Warm capacity costs money when idle. Right-size pools based on usage: scale up during working hours, scale down nights and weekends.

Cost control

Ephemeral runtimes get expensive when you pay cold-start overhead repeatedly. Control cost by shrinking runtimes, sharing safe caches, and killing runaway tasks.

Cost control strategies

Right-size instances

Match CPU/RAM to the task. Most checks don’t need big boxes.

Prefer fine-grained billing

Per-second billing (often with a 1-minute minimum) helps short tasks.

Batch when isolation doesn’t matter

Run multiple checks per lease, then reset before unrelated work.

Set hard time limits

Timeouts prevent zombie spend. Cleanup on timeout, not just success.

Track per-task costs

Tag resources by task ID and review weekly.

A practical target: keep average compute under $0.10 per agent task. If you’re at $1+, tasks are too large or environments are too heavyweight.

Tools that help

sprites.dev

Sandboxes with checkpoint/restore. Use checkpoints to get “fresh start” semantics without paying cold starts every run.

Managed

exe.dev

Programmatic cloud VMs. Pair with destroy/reset per task to keep runs attributable.

Managed

Fly.io Machines

Fast microVMs with an API. Good if you’re building your own ephemeral runtime layer.

Build-your-own

GitHub Codespaces

Cloud workspaces with prebuilds. Excellent for humans; usable for agents.

Managed

Firecracker

MicroVM foundation for fast, isolated sandboxes. A common substrate for “ephemeral but fast.”

Self-hosted

What goes wrong

Orphaned resources

Agent crashes mid-task. The VM keeps running. Always enforce cleanup on timeout.

Cache poisoning

A corrupted artifact enters a shared cache. Use content-addressed keys and treat entries as immutable.

State leakage

The runtime isn’t actually reset — a token, DB, or profile persists. Reset semantics must be enforced.

Cold start death spiral

Cold starts take minutes. People reuse stale boxes. Either fix cold start or embrace persistence.

Summary

→Agents need resettable runtimes. Hidden state is the root of subtle failures.
→Separate workspaces (persist) from runtimes (reset). Persist deterministic caches only.
→For human handoff, pull-local is powerful: pull environment identity (ref + image + snapshot), hydrate fast from caches.
→Warm pools + snapshots make “fresh” fast. Reset-on-lease is the key semantic.
→Control cost with right-sizing, timeouts, tagging, and batching.

Related Guides

Execution Environments

Where do humans and agents run code?

Reproducible Toolchains

Guaranteeing identical behavior across runs

Stay updated

Get notified when we publish new guides or make major updates.
(We won't email you for little stuff like typos — only for new content or significant changes.)

Ephemeral Runtimes

Core Questions

Should every task run fresh?

Why agents need reset semantics

Two platform decisions (and their consequences)

How long should workspaces live?

What’s allowed to persist?

Artifacts, caches, and provenance

Minimum artifact metadata

Cache safety rules (so caches don’t poison runs)

Cold start vs. warm pools

Warm pool strategies

Cost control

Summary

Stay updated