02 / 25
Ephemeral Runtimes
Core Questions
- Should every task run in a fresh environment?
- How long do workspaces live?
- What state is allowed to persist?
Ephemeral means resettable. An ephemeral runtime is leased for a task, runs, then gets destroyed (or reverted to a known snapshot).
Should every task run fresh?
Agents: yes. CI: almost always. Humans: not always. But you want a one-click reset when the environment gets weird.
Fresh runtimes buy isolation and reproducibility. They cost cold-start time (boot, pull, install) and sometimes money (spin-up overhead). The right default depends on who’s running the work and which failure modes you can tolerate.
Rule of thumb
- Agents: fresh runtime per task (destroy or revert-to-snapshot).
- CI: fresh runtime per job; share only deterministic caches.
- Humans: keep workspaces persistent; reset runtimes on demand.
Why agents need reset semantics
Humans detect state drift. Agents don’t. Agents just run commands. If a prior run left a stale node_modules, a zombie process, a temp file, or cached auth, an agent will fail cryptically, succeed incorrectly, or waste cycles debugging ghosts.
Resettable runtimes delete that entire class of failures. When something breaks, you can trust it broke from a baseline.
For humans, the inner loop includes context: half-made changes, logs, tabs, and a mental model you’ve partially externalized. The fix is to separate workspaces (persist) from runtimes (reset).
Your code and git checkout live in a workspace that survives. The execution sandbox (the container/VM where commands run) is disposable.
Two platform decisions (and their consequences)
If you’re building a 1st-party agentic dev environment, the hard part isn’t “can we run code in the cloud.” The hard part is making environment switching cheap for humans without letting hidden state leak between tasks.
1) What are you optimizing for?
Pick a primary constraint. Everything else is a trade you’re consciously making.
- Handoff speed: developers can enter any agent context in seconds.
- Reproducibility: every run is attributable (image, lockfiles, inputs) and repeatable.
- Throughput / cost: lots of tasks cheaply, even if humans step in less often.
For most internal platforms, a good default is handoff speed + reproducibility. You can always buy more compute later; you can’t buy back developer trust.
2) How do humans step into agent context?
You need a deliberate handoff model. Otherwise you end up with the worst hybrid: slow switching and flaky parity.
Pull-local (recommended)
Agents run remotely by default. When a human steps in, they rehydrate the agent’s context locally and debug with their normal inner loop.
Thin client (remote-first)
Humans connect into a remote environment. Handoff is a connection change, but latency becomes a permanent tax.
Pull-local only works if you pull an environment identity, not an environment blob. The identity pins what matters; local caches provide the speed.
- Repo ref: branch/commit SHA the agent worked on
- Image/toolchain: a pinned image digest (or equivalent) used to run commands
- Task snapshot ID: small, semantic task state (workspace diff + key artifacts), not a VM disk
- Cache keys: lockfile/input hashes that select safe caches
Enter context checklist
This is the acceptance test for your platform. If you can’t do these in seconds, humans will stop stepping in.
- Pull the agent’s branch/commit and open the repo locally
- Pull (or already have) the pinned image/toolchain used by the agent
- Restore the task snapshot (workspace diff + artifacts), not a whole disk
- Run the dev server/tests without additional setup steps
- Reset back to baseline on exit (no leaked processes, ports, or auth state)
How long should workspaces live?
Workspace lifetime is really two decisions: what triggers deletion and who guarantees cleanup.
For pull-local platforms, this choice is tightly coupled to handoff speed. Longer-lived workspaces reduce cold starts, but they increase the cost of enforcing reset semantics and keeping snapshots small and meaningful.
Workspace lifetime patterns
Per-task (fully ephemeral)
Clone → run → destroy. Maximum isolation, maximum cold-start cost.
Deletion: end of task · Owner: orchestrator
Per-branch
Reuse across commits. Deleted on merge or branch delete.
Deletion: merge/delete · Owner: git automation
Per-project (long-lived)
One workspace per repo. Manual cleanup. Most like local dev.
Deletion: manual · Owner: humans (expect cruft)
Per-developer (persistent)
A “laptop in the cloud.” Fast, personal, and inevitably stateful.
Deletion: rare · Owner: humans (treat as a laptop)
For agents, per-task or per-branch makes the most sense. Per-branch is a strong compromise: warmed dependencies across multiple runs, but isolation between branches.
For humans, per-project or per-developer avoids flow-breaking cold starts.
What’s allowed to persist?
Persist deterministic, content-addressed artifacts. Discard per-run variance.
Safe to persist
- ✓Package caches — keyed by lockfile hash
- ✓Build caches — keyed by inputs
- ✓Base images — OS + language runtimes
- ✓Read-only fixtures — golden data
Dangerous to persist
- ✗Runtime state — processes, ports, temp files
- ✗Credentials — tokens, cookies, cached sessions
- ✗CLI/browser auth state — ~/.config, keychains
- ✗Git state — stashes, conflicts, uncommitted changes
- ✗Database state — leftover test data
If a task needs a secret, inject it at runtime and guarantee it’s wiped on reset. Don’t store it in the workspace.
The common implementation pattern is layered storage:
- Layer 0: base image
- Layer 1: shared caches
- Layer 2: workspace
- Layer 3: ephemeral runtime state
Artifacts, caches, and provenance
Ephemeral runtimes destroy runtime state, but they should still produce durable evidence: logs, test output, screenshots, and build artifacts. Store those outputs with enough provenance to reproduce and audit the run.
The common trap is treating caches like artifacts (or vice versa). Caches are derived and disposable. Artifacts are durable and should be attributable to a specific run.
Minimum artifact metadata
{
"artifact_type": "test-results|logs|docker-image|binary|screenshot",
"created_at": "2026-01-15T10:30:00Z",
"task_id": "task-xyz789",
"agent_identity": "agent-app|machine-user",
"repo": "github.com/org/repo",
"commit": "abc123def456",
"environment_identity": {
"toolchain": "flake.lock-hash|image-digest",
"lockfiles": ["pnpm-lock.yaml:sha256:..."],
"snapshot_id": "optional"
}
}Cache safety rules (so caches don’t poison runs)
Cold start vs. warm pools
Cold start is the enemy of fast feedback. “Fresh” doesn’t have to mean “slow,” but you need the right semantics.
In a pull-local model, the goal is to make cold starts rare: keep base images and caches warm locally, and keep remote snapshots small so restore is effectively “pull identity, hydrate from cache.”
Warm pool strategies
Keep a pool of ready environments that already did the slow setup work. Warm pools only count as ephemeral if every checkout includes a reset step: destroy the VM, or revert to a known snapshot before the task starts.
- Pre-provisioned VMs: lease → run → reset/replace
- Snapshots: restore after slow setup
- Persistent caches: mount caches into fresh runtimes
- Branch prediction: pre-warm on PR open
Warm capacity costs money when idle. Right-size pools based on usage: scale up during working hours, scale down nights and weekends.
Cost control
Ephemeral runtimes get expensive when you pay cold-start overhead repeatedly. Control cost by shrinking runtimes, sharing safe caches, and killing runaway tasks.
Cost control strategies
Right-size instances
Match CPU/RAM to the task. Most checks don’t need big boxes.
Prefer fine-grained billing
Per-second billing (often with a 1-minute minimum) helps short tasks.
Batch when isolation doesn’t matter
Run multiple checks per lease, then reset before unrelated work.
Set hard time limits
Timeouts prevent zombie spend. Cleanup on timeout, not just success.
Track per-task costs
Tag resources by task ID and review weekly.
A practical target: keep average compute under $0.10 per agent task. If you’re at $1+, tasks are too large or environments are too heavyweight.
Tools that help
Sandboxes with checkpoint/restore. Use checkpoints to get “fresh start” semantics without paying cold starts every run.
Programmatic cloud VMs. Pair with destroy/reset per task to keep runs attributable.
Fast microVMs with an API. Good if you’re building your own ephemeral runtime layer.
Cloud workspaces with prebuilds. Excellent for humans; usable for agents.
MicroVM foundation for fast, isolated sandboxes. A common substrate for “ephemeral but fast.”
What goes wrong
Orphaned resources
Agent crashes mid-task. The VM keeps running. Always enforce cleanup on timeout.
Cache poisoning
A corrupted artifact enters a shared cache. Use content-addressed keys and treat entries as immutable.
State leakage
The runtime isn’t actually reset — a token, DB, or profile persists. Reset semantics must be enforced.
Cold start death spiral
Cold starts take minutes. People reuse stale boxes. Either fix cold start or embrace persistence.
Summary
- →Agents need resettable runtimes. Hidden state is the root of subtle failures.
- →Separate workspaces (persist) from runtimes (reset). Persist deterministic caches only.
- →For human handoff, pull-local is powerful: pull environment identity (ref + image + snapshot), hydrate fast from caches.
- →Warm pools + snapshots make “fresh” fast. Reset-on-lease is the key semantic.
- →Control cost with right-sizing, timeouts, tagging, and batching.
Related Guides
Stay updated
Get notified when we publish new guides or make major updates.
(We won't email you for little stuff like typos — only for new content or significant changes.)
Found this useful? Share it with your team.