Long-running agent stack
A long-running agent is a model inside an execution system.
Chat is enough for a 5-minute task. A 2-hour coding session needs more structure. Once an agent runs for a long time, the real problem becomes state: what it knows, what it changed, what it forgot, and how it recovers when something goes wrong.
The model matters, but the stack around the model matters more than people expect.
The model
The model is the part everyone talks about. Claude, Codex, Gemini, DeepSeek, whatever comes next.
It decides what to do next. It reads the current context, chooses tools, writes code, explains tradeoffs, and decides when it is done.
But the model does not actually do anything by itself. It does not read your repo. It does not edit files. It does not run tests. It produces text that asks the agent loop to do those things.
That distinction matters. A better model gives you better decisions, but it still needs the rest of the system to turn those decisions into work.
The loop
The loop is the runtime.
It sends context to the model, receives tool calls, runs those tools, appends the results, then calls the model again. That keeps going until the model returns a normal message with no tool calls.
For a short task, the loop might run 3 times.
For a long task, it might run 80 times. It reads files, edits code, runs tests, inspects failures, patches the code again, checks git diff, and writes a final response.
This is where chat stops being a useful mental model. A long-running agent is closer to a small process manager wrapped around an LLM.
Tools
Tools are where the work happens.
The model can ask to read a file, run rg, apply a patch, start a dev server, take a screenshot, or run tests. The loop decides whether that request is valid and executes it.
Bad tools make agents useless. If the agent can only write code but cannot run it, it is just autocomplete with extra steps.
Good tools give the model fast feedback. It can inspect the repo before editing. It can run the narrow test first, then the broader test. It can look at the actual screenshot instead of guessing from CSS.
For coding agents, the minimum useful tool stack is:
- read files
- search the repo
- edit files
- run shell commands
- run tests
- inspect git diff
Frontend work needs more: browser automation, screenshots, and some way to verify that the page is not blank or broken on mobile.
Workspace
The workspace is the agent's world.
For coding, that means a real checkout of the repo. The agent needs the same files, package scripts, tests, local conventions, and dirty git state that a human engineer would see.
This is why repository instructions matter. AGENTS.md, CLAUDE.md, local docs, test names, npm scripts, and existing code style all become part of the environment the agent reads from.
Long-running agents get worse when the workspace is fake. A pasted file, a copied stack trace, or a partial code sample can work for one bug. It breaks down once the task spans 12 files and three test suites.
The agent needs the real project.
Memory
Memory is the weak point.
The model does not remember the previous turn by itself. The agent reconstructs memory from the conversation, tool results, summaries, files, and sometimes git history.
That works until the session gets too large. Then the agent has to compact old context into a summary.
Compaction is useful, but it loses detail. A summary might preserve the goal and the current plan while dropping a tiny constraint from 40 minutes ago. That tiny constraint might be the reason the next edit is wrong.
This is one reason long-running agents drift. They do not suddenly become stupid. They lose some of the local texture that made earlier decisions make sense.
Good agents write important state back into durable places: files, TODOs, tests, comments, issue notes, or the final diff. Anything that only lives in the conversation is fragile.
Plan
A long-running agent needs a plan, but the plan cannot be sacred.
The first plan is usually based on incomplete information. After the agent reads the code, runs tests, or hits a failing typecheck, the plan should change.
The useful version is a short working plan:
- inspect the current behavior
- make the smallest code change
- run the nearest verification
- broaden verification if the change touches shared behavior
- report the diff and remaining risk
The plan is there to keep the agent oriented. Locking the agent into a guess it made before reading the code makes the plan worse.
Checkpoints
Long runs need checkpoints because long runs fail in boring ways.
The terminal hangs. The dev server picks the wrong port. A test suite takes 14 minutes. The user edits a file halfway through. The context window fills up. The agent realizes it misunderstood the task after making 5 edits.
A checkpoint gives the system a place to resume from.
For coding agents, git is the best checkpoint system we already have. The agent can inspect the diff, see what changed, and recover from partial work without pretending the conversation is the source of truth.
This is also why agents should leave clean diffs. The human reviewer needs the diff, and so does the next agent. After compaction, even the same agent may need the diff to understand what happened.
Verification
Verification is the boundary between code generation and software engineering.
An agent that writes code but never runs it is guessing. Sometimes the guess is good. It is still a guess.
For backend work, verification is usually tests, typecheck, lint, or a local command that reproduces the bug.
For frontend work, verification needs visual checks too. A passing build does not tell you that the button text overflows on mobile or that the canvas rendered blank.
Long-running agents need cheap verification loops. The agent should not wait until the end to run everything. It should run the closest check after each meaningful change, then run broader checks when the shape of the fix is stable.
Human review
The human is still part of the stack.
The agent can do the search, make the edit, run the test, and write the summary. The human decides whether the tradeoff is acceptable.
This matters more on long tasks because long tasks contain more judgment calls. The agent might choose a local fix over a deeper refactor. It might skip a slow test. It might preserve an ugly pattern because changing it would touch too much code.
Those choices are not always wrong. They just need to be visible.
A good final response says what changed, what passed, and what risk remains. It does not bury the tradeoff under a confident sentence.
The stack
So the stack looks like this:
- Model: decides what to do next.
- Loop: keeps calling the model and running tools.
- Tools: read, edit, execute, inspect.
- Workspace: real repo, real scripts, real state.
- Memory: conversation, summaries, files, git history.
- Plan: short, current, replaceable.
- Checkpoints: resumable state.
- Verification: tests, typecheck, screenshots, logs.
- Human review: judgment and acceptance.
A useful long-running agent needs a system that can survive context loss, bad edits, interrupted runs, slow tests, and boring verification.