Anthropic found that even frontier models fail 4 specific ways when you let them work longer than 5 minutes. I hit all of them before I figured out the fix.
Not in theory. In practice. I had an agent that was supposed to build a working authentication module over a weekend. By Monday morning it had committed 23 times, declared the feature done three separate times, and left the app in a state where npm start didn’t even work. The commits looked great in the log. The code did not work.
That weekend cost me two days of debugging. But it taught me more about agent orchestration than any paper or demo. This post is what I wish someone had written before I started.
The 4 Failure Modes (And Why They Happen)
Anthropic’s research on long-running agents identified four specific failure modes that show up consistently when you push past the 5-minute mark. Understanding them is the first step to fixing them.
1. Declaring victory too early. The agent finishes a visible part of the task, reports success, and stops. The auth routes exist but tokens never expire. The API returns 200 but the response body is wrong. The agent sees a green checkmark in its own output and calls it done.
2. Leaving buggy state. The agent makes a change, hits an error, works around it, and moves on. But the workaround introduced a silent bug — a missing migration, a hardcoded URL, a race condition that only shows up under load. The agent doesn’t circle back.
3. Marking features done prematurely. This is the most dangerous one. The agent maintains some kind of task list and marks items complete based on whether code was written, not whether the code works. A feature marked “done” in the task tracker can be completely broken in production.
4. Spending time figuring out how to run the app. The agent burns 30 minutes trying to figure out the project’s build system, another 20 on environment variables, and another 15 on why Docker won’t start. Zero minutes on actual engineering.
These aren’t model problems. They’re orchestration problems. The model is capable. The system around it is not.

The Initializer Agent Pattern
The first pattern that actually worked for me came from Anthropic’s playbook: the initializer agent. Instead of telling one agent to build an entire feature, you split the work into two phases.
The initializer agent does one thing: it reads the codebase, understands the structure, and sets up the project scaffolding. A feature list file with a structured JSON list of end-to-end feature descriptions. A git repository. An init.sh script that can run the development server. A progress notes file that the next agent will read. Not code. The scaffolding for code.
| |
Then a second agent — the worker — executes that plan one step at a time. One feature. One commit. One test run. Then the next.
This separation matters because it prevents the agent from simultaneously planning and doing, which is where most of the failure modes originate. When an agent is trying to figure out what to build and how to build it at the same time, it cuts corners on both.
Git Commits as a Recovery Mechanism
Here’s a pattern that saved me more than once: every meaningful agent action produces a git commit. Not as an afterthought. As the primary unit of progress.
| |
Why this works:
- Recovery. When the agent goes off the rails — and it will — you
git bisectto find where things broke. You don’t debug the agent’s reasoning. You debug the diff. - Audit trail. Every commit tells you what the agent did and why. The commit message is the agent’s explanation of its own work.
- Rollback granularity. If step 7 of a 12-step plan introduces a bug, you revert step 7. You don’t throw away the entire weekend’s work.
I run this with a branch-per-agent pattern. The agent works on feature/auth-module. When it’s done and tests pass, I merge to main. If it fails, the main branch stays clean and I can inspect the mess in isolation.
Incremental Progress: One Feature at a Time
The Faros 2026 survey found that 85% of developers now use AI tools in their workflow. Engineering throughput is up. But quality is down. They call this “acceleration whiplash” — the gap between how fast you can generate code and how fast you can verify it works.
The fix is boring and effective: one feature at a time, verified before moving on.
Here’s what that looks like in practice with a LangGraph workflow:
| |
The key insight: the verify step uses LangGraph’s interrupt() for a human checkpoint. The agent doesn’t get to decide if its own work is done. That decision belongs to a person, or at minimum to an automated end-to-end test suite that the agent didn’t write.
Context Engineering: The Discipline Nobody Talks About
Context engineering — the practice of carefully controlling what information an agent has access to at any given point in time — has emerged as a discipline in its own right in 2026. ByteByteGo’s trends report identified persistent always-on agents as a key trend, and the industry is waking up to the fact that context management is what separates agents that drift from agents that deliver.
This matters enormously for long-running agents. A 5-minute task can hold everything in context. A 3-day task cannot. And the quality of your context management directly determines whether the agent stays coherent or drifts into hallucination.
Microsoft Conductor approaches this with explicit context control modes:
| |
Three modes, each solving a different problem:
accumulate— the agent needs full history. Use this for planning and analysis.last_only— the agent only needs the current plan. Use this for implementation, where old context is noise.explicit— you hand-pick what the agent sees. Use this for review and testing, where irrelevant context causes the agent to make connections that don’t exist.
I use a similar pattern in my own setup. Each agent step gets a context manifest — a specific list of files, documentation sections, and previous outputs. The agent doesn’t get to “look around” the codebase. It gets exactly what it needs and nothing more.

End-to-End Testing as the Source of Truth
Anthropic’s recommendation for Puppeteer MCP — using a real browser to test your agent’s work — changed how I validate agent output. Unit tests can pass while the app is broken. A browser doesn’t lie.
| |
This test runs after every agent commit. If it fails, the agent’s “done” status is revoked and it goes back to fix the issue. No human needed for verification. The browser is the judge.
The Real Gap: Spend vs. Outcomes
The Faros 2026 Engineering Report tells a story most teams are living right now: volume is up, quality is down, and the gap between the two is widening. The “AI Acceleration Whiplash” — their term for this pattern — shows that throwing more AI at the problem without investing in the surrounding infrastructure just makes the gap bigger.
This isn’t because AI tools are bad. It’s because we’re using them the way we use hammers — swinging at everything and hoping something gets nailed down. The teams that are closing this gap are the ones investing in the unglamorous infrastructure around the agents:
- State management. LangGraph, Conductor, or a custom state machine. The agent needs a brain that persists across sessions.
- Recovery mechanisms. Git commits, checkpoints, rollback procedures. The agent will fail. The question is how fast you recover.
- Verification pipelines. E2E tests, human gates, automated quality checks. The agent doesn’t grade its own homework.
- Context discipline. Controlled inputs, explicit context modes, no free-roaming access to the codebase.

What I’d Do Differently
If I were starting over today, here’s the stack I’d build on:
Initializer + worker split. Always. No exceptions. Planning and doing are different cognitive tasks and they need different agent configurations.
Git as the source of truth. Every agent action is a commit. Every commit is verified. The git log is the project timeline.
Human gates at every feature boundary. The agent proposes. The human disposes. At least until the verification pipeline is mature enough to replace that gate — and that takes months, not days.
Context control from day one. Don’t wait until the agent starts hallucinating. Build the context manifest system before you need it.
E2E tests written by humans, run by agents. The test suite is the contract. The agent’s job is to fulfill it. The human’s job is to define what “fulfill” means.
The agents that work across days aren’t smarter than the ones that work across minutes. They’re wrapped in better engineering. The model is the easy part. The system around it is where the real work lives.
