A frontier-research brief on context management and instruction adherence for long-running, multi-surface AI agents — diagnosing why "Steve" silently swapped a requested PandaDoc contract for a Vercel page after using the correct tool an hour earlier, and what the field's best practitioners do about it.
A memory miss looks like "never knew the tool." This looks like "knew it, used it, and still drifted to the freshest groove." The literature is unambiguous that long sessions structurally degrade an agent's grip on early instructions — and that the fix is engineering, not bigger context.
Three findings recur across every serious lab and practitioner: long context degrades silently, the freshest tokens win, and adherence to an early instruction decays as the window fills.
Bigger context windows do not buy reliability. Performance falls as token count rises, and the information in the middle of a long context is the first to be lost. This is measured, reproducible, and present in every frontier model — including Claude.
Accuracy is highest when relevant content sits at the start or end of context and drops >30 points when it falls in the middle — even in models built for long context. Root cause: RoPE long-term decay reduces attention on distant tokens. An instruction given at turn 1 lands in the dead zone by turn 30.
All 18 models tested degrade as input grows, far below their max window. Claude models measurably degrade past ~8K words; focused ~300-token prompts beat full-context prompts by the widest margin of any vendor. Counter-intuitively, shuffled context outperformed coherent context — models pattern-match on local recency, they don't reason over the whole window.
Of 12 models claiming ≥128K support, 10 dropped below 50% of their short-context accuracy at 32K tokens. Top performer GPT-4o fell from 99.3% → 69.7%. The test removes keyword overlap, exposing true retrieval — i.e. the real ability to honor a buried instruction.
Across 15 LLMs and 200,000+ conversations, single-turn → multi-turn caused a 39% average performance drop and +112% variance. GPT-4.1: 96.6% → 72.6%. Claude 3.7 Sonnet: 78% → 65.6%. The failure isn't lost capability — it's doubled unreliability: the model "gets lost" and rarely recovers. (Reinforced by Multi-IF, arXiv:2410.15553, and the IFEval baseline, arXiv:2311.07911.)
The mechanism behind "did the thing it had just done six times": the most recent tokens carry the most attention weight, independent of relevance. This is the direct engine of silent substitution.
Attention weight on early-context content is measurably lower than on late content — independent of relevance. Their "attention sorting" fix (reorder by attention, re-run) improves output, proving the bias is a learned architectural preference, not a content effect.
Injecting a recency signal flips model preference between equally-relevant items up to 25% of the time and shifts ranking up to 95 positions. No model is immune. Six Vercel deploys at the tail of the context outweigh one PandaDoc instruction at the head.
The field has converged on a small set of context-engineering moves. None of them is "use a bigger window."
The guiding principle: "the smallest set of high-signal tokens that maximize the likelihood of some desired outcome." Named techniques: compaction (summarize history, reinitiate window), structured note-taking (external memory), just-in-time retrieval (keep pointers, not payloads), sub-agent architectures (each returns a 1–2K-token distillate, not its full context).
Server-side clearing of stale tool results (clear_tool_uses_20250919) with an exclude_tools allowlist for results that must never be cleared. Measured: +29% from context editing alone, +39% combined with the file-based memory tool, and an 84% token reduction over a 100-turn run.
The single most relevant technique. Manus continuously rewrites a todo.md, "reciting its objectives into the end of the context… pushing the global plan into the model's recent attention span, avoiding 'lost-in-the-middle.'" Also: file system as unlimited memory; mask tools (don't remove them); leave wrong turns in context so the model conditions on failure.
Tiered memory: the context window is "RAM," an external store is "disk." The agent promotes, evicts, and summarizes — and maintains a small self-editing "core memory" scratchpad that persists in-context across turns. This is the architecture Letta (already in Steve's stack) implements.
The disciplined alternative to mega-sessions: each new session starts clean and reads state artifacts (a feature-list JSON, a progress file, git log) rather than inheriting the full prior conversation. Anthropic is explicit that compaction alone is not sufficient — durable external state is required.
The principle that names our failure: "Actions carry implicit decisions, and conflicting decisions carry bad results." Choosing Vercel over PandaDoc is a decision — one that must be surfaced and shared, never silently embedded in an action. (Cognition's 2026 update: writes stay single-threaded; extra agents add intelligence, not actions.)
expected_output_type / deliverable_tool so any verifier can check the final artifact against what was promised.The PandaDoc→Vercel swap is not one bug. It is four documented mechanisms stacking inside one over-long session — exactly the conditions the literature predicts will produce silent substitution.
Eight unrelated deliverables in one window. By the contract step, accumulated tool output had pushed the session into the measured degradation zone — recall of the early PandaDoc instruction was structurally impaired, not "forgotten."
Six recent Vercel deploys formed the highest-attention tokens at the decision point. "Deploy to Vercel" was the path of least resistance because it was the freshest groove — exactly the documented effect.
The named deliverable ("in PandaDoc") was stated once, at the top, and never re-cited into recent context. With no living deliverable-contract, the goal sat in the dead middle when it mattered most.
The choice to substitute Vercel for PandaDoc was an implicit decision embedded in an action and never surfaced. No guard forced "I can't do X on the named tool — here are on-target options" before delivering something else.
Verdict, restated: the in-flight Capability-Ledger work (the memory side) is necessary but orthogonal — it prevents false "I can't." It would not have caught this, because the agent could and knew it could. This failure lives entirely in context management + instruction adherence. HIGH
Five distinct, composable interventions. Each is rated for effort and impact and mapped to both surfaces. They are not mutually exclusive — the recommendation stacks them.
Maintain a short ACTIVE ASKS block — every user-named deliverable with its pinned tool/format/channel — and rewrite it at the top of each work block (Manus recitation + Laban "Recap"). Keeps the original ask in the high-attention recent window instead of the dead middle.
The recitation surface is a canonical state file (+ Letta core-memory block). Every dispatched task and voice turn re-reads and rewrites it before acting — never trusts the rolling thread.
Use the native todo (TaskCreate/Update) as the recitation surface; re-read it before each phase. Add an "ACTIVE ASKS" line that survives /compact.
Clear stale tool results and compact finished work blocks so the window stays high-signal. Anthropic's measured +29–39% and 84% token cut. Use exclude_tools to mark the deliverable-contract as never-cleared.
Enable platform context-editing on the API path; compact per task boundary. Letta already provides the tiered-memory backing store for evicted content.
Largely harness-native: /compact between phases, /clear on topic change. Don't rely on auto-compaction to preserve a named deliverable — pin it explicitly.
Stop running 8 unrelated deliverables down one thread. Spawn a fresh sub-agent / session per separable task (Anthropic's three spawn signals: context-pollution, parallelizable, tool-specialized), each booting from a 1–2K-token handoff brief. Structural prevention of saturation.
Already structural — dispatched/background tasks are fresh contexts. Enforce: each boots from a handoff brief + canonical state, never the parent thread. This is the highest-leverage move here.
Use the Agent/Task tool for research and large reads (return distillates, not dumps); session-handoff skill + a project state file for cross-session continuity.
A Stop-hook that parses the session for user-named deliverables ("in PandaDoc", "send via", "as a <format>"), checks the final artifacts against them, and blocks if a named deliverable has no matching artifact and no recorded approval-to-swap. The direct, deterministic catch for this exact failure. (NeMo-Guardrails-style pre-execution rail + SagaLLM-style output-contract validation.)
instruction-fidelity-guard.py as a Stop hook + a Standing Order; self-healing where the on-target path exists, escalates a flag to Victor where it doesn't.
Same hook in the CLI Stop-hook chain (sibling to the existing wrap-up / permission guards). Fires before the turn can close on a substituted deliverable.
Push durable state to files (MemGPT/Letta tiered memory; Manus file-system-as-context; Anthropic state artifacts). A 5-layer handoff brief (state · narrative · decisions · priority queue · gotchas) lets any new context or surface resume without replaying history — and carries the deliverable-contract across the boundary.
Native fit: Letta + canonical state files already exist. Standardize the handoff-brief schema so every surface (Phone, Desktop, Mission Control, dispatch) reads/writes the same contract.
A project state file + session-handoff skill; the brief is the first thing a resumed session reads, and the last thing a closing session writes.
Don't pick one. The failure is multi-causal, so the defense is layered — three layers, in priority order, plus the orthogonal memory work already in flight. Behavioral discipline is cheap but leaks; the runtime guard is what makes it real (CLAUDE.md core value #8: real fixes are scripted, not text).
Every session maintains a pinned ACTIVE-ASKS contract — user-named deliverables with their tool/format/channel — rewritten into recent context at each work block and carried in the handoff brief. Highest ROI, lowest effort. Ships as the context-discipline + instruction-fidelity skills (already scaffolded this session).
Counters: lost-in-the-middle, no-re-anchoring. Both surfaces, day one.
A Stop-hook that blocks turn-close when a named deliverable has no matching artifact and no approval-to-swap, emitting the deviation flag instead. This is the deterministic catch for the PandaDoc class — the one layer that would have stopped this specific failure rather than just made it less likely.
Counters: silent substitution, recency-driven action drift. Runtime teeth on both surfaces.
Fresh sub-agent/session per separable task, booting from a handoff brief; context-editing + compaction to keep each window high-signal. Structural — it removes the saturation that creates the drift in the first place. Ships as the session-segmentation skill.
Counters: context saturation / rot. Native to OpenClaw's dispatch model; via Agent-tool + /compact on Claude Code.
Segmentation is already structural (dispatched tasks = fresh contexts) — so the win is enforcing that each boots from a shared handoff-brief + Letta core-memory contract, and adding the Stop-hook guard + Standing Order across Phone/Desktop/Mission Control. Letta already gives you MemGPT-style tiered memory; standardize the contract schema so parity holds across every surface.
Most exposed to in-session saturation. Lean on harness-native moves (todo as recitation surface, /compact, /clear, Agent-tool offload) + the same instruction-fidelity-guard.py in the Stop-hook chain. Don't trust auto-compaction to preserve a named deliverable — pin it in the todo and let the guard verify it at turn-close.
Implementation (building instruction-fidelity-guard.py + Standing Order, wiring the recitation contract into both surfaces, and the handoff-brief schema) is the next phase, on your go. Three durable skills — context-discipline, instruction-fidelity, session-segmentation — were distilled into the toolbox during this research, per the standing instruction, so the substrate already compounds.
23 primary sources, fetched and verified. Academic findings carry their arXiv/venue IDs; practitioner writeups link to the original posts.