The Drift Brief — Context Management & Instruction Adherence for Long-Running Agents

The frontier research

Three findings recur across every serious lab and practitioner: long context degrades silently, the freshest tokens win, and adherence to an early instruction decays as the window fills.

A · Long sessions degrade — well before the window is "full" HIGH

Bigger context windows do not buy reliability. Performance falls as token count rises, and the information in the middle of a long context is the first to be lost. This is measured, reproducible, and present in every frontier model — including Claude.

"Lost in the Middle" — the U-shaped curve

Liu et al., TACL 2024 · arXiv:2307.03172

Accuracy is highest when relevant content sits at the start or end of context and drops >30 points when it falls in the middle — even in models built for long context. Root cause: RoPE long-term decay reduces attention on distant tokens. An instruction given at turn 1 lands in the dead zone by turn 30.

"Context Rot" — degradation below max capacity

Chroma Research (Hong, Troynikov, Huber), Jul 2025

All 18 models tested degrade as input grows, far below their max window. Claude models measurably degrade past ~8K words; focused ~300-token prompts beat full-context prompts by the widest margin of any vendor. Counter-intuitively, shuffled context outperformed coherent context — models pattern-match on local recency, they don't reason over the whole window.

NoLiMa — adherence collapses at 32K tokens

Modarressi et al. (Adobe), ICML 2025

Of 12 models claiming ≥128K support, 10 dropped below 50% of their short-context accuracy at 32K tokens. Top performer GPT-4o fell from 99.3% → 69.7%. The test removes keyword overlap, exposing true retrieval — i.e. the real ability to honor a buried instruction.

Multi-turn collapse — the headline number

Laban et al., 2025 · arXiv:2505.06120 (ICLR'26 Best Paper)

Across 15 LLMs and 200,000+ conversations, single-turn → multi-turn caused a 39% average performance drop and +112% variance. GPT-4.1: 96.6% → 72.6%. Claude 3.7 Sonnet: 78% → 65.6%. The failure isn't lost capability — it's doubled unreliability: the model "gets lost" and rarely recovers. (Reinforced by Multi-IF, arXiv:2410.15553, and the IFEval baseline, arXiv:2311.07911.)

B · The freshest tokens win — recency bias drives the swap HIGH

The mechanism behind "did the thing it had just done six times": the most recent tokens carry the most attention weight, independent of relevance. This is the direct engine of silent substitution.

Early content is structurally under-attended

Peysakhovich & Lerer, 2023 · arXiv:2310.01427

Attention weight on early-context content is measurably lower than on late content — independent of relevance. Their "attention sorting" fix (reorder by attention, re-run) improves output, proving the bias is a learned architectural preference, not a content effect.

Recency signal overrides relevance

Fang et al., SIGIR-AP 2025 · arXiv:2509.11353

Injecting a recency signal flips model preference between equally-relevant items up to 25% of the time and shifts ranking up to 95 positions. No model is immune. Six Vercel deploys at the tail of the context outweigh one PandaDoc instruction at the head.

Honest gap: no single paper isolates "action-recency bias in tool selection" in the exact PandaDoc→Vercel form. The diagnosis is triangulated from position-bias, recency-bias, and context-rot evidence above — each independently robust. MEDIUM on the precise tool-selection mechanism; HIGH that recency + saturation produce exactly this class of drift.

C · The toolkit — what the best labs actually do HIGH

The field has converged on a small set of context-engineering moves. None of them is "use a bigger window."

Anthropic — context as a finite resource

"Effective context engineering," Sep 2025

The guiding principle: "the smallest set of high-signal tokens that maximize the likelihood of some desired outcome." Named techniques: compaction (summarize history, reinitiate window), structured note-taking (external memory), just-in-time retrieval (keep pointers, not payloads), sub-agent architectures (each returns a 1–2K-token distillate, not its full context).

Anthropic — Memory Tool + Context Editing

Claude Developer Platform, Sep 2025

Server-side clearing of stale tool results (clear_tool_uses_20250919) with an exclude_tools allowlist for results that must never be cleared. Measured: +29% from context editing alone, +39% combined with the file-based memory tool, and an 84% token reduction over a 100-turn run.

Manus — recitation beats drift

Yichao "Peak" Ji, Jul 2025

The single most relevant technique. Manus continuously rewrites a todo.md, "reciting its objectives into the end of the context… pushing the global plan into the model's recent attention span, avoiding 'lost-in-the-middle.'" Also: file system as unlimited memory; mask tools (don't remove them); leave wrong turns in context so the model conditions on failure.

MemGPT / Letta — memory as an OS

Packer et al., 2023 · arXiv:2310.08560

Tiered memory: the context window is "RAM," an external store is "disk." The agent promotes, evicts, and summarizes — and maintains a small self-editing "core memory" scratchpad that persists in-context across turns. This is the architecture Letta (already in Steve's stack) implements.

Anthropic — fresh context per task + state artifacts

"Effective harnesses for long-running agents," 2025

The disciplined alternative to mega-sessions: each new session starts clean and reads state artifacts (a feature-list JSON, a progress file, git log) rather than inheriting the full prior conversation. Anthropic is explicit that compaction alone is not sufficient — durable external state is required.

Cognition — actions carry implicit decisions

Walden Yan, "Don't Build Multi-Agents," Jun 2025

The principle that names our failure: "Actions carry implicit decisions, and conflicting decisions carry bad results." Choosing Vercel over PandaDoc is a decision — one that must be surfaced and shared, never silently embedded in an action. (Cognition's 2026 update: writes stay single-threaded; extra agents add intelligence, not actions.)

D · Instruction fidelity — re-anchoring & deviation-flagging HIGH

Re-injection / "Recap" works. Laban et al. tested restating constraints before generation: GPT-4o recovered from 59.1% → 76.6%. OpenAI's GPT-4.1 guide formalizes three persistent reminders (~20% gain on SWE-bench); Anthropic recommends placing the active query/constraints at the end of context for up to a 30% quality gain (the recency exploit, used for us).
Deviation-flagging needs to be built — no framework gives it for free. Reflexion (arXiv:2303.11366) catches mismatches only at episode end (retry, not pre-flight flag). NeMo Guardrails (arXiv:2310.10501) can block a disallowed action before execution, and Guardrails AI can fail an output that doesn't match a schema — but "ask before substituting" must be explicitly programmed.
Constrained planning pins the deliverable. SagaLLM (arXiv:2503.11951, VLDB 2025) runs independent validators against each step's declared output contract, with rollback on failure. The production pattern is a plan schema that records expected_output_type / deliverable_tool so any verifier can check the final artifact against what was promised.

Our failure, mapped

The PandaDoc→Vercel swap is not one bug. It is four documented mechanisms stacking inside one over-long session — exactly the conditions the literature predicts will produce silent substitution.

The original instruction decays into the lost-in-the-middle zone while six recent Vercel deploys dominate attention. At the friction point, recency wins — and nothing forces a flag.

Root-cause decomposition — all four, stacked

1 · Context saturation

→ Context Rot, NoLiMa, multi-turn collapse

Eight unrelated deliverables in one window. By the contract step, accumulated tool output had pushed the session into the measured degradation zone — recall of the early PandaDoc instruction was structurally impaired, not "forgotten."

2 · Recency bias

→ Peysakhovich & Lerer; Fang et al.

Six recent Vercel deploys formed the highest-attention tokens at the decision point. "Deploy to Vercel" was the path of least resistance because it was the freshest groove — exactly the documented effect.

3 · No re-anchoring on the original ask

→ Lost in the Middle; Manus recitation absent

The named deliverable ("in PandaDoc") was stated once, at the top, and never re-cited into recent context. With no living deliverable-contract, the goal sat in the dead middle when it mattered most.

4 · No deviation flag

→ Cognition "actions carry implicit decisions"

The choice to substitute Vercel for PandaDoc was an implicit decision embedded in an action and never surfaced. No guard forced "I can't do X on the named tool — here are on-target options" before delivering something else.

Verdict, restated: the in-flight Capability-Ledger work (the memory side) is necessary but orthogonal — it prevents false "I can't." It would not have caught this, because the agent could and knew it could. This failure lives entirely in context management + instruction adherence. HIGH

The options

Five distinct, composable interventions. Each is rated for effort and impact and mapped to both surfaces. They are not mutually exclusive — the recommendation stacks them.

Deliverable-Contract Recitation

Effort · Impact ●●●

Maintain a short ACTIVE ASKS block — every user-named deliverable with its pinned tool/format/channel — and rewrite it at the top of each work block (Manus recitation + Laban "Recap"). Keeps the original ask in the high-attention recent window instead of the dead middle.

OpenClaw Steve

The recitation surface is a canonical state file (+ Letta core-memory block). Every dispatched task and voice turn re-reads and rewrites it before acting — never trusts the rolling thread.

Claude Code Steve

Use the native todo (TaskCreate/Update) as the recitation surface; re-read it before each phase. Add an "ACTIVE ASKS" line that survives /compact.

CostNegligible tokens; pure prompt/skill discipline.

RiskBehavioral unless enforced — pair with Option D for teeth.

Context Editing & Compaction

Effort · Impact ●●●

Clear stale tool results and compact finished work blocks so the window stays high-signal. Anthropic's measured +29–39% and 84% token cut. Use exclude_tools to mark the deliverable-contract as never-cleared.

OpenClaw Steve

Enable platform context-editing on the API path; compact per task boundary. Letta already provides the tiered-memory backing store for evicted content.

Claude Code Steve

Largely harness-native: /compact between phases, /clear on topic change. Don't rely on auto-compaction to preserve a named deliverable — pin it explicitly.

CostLow; net token savings on long runs.

RiskLossy summaries can drop a constraint — mitigated by Option A pinning.

Session Segmentation + Sub-agent Offload

Effort · Impact ●●●●

Stop running 8 unrelated deliverables down one thread. Spawn a fresh sub-agent / session per separable task (Anthropic's three spawn signals: context-pollution, parallelizable, tool-specialized), each booting from a 1–2K-token handoff brief. Structural prevention of saturation.

OpenClaw Steve

Already structural — dispatched/background tasks are fresh contexts. Enforce: each boots from a handoff brief + canonical state, never the parent thread. This is the highest-leverage move here.

Claude Code Steve

Use the Agent/Task tool for research and large reads (return distillates, not dumps); session-handoff skill + a project state file for cross-session continuity.

Cost3–10× tokens for parallel fan-out (Anthropic) — justified on heavy work.

RiskCognition's caveat: keep writes single-threaded; fan out reads/research, not conflicting decisions.

No-Silent-Substitution Guard (runtime teeth)

Effort · Impact ●●●●●

A Stop-hook that parses the session for user-named deliverables ("in PandaDoc", "send via", "as a <format>"), checks the final artifacts against them, and blocks if a named deliverable has no matching artifact and no recorded approval-to-swap. The direct, deterministic catch for this exact failure. (NeMo-Guardrails-style pre-execution rail + SagaLLM-style output-contract validation.)

OpenClaw Steve

instruction-fidelity-guard.py as a Stop hook + a Standing Order; self-healing where the on-target path exists, escalates a flag to Victor where it doesn't.

Claude Code Steve

Same hook in the CLI Stop-hook chain (sibling to the existing wrap-up / permission guards). Fires before the turn can close on a substituted deliverable.

CostOne script + parser; moderate build, high durability.

RiskFalse positives on legitimate approved swaps — mitigated by an explicit approval token.

Externalized State & Handoff Briefs

Effort · Impact ●●●

Push durable state to files (MemGPT/Letta tiered memory; Manus file-system-as-context; Anthropic state artifacts). A 5-layer handoff brief (state · narrative · decisions · priority queue · gotchas) lets any new context or surface resume without replaying history — and carries the deliverable-contract across the boundary.

OpenClaw Steve

Native fit: Letta + canonical state files already exist. Standardize the handoff-brief schema so every surface (Phone, Desktop, Mission Control, dispatch) reads/writes the same contract.

Claude Code Steve

A project state file + session-handoff skill; the brief is the first thing a resumed session reads, and the last thing a closing session writes.

CostSchema + write/read discipline; low ongoing.

RiskDrift between surfaces if schema isn't shared — enforce one schema.

The architect's recommendation

Don't pick one. The failure is multi-causal, so the defense is layered — three layers, in priority order, plus the orthogonal memory work already in flight. Behavioral discipline is cheap but leaks; the runtime guard is what makes it real (CLAUDE.md core value #8: real fixes are scripted, not text).

Deliverable-Contract Recitation (A + E)

Every session maintains a pinned ACTIVE-ASKS contract — user-named deliverables with their tool/format/channel — rewritten into recent context at each work block and carried in the handoff brief. Highest ROI, lowest effort. Ships as the context-discipline + instruction-fidelity skills (already scaffolded this session).

Counters: lost-in-the-middle, no-re-anchoring. Both surfaces, day one.

No-Silent-Substitution Guard (D)

A Stop-hook that blocks turn-close when a named deliverable has no matching artifact and no approval-to-swap, emitting the deviation flag instead. This is the deterministic catch for the PandaDoc class — the one layer that would have stopped this specific failure rather than just made it less likely.

Counters: silent substitution, recency-driven action drift. Runtime teeth on both surfaces.

Segmentation + Context Hygiene (C + B)

Fresh sub-agent/session per separable task, booting from a handoff brief; context-editing + compaction to keep each window high-signal. Structural — it removes the saturation that creates the drift in the first place. Ships as the session-segmentation skill.

Counters: context saturation / rot. Native to OpenClaw's dispatch model; via Agent-tool + /compact on Claude Code.

Why this order. L1 is nearly free and addresses the most-cited mechanism (re-anchoring) — do it immediately. L2 is the only layer that guarantees the specific failure can't silently recur, so it's the priority build. L3 is the deepest fix but the heaviest lift and partly already-native to OpenClaw — sequence it third. The in-flight Capability Ledger (memory side) runs in parallel and is complementary: it stops false "I can't," while this stack stops "could, but did something else."

The two surfaces, side by side

OpenClaw Steve — always-on, cross-surface

Segmentation is already structural (dispatched tasks = fresh contexts) — so the win is enforcing that each boots from a shared handoff-brief + Letta core-memory contract, and adding the Stop-hook guard + Standing Order across Phone/Desktop/Mission Control. Letta already gives you MemGPT-style tiered memory; standardize the contract schema so parity holds across every surface.

Claude Code Steve — single long sessions

Most exposed to in-session saturation. Lean on harness-native moves (todo as recitation surface, /compact, /clear, Agent-tool offload) + the same instruction-fidelity-guard.py in the Stop-hook chain. Don't trust auto-compaction to preserve a named deliverable — pin it in the todo and let the guard verify it at turn-close.

Sources

23 primary sources, fetched and verified. Academic findings carry their arXiv/venue IDs; practitioner writeups link to the original posts.

Liu et al. — "Lost in the Middle: How Language Models Use Long Contexts." TACL 2024. aclanthology.org/2024.tacl-1.9 · arXiv:2307.03172
Hong, Troynikov, Huber — "Context Rot: How Increasing Input Tokens Impacts LLM Performance." Chroma Research, Jul 2025. trychroma.com/research/context-rot
Modarressi et al. — "NoLiMa: Long-Context Evaluation Beyond Literal Matching." ICML 2025 (Adobe Research). research.adobe.com
Laban et al. — "LLMs Get Lost in Multi-Turn Conversation." 2025, ICLR 2026 Best Paper. arXiv:2505.06120 · arxiv.org/abs/2505.06120
Meta — "Multi-IF: Benchmarking LLMs on Multi-Turn Multilingual Instruction Following." Oct 2024. arXiv:2410.15553
Google — "IFEval: Instruction-Following Eval for LLMs." Nov 2023. arXiv:2311.07911
Peysakhovich & Lerer — "Attention Sorting Combats Recency Bias in Long Context Language Models." 2023. arXiv:2310.01427
Fang et al. — "Do Large Language Models Favor Recent Content? A Study on Recency Bias in LLM-Based Reranking." SIGIR-AP 2025. arXiv:2509.11353
Anthropic — "Effective context engineering for AI agents." Sep 2025. anthropic.com/engineering
Anthropic — "Managing context on the Claude Developer Platform" (Memory tool + context editing). Sep 2025. claude.com/blog/context-management
Anthropic — "Building Effective Agents." Dec 2024. anthropic.com/engineering/building-effective-agents
Anthropic — "How we built our multi-agent research system." Jun 2025. anthropic.com/engineering
Anthropic — "Effective harnesses for long-running agents." 2025. anthropic.com/engineering
Anthropic / Claude — "When to use multi-agent systems (and when not to)." 2025. claude.com/blog
Anthropic Platform Docs — "Context windows" (context rot, state artifacts). platform.claude.com/docs
Packer et al. — "MemGPT: Towards LLMs as Operating Systems." 2023 → Letta. arXiv:2310.08560 · research.memgpt.ai
Walden Yan (Cognition) — "Don't Build Multi-Agents." Jun 2025. cognition.ai/blog/dont-build-multi-agents
Walden Yan (Cognition) — "Multi-Agents: What's Actually Working." Apr 2026. cognition.ai/blog/multi-agents-working
Yichao "Peak" Ji (Manus) — "Context Engineering for AI Agents: Lessons from Building Manus." Jul 2025. manus.im/blog
Shinn et al. — "Reflexion: Language Agents with Verbal Reinforcement Learning." 2023. arXiv:2303.11366
Rebedea et al. — "NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications." EMNLP 2023. arXiv:2310.10501
"SagaLLM: Context Management, Validation, and Transaction Guarantees for Multi-Agent LLM Planning." VLDB 2025. arXiv:2503.11951
OpenAI — "GPT-4.1 Prompting Guide" (persistent reminders); Beurer-Kellner et al., arXiv:2506.08837 (instruction re-injection, security-hardened). 2025.

Why the agent driftswhen it knew better all along.

This is not a memory problem. The agent knew the capability and used it 30 minutes earlier. It is a context-management + instruction-adherence failure — context saturation, recency bias, and a missing deviation-flag.

The frontier research

A · Long sessions degrade — well before the window is "full" HIGH

"Lost in the Middle" — the U-shaped curve

"Context Rot" — degradation below max capacity

NoLiMa — adherence collapses at 32K tokens

Multi-turn collapse — the headline number

B · The freshest tokens win — recency bias drives the swap HIGH

Early content is structurally under-attended

Recency signal overrides relevance

C · The toolkit — what the best labs actually do HIGH

Anthropic — context as a finite resource

Anthropic — Memory Tool + Context Editing

Manus — recitation beats drift

MemGPT / Letta — memory as an OS

Anthropic — fresh context per task + state artifacts

Cognition — actions carry implicit decisions

D · Instruction fidelity — re-anchoring & deviation-flagging HIGH

Our failure, mapped

Root-cause decomposition — all four, stacked

1 · Context saturation

2 · Recency bias

3 · No re-anchoring on the original ask

4 · No deviation flag

The options

Deliverable-Contract Recitation

OpenClaw Steve

Claude Code Steve

Context Editing & Compaction

OpenClaw Steve

Claude Code Steve

Session Segmentation + Sub-agent Offload

OpenClaw Steve

Claude Code Steve

No-Silent-Substitution Guard (runtime teeth)

OpenClaw Steve

Claude Code Steve

Externalized State & Handoff Briefs

OpenClaw Steve

Claude Code Steve

The architect's recommendation

Deliverable-Contract Recitation (A + E)

No-Silent-Substitution Guard (D)

Segmentation + Context Hygiene (C + B)

The two surfaces, side by side

OpenClaw Steve — always-on, cross-surface

Claude Code Steve — single long sessions

This is research, options, and a recommendation — not implementation.

Sources