80% Token Compression on Long AI Conversations: Observational Memory
Long agent conversations blow past context windows. ZenSearch's observational memory layer extracts the key findings, tool usage, and pending work into a compact summary that replaces the full history — typically 80%+ token compression on 50+ message chats.
Observational memory is an LLM-based summarisation layer that runs after each long agent turn, extracting key findings, tool usage, user intent, and pending work into a structured Observation object. On the next run, the observation is injected as a cacheable system-prompt prefix instead of replaying the full chat history — typically compressing 50+ message conversations by 80% or more. A second threshold triggers a Reflection that consolidates many observations into a higher-level summary when they accumulate.
This replaces the older heuristic compaction (string-based extraction of search queries and findings) with something that actually understands the conversation. The trade-off is one extra zen-mini call per long turn — which is cheap at the compression ratios we see, and fully optional via a single env var.
The Problem Compaction Solves
A three-message chat fits comfortably in any context window. A thirty-message chat doesn't — and crucially, most of those thirty messages are low-signal (tool-call acks, intermediate search results, the agent's own planning chatter). Stuffing them all back into the next turn's context is wasteful at best and straight-up impossible past a certain length.
The naive fix is to drop the oldest N messages. That loses the user's original intent, any constraints they stated early on, and the synthesis decisions the agent made along the way. The better fix is to compress the old messages into a structured summary that preserves those signals — and that's what an Observation is.
What an Observation Contains
Each Observation is a small structured record (JSON in conversation_observations) with:
- User intent — what the user is ultimately trying to accomplish
- Key findings — the substantive information the agent discovered across its tool calls
- Tool usage summary — which tools were called, with which arguments, and what they returned (compressed)
- Pending work — anything the user asked for that the agent hasn't resolved yet
- Unresolved constraints — hard requirements the next turn must still honour
The Observer runs after a conversation turn crosses AGENT_OBSERVATION_THRESHOLD tokens (default 10,000), rate-limited to once per conversation per five minutes. Below that threshold, running the observer costs more than it saves.
Reflection: Summarising the Summaries
Observations accumulate. At some point a conversation has twenty observations, and stuffing those into every subsequent turn is back to the original problem on a different axis. When observations exceed AGENT_REFLECTION_THRESHOLD tokens (default 20,000), a second pass — the Reflector — consolidates many observations into a single higher-level summary. The Reflection is preferred over raw observations in context assembly, and older observations can be pruned without losing the aggregated signal.
Cacheable System-Prompt Prefix
The key design decision is where the observation lives in the prompt. ZenSearch puts it as a system-prompt prefix, not interleaved with messages. This matters because Anthropic's prompt cache and OpenAI's prefix cache both key on the exact system prompt — a prefix that doesn't change between turns is cache-eligible, which means a 90% discount on the cached portion for Anthropic and roughly 50% for OpenAI. The compression ratio of the observation is effectively multiplied by the cache-hit rate on the prefix.
Pluggable Context Pipeline
Observational memory is just one of three built-in context providers in a priority-ordered chain: observations (priority 10), heuristic compaction (priority 20, fallback), and truncation (priority 30). Each provider transforms the message list; failures are logged and skipped so the pipeline never fails a request.
Custom providers plug in at any priority slot via Orchestrator.ContextPipeline().Register(...) or declaratively through OrchestratorConfig.ExtraContextProviders. Teams can inject things like "fetch the user's current CRM account before each iteration" or "attach the last incident summary from PagerDuty" without forking the orchestrator. The pipeline runs in both LLMNode and SynthesizeNode, so custom providers also apply to the final synthesis pass.
When to Turn It Off
Observational memory adds one zen-mini call per long-conversation turn. For deployments with tight model budgets, or where conversations rarely exceed the observation threshold, set AGENT_OBSERVATIONAL_MEMORY_ENABLED=false and the agent falls back to deterministic heuristic compaction. For deployments where long conversations are common and model spend is acceptable, leave it on — the cache-hit discount on the prefix usually pays for the extraction call several times over.