Agent memory: short-term, long-term, and coordination
- AI
- Agents
- Architecture
Useful agents remember more than the current chat window. Here is a practical map of short-term working memory, long-term procedural and episodic stores, semantic knowledge, and shared coordination—what each layer does and when to reach for it.
Every agent starts with the same constraint: the model only sees what you put in front of it right now. That limit is real, but it is not the whole story. Production agents combine several memory layers—some transient, some persistent, some shared across runs—so they can stay coherent across long tasks, recall past work, and follow team rules without stuffing everything into one prompt.
A useful mental model splits agent memory into three branches: short-term memory (STM) for the active task, long-term memory (LTM) for durable knowledge and habits, and coordination for multi-agent or multi-session state. The goal is not to mimic human cognition literally; it is to decide what to store, where to store it, and how to retrieve it cheaply when the next turn needs context.
Three branches of agent memory
Short-term memory covers the active window: context, session state, and semantic cache. Long-term memory splits into procedural (how), episodic (what happened), and semantic (what is known). Coordination adds shared memory when more than one agent or service must stay aligned.
Short-term memory: what the agent is working with now
STM is everything that exists only for the current job. It should be fast, bounded, and easy to discard when the session ends.
Working memory is the active scratch space. The LLM context window is the hard ceiling: tokens for system instructions, tool results, retrieved snippets, and recent messages. Session memory is the softer layer on top—conversation state, open todos, partial plans, and variables the agent set earlier in the same run. Good session design keeps this structured (JSON, state objects, explicit summaries) instead of hoping the model re-reads fifty messages correctly.
Semantic cache sits beside working memory. When the agent or user asks something semantically similar to a recent query, a cache hit can return an answer or embedding without another full retrieval pass. That cuts latency and cost for repeated questions, common tool lookups, or RAG queries that barely change between turns. Caches are STM because they expire: TTLs, session scope, or invalidation when underlying data changes.
STM fails when teams treat the context window as their only database. Long transcripts, raw logs, and duplicate retrievals eat tokens; the model forgets early decisions even though the information still exists somewhere in the thread. The fix is aggressive summarization, structured state, and moving durable facts into LTM.
Long-term memory: what should survive the session
LTM answers: what should this agent know tomorrow, next week, or for the next user on the same project? It is slower to write, cheaper to read selectively, and usually lives outside the prompt until retrieval time.
Procedural memory is how work gets done. Workflows encode sequences—deploy steps, review checklists, incident playbooks. The toolbox is the set of functions, APIs, MCP servers, and skills the agent may call. Procedural memory is not facts about the world; it is operational knowledge. Store it in versioned repo files, agent definitions, hooks, and documented runbooks so behavior stays auditable.
Episodic memory is what happened. Conversations are raw interaction logs—valuable for audit and debugging, expensive to replay verbatim. Summarizations compress episodes into durable takeaways: decisions made, errors hit, files touched, open questions. Most agents need summaries plus pointers (links to transcripts, tickets, commits) rather than full chat history in every prompt.
Semantic memory is general knowledge. A knowledge base backs RAG over docs, tickets, and wikis. Entity memory tracks specific objects—customers, repos, services, people—with attributes and relationships. Persona memory defines how the agent should behave: tone, boundaries, role, and standing instructions. Persona and entity memory overlap with system prompts, but externalizing them lets you update behavior without redeploying the whole stack.
The LTM design question is always retrieval, not storage. Vector search, keyword indexes, graph links, and metadata filters each fit different shapes of memory. Procedural docs reward exact paths; episodic summaries reward recency and tags; entity memory rewards stable IDs.
Coordination: memory beyond one agent
When multiple agents, humans, or services collaborate, shared memory becomes the coordination layer. It is the bulletin board everyone reads before acting: task status, locks, intermediate results, and decisions that must not be re-litigated.
Shared memory can be a database row, a Redis key, a project file, or a team-visible session object. The contract matters more than the technology: who may write, what schema is canonical, when entries expire, and how conflicts resolve. Without that, parallel agents duplicate work, contradict each other, or overwrite conclusions.
Coordination also bridges STM and LTM. A subagent might write a structured result to shared memory; the parent reads a short summary into its context window; a nightly job compacts shared entries into episodic summaries in LTM. That pipeline keeps working memory small while preserving institutional recall.
Putting the map to work
Start with the smallest stack that matches your task shape. A single-session coding agent may need only context window discipline, session state, and repo-backed procedural memory (rules, skills, hooks). A customer-support agent adds entity memory and a knowledge base. A research swarm adds shared memory and episodic summarization between specialist runs.
Four rules keep the architecture honest. Bound STM—summarize, cache, and retrieve instead of growing one endless thread. Type LTM—separate how (workflows/tools), what happened (episodes), and what is true (semantic/entity/persona). Retrieve on demand—inject only what the current step needs. Make coordination explicit—shared state needs writers, readers, and TTLs, not ad hoc message passing.
Agent memory is not one feature you bolt on; it is a set of trade-offs between freshness, cost, latency, and correctness. Short-term layers keep the model focused. Long-term layers preserve skill and history. Coordination keeps many hands from breaking the same vase. Name the layer before you store something, and most “why did the agent forget?” bugs become design questions instead of mysteries.