Memory: what separates a capable agent from a useful one
An agent without memory starts from scratch every time. Memory is what allows it to carry context across steps, sessions, and tasks — and it is where some of the most underestimated production risks live.
The failure this pattern exists to prevent
This is an illustrative example of a failure pattern that surfaces regularly in production. A financial services company deploys a customer-facing agent to handle account queries. The agent is capable. It could look up balances, explain fee structures, and walk customers through dispute processes. The team was satisfied with the demo and shipped it.
Six weeks later the support team flagged a pattern. Customers were calling back after interacting with the agent and repeating everything they had just told it. The agent had no memory of the previous session. Every conversation started from zero. A customer who had spent twenty minutes explaining a complex dispute had to explain it again the next day to the same system. The capability was there. The continuity was not.
That is the most visible memory failure. There is a less visible one that is more dangerous. An agent with no memory of its own past actions cannot recognise when it is repeating a mistake it already made. It will attempt the same failing approach across sessions because nothing in its architecture tells it that approach has failed before.
What the pattern is
Memory in AI agents is not a single mechanism. It is a layer of three distinct capabilities that operate at different timescales and serve different purposes.
Short-term memory is the context window. Everything the agent knows within a single session — the conversation history, the task description, the tool call results, the intermediate reasoning — lives here. It is fast, immediately available, and temporary. When the session ends, short-term memory is gone.
Long-term memory persists across sessions. The agent stores summaries, outcomes, user preferences, or learned patterns in a way that is available the next time it runs. This is what allows an agent to recognise a returning user, recall the outcome of a previous task, or avoid repeating an approach that did not work.
External memory is retrieval from a knowledge store. The agent does not hold the information itself. It queries a database, a vector store, or a document index and pulls in what is relevant at the moment it is needed. Retrieval-Augmented Generation, commonly called RAG, is a common implementation of this layer. The agent retrieves before it reasons rather than relying solely on what it was trained on.
Many production agent systems combine these layers in some form, though the exact combination varies widely by product and maturity. The other two are sometimes left implicit rather than deliberately designed.
A concrete before and after
A legal research agent is asked to find relevant case law for a contract dispute, summarise the findings, and flag any cases that contradict the client's intended argument.
Without memory the agent works like this across two sessions a week apart. Session 1:
Task: find relevant case law for contract dispute X.
Agent searches, retrieves 12 cases, summarises, flags 3
as potentially contradictory.
Session ends. Session 2 (one week later, same matter):
Task: check whether any new case law has emerged
since last week.
Agent searches, retrieves the same 12 cases plus
2 new ones, summarises all 14 from scratch, flags
the same 3 contradictory cases it already flagged
plus one new one.
The second session repeated all the work of the first. The lawyer has no way of knowing which findings are new and which were already reviewed. The agent produced more output, not more value.
With memory the second session looks like this. Session 2:
Long-term memory: 12 cases reviewed on [date],
3 flagged as contradictory, summaries stored. Task: check for new case law since last session.
Agent retrieves only cases published after the
previous session date.
Result: 2 new cases found. 1 is relevant.
0 additional contradictions identified. Output: 1 new case summary, delta from previous
session clearly marked, no repeated work.
In this scenario the second session takes a fraction of the time and produces output the lawyer can act on without re-reading everything from the previous session.
Three layers, three sets of risks
Short-term memory fails when the context window fills. In a long session or a complex multi-step task, early context gets pushed out to make room for recent content. The agent forgets the original task framing, the user's stated constraints, or the outcome of a tool call it made twenty steps ago. The session continues but the agent is operating on a partial picture of its own work. This is a common failure mode in production agent systems and one of the least monitored.
Long-term memory introduces a different class of risk. What gets stored is a decision made by someone — either the model or the developer who designed the storage logic. If the wrong things are stored, or stored in the wrong way, the memory becomes a source of persistent error rather than accumulated knowledge. An agent that stores an incorrect user preference and applies it across every subsequent session will produce subtly wrong outputs until someone notices and corrects the stored value, which can take longer than most teams expect. Unlike a session error that ends when the session ends, a long-term memory error compounds over time.
External memory retrieval has its own failure surface. The agent retrieves what the retrieval system ranks as relevant, not necessarily what is actually relevant for the task. A poorly configured vector store, a knowledge base that has not been updated, or a retrieval query that is subtly misframed can return plausible-looking content that is out of date, out of context, or simply wrong. The agent reasons confidently against bad retrieved content and the output looks authoritative because the reasoning chain is internally consistent.
There is a governance dimension that runs across all three layers and that most teams do not address explicitly. Memory is data. Data has a lifecycle. Who decides what gets stored in long-term memory. Who can correct it when it is wrong. How long does it persist. Who has access to it. For agents operating in regulated industries these are not optional questions. An agent that stores customer interaction summaries in long-term memory is creating a data record with retention, access, and correction obligations that somebody in your organisation needs to own.
The question worth asking before you ship
Map the three memory layers against your current agent implementation and answer one question for each.
For short-term memory, what happens to the agent's behaviour when the context window is eighty percent full. Have you tested it. Most teams have not.
For long-term memory, who owns the correction process when something stored is wrong. If the answer is nobody, that is a gap worth closing before the agent touches production data.
For external memory, when was the knowledge store last updated and who is responsible for keeping it current. A retrieval system is only as good as the knowledge it retrieves from. Stale knowledge produces stale answers regardless of how capable the agent is.
What comes next
Memory gives an agent continuity across steps and sessions. But memory alone does not solve the problem of what goes into the agent's context at any given moment. A large context window full of loosely relevant information is not the same as a well-constructed context that gives the agent exactly what it needs.
Day 7 covers Context Engineering, which is the practice of deliberately designing what goes into an agent's context and what stays out. It is less discussed than the other patterns in this series and more consequential in production than most teams expect.
For the AI agents you currently run in production, do you know what they are storing in long-term memory and who owns the process of correcting it when something stored turns out to be wrong?
Part of the TrilokCloud AI Design Patterns series · One pattern per day · trilokcloud.in/blog