HubSpot AEO
Picture this. A buyer opens ChatGPT and asks for a recommendation in your category. Your competitor's name comes up. Yours doesn't. And that buyer never makes it to your website.
That's happening right now in markets everywhere. And most teams don't know it's happening because it never shows up in their analytics.
HubSpot AEO shows you exactly where your brand stands in AI search, where competitors are getting recommended instead of you, and tells you specifically what to fix. No expertise needed.
Try it free for 28 days. Just $50 a month after.
10x the context. Half the time.
Speak your prompts into ChatGPT or Claude and get detailed, paste-ready input that actually gives you useful output. Wispr Flow captures what you'd cut when typing. Free on Mac, Windows, and iPhone.
Added Job Opening in the end of the article!
Claude can process 200K tokens in a single conversation. Gemini 2.5 Pro handles 1M tokens. Engineers look at these numbers and assume their AI agent has a great memory. It does not. It has zero memory. What it has is a whiteboard that someone erases the moment you close the tab.
Memory is the single biggest gap between a chatbot and an agent that actually gets better over time. And in 2026, it is the architectural decision most engineers get wrong.
Welcome to Grind Engineer, your guide to becoming a better software engineer! No fluff. Pure engineering insights.
TL;DR: AI agents have no persistent memory by default. The context window is temporary working space, not storage. Real agent memory requires four distinct systems borrowed from cognitive science: sensory, short term, long term semantic, and long term episodic. This article breaks down each type, shows how vector memory works, explains why conversation summarization loses critical details, and covers the five ways agents forget.

The Problem: Your Agent Forgets Everything
Here is what happens when you talk to most AI agents. You ask a question. It answers. You ask a follow up. It uses the previous messages sitting in the context window to stay coherent. You close the browser. Come back tomorrow. It has no idea who you are.
This is not a bug. This is how LLMs work.
The model has no internal state between sessions. Everything it "knows" during a conversation lives inside the context window: a rolling buffer of tokens that the model can read each time it generates a response.
Limitation | What happens |
|---|---|
Context window fills up | Oldest messages silently disappear |
Session ends | All context is gone permanently |
Summarization compresses history | Specific details get lost |
No external memory store | Agent cannot learn across sessions |
The CoALA framework (Cognitive Architectures for Language Agents, Sumers et al. 2023) formalized this problem. It maps human memory systems directly onto agent architecture components. Every major agent framework in 2026, LangGraph, OpenAI Agents SDK, CrewAI, uses some version of this model.
💡 Key Insight: An LLM's context window is not memory. It is a whiteboard that gets erased every time you leave the room. Real memory requires architecture outside the model.
The Four Types of Agent Memory
Cognitive science gives us four memory types. Each one maps to a specific component in your agent's architecture.

Memory Type | Human Analogy | Agent Implementation | Persistence |
|---|---|---|---|
Sensory | Raw stimuli hitting your eyes and ears | Token input buffer, raw user message | Milliseconds |
Short Term | Holding a phone number while you dial | Context window (conversation history) | Single session |
Long Term (Semantic) | Facts: "Paris is in France" | Vector database, knowledge base | Permanent |
Long Term (Episodic) | Events: "Tuesday's deploy failed" | Event logs with embeddings | Permanent |
There is also procedural memory: the "how to" knowledge. In agents, this lives in the system prompt, tool definitions, and few shot examples. It tells the agent how to behave, not what it knows.
Sensory memory is the raw input. Every token hitting the model before processing. You rarely think about it, but it matters for multimodal agents handling images, audio, and text simultaneously.
Short term memory is the context window. Most people call this "AI memory." It is not. Mem0 research shows that models lose retrieval accuracy on details buried in the middle of long contexts, even when nowhere near the token limit.
Long term semantic memory stores facts across sessions. This is where vector databases like Pinecone, Weaviate, and ChromaDB come in. The agent converts information to embeddings and retrieves relevant facts when needed.
Long term episodic memory stores specific experiences. "The last time this user asked about auth, they were building a Go microservice." This is the most underused memory type in production agents today.
In Context vs External Memory
This is the most important architectural choice you will make.
Approach | How it works | Pros | Cons |
|---|---|---|---|
In context | Everything stays in the prompt | Simple, zero infrastructure | Limited by context window, expensive |
External (vector DB) | Memories stored in a database, retrieved per query | Unlimited capacity, persistent | Can miss relevant memories, added latency |
Hybrid | Recent context in window + retrieval from external store | Best of both | More complex to build and tune |
In context memory is the default. The conversation history sits in the prompt. Works for short interactions. Falls apart when the conversation gets long, the session ends, or the user references something from days ago.
External memory solves persistence.
Every memory becomes a vector. Every query becomes a vector. Retrieval is finding the closest vectors by cosine similarity. This is the foundation of RAG (Retrieval Augmented Generation) and how most production agents handle long term memory.
Vector Memory: How Semantic Search Powers Agent Recall
The quality of vector memory depends on three decisions:
1. What you embed. Raw conversation turns make terrible memories. "Yes, let's do that" means nothing without context. Extract structured facts instead: "User prefers Python over Go for scripting tasks."
2. How you chunk. Long documents need splitting into meaningful segments. Too small and you lose context. Too large and retrieval gets noisy. The sweet spot is 200 to 500 tokens per chunk with overlap.
3. How you score. Pure cosine similarity is not enough. The SmartVector framework adds four signals:
Retrieval Signal | What it measures | Why it matters |
|---|---|---|
Semantic similarity | How close is this memory to the query? | Core relevance |
Temporal recency | How recent is this memory? | Prevents stale info |
Confidence decay | How certain was this memory when stored? | Filters uncertain facts |
Relational graph | Is this memory connected to other relevant memories? | Surfaces context clusters |

Conversation Summarization: The Compression Trade Off
When the context window fills up, you have two options: drop old messages or summarize them. Most frameworks choose summarization.
The agent takes the oldest N messages, asks the LLM to compress them into a summary, and replaces the originals. The context window shrinks. The conversation continues.
The problem? Summaries of summaries lose detail fast. After three or four compression passes, the agent remembers the shape of what happened but none of the specifics. It "knows" you discussed authentication but cannot continue the work because the code snippets, error messages, and decisions got compressed away.
Sanity.io published a better approach in 2025: distillation instead of summarization. Their system extracts two things from each conversation window: a narrative (short sentences explaining what happened) and a fact list (decisions, preferences, data points). Facts persist forever. Narratives get compressed.
Approach | What survives | What gets lost | Best for |
|---|---|---|---|
Drop old messages | Nothing from dropped messages | Everything before cutoff | Simple chatbots |
Summarize | General themes and decisions | Code, exact numbers, details | Medium conversations |
Distill (narrative + facts) | Both the story and the specifics | Redundant back and forth | Production agents |
The Forgetting Problem (and How to Fix It)
Agents forget in five distinct ways. Each one needs a different fix.
Forgetting Type | When it happens | Fix |
|---|---|---|
Session boundary | Conversation ends | External memory store |
Mid conversation | Context fills up | Summarize or distill |
Retrieval failure | Memory exists but query does not match | Hybrid search + metadata tags |
Interference | New info conflicts with old | Timestamps + "latest wins" policy |
Gradual drift | Over many sessions, summaries drift from reality | Immutable fact anchors + periodic re validation |
Session boundary forgetting is the most common. The fix is simple: persist memories to an external store before the session closes.
Retrieval failure forgetting is the sneakiest. The memory exists in your vector store, but the user's query does not match it semantically. The fix: store memories with multiple phrasings, add keyword metadata, and use hybrid search (vector + keyword matching together).
Gradual drift forgetting is the hardest to detect. Over hundreds of interactions, accumulated summaries slowly diverge from what actually happened. The fix: anchor critical facts as immutable entries that never get summarized or compressed.
Try This Today
1. Start with the simplest memory that works. A JSON file storing key facts between sessions beats a full vector database for most prototypes. Upgrade when you hit the limits, not before.
2. Never trust the context window as your only memory. Even with 1M tokens, retrieval accuracy drops for information buried in the middle. Treat the context window as a desk, not a filing cabinet.
3. When you add vector memory, invest time in what you embed. Extract structured facts like "User is building a Go microservice for payment processing" instead of raw messages like "Yeah let's use Go for this one."
Job Openings
Software Engineer, New Grad @Stripe: Apply Here
Software Engineer, Payments and Risk @Stripe: Apply Here
Software Engineer, Data & AI @Stripe: Apply Here
Software Engineer (1+ YOE) @Stripe: Apply Here
Software Engineer 2, iOS @Uber: Apply Here
See you in the next one!
Scortier, Signing Off!




