Build Your First AI Agent from Scratch - Part 3: Memory, Context, and Retrieval

📚 Build Your First AI Agent
View All Parts in This Series
Ad Space
Build Your First AI Agent from Scratch - Part 3: Memory, Context, and Retrieval
Our agent now holds a coherent conversation, but it forgets everything the moment the session ends. When the Lumenly pilot met with the CFO, she asked, "Remember that pricing tweak from last week?" The agent stared blankly. This part fixes that embarrassment.
Thesis: agents need layered memory--short-term working context, episodic archives, and semantic recall. Implement each layer intentionally and it becomes easy to plug in summaries, vector search, and policy checks without rewriting the core agent.
Design the Memory Stack
We'll layer three constructs:
- Working Memory: bounded history already in Part 2; we'll add token-aware trimming.
- Episodic Memory: persistent transcripts stored in SQLite.
- Semantic Memory: embeddings + similarity search for cross-session recall.
Visualize it:
User Turn --> Working Queue --> Token Compressor --> LLM
| |
| \_-> Semantic Summaries
v
Episodic Store
Takeaway: Plan the flow first so every new module has a purpose.
Extend Working Memory with Token Awareness
Install tiktoken for token counts.
pip install tiktoken
Update ConversationStore:
# src/agent_lab/memory/working.py
import tiktoken
from agent_lab.contracts import Message
class WorkingMemory:
def __init__(self, max_tokens: int, model: str):
self.encoder = tiktoken.encoding_for_model(model)
self.max_tokens = max_tokens
self.messages: list[Message] = []
def append(self, message: Message) -> None:
self.messages.append(message)
self._truncate()
def _truncate(self) -> None:
tokens = 0
trimmed = []
for msg in reversed(self.messages):
msg_tokens = len(self.encoder.encode(msg.content))
if tokens + msg_tokens > self.max_tokens:
break
trimmed.append(msg)
tokens += msg_tokens
self.messages = list(reversed(trimmed))
Integrate with CoreAgent by replacing the old store. Now the agent gracefully trims context when conversations run long.
Takeaway: Token budgets belong in code, not just documentation.
Add Episodic Memory with SQLite
Create memory/episodic.py:
from dataclasses import dataclass
from pathlib import Path
import sqlite3
from agent_lab.contracts import Message
@dataclass
class Episode:
id: str
role: str
content: str
turn: int
class EpisodeStore:
def __init__(self, db_path: Path):
self.conn = sqlite3.connect(db_path)
self.conn.execute("""
CREATE TABLE IF NOT EXISTS turns (
episode_id TEXT,
turn INTEGER,
role TEXT,
content TEXT
)
""")
def log(self, episode_id: str, turn: int, message: Message) -> None:
self.conn.execute(
"INSERT INTO turns VALUES (?, ?, ?, ?)",
(episode_id, turn, message.role, message.content),
)
self.conn.commit()
def transcript(self, episode_id: str) -> list[Episode]:
rows = self.conn.execute(
"SELECT episode_id, role, content, turn FROM turns WHERE episode_id=? ORDER BY turn",
(episode_id,),
)
return [Episode(*row) for row in rows]
Modify CoreAgent to assign an episode_id (UUID) per session and log each turn. Persist transcripts so Part 5's deployment pipeline can replay them.
Takeaway: If you can't replay a conversation, you can't debug it.
Build Semantic Memory with Embeddings
Use OpenAI embeddings or a local alternative. We'll store vectors alongside metadata.
# src/agent_lab/memory/semantic.py
from openai import OpenAI
import numpy as np
from pathlib import Path
import json
class SemanticMemory:
def __init__(self, path: Path, model: str = "text-embedding-3-small"):
self.path = path
self.model = model
self.client = OpenAI()
self.path.mkdir(exist_ok=True)
def embed(self, text: str) -> np.ndarray:
resp = self.client.embeddings.create(model=self.model, input=text)
return np.array(resp.data[0].embedding, dtype=np.float32)
def upsert(self, episode_id: str, snippet: str):
vector = self.embed(snippet)
payload = {
"episode_id": episode_id,
"snippet": snippet,
"vector": vector.tolist(),
}
file = self.path / f"{episode_id}_{hash(snippet)}.json"
file.write_text(json.dumps(payload), encoding="utf-8")
def search(self, query: str, k: int = 3):
q_vec = self.embed(query)
results = []
for file in self.path.glob("*.json"):
item = json.loads(file.read_text(encoding="utf-8"))
sim = np.dot(q_vec, np.array(item["vector"])) / (
np.linalg.norm(q_vec) * np.linalg.norm(item["vector"])
)
results.append((sim, item))
return sorted(results, reverse=True)[:k]
Hook semantic memory into the agent: after each assistant reply, summarize the exchange and upsert. During the next user input, query for similar snippets and prepend them to working memory (Message("system", f"Relevant memory: ...")).
Takeaway: Semantic recall turns one-off chats into institutional knowledge.
Implement Summaries and Retrieval Middleware
Long transcripts will exceed embedding budgets. Use a simple summarizer to condense each turn.
def summarize(turn: list[Message], client: OpenAI) -> str:
prompt = [
{"role": "system", "content": "Summarize the exchange in <=60 words."},
*[m.__dict__ for m in turn],
]
resp = client.chat.completions.create(model="gpt-3.5-turbo", messages=prompt)
return resp.choices[0].message.content.strip()
Wrap everything in middleware inside CoreAgent.send:
- Before LLM call: fetch semantic memories relevant to the user prompt, append as context.
- After LLM response: summarize the user+assistant pair, persist to episodic store, upsert semantic vector.
Pseudo-code:
memories = semantic.search(message.content)
memory_prompts = [Message("system", f"Memory: {m['snippet']}") for _, m in memories]
working.extend(memory_prompts + [message])
reply = llm.complete(working.history())
summary = summarize([message, reply], llm.client)
episode.log(turn, message); episode.log(turn+1, reply)
semantic.upsert(episode_id, summary)
Takeaway: Memory flows should be deterministic so you can test them easily.
Provide Developer Tooling for Inspection
Add CLI commands:
python -m agent_lab.cli memories --query "billing terms"- lists top semantic hits.python -m agent_lab.cli transcript --episode <id>- replays a session.python -m agent_lab.cli summarize --episode <id>- regenerates a long-form summary.
This empowers human reviewers (support, legal) to audit conversations--the same behavior regulators expect.
Test and Validate
Unit tests:
def test_working_memory_truncates():
wm = WorkingMemory(max_tokens=20, model="gpt-3.5-turbo")
for i in range(10):
wm.append(Message("user", "hello " * i))
assert wm.messages[-1].role == "user"
assert sum(len(wm.encoder.encode(m.content)) for m in wm.messages) <= 20
Integration tests (mark them, run nightly):
@pytest.mark.integration
def test_semantic_search_roundtrip(tmp_path):
store = SemanticMemory(tmp_path)
store.upsert("ep1", "Customer likes dark mode")
results = store.search("dark theme")
assert results
assert "dark mode" in results[0][1]["snippet"]
Add a script scripts/reindex_semantic.py to rebuild vectors if you change models.
Recap and Handoff to Part 4
You now have:
- Token-aware working memory.
- SQLite episodic transcripts with replay.
- Semantic memory via embeddings and summaries.
- CLI tooling for search, transcripts, and summaries.
- Tests and scripts to keep the system trustworthy.
In Part 4 we will teach the agent to use tools and APIs. Memory makes tool decisions better: the agent will know whether it already pulled a report or which API token to reuse. Before continuing, run:
python -m agent_lab.cli chat
python -m agent_lab.cli memories --query "pricing"
pytest -m "not integration"
Then dive into Part 4: Tool Usage and API Integrations to connect your agent to the outside world.
Ad Space
Recommended Tools & Resources
* This section contains affiliate links. We may earn a commission when you purchase through these links at no additional cost to you.
📚 Featured AI Books
OpenAI API
AI PlatformAccess GPT-4 and other powerful AI models for your agent development.
LangChain Plus
FrameworkAdvanced framework for building applications with large language models.
Pinecone Vector Database
DatabaseHigh-performance vector database for AI applications and semantic search.
AI Agent Development Course
EducationComplete course on building production-ready AI agents from scratch.
💡 Pro Tip
Start with the free tiers of these tools to experiment, then upgrade as your AI agent projects grow. Most successful developers use a combination of 2-3 core tools rather than trying everything at once.
📚 Build Your First AI Agent
View All Parts in This Series
🚀 Join the AgentForge Community
Get weekly insights, tutorials, and the latest AI agent developments delivered to your inbox.
No spam, ever. Unsubscribe at any time.



