ai-agentstutorialpythonmemorycontext

Build Your First AI Agent from Scratch - Part 3: Memory, Context, and Retrieval

By AgentForge Hub2/10/20255 min read

Intermediate

📚 Build Your First AI Agent

Part 3 of 5

Part 2: Architecting the Core Agent Loop

All Tutorials

Part 4: Tooling and API Integrations

Series Progress60% Complete

View All Parts in This Series

Environment Setup and Safety Rails

Architecting the Core Agent Loop

Memory, Context, and RetrievalCurrent

Tooling and API Integrations

Testing, Simulation, and Deployment

Ad Space

Build Your First AI Agent from Scratch - Part 3: Memory, Context, and Retrieval

Our agent now holds a coherent conversation, but it forgets everything the moment the session ends. When the Lumenly pilot met with the CFO, she asked, "Remember that pricing tweak from last week?" The agent stared blankly. This part fixes that embarrassment.

Thesis: agents need layered memory--short-term working context, episodic archives, and semantic recall. Implement each layer intentionally and it becomes easy to plug in summaries, vector search, and policy checks without rewriting the core agent.

Design the Memory Stack

We'll layer three constructs:

Working Memory: bounded history already in Part 2; we'll add token-aware trimming.
Episodic Memory: persistent transcripts stored in SQLite.
Semantic Memory: embeddings + similarity search for cross-session recall.

Visualize it:

User Turn --> Working Queue --> Token Compressor --> LLM
                  |                        |
                  |                        \_-> Semantic Summaries
                  v
           Episodic Store

Takeaway: Plan the flow first so every new module has a purpose.

Extend Working Memory with Token Awareness

Install tiktoken for token counts.

pip install tiktoken

Update ConversationStore:

# src/agent_lab/memory/working.py
import tiktoken
from agent_lab.contracts import Message

class WorkingMemory:
    def __init__(self, max_tokens: int, model: str):
        self.encoder = tiktoken.encoding_for_model(model)
        self.max_tokens = max_tokens
        self.messages: list[Message] = []

    def append(self, message: Message) -> None:
        self.messages.append(message)
        self._truncate()

    def _truncate(self) -> None:
        tokens = 0
        trimmed = []
        for msg in reversed(self.messages):
            msg_tokens = len(self.encoder.encode(msg.content))
            if tokens + msg_tokens > self.max_tokens:
                break
            trimmed.append(msg)
            tokens += msg_tokens
        self.messages = list(reversed(trimmed))

Integrate with CoreAgent by replacing the old store. Now the agent gracefully trims context when conversations run long.

Takeaway: Token budgets belong in code, not just documentation.

Add Episodic Memory with SQLite

Create memory/episodic.py:

from dataclasses import dataclass
from pathlib import Path
import sqlite3
from agent_lab.contracts import Message

@dataclass
class Episode:
    id: str
    role: str
    content: str
    turn: int

class EpisodeStore:
    def __init__(self, db_path: Path):
        self.conn = sqlite3.connect(db_path)
        self.conn.execute("""
            CREATE TABLE IF NOT EXISTS turns (
              episode_id TEXT,
              turn INTEGER,
              role TEXT,
              content TEXT
            )
        """)

    def log(self, episode_id: str, turn: int, message: Message) -> None:
        self.conn.execute(
            "INSERT INTO turns VALUES (?, ?, ?, ?)",
            (episode_id, turn, message.role, message.content),
        )
        self.conn.commit()

    def transcript(self, episode_id: str) -> list[Episode]:
        rows = self.conn.execute(
            "SELECT episode_id, role, content, turn FROM turns WHERE episode_id=? ORDER BY turn",
            (episode_id,),
        )
        return [Episode(*row) for row in rows]

Modify CoreAgent to assign an episode_id (UUID) per session and log each turn. Persist transcripts so Part 5's deployment pipeline can replay them.

Takeaway: If you can't replay a conversation, you can't debug it.

Build Semantic Memory with Embeddings

Use OpenAI embeddings or a local alternative. We'll store vectors alongside metadata.

# src/agent_lab/memory/semantic.py
from openai import OpenAI
import numpy as np
from pathlib import Path
import json

class SemanticMemory:
    def __init__(self, path: Path, model: str = "text-embedding-3-small"):
        self.path = path
        self.model = model
        self.client = OpenAI()
        self.path.mkdir(exist_ok=True)

    def embed(self, text: str) -> np.ndarray:
        resp = self.client.embeddings.create(model=self.model, input=text)
        return np.array(resp.data[0].embedding, dtype=np.float32)

    def upsert(self, episode_id: str, snippet: str):
        vector = self.embed(snippet)
        payload = {
            "episode_id": episode_id,
            "snippet": snippet,
            "vector": vector.tolist(),
        }
        file = self.path / f"{episode_id}_{hash(snippet)}.json"
        file.write_text(json.dumps(payload), encoding="utf-8")

    def search(self, query: str, k: int = 3):
        q_vec = self.embed(query)
        results = []
        for file in self.path.glob("*.json"):
            item = json.loads(file.read_text(encoding="utf-8"))
            sim = np.dot(q_vec, np.array(item["vector"])) / (
                np.linalg.norm(q_vec) * np.linalg.norm(item["vector"])
            )
            results.append((sim, item))
        return sorted(results, reverse=True)[:k]

Hook semantic memory into the agent: after each assistant reply, summarize the exchange and upsert. During the next user input, query for similar snippets and prepend them to working memory (Message("system", f"Relevant memory: ...")).

Takeaway: Semantic recall turns one-off chats into institutional knowledge.

Implement Summaries and Retrieval Middleware

Long transcripts will exceed embedding budgets. Use a simple summarizer to condense each turn.

def summarize(turn: list[Message], client: OpenAI) -> str:
    prompt = [
        {"role": "system", "content": "Summarize the exchange in <=60 words."},
        *[m.__dict__ for m in turn],
    ]
    resp = client.chat.completions.create(model="gpt-3.5-turbo", messages=prompt)
    return resp.choices[0].message.content.strip()

Wrap everything in middleware inside CoreAgent.send:

Before LLM call: fetch semantic memories relevant to the user prompt, append as context.
After LLM response: summarize the user+assistant pair, persist to episodic store, upsert semantic vector.

Pseudo-code:

memories = semantic.search(message.content)
memory_prompts = [Message("system", f"Memory: {m['snippet']}") for _, m in memories]
working.extend(memory_prompts + [message])
reply = llm.complete(working.history())
summary = summarize([message, reply], llm.client)
episode.log(turn, message); episode.log(turn+1, reply)
semantic.upsert(episode_id, summary)

Takeaway: Memory flows should be deterministic so you can test them easily.

Provide Developer Tooling for Inspection

Add CLI commands:

python -m agent_lab.cli memories --query "billing terms" - lists top semantic hits.
python -m agent_lab.cli transcript --episode <id> - replays a session.
python -m agent_lab.cli summarize --episode <id> - regenerates a long-form summary.

This empowers human reviewers (support, legal) to audit conversations--the same behavior regulators expect.

Test and Validate

Unit tests:

def test_working_memory_truncates():
    wm = WorkingMemory(max_tokens=20, model="gpt-3.5-turbo")
    for i in range(10):
        wm.append(Message("user", "hello " * i))
    assert wm.messages[-1].role == "user"
    assert sum(len(wm.encoder.encode(m.content)) for m in wm.messages) <= 20

Integration tests (mark them, run nightly):

@pytest.mark.integration
def test_semantic_search_roundtrip(tmp_path):
    store = SemanticMemory(tmp_path)
    store.upsert("ep1", "Customer likes dark mode")
    results = store.search("dark theme")
    assert results
    assert "dark mode" in results[0][1]["snippet"]

Add a script scripts/reindex_semantic.py to rebuild vectors if you change models.

Recap and Handoff to Part 4

You now have:

Token-aware working memory.
SQLite episodic transcripts with replay.
Semantic memory via embeddings and summaries.
CLI tooling for search, transcripts, and summaries.
Tests and scripts to keep the system trustworthy.

In Part 4 we will teach the agent to use tools and APIs. Memory makes tool decisions better: the agent will know whether it already pulled a report or which API token to reuse. Before continuing, run:

python -m agent_lab.cli chat
python -m agent_lab.cli memories --query "pricing"
pytest -m "not integration"

Then dive into Part 4: Tool Usage and API Integrations to connect your agent to the outside world.

Ad Space

Recommended Tools & Resources

* This section contains affiliate links. We may earn a commission when you purchase through these links at no additional cost to you.

📚 Featured AI Books

The Agentic AI Bible

The AI Revolution in Project Management

The AI Engineering Bible

OpenAI API

AI Platform

Access GPT-4 and other powerful AI models for your agent development.

Pay-per-use

LangChain Plus

Framework

Advanced framework for building applications with large language models.

Free + Paid

Pinecone Vector Database

Database

High-performance vector database for AI applications and semantic search.

Free tier available

AI Agent Development Course

Education

Complete course on building production-ready AI agents from scratch.

$199

💡 Pro Tip

Start with the free tiers of these tools to experiment, then upgrade as your AI agent projects grow. Most successful developers use a combination of 2-3 core tools rather than trying everything at once.