ai-agenttutorialassistantmemorycontext

Build a Personal AI Assistant – Part 3: Memory, Context Windows, and Semantic Recall

By AgentForge Hub8/16/20255 min read

Intermediate

📚 Build a Personal AI Assistant

Part 3 of 5

Part 2: Core Logic, Conversation Loop, and CLI

All Tutorials

Part 4: Tooling and API Integrations

Series Progress60% Complete

View All Parts in This Series

Architecture Mindset and Environment Setup

Core Logic, Conversation Loop, and CLI

Memory, Context Windows, and Semantic RecallCurrent

Tooling and API Integrations

Testing, Simulation, and Deployment

Ad Space

Build a Personal AI Assistant – Part 3: Memory, Context Windows, and Semantic Recall

Real assistants remember what mattered last week. In Lumenly’s pilot, customers quickly lost trust when the assistant forgot custom pricing rules. This lesson gives your assistant durable memory: token-aware working context, episodic storage in SQLite, and semantic search using embeddings.

Architecture for Memory Layers

We’ll introduce three modules:

WorkingMemory: trims conversation history based on tokens.
EpisodicStore: persists full transcripts with metadata.
SemanticMemory: stores summaries + embeddings for retrieval.

Data flow per turn:

User turn ─► WorkingMemory ─► LLM
      │              │
      │              ├─► Summary ─► SemanticMemory
      │              └─► Transcript ─► EpisodicStore
      └─► Retrieval ────────────────────────┘

Takeaway: Think of memory as a pipeline, not a single database.

Working Memory with Token Budgets

Add tiktoken bindings:

npm install tiktoken-node

src/core/workingMemory.ts:

import { encoding_for_model } from "tiktoken-node";
import { Message } from "./types";

export class WorkingMemory {
  private history: Message[] = [];
  private encoder = encoding_for_model("gpt-4o-mini");

  constructor(private readonly maxTokens = 1200) {}

  append(message: Message) {
    this.history.push(message);
    this.trim();
  }

  get() {
    return [...this.history];
  }

  private trim() {
    let tokens = 0;
    const reversed: Message[] = [];
    for (let i = this.history.length - 1; i >= 0; i -= 1) {
      const msg = this.history[i];
      const cost = this.encoder.encode(msg.content).length;
      if (tokens + cost > this.maxTokens) break;
      reversed.push(msg);
      tokens += cost;
    }
    this.history = reversed.reverse();
  }
}

Swap this into Assistant to replace the old conversation store.

Episodic Storage with SQLite

Install better-sqlite3:

npm install better-sqlite3

src/memory/episodic.ts:

import Database from "better-sqlite3";
import { Message } from "../core/types";
import { nanoid } from "nanoid";

export class EpisodicStore {
  private db = new Database("data/assistant.db");
  private episodeId = nanoid();
  private turn = 0;

  constructor() {
    this.db
      .prepare(
        `CREATE TABLE IF NOT EXISTS turns(
          episode_id TEXT,
          turn INTEGER,
          role TEXT,
          content TEXT,
          created_at DATETIME DEFAULT CURRENT_TIMESTAMP
        )`,
      )
      .run();
  }

  log(message: Message) {
    this.db
      .prepare(
        `INSERT INTO turns(episode_id, turn, role, content) VALUES (?, ?, ?, ?)`,
      )
      .run(this.episodeId, this.turn++, message.role, message.content);
  }

  transcript(episodeId = this.episodeId) {
    return this.db
      .prepare(`SELECT role, content FROM turns WHERE episode_id=? ORDER BY turn`)
      .all(episodeId);
  }
}

Call episodic.log() inside Assistant.send for both user and assistant messages.

Semantic Memory and Summaries

Install langchain’s embedding utilities or keep using OpenAI. We’ll use OpenAI embeddings:

// src/memory/semantic.ts
import OpenAI from "openai";
import fs from "node:fs";
import path from "node:path";
import { Message } from "../core/types";

const client = new OpenAI();
const storeDir = path.join("data", "semantic");
fs.mkdirSync(storeDir, { recursive: true });

export async function summarizeTurn(messages: Message[]): Promise<string> {
  const resp = await client.chat.completions.create({
    model: "gpt-3.5-turbo",
    messages: [
      { role: "system", content: "Summarize in <=80 words focusing on commitments." },
      ...messages,
    ],
  });
  return resp.choices[0].message.content ?? "";
}

export async function storeSummary(episodeId: string, summary: string) {
  const embedding = await client.embeddings.create({
    model: "text-embedding-3-small",
    input: summary,
  });
  const payload = {
    episodeId,
    summary,
    vector: embedding.data[0].embedding,
  };
  const file = path.join(storeDir, `${episodeId}-${Date.now()}.json`);
  fs.writeFileSync(file, JSON.stringify(payload));
}

export async function searchMemories(query: string, topK = 3) {
  const qVec = await client.embeddings.create({
    model: "text-embedding-3-small",
    input: query,
  });
  const queryVec = qVec.data[0].embedding;
  return fs
    .readdirSync(storeDir)
    .map((file) => JSON.parse(fs.readFileSync(path.join(storeDir, file), "utf-8")))
    .map((entry) => ({
      summary: entry.summary,
      score: cosine(entry.vector, queryVec),
    }))
    .sort((a, b) => b.score - a.score)
    .slice(0, topK);
}

function cosine(a: number[], b: number[]) {
  const dot = a.reduce((acc, v, i) => acc + v * b[i], 0);
  const magA = Math.sqrt(a.reduce((acc, v) => acc + v * v, 0));
  const magB = Math.sqrt(b.reduce((acc, v) => acc + v * v, 0));
  return dot / (magA * magB);
}

Integrate:

const summary = await summarizeTurn([message, assistantMessage]);
await storeSummary(this.episodicId, summary);

Takeaway: Summaries create bite-sized vectors and keep storage costs sane.

Retrieval Middleware

Before sending a user message to the LLM, fetch semantically similar summaries:

const memories = await searchMemories(message.content);
const memoryMessages = memories.map((mem) => ({
  role: "system" as const,
  content: `Relevant past insight: ${mem.summary}`,
}));

const combinedHistory = [...memoryMessages, ...this.workingMemory.get(), message];

If you want to avoid duplication, dedupe summaries by hashing content.

Takeaway: Inject relevant memories as system messages so the model treats them as authoritative context.

CLI Commands for Memory Ops

Extend cli.ts:

program
  .command("memories")
  .argument("<query>", "search term")
  .action(async (query) => {
    const hits = await searchMemories(query);
    hits.forEach((hit) =>
      console.log(`${hit.score.toFixed(2)} ▸ ${hit.summary}`),
    );
  });

program
  .command("transcript")
  .option("--episode <id>")
  .action((opts) => {
    const store = new EpisodicStore();
    store.transcript(opts.episode).forEach((row) =>
      console.log(`[${row.role}] ${row.content}`),
    );
  });

Takeaway: Memory inspection commands give humans confidence before production.

Testing Memory Behavior

Add tests:

import { describe, it, expect } from "vitest";
import { WorkingMemory } from "../src/core/workingMemory";

describe("WorkingMemory", () => {
  it("respects token budgets", () => {
    const wm = new WorkingMemory(50);
    for (let i = 0; i < 20; i += 1) {
      wm.append({ role: "user", content: `message ${"!" .repeat(i)}` });
    }
    expect(wm.get().length).toBeLessThan(20);
  });
});

For semantic memory, mock embeddings by stubbing client.embeddings.create. Integration tests can run nightly with the real API.

Verification Checklist

npm run cli:chat retains context beyond 30 turns without exceeding token limits.
data/assistant.db contains transcripts.
data/semantic/*.json stores summary vectors.
npm run cli memories "pricing" returns relevant summaries.
npm run test -- --runInBand covers working memory logic.
Observability (metrics.ndjson) now includes new metrics (e.g., memory.summary_tokens).

Once these pass, your assistant can remember what matters—ready for API/tool integrations in Part 4.

Ad Space

Recommended Tools & Resources

* This section contains affiliate links. We may earn a commission when you purchase through these links at no additional cost to you.

📚 Featured AI Books

The Agentic AI Bible

The AI Revolution in Project Management

The AI Engineering Bible

OpenAI API

AI Platform

Access GPT-4 and other powerful AI models for your agent development.

Pay-per-use

LangChain Plus

Framework

Advanced framework for building applications with large language models.

Free + Paid

Pinecone Vector Database

Database

High-performance vector database for AI applications and semantic search.

Free tier available

AI Agent Development Course

Education

Complete course on building production-ready AI agents from scratch.

$199

💡 Pro Tip

Start with the free tiers of these tools to experiment, then upgrade as your AI agent projects grow. Most successful developers use a combination of 2-3 core tools rather than trying everything at once.