ChromaDB: How to use a vector database to avoid screwing up while teaching

The problem: teaching what you haven’t taught yet

I have a programming course with 47 classes. Each class has notes (where I explain stuff) and labs (where students practice). And I have a problem: sometimes I use concepts in labs that I haven’t explained in the notes yet.

“Alright, in this exercise use map to transform the list.”

The problem? I don’t explain what the hell map is until three classes later.

This happens more than you think. You have the material in your head, you jump from one place to another, and without realizing it you assume the student knows stuff you haven’t told them yet. The result: frustration, confusion, and students who think they’re dumb when you’re the dumb one.

The manual solution would be to review each lab, note what concepts it uses, and verify they’ve been explained before. But I have 47 classes with several notebooks each. Yeah, that’s not happening.

The solution: semantic search with ChromaDB

The idea is simple:

Extract concepts from each notebook (what’s taught, what’s used)
Store them in a database that understands meaning, not just text
For each concept used in a lab, verify it exists in previous notes

That “understanding meaning” part is key. If in the notes I say “higher-order function” and in the lab I use “función de orden superior”, a grep won’t find anything. But semantically they’re the same thing.

This is where ChromaDB comes in: a vector database that converts text into embeddings and allows searching by similarity. In plain English: you store text, and then you can ask “is there anything similar to this?” and it returns the most similar ones.

ChromaDB in 5 minutes

ChromaDB is like SQLite but for embeddings. A single file (or folder), no server, no configuration. You install, use, and go.

1
2
3
pip install chromadb
# Or if you use uv:
uv add chromadb

The basic concept

In a normal database you store rows with columns. In ChromaDB you store documents with embeddings:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
import chromadb

# Create client (persistent on disk)
client = chromadb.PersistentClient(path="./my_db")

# Create a "collection" (like a table)
collection = client.get_or_create_collection(
    name="concepts",
    metadata={"hnsw:space": "cosine"}  # Cosine distance
)

# Store documents
collection.add(
    ids=["c1", "c2", "c3"],
    documents=["pure function", "for loop", "recursion"],
    metadatas=[
        {"class": "class_010", "type": "notes"},
        {"class": "class_015", "type": "notes"},
        {"class": "class_020", "type": "notes"},
    ]
)

That’s it. ChromaDB automatically:

Generates embeddings for the documents (uses all-MiniLM-L6-v2 by default)
Indexes them for fast search
Persists them to disk

Search by similarity

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
results = collection.query(
    query_texts=["función de orden superior"],
    n_results=3
)

print(results["documents"])
# [['higher-order function', 'pure function', 'recursion']]

print(results["distances"])
# [[0.23, 0.45, 0.67]]  # Lower = more similar

See? I searched for “función de orden superior” and it found “higher-order function” even though the text is completely different. That’s the magic of embeddings.

The complete system: curriculum validation

Now let’s build the system that validates I don’t screw up. The actual code is in my project, but here’s the simplified version so you understand the concept.

Step 1: Extract concepts from notebooks

First we need to extract concepts from each notebook. I do this with an LLM (Gemini Flash via OpenRouter), but you could do it with regexes if you’re brave:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
def extract_concepts_from_notebook(notebook_path: Path) -> list[dict]:
    """
    Extracts concepts from a Jupyter notebook.

    Returns:
        List of {"name": "concept", "category": "introduces|uses"}
    """
    content = get_notebook_content(notebook_path)

    # Call LLM to extract concepts
    response = llm.chat(
        messages=[
            {"role": "system", "content": EXTRACTION_PROMPT},
            {"role": "user", "content": content}
        ]
    )

    return json.loads(response)

The LLM classifies each concept as:

introduces: Taught with explanation
uses: Used assuming prior knowledge

Step 2: Store in ChromaDB

Now we store the concepts with their metadata:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
from chromadb.utils import embedding_functions

# Use multilingual model (Spanish + English)
embedding_fn = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="paraphrase-multilingual-MiniLM-L12-v2"
)

collection = client.get_or_create_collection(
    name="course_concepts",
    embedding_function=embedding_fn,
    metadata={"hnsw:space": "cosine"}
)

# Store concepts
for class_id, concepts in course_concepts.items():
    for concept in concepts:
        collection.add(
            ids=[f"{class_id}:{concept['name']}"],
            documents=[concept["name"]],
            metadatas=[{
                "class_id": class_id,
                "category": concept["category"],
                "source_type": concept["source_type"],  # notes or labs
            }]
        )

Step 3: Validate progression

Here comes the interesting part. For each concept “used” in a lab, we verify something similar exists in previous notes:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
def validate_curriculum(course_concepts: dict) -> list[str]:
    """
    Validates that labs don't use untaught concepts.

    Returns:
        List of errors found
    """
    errors = []
    known_concepts = set()

    # Process classes in order
    for class_id in sorted(course_concepts.keys()):
        class_data = course_concepts[class_id]

        # Add concepts introduced in notes to known concepts
        for c in class_data:
            if c["source_type"] == "notes" and c["category"] == "introduces":
                known_concepts.add(c["name"].lower())

        # Validate concepts used in labs
        for c in class_data:
            if c["source_type"] == "labs" and c["category"] == "uses":
                if not is_concept_known(c["name"], known_concepts):
                    errors.append(
                        f"{class_id}: '{c['name']}' used without teaching"
                    )

    return errors

The is_concept_known function is where ChromaDB comes in. We don’t do exact matching, we do semantic search:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
def is_concept_known(concept: str, known_concepts: set) -> bool:
    """Verifies if a concept is known (exact or semantic)."""

    # 1. Exact match
    if concept.lower() in known_concepts:
        return True

    # 2. Semantic search
    results = collection.query(
        query_texts=[concept],
        n_results=3,
        where={"category": "introduces"}  # Only search in "introduces"
    )

    # If there's something very similar (distance < 0.3), consider it known
    if results["distances"][0] and results["distances"][0][0] < 0.3:
        return True

    return False

Step 4: The report

Running validation on my course, I get a nice report:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
# Curriculum Validation Report

Found 17 problems:

## class_006_abstraciones_abstraccion_funcion

- **refactoring**: Concept 'refactoring' used in lab but not taught
  - File: `labs/0.funciones_basicas.ipynb`

## class_020_secuencias_while

- **memoization**: Concept 'memoization' used in lab but not taught
  - File: `labs/2.generar_secuencias.ipynb`

## class_026_funciones_orden_superior

- **map**: Concept 'map' used in lab but not taught
  - File: `labs/0.ejercicios_aplicadores_listas.ipynb`

Now I know exactly what I need to fix.

Details that matter

Multilingual embedding model

If your content is in Spanish, use a multilingual model:

1
2
3
embedding_fn = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="paraphrase-multilingual-MiniLM-L12-v2"
)

The default model (all-MiniLM-L6-v2) is trained mainly on English and can give weird results with Spanish.

Cosine vs Euclidean distance

For text, use cosine distance:

1
2
3
4
collection = client.get_or_create_collection(
    name="concepts",
    metadata={"hnsw:space": "cosine"}  # ← This
)

Cosine distance measures the angle between vectors, ignoring magnitude. This is what you want for semantic similarity.

Converting distance to similarity

ChromaDB returns distance (lower = more similar). If you want similarity (higher = more similar):

1
similarity = 1 - distance

With cosine distance, the range is [0, 2], so similarity ends up in [-1, 1]. In practice, for similar texts it’s usually in [0.5, 1].

Persistence

ChromaDB has two modes:

1
2
3
4
5
# In memory (lost when closing)
client = chromadb.Client()

# Persistent (saved to disk)
client = chromadb.PersistentClient(path="./data/chroma")

For a validation system you’ll run repeatedly, use persistent. That way you don’t recalculate embeddings every time.

Filters with `where`

You can filter results by metadata:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Only search in notes
results = collection.query(
    query_texts=["function"],
    where={"source_type": "notes"}
)

# Only search in classes before 020
results = collection.query(
    query_texts=["function"],
    where={"class_num": {"$lt": 20}}
)

This is crucial for validation: we only want to search in concepts that have already been taught.

Alternatives to ChromaDB

ChromaDB isn’t the only option. Here are others:

Tool	Pros	Cons
ChromaDB	Simple, serverless, good docs	Limited to millions of vectors
Pinecone	Scalable, managed	Paid, vendor lock-in
Weaviate	Powerful, GraphQL API	More complex to set up
Qdrant	Fast, Rust	Less known
pgvector	If you already use PostgreSQL	Requires PostgreSQL

For a project like this (thousands of concepts, not millions), ChromaDB is perfect. If you need to scale to billions of vectors or high availability, look at the alternatives.

The pre-commit hook

To make this truly useful, I integrated it into the git workflow:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
#!/bin/bash
# .git/hooks/pre-commit

echo "🔍 Validating curriculum progression..."

if python bin/concept_index.py validate; then
    echo "✓ Validation successful"
    exit 0
else
    echo "❌ There are concepts used without teaching"
    echo "   Run: make concept-validate-report"
    exit 1
fi

Now, every time I try to commit, the system verifies I’m not screwing up. If there are violations, the commit is blocked and it tells me what to fix.

Conclusion

ChromaDB is one of those tools that when you discover it you think “how have I lived without this?”. It’s SQLite for embeddings: simple, local, and it works.

The use case I’ve shown you (curriculum validation) is just one example. Vector databases are useful for:

Semantic search in documents
RAG (Retrieval-Augmented Generation) for LLMs
Semantic duplicate detection
Recommendations based on similarity
Content clustering

And the best part: the barrier to entry is minimal. You install, store documents, search. No configuring servers, schemas, or complicated indexes.

If you have a problem where you need to find “things similar to this,” try ChromaDB. The worst that can happen is it works too well and you wonder why you didn’t use it before.

TL;DR: ChromaDB is a local and simple vector database. We use it to verify that a programming course doesn’t use concepts before teaching them, using semantic search to detect similar concepts even when the text is different.

The problem: teaching what you haven’t taught yet#

The solution: semantic search with ChromaDB#

ChromaDB in 5 minutes#

The basic concept#

Search by similarity#

The complete system: curriculum validation#

Step 1: Extract concepts from notebooks#

Step 2: Store in ChromaDB#

Step 3: Validate progression#

Step 4: The report#

Details that matter#

Multilingual embedding model#

Cosine vs Euclidean distance#

Converting distance to similarity#

Persistence#

Filters with where#

Alternatives to ChromaDB#

The pre-commit hook#

Conclusion#