The problem: teaching what you haven’t taught yet
I have a programming course with 47 classes. Each class has notes (where I explain stuff) and labs (where students practice). And I have a problem: sometimes I use concepts in labs that I haven’t explained in the notes yet.
“Alright, in this exercise use map to transform the list.”
The problem? I don’t explain what the hell map is until three classes later.
This happens more than you think. You have the material in your head, you jump from one place to another, and without realizing it you assume the student knows stuff you haven’t told them yet. The result: frustration, confusion, and students who think they’re dumb when you’re the dumb one.
The manual solution would be to review each lab, note what concepts it uses, and verify they’ve been explained before. But I have 47 classes with several notebooks each. Yeah, that’s not happening.
The solution: semantic search with ChromaDB
The idea is simple:
- Extract concepts from each notebook (what’s taught, what’s used)
- Store them in a database that understands meaning, not just text
- For each concept used in a lab, verify it exists in previous notes
That “understanding meaning” part is key. If in the notes I say “higher-order function” and in the lab I use “función de orden superior”, a grep won’t find anything. But semantically they’re the same thing.
This is where ChromaDB comes in: a vector database that converts text into embeddings and allows searching by similarity. In plain English: you store text, and then you can ask “is there anything similar to this?” and it returns the most similar ones.
ChromaDB in 5 minutes
ChromaDB is like SQLite but for embeddings. A single file (or folder), no server, no configuration. You install, use, and go.
| |
The basic concept
In a normal database you store rows with columns. In ChromaDB you store documents with embeddings:
| |
That’s it. ChromaDB automatically:
- Generates embeddings for the documents (uses
all-MiniLM-L6-v2by default) - Indexes them for fast search
- Persists them to disk
Search by similarity
| |
See? I searched for “función de orden superior” and it found “higher-order function” even though the text is completely different. That’s the magic of embeddings.
The complete system: curriculum validation
Now let’s build the system that validates I don’t screw up. The actual code is in my project, but here’s the simplified version so you understand the concept.
Step 1: Extract concepts from notebooks
First we need to extract concepts from each notebook. I do this with an LLM (Gemini Flash via OpenRouter), but you could do it with regexes if you’re brave:
| |
The LLM classifies each concept as:
- introduces: Taught with explanation
- uses: Used assuming prior knowledge
Step 2: Store in ChromaDB
Now we store the concepts with their metadata:
| |
Step 3: Validate progression
Here comes the interesting part. For each concept “used” in a lab, we verify something similar exists in previous notes:
| |
The is_concept_known function is where ChromaDB comes in. We don’t do exact matching, we do semantic search:
| |
Step 4: The report
Running validation on my course, I get a nice report:
| |
Now I know exactly what I need to fix.
Details that matter
Multilingual embedding model
If your content is in Spanish, use a multilingual model:
| |
The default model (all-MiniLM-L6-v2) is trained mainly on English and can give weird results with Spanish.
Cosine vs Euclidean distance
For text, use cosine distance:
| |
Cosine distance measures the angle between vectors, ignoring magnitude. This is what you want for semantic similarity.
Converting distance to similarity
ChromaDB returns distance (lower = more similar). If you want similarity (higher = more similar):
| |
With cosine distance, the range is [0, 2], so similarity ends up in [-1, 1]. In practice, for similar texts it’s usually in [0.5, 1].
Persistence
ChromaDB has two modes:
| |
For a validation system you’ll run repeatedly, use persistent. That way you don’t recalculate embeddings every time.
Filters with where
You can filter results by metadata:
| |
This is crucial for validation: we only want to search in concepts that have already been taught.
Alternatives to ChromaDB
ChromaDB isn’t the only option. Here are others:
| Tool | Pros | Cons |
|---|---|---|
| ChromaDB | Simple, serverless, good docs | Limited to millions of vectors |
| Pinecone | Scalable, managed | Paid, vendor lock-in |
| Weaviate | Powerful, GraphQL API | More complex to set up |
| Qdrant | Fast, Rust | Less known |
| pgvector | If you already use PostgreSQL | Requires PostgreSQL |
For a project like this (thousands of concepts, not millions), ChromaDB is perfect. If you need to scale to billions of vectors or high availability, look at the alternatives.
The pre-commit hook
To make this truly useful, I integrated it into the git workflow:
| |
Now, every time I try to commit, the system verifies I’m not screwing up. If there are violations, the commit is blocked and it tells me what to fix.
Conclusion
ChromaDB is one of those tools that when you discover it you think “how have I lived without this?”. It’s SQLite for embeddings: simple, local, and it works.
The use case I’ve shown you (curriculum validation) is just one example. Vector databases are useful for:
- Semantic search in documents
- RAG (Retrieval-Augmented Generation) for LLMs
- Semantic duplicate detection
- Recommendations based on similarity
- Content clustering
And the best part: the barrier to entry is minimal. You install, store documents, search. No configuring servers, schemas, or complicated indexes.
If you have a problem where you need to find “things similar to this,” try ChromaDB. The worst that can happen is it works too well and you wonder why you didn’t use it before.
TL;DR: ChromaDB is a local and simple vector database. We use it to verify that a programming course doesn’t use concepts before teaching them, using semantic search to detect similar concepts even when the text is different.