DDD Enforcer RAG
Project Overview
A RAG-enabled chatbot built on Gemini 2.5 Flash-Lite and ChromaDB. This is the standalone RAG pipeline that DDD Enforcer (the VSCode extension) was eventually built on top of — a separate project where I first worked out the chunking, embedding, and retrieval mechanics before integrating them into the linter.
Stack
- LLM: Gemini 2.5 Flash-Lite (fast response, high free-tier quota)
- Vector DB: ChromaDB (built-in embedding helpers, easy local development)
- Embeddings:
all-MiniLM-L6-v2(lightweight sentence-transformer for local embedding) - UI: Gradio (quick interactive web interface for demos)
- Language: Python
Architecture
1. The Ingestion Pipeline
Before answering anything, the system processes raw documents:
- Chunking. Documents (
.pdf,.md) are split into ~512-token segments. The chunk size matters — it needs to fit inside the embedding model's input window (256 tokens here) without losing too much context. - Embedding generation. Each chunk is passed through
all-MiniLM-L6-v2, producing a dense vector representing the chunk's semantic meaning.
2. Vector Storage & Retrieval
ChromaDB stores the chunk + vector pairs as the knowledge base. At query time:
- The user's question is embedded with the same model that embedded the source documents.
- ChromaDB computes the distance between the question vector and stored vectors.
- The top K chunks (default 3) with the shortest distance are retrieved.
3. Prompt Augmentation & Generation
The retrieved chunks are injected into a strict prompt template that essentially says "Using ONLY the following context, answer the user's question." The augmented prompt goes to Gemini, which generates a grounded answer.
What I Learned
- The hardest part of RAG isn't the retrieval — it's the chunking. Bad chunks (mid-sentence, oversized, no overlap) destroy retrieval quality regardless of how good your embedding model is.
- Using the same model for both ingestion-time embedding and query-time embedding is non-negotiable. Different models produce vectors in different latent spaces — you can't compare them.
- Gradio is the right tool for "I need a UI to debug this in an hour." Not for production, not for portfolio, but for proving the pipeline works end-to-end before integrating into a real frontend.
- This project was the basis of the website chatbot's RAG capability (the red message circle at the bottom right). See the blog post on building a RAG chatbot for the full write-up.