skip to content

DDD Enforcer RAG

[completed]Gemini API, ChromaDB, Gradio, LLM, RAG, Pythongithub ↗

Project Overview

A RAG-enabled chatbot built on Gemini 2.5 Flash-Lite and ChromaDB. This is the standalone RAG pipeline that DDD Enforcer (the VSCode extension) was eventually built on top of — a separate project where I first worked out the chunking, embedding, and retrieval mechanics before integrating them into the linter.

Stack

  • LLM: Gemini 2.5 Flash-Lite (fast response, high free-tier quota)
  • Vector DB: ChromaDB (built-in embedding helpers, easy local development)
  • Embeddings: all-MiniLM-L6-v2 (lightweight sentence-transformer for local embedding)
  • UI: Gradio (quick interactive web interface for demos)
  • Language: Python

Architecture

1. The Ingestion Pipeline

Before answering anything, the system processes raw documents:

  • Chunking. Documents (.pdf, .md) are split into ~512-token segments. The chunk size matters — it needs to fit inside the embedding model's input window (256 tokens here) without losing too much context.
  • Embedding generation. Each chunk is passed through all-MiniLM-L6-v2, producing a dense vector representing the chunk's semantic meaning.

2. Vector Storage & Retrieval

ChromaDB stores the chunk + vector pairs as the knowledge base. At query time:

  • The user's question is embedded with the same model that embedded the source documents.
  • ChromaDB computes the distance between the question vector and stored vectors.
  • The top K chunks (default 3) with the shortest distance are retrieved.

3. Prompt Augmentation & Generation

The retrieved chunks are injected into a strict prompt template that essentially says "Using ONLY the following context, answer the user's question." The augmented prompt goes to Gemini, which generates a grounded answer.

What I Learned

  • The hardest part of RAG isn't the retrieval — it's the chunking. Bad chunks (mid-sentence, oversized, no overlap) destroy retrieval quality regardless of how good your embedding model is.
  • Using the same model for both ingestion-time embedding and query-time embedding is non-negotiable. Different models produce vectors in different latent spaces — you can't compare them.
  • Gradio is the right tool for "I need a UI to debug this in an hour." Not for production, not for portfolio, but for proving the pipeline works end-to-end before integrating into a real frontend.
  • This project was the basis of the website chatbot's RAG capability (the red message circle at the bottom right). See the blog post on building a RAG chatbot for the full write-up.