AI Chat Assistant
Project Overview
My website chatbot — the red message circle at the bottom right of this site. Ask it anything about my projects, blog posts, or background, and it will answer using a RAG pipeline grounded in this site's content.
Stack
- LLM: Google Gemini 2.5 Flash-Lite (streaming, low latency, generous free-tier quota)
- Vector store: Vectra — a small file-based index bundled at build time, ~5 MB total
- Embeddings:
all-MiniLM-L6-v2via Hugging Face transformers, run at build time only - Frontend: A streaming chat component in this site's React tree, talking to a
/api/chatroute handler
How It Works
At deploy time, a pnpm build-rag script reads every file under content/ (personal.json, projects.json + per-project MDX, blog MDX) and chunks them by paragraph. Each chunk gets embedded with the local MiniLM model; the chunk + its vector go into vectra-index/. The whole index ships as a static artifact in the function bundle.
At runtime, the /api/chat handler:
- Embeds the user's question with the same MiniLM model.
- Queries the Vectra index for the top-K (default 5) most similar chunks.
- Builds a Gemini prompt:
"You are Ali's portfolio assistant. Using only the context below, answer the user's question. Context: <chunks>. Question: <q>". - Streams the Gemini response back over Server-Sent Events.
The prompt is strict: it tells the model to refuse if the answer isn't in the context, which keeps hallucination low. The Vectra index is bundled (not hosted) so cold-starts are fast and there's no external dependency to break.
What I Learned
- RAG quality is a chunking problem, not a model problem. Bad chunks (mid-sentence cuts, oversized blocks, no overlap) tank retrieval. The fix is patient chunking — paragraph-aligned, ~512 token cap, ~50 token overlap.
- Streaming responses are a huge UX win for chat. The first token arrives in ~300ms vs ~3s for the full response — feels instant.
- Bundling the vector index avoids an entire category of operational pain (hosted vector DB downtime, network latency, monthly costs). The tradeoff: redeploys are required whenever content changes. For a portfolio, redeploys are cheap.
- Gemini's free-tier quota is generous enough that a personal portfolio chatbot is essentially free to run.