skip to content

AI Chat Assistant

[completed]Gemini API, LLM, RAG

Project Overview

My website chatbot — the red message circle at the bottom right of this site. Ask it anything about my projects, blog posts, or background, and it will answer using a RAG pipeline grounded in this site's content.

Stack

  • LLM: Google Gemini 2.5 Flash-Lite (streaming, low latency, generous free-tier quota)
  • Vector store: Vectra — a small file-based index bundled at build time, ~5 MB total
  • Embeddings: all-MiniLM-L6-v2 via Hugging Face transformers, run at build time only
  • Frontend: A streaming chat component in this site's React tree, talking to a /api/chat route handler

How It Works

At deploy time, a pnpm build-rag script reads every file under content/ (personal.json, projects.json + per-project MDX, blog MDX) and chunks them by paragraph. Each chunk gets embedded with the local MiniLM model; the chunk + its vector go into vectra-index/. The whole index ships as a static artifact in the function bundle.

At runtime, the /api/chat handler:

  1. Embeds the user's question with the same MiniLM model.
  2. Queries the Vectra index for the top-K (default 5) most similar chunks.
  3. Builds a Gemini prompt: "You are Ali's portfolio assistant. Using only the context below, answer the user's question. Context: <chunks>. Question: <q>".
  4. Streams the Gemini response back over Server-Sent Events.

The prompt is strict: it tells the model to refuse if the answer isn't in the context, which keeps hallucination low. The Vectra index is bundled (not hosted) so cold-starts are fast and there's no external dependency to break.

What I Learned

  • RAG quality is a chunking problem, not a model problem. Bad chunks (mid-sentence cuts, oversized blocks, no overlap) tank retrieval. The fix is patient chunking — paragraph-aligned, ~512 token cap, ~50 token overlap.
  • Streaming responses are a huge UX win for chat. The first token arrives in ~300ms vs ~3s for the full response — feels instant.
  • Bundling the vector index avoids an entire category of operational pain (hosted vector DB downtime, network latency, monthly costs). The tradeoff: redeploys are required whenever content changes. For a portfolio, redeploys are cheap.
  • Gemini's free-tier quota is generous enough that a personal portfolio chatbot is essentially free to run.