A fully dynamic learning experience where RAG comes alive. Adjust chunk sizes, visualize embeddings, simulate retrieval, and watch the entire pipeline react in real time. From beginner clarity to production-grade mastery — all explained in the simplest possible way.From raw documents to a production-grade AI system that answers with precision. Every step explained, every concept interactive, every algorithm visualized live.
LLMs hallucinate and know nothing about your private data. RAG solves both problems permanently.
LLMs train on public internet data. They have zero knowledge of your company documents, internal policies, Vikash Innovative Tech product specs, customer data, or anything created after their training cutoff. Ask them about your business and they'll confidently fabricate answers.
LLMs generate statistically plausible text, not factually verified text. They will invent citations, make up statistics, and state wrong dates with total confidence. In production systems, this is catastrophic and litigious.
Think of it as an open-book exam. Before answering any question, the system retrieves the most relevant pages from your document library. The LLM sees those pages and answers only from that retrieved context — grounded, sourced, verifiable.
Click any stage to jump straight to its deep-dive section with code and interactive demo.
Every step explained with production-grade Python code, gotchas, and live demos.
Before anything can happen, we need raw text. PDFs, Word docs, plain text, HTML — all become a standardized Python dict: { "content": text, "metadata": {...} }. That's the atomic unit of RAG.
What is a document in RAG? A Python dict with two keys: content (the raw text) and metadata (source filename, page number, section, etc.). Everything downstream depends on this structure being consistent.
Metadata is gold. Always attach source filename and page number. When the LLM answers a Vikash Innovative Tech policy question, the user can see exactly which document page it came from. That's what makes RAG trustworthy.
Supported loaders: pdfplumber / pypdf (PDF), python-docx (Word), beautifulsoup4 (HTML), csv reader, direct open() for .txt and .md. Use the right loader for each file type.
# ── Step 01: Document Loading ────────────────────── # Vikash Innovative Tech — RAG Pipeline import pdfplumber import os def load_pdf(file_path: str) -> list[dict]: """Load PDF and extract text page-by-page.""" documents = [] with pdfplumber.open(file_path) as pdf: for i, page in enumerate(pdf.pages): text = page.extract_text() if text and len(text.strip()) > 20: documents.append({ "content": text, "metadata": { "source": file_path, "page": i + 1, "type": "pdf", "company": "Vikash Innovative Tech" } }) return documents def load_text(path: str) -> list[dict]: """Load plain text file as single document.""" with open(path, "r", encoding="utf-8") as f: text = f.read() return [{ "content": text, "metadata": { "source": path, "page": 1, "type": "text" } }] def load_directory(dir_path: str) -> list[dict]: """Load all supported files in a directory.""" loaders = {".pdf": load_pdf, ".txt": load_text} all_docs = [] for fn in os.listdir(dir_path): ext = os.path.splitext(fn)[1] if ext in loaders: docs = loaders[ext](os.path.join(dir_path, fn)) all_docs.extend(docs) return all_docs # ── Usage ────────────────────────────────────────── docs = load_pdf("vik_policy.pdf") print(f"Loaded {len(docs)} pages") # → Loaded 4 pages # docs[0] = { # "content": "Vikash Innovative Tech\nWarranty Policy...", # "metadata": {"source": "vik_policy.pdf", "page": 1, ...} # }
Raw PDF text is messy — broken sentences, double spaces, weird Unicode, stray headers. Embeddings work better on clean text. This step is unglamorous but critical.
What to fix: Multiple whitespace → single space. Mid-sentence line breaks → spaces. Non-printable chars → strip. Page headers/footers → remove. The goal: clean, readable prose.
Don't over-clean! Removing too aggressively destroys semantic content. Keep punctuation, numbers, proper nouns, and acronyms intact. Vikash Innovative Tech product names must survive cleaning.
# ── Step 02: Text Preprocessing ────────────────── import re import unicodedata def clean_text(text: str) -> str: """Normalize text for embedding quality.""" # Normalize unicode (e.g. curly quotes → straight) text = unicodedata.normalize("NFKC", text) # Fix mid-sentence line breaks text = re.sub(r'(?<![.!?])\n(?=[a-z])', ' ', text) # Collapse multiple spaces / tabs text = re.sub(r'[ \t]+', ' ', text) # Normalize multiple newlines → double text = re.sub(r'\n{3,}', '\n\n', text) # Strip non-printable characters text = re.sub(r'[^\x20-\x7E\n]', '', text) # Remove page numbers / standalone numbers text = re.sub(r'^\d+\s*$', '', text, flags=re.MULTILINE) return text.strip() def preprocess_documents(docs: list) -> list: """Clean all documents in pipeline.""" cleaned = [] for doc in docs: clean = clean_text(doc["content"]) if len(clean) > 30: # skip near-empty pages doc["content"] = clean cleaned.append(doc) return cleaned # ── Usage ────────────────────────────────────────── cleaned_docs = preprocess_documents(docs) print(f"Cleaned: {len(cleaned_docs)} docs remain") print(cleaned_docs[0]["content"][:200]) # → Vikash Innovative Tech Warranty Policy # All laptops come with a 1-year limited warranty...
Documents are too long to embed directly. We split into "chunks" — this is the single most important decision in RAG. Wrong chunk size = bad retrieval = wrong answers.
The Core Trade-off: Chunks too large → retrieval brings back irrelevant noise. Chunks too small → you lose semantic context and answers become shallow. Sweet spot: 200–400 chars for most docs.
Overlap matters. A 50–80 char overlap between adjacent chunks ensures sentences at boundaries are never cut off. Always use overlap in production.
3 Strategies: Fixed Size (equal chunks), Sliding Window (overlapping windows), Recursive Splitter (paragraph→sentence→word priority). Full interactive lab below ↓
# ── Step 03: Chunking (all 3 strategies) ───────── import re # ── Strategy 1: Fixed Size ──────────────────────── def chunk_fixed(text, size=300): return [text[i:i+size].strip() for i in range(0,len(text),size) if text[i:i+size].strip()] # ── Strategy 2: Sliding Window ──────────────────── def chunk_sliding(text, size=300, overlap=60): chunks, i = [], 0 while i < len(text): end = min(i+size, len(text)) t = text[i:end].strip() if t: chunks.append(t) if end == len(text): break i += size - overlap return chunks # ── Strategy 3: Recursive Splitter ─────────────── def chunk_recursive(text, size=300, seps=['\n\n','\n','. ','! ',' ']): if len(text) <= size: return [text.strip()] sep = next((s for s in seps if s in text), ' ') parts = text.split(sep) chunks, cur = [], "" for p in parts: candidate = cur + (sep if cur else "") + p if len(candidate) <= size: cur = candidate else: if cur.strip(): chunks.append(cur.strip()) cur = p if cur.strip(): chunks.append(cur.strip()) return [c for c in chunks if len(c) >= 15] # ── Wrap chunks with metadata ───────────────────── def chunk_documents(docs, strategy="recursive", **kw): fn_map = { "fixed": chunk_fixed, "sliding": chunk_sliding, "recursive": chunk_recursive } fn = fn_map[strategy] all_chunks = [] for doc in docs: for j, txt in enumerate(fn(doc["content"], **kw)): all_chunks.append({ "content": txt, "metadata": {**doc["metadata"], "chunk": j} }) return all_chunks chunks = chunk_documents(cleaned_docs, strategy="recursive", size=300) print(f"Created {len(chunks)} chunks") # → Created 22 chunks
Embeddings convert text to a list of numbers — a vector in high-dimensional space. The magic: semantically similar texts get numerically similar vectors. This enables semantic search, not keyword matching.
What is an embedding? A 768- or 1536-dimensional vector. "Warranty" and "guarantee" sit close together. "Warranty" and "pizza" sit far apart. We retrieve by spatial proximity — finding nearest neighbors.
Critical rule: Always use the SAME embedding model for both chunks and queries. Mixing models makes similarity scores meaningless — like measuring distance in miles vs kilometers.
Model choices: OpenAI text-embedding-3-small (fast, cheap, great), all-MiniLM-L6-v2 (free, local, good), BAAI/bge-large (best open-source), Cohere embed-v3 (multilingual).
# ── Step 04: Embeddings ────────────────────────── from openai import OpenAI from sentence_transformers import SentenceTransformer import numpy as np # ── Option A: OpenAI (paid, production) ─────────── oai = OpenAI() def embed_openai(texts: list[str]) -> list: resp = oai.embeddings.create( input=texts, model="text-embedding-3-small" # 1536 dims ) return [d.embedding for d in resp.data] # ── Option B: Sentence Transformers (free) ───────── model = SentenceTransformer("all-MiniLM-L6-v2") def embed_local(texts: list[str]) -> np.ndarray: return model.encode(texts, batch_size=32, show_progress_bar=True) # ── Embed all chunks in batches ─────────────────── def embed_chunks(chunks, batch_size=100): texts = [c["content"] for c in chunks] all_emb = [] for i in range(0, len(texts), batch_size): batch = texts[i:i+batch_size] all_emb.extend(embed_openai(batch)) return all_emb embeddings = embed_chunks(chunks) print(f"Shape: {np.array(embeddings).shape}") # → Shape: (22, 1536) # ── Manual cosine similarity ────────────────────── def cosine_sim(a, b): a, b = np.array(a), np.array(b) return float(np.dot(a,b) / (np.linalg.norm(a)*np.linalg.norm(b)))
A vector database stores embeddings and lets you query them by similarity — not SQL WHERE clauses, but cosine distance search. Give it a query vector, get back the N closest chunk vectors.
ChromaDB — perfect for prototyping. In-memory or disk-persistent, zero config, Python-native. For production: Pinecone (managed cloud), Weaviate, Qdrant (open-source), pgvector (PostgreSQL).
Vector DB vs SQL: SQL searches by exact value match. Vector DB searches by angle between vectors — "find the 3 chunks that mean the most similar thing to this query." Fundamentally different operation.
Index your collection. Without an index (HNSW), vector search is O(n) — it compares every vector. With HNSW index, search is O(log n). Critical for production with millions of chunks.
# ── Step 05: Vector Store (ChromaDB) ───────────── import chromadb from chromadb.config import Settings # ── Persistent client (survives restarts) ───────── client = chromadb.PersistentClient( path="./vikash_rag_db" ) # ── Create or load collection ───────────────────── collection = client.get_or_create_collection( name="vikash_innovative_tech", metadata={"hnsw:space": "cosine"} # cosine dist ) def store_chunks(chunks, embeddings): """Add chunks + embeddings to ChromaDB.""" collection.add( documents=[c["content"] for c in chunks], embeddings=embeddings, metadatas=[c["metadata"] for c in chunks], ids=[str(i) for i in range(len(chunks))] ) print(f"✓ Stored {collection.count()} chunks") store_chunks(chunks, embeddings) # → ✓ Stored 22 chunks in vikash_rag_db # ── Metadata filtering (powerful feature) ───────── # Find only chunks from warranty policy pages: results = collection.query( query_embeddings=[query_emb], n_results=3, where={"source": "warranty_policy.pdf"} ) # ── Check if collection already built ───────────── if collection.count() > 0: print("Loaded existing index — skip rebuild") else: store_chunks(chunks, embeddings)
A user asks a question. We embed their query using the same model, then find the K nearest chunks in the vector database. This is where RAG's power becomes visible — semantic matching, not keywords.
Top-K guideline: K=3 is a good default. Too low (K=1) and you miss context. Too high (K=20) and you fill the prompt with noise. For Vikash Innovative Tech docs, K=3–5 works well.
Cosine similarity returns a score from 0 to 1. Score >0.85 = very relevant. Score 0.6–0.85 = somewhat relevant. Score <0.6 = probably irrelevant — consider filtering these out.
# ── Step 06: Retrieval ─────────────────────────── import numpy as np def retrieve(query: str, k=3) -> list[dict]: """Find k most relevant chunks for a query.""" # 1. Embed the query (same model as chunks!) q_emb = embed_openai([query])[0] # 2. Query vector store results = collection.query( query_embeddings=[q_emb], n_results=k ) # 3. Format results with similarity scores retrieved = [] for i in range(len(results["documents"][0])): sim = 1 - results["distances"][0][i] if sim > 0.5: # Filter low-relevance retrieved.append({ "text": results["documents"][0][i], "score": round(sim, 4), "source": results["metadatas"][0][i], }) return sorted(retrieved, key=lambda x:x["score"], reverse=True) # ── Example ─────────────────────────────────────── results = retrieve("What does the warranty cover?") for r in results: print(f"Score: {r['score']:.3f}") print(r["text"][:100] + "...") # → Score: 0.940 # All laptops come with a 1-year limited... # → Score: 0.872 # Warranty covers: manufacturing defects...
The final step: combine retrieved chunks into a prompt and send to the LLM. The key insight — the system prompt must explicitly constrain the LLM to answer only from context. This eliminates hallucinations.
Prompt engineering for RAG: (1) Put context before the question. (2) Tell LLM to say "I don't know" if answer isn't in context. (3) Require source citation. These 3 rules prevent 99% of hallucinations.
Set temperature=0.1. Lower temperature = more factual, deterministic answers. Higher temperature = more creative but more likely to drift from the context. For Q&A, stay at 0.0–0.2.
# ── Step 07: LLM + Prompt Engineering ──────────── from openai import OpenAI client = OpenAI() def build_context(retrieved_chunks): """Format retrieved chunks as numbered context.""" return "\n\n".join( f"[{i+1}] (Source: {c['source'].get('source','?')}, " f"page {c['source'].get('page','?')})\n{c['text']}" for i, c in enumerate(retrieved_chunks) ) def rag_answer(question: str, k=3) -> str: """Full RAG: retrieve → prompt → answer.""" # Retrieve relevant chunks chunks = retrieve(question, k=k) context = build_context(chunks) # Build system prompt with strict grounding system = """You are the AI assistant for Vikash Innovative Tech. Answer ONLY based on the retrieved context provided. If the answer is not in the context, say exactly: 'I don't have enough information to answer this.' Always cite sources as [1], [2], etc. Be concise and accurate.""" response = client.chat.completions.create( model="gpt-4o-mini", messages=[ {"role": "system", "content": system}, {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"} ], temperature=0.1 ) return response.choices[0].message.content # ── Complete pipeline in 3 lines ────────────────── answer = rag_answer("What does the warranty cover?") print(answer) # → Based on [1], Vikash Innovative Tech's 1-year # warranty covers manufacturing defects, hardware # malfunctions, and faulty components...
Drag the sliders, switch strategies, and watch Vikash Innovative Tech policy text re-chunk in real time. Click any colored span to inspect that chunk.
Each chunk becomes a point in high-dimensional space. We project to 2D here. Change the chunk size and watch the semantic clusters re-form with live physics animation.
Production RAG systems use these techniques to reach 95%+ accuracy on real-world queries.
The math powering semantic search. Measures angle between two embedding vectors. Score=1.0 means identical meaning. Score=0.0 means completely unrelated. "Warranty" and "guarantee" cluster at ~0.9.
MATH FOUNDATIONFixed-size, sliding window, recursive, semantic, and sentence-level chunking — each with distinct trade-offs. Recursive splitting (paragraph → sentence → word) wins on most structured documents.
ARCHITECTURECombine dense (semantic) retrieval with sparse (BM25/TF-IDF keyword) retrieval. Dense finds meaning. Sparse finds exact names/codes. Hybrid search + reciprocal rank fusion = state-of-the-art precision.
ADVANCEDRetrieve top-20 chunks, then use a cross-encoder model to re-rank and select top-3. Cross-encoders score query-chunk pairs jointly — 2–3× better precision than bi-encoders at 5× compute cost.
ADVANCEDPre-filter by date, department, document type, or author before semantic search. For Vikash Innovative Tech: "Find answers from warranty docs created after 2024" combines structured + unstructured search.
PRODUCTIONMeasure: faithfulness (zero hallucination), answer relevancy, context precision, context recall. Never deploy without evals. RAGAS automates this against ground-truth Q&A pairs you define.
QUALITYGenerate multiple paraphrases of the user question before retrieval. "Warranty claim process" → also search "how to file warranty", "warranty repair steps". Dramatically improves recall.
ADVANCEDStore small child chunks for precise retrieval, but send parent (larger) chunks to the LLM for full context. Best of both worlds — surgical retrieval precision with broad generative context.
PRODUCTIONPowered by Vikash Innovative Tech company policy data. Ask any question and watch the full pipeline run — retrieve, contextualize, answer.