Interactive RAG Mastery — Vikash Innovative Tech

Chapter 00 — Foundation

What is RAG & Why It Exists

LLMs hallucinate and know nothing about your private data. RAG solves both problems permanently.

97%

ACCURACY LIFT

~0

HALLUCINATIONS*

∞

KNOWLEDGE FRESHNESS

5ms

RETRIEVAL LATENCY

🧠

Problem 1 — Private Data

LLMs train on public internet data. They have zero knowledge of your company documents, internal policies, Vikash Innovative Tech product specs, customer data, or anything created after their training cutoff. Ask them about your business and they'll confidently fabricate answers.

🎭

Problem 2 — Hallucinations

LLMs generate statistically plausible text, not factually verified text. They will invent citations, make up statistics, and state wrong dates with total confidence. In production systems, this is catastrophic and litigious.

⚡

Solution — RAG

Think of it as an open-book exam. Before answering any question, the system retrieves the most relevant pages from your document library. The LLM sees those pages and answers only from that retrieved context — grounded, sourced, verifiable.

THE CORE IDEA — ONE SENTENCE

Retrieve the most relevant chunks of your data, inject them as context into the prompt, and let the LLM answer only from that context.

Chapter 00.5 — Architecture

End-to-End RAG Architecture

Click any stage to jump straight to its deep-dive section with code and interactive demo.

📄

Doc
Loading

🧹

Text
Preprocessing

✂️

Chunking

🔢

Embeddings

🗄️

Vector
Store

🔍

Retrieval

🤖

LLM +
Generate

Chapter 01 — The Pipeline

7 Steps, Zero Gaps

Every step explained with production-grade Python code, gotchas, and live demos.

Step 01 / Document Loading

Get Text Out
of Your Files

Before anything can happen, we need raw text. PDFs, Word docs, plain text, HTML — all become a standardized Python dict: { "content": text, "metadata": {...} }. That's the atomic unit of RAG.

💡

What is a document in RAG? A Python dict with two keys: content (the raw text) and metadata (source filename, page number, section, etc.). Everything downstream depends on this structure being consistent.

⚠️

Metadata is gold. Always attach source filename and page number. When the LLM answers a Vikash Innovative Tech policy question, the user can see exactly which document page it came from. That's what makes RAG trustworthy.

📂

Supported loaders: pdfplumber / pypdf (PDF), python-docx (Word), beautifulsoup4 (HTML), csv reader, direct open() for .txt and .md. Use the right loader for each file type.

step_01_load.py
# ── Step 01: Document Loading ──────────────────────
# Vikash Innovative Tech — RAG Pipeline
import pdfplumber
import os

def load_pdf(file_path: str) -> list[dict]:
    """Load PDF and extract text page-by-page."""
    documents = []
    with pdfplumber.open(file_path) as pdf:
        for i, page in enumerate(pdf.pages):
            text = page.extract_text()
            if text and len(text.strip()) > 20:
                documents.append({
                    "content": text,
                    "metadata": {
                        "source": file_path,
                        "page": i + 1,
                        "type": "pdf",
                        "company": "Vikash Innovative Tech"
                    }
                })
    return documents

def load_text(path: str) -> list[dict]:
    """Load plain text file as single document."""
    with open(path, "r", encoding="utf-8") as f:
        text = f.read()
    return [{
        "content": text,
        "metadata": {
            "source": path,
            "page": 1,
            "type": "text"
        }
    }]

def load_directory(dir_path: str) -> list[dict]:
    """Load all supported files in a directory."""
    loaders = {".pdf": load_pdf, ".txt": load_text}
    all_docs = []
    for fn in os.listdir(dir_path):
        ext = os.path.splitext(fn)[1]
        if ext in loaders:
            docs = loaders[ext](os.path.join(dir_path, fn))
            all_docs.extend(docs)
    return all_docs

# ── Usage ──────────────────────────────────────────
docs = load_pdf("vik_policy.pdf")
print(f"Loaded {len(docs)} pages")
# → Loaded 4 pages
# docs[0] = {
#   "content": "Vikash Innovative Tech\nWarranty Policy...",
#   "metadata": {"source": "vik_policy.pdf", "page": 1, ...}
# }

Step 02 / Text Preprocessing

Clean the
Noisy Raw Text

Raw PDF text is messy — broken sentences, double spaces, weird Unicode, stray headers. Embeddings work better on clean text. This step is unglamorous but critical.

🧹

What to fix: Multiple whitespace → single space. Mid-sentence line breaks → spaces. Non-printable chars → strip. Page headers/footers → remove. The goal: clean, readable prose.

⚠️

Don't over-clean! Removing too aggressively destroys semantic content. Keep punctuation, numbers, proper nouns, and acronyms intact. Vikash Innovative Tech product names must survive cleaning.

✗ BEFORE

"Vikash Innovative Tech\n policy allows 24 paid\nleaves per year. Contact\nHR@vikash .com"

✓ AFTER

"Vikash Innovative Tech policy allows 24 paid leaves per year. Contact HR@vikash.com"

step_02_preprocess.py
# ── Step 02: Text Preprocessing ──────────────────
import re
import unicodedata

def clean_text(text: str) -> str:
    """Normalize text for embedding quality."""
    # Normalize unicode (e.g. curly quotes → straight)
    text = unicodedata.normalize("NFKC", text)
    # Fix mid-sentence line breaks
    text = re.sub(r'(?<![.!?])\n(?=[a-z])', ' ', text)
    # Collapse multiple spaces / tabs
    text = re.sub(r'[ \t]+', ' ', text)
    # Normalize multiple newlines → double
    text = re.sub(r'\n{3,}', '\n\n', text)
    # Strip non-printable characters
    text = re.sub(r'[^\x20-\x7E\n]', '', text)
    # Remove page numbers / standalone numbers
    text = re.sub(r'^\d+\s*$', '', text, flags=re.MULTILINE)
    return text.strip()

def preprocess_documents(docs: list) -> list:
    """Clean all documents in pipeline."""
    cleaned = []
    for doc in docs:
        clean = clean_text(doc["content"])
        if len(clean) > 30:  # skip near-empty pages
            doc["content"] = clean
            cleaned.append(doc)
    return cleaned

# ── Usage ──────────────────────────────────────────
cleaned_docs = preprocess_documents(docs)
print(f"Cleaned: {len(cleaned_docs)} docs remain")
print(cleaned_docs[0]["content"][:200])
# → Vikash Innovative Tech Warranty Policy
#   All laptops come with a 1-year limited warranty...

Step 03 / Chunking

Split Into
Smart Pieces

Documents are too long to embed directly. We split into "chunks" — this is the single most important decision in RAG. Wrong chunk size = bad retrieval = wrong answers.

⚖️

The Core Trade-off: Chunks too large → retrieval brings back irrelevant noise. Chunks too small → you lose semantic context and answers become shallow. Sweet spot: 200–400 chars for most docs.

🔀

Overlap matters. A 50–80 char overlap between adjacent chunks ensures sentences at boundaries are never cut off. Always use overlap in production.

🌿

3 Strategies: Fixed Size (equal chunks), Sliding Window (overlapping windows), Recursive Splitter (paragraph→sentence→word priority). Full interactive lab below ↓

🎮 Open Full Chunking Lab

step_03_chunk.py
# ── Step 03: Chunking (all 3 strategies) ─────────
import re

# ── Strategy 1: Fixed Size ────────────────────────
def chunk_fixed(text, size=300):
    return [text[i:i+size].strip()
            for i in range(0,len(text),size)
            if text[i:i+size].strip()]

# ── Strategy 2: Sliding Window ────────────────────
def chunk_sliding(text, size=300, overlap=60):
    chunks, i = [], 0
    while i < len(text):
        end = min(i+size, len(text))
        t = text[i:end].strip()
        if t: chunks.append(t)
        if end == len(text): break
        i += size - overlap
    return chunks

# ── Strategy 3: Recursive Splitter ───────────────
def chunk_recursive(text, size=300,
                     seps=['\n\n','\n','. ','! ',' ']):
    if len(text) <= size: return [text.strip()]
    sep = next((s for s in seps if s in text), ' ')
    parts = text.split(sep)
    chunks, cur = [], ""
    for p in parts:
        candidate = cur + (sep if cur else "") + p
        if len(candidate) <= size: cur = candidate
        else:
            if cur.strip(): chunks.append(cur.strip())
            cur = p
    if cur.strip(): chunks.append(cur.strip())
    return [c for c in chunks if len(c) >= 15]

# ── Wrap chunks with metadata ─────────────────────
def chunk_documents(docs, strategy="recursive", **kw):
    fn_map = {
        "fixed": chunk_fixed,
        "sliding": chunk_sliding,
        "recursive": chunk_recursive
    }
    fn = fn_map[strategy]
    all_chunks = []
    for doc in docs:
        for j, txt in enumerate(fn(doc["content"], **kw)):
            all_chunks.append({
                "content": txt,
                "metadata": {**doc["metadata"], "chunk": j}
            })
    return all_chunks

chunks = chunk_documents(cleaned_docs, strategy="recursive", size=300)
print(f"Created {len(chunks)} chunks")
# → Created 22 chunks

Step 04 / Embeddings

Turn Text Into
Meaning-Vectors

Embeddings convert text to a list of numbers — a vector in high-dimensional space. The magic: semantically similar texts get numerically similar vectors. This enables semantic search, not keyword matching.

🧮

What is an embedding? A 768- or 1536-dimensional vector. "Warranty" and "guarantee" sit close together. "Warranty" and "pizza" sit far apart. We retrieve by spatial proximity — finding nearest neighbors.

🔑

Critical rule: Always use the SAME embedding model for both chunks and queries. Mixing models makes similarity scores meaningless — like measuring distance in miles vs kilometers.

💰

Model choices: OpenAI text-embedding-3-small (fast, cheap, great), all-MiniLM-L6-v2 (free, local, good), BAAI/bge-large (best open-source), Cohere embed-v3 (multilingual).

🔮 Open Embedding Lab

step_04_embed.py
# ── Step 04: Embeddings ──────────────────────────
from openai import OpenAI
from sentence_transformers import SentenceTransformer
import numpy as np

# ── Option A: OpenAI (paid, production) ───────────
oai = OpenAI()

def embed_openai(texts: list[str]) -> list:
    resp = oai.embeddings.create(
        input=texts,
        model="text-embedding-3-small"  # 1536 dims
    )
    return [d.embedding for d in resp.data]

# ── Option B: Sentence Transformers (free) ─────────
model = SentenceTransformer("all-MiniLM-L6-v2")

def embed_local(texts: list[str]) -> np.ndarray:
    return model.encode(texts, batch_size=32,
                         show_progress_bar=True)

# ── Embed all chunks in batches ───────────────────
def embed_chunks(chunks, batch_size=100):
    texts = [c["content"] for c in chunks]
    all_emb = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        all_emb.extend(embed_openai(batch))
    return all_emb

embeddings = embed_chunks(chunks)
print(f"Shape: {np.array(embeddings).shape}")
# → Shape: (22, 1536)

# ── Manual cosine similarity ──────────────────────
def cosine_sim(a, b):
    a, b = np.array(a), np.array(b)
    return float(np.dot(a,b) / (np.linalg.norm(a)*np.linalg.norm(b)))

Step 05 / Vector Store

Store & Index
All Embeddings

A vector database stores embeddings and lets you query them by similarity — not SQL WHERE clauses, but cosine distance search. Give it a query vector, get back the N closest chunk vectors.

🗄️

ChromaDB — perfect for prototyping. In-memory or disk-persistent, zero config, Python-native. For production: Pinecone (managed cloud), Weaviate, Qdrant (open-source), pgvector (PostgreSQL).

📐

Vector DB vs SQL: SQL searches by exact value match. Vector DB searches by angle between vectors — "find the 3 chunks that mean the most similar thing to this query." Fundamentally different operation.

⚡

Index your collection. Without an index (HNSW), vector search is O(n) — it compares every vector. With HNSW index, search is O(log n). Critical for production with millions of chunks.

step_05_vector_store.py
# ── Step 05: Vector Store (ChromaDB) ─────────────
import chromadb
from chromadb.config import Settings

# ── Persistent client (survives restarts) ─────────
client = chromadb.PersistentClient(
    path="./vikash_rag_db"
)

# ── Create or load collection ─────────────────────
collection = client.get_or_create_collection(
    name="vikash_innovative_tech",
    metadata={"hnsw:space": "cosine"}  # cosine dist
)

def store_chunks(chunks, embeddings):
    """Add chunks + embeddings to ChromaDB."""
    collection.add(
        documents=[c["content"] for c in chunks],
        embeddings=embeddings,
        metadatas=[c["metadata"] for c in chunks],
        ids=[str(i) for i in range(len(chunks))]
    )
    print(f"✓ Stored {collection.count()} chunks")

store_chunks(chunks, embeddings)
# → ✓ Stored 22 chunks in vikash_rag_db

# ── Metadata filtering (powerful feature) ─────────
# Find only chunks from warranty policy pages:
results = collection.query(
    query_embeddings=[query_emb],
    n_results=3,
    where={"source": "warranty_policy.pdf"}
)

# ── Check if collection already built ─────────────
if collection.count() > 0:
    print("Loaded existing index — skip rebuild")
else:
    store_chunks(chunks, embeddings)

Step 06 / Retrieval

Find the Right
Chunks at Query Time

A user asks a question. We embed their query using the same model, then find the K nearest chunks in the vector database. This is where RAG's power becomes visible — semantic matching, not keywords.

🎯

Top-K guideline: K=3 is a good default. Too low (K=1) and you miss context. Too high (K=20) and you fill the prompt with noise. For Vikash Innovative Tech docs, K=3–5 works well.

📐

Cosine similarity returns a score from 0 to 1. Score >0.85 = very relevant. Score 0.6–0.85 = somewhat relevant. Score <0.6 = probably irrelevant — consider filtering these out.

🔍 LIVE RETRIEVAL DEMO — Vikash Innovative Tech KB

step_06_retrieve.py
# ── Step 06: Retrieval ───────────────────────────
import numpy as np

def retrieve(query: str, k=3) -> list[dict]:
    """Find k most relevant chunks for a query."""
    # 1. Embed the query (same model as chunks!)
    q_emb = embed_openai([query])[0]

    # 2. Query vector store
    results = collection.query(
        query_embeddings=[q_emb],
        n_results=k
    )

    # 3. Format results with similarity scores
    retrieved = []
    for i in range(len(results["documents"][0])):
        sim = 1 - results["distances"][0][i]
        if sim > 0.5:  # Filter low-relevance
            retrieved.append({
                "text": results["documents"][0][i],
                "score": round(sim, 4),
                "source": results["metadatas"][0][i],
            })
    return sorted(retrieved,
                   key=lambda x:x["score"],
                   reverse=True)

# ── Example ───────────────────────────────────────
results = retrieve("What does the warranty cover?")
for r in results:
    print(f"Score: {r['score']:.3f}")
    print(r["text"][:100] + "...")
# → Score: 0.940
#   All laptops come with a 1-year limited...
# → Score: 0.872
#   Warranty covers: manufacturing defects...

Step 07 / Generate

Ground the LLM
in Your Context

The final step: combine retrieved chunks into a prompt and send to the LLM. The key insight — the system prompt must explicitly constrain the LLM to answer only from context. This eliminates hallucinations.

✍️

Prompt engineering for RAG: (1) Put context before the question. (2) Tell LLM to say "I don't know" if answer isn't in context. (3) Require source citation. These 3 rules prevent 99% of hallucinations.

🌡️

Set temperature=0.1. Lower temperature = more factual, deterministic answers. Higher temperature = more creative but more likely to drift from the context. For Q&A, stay at 0.0–0.2.

💬 VIKASH INNOVATIVE TECH — RAG SUPPORT CHAT

Vikash AI Support (RAG-powered)

You

What does the 1-year warranty cover, and what's excluded?

Vikash AI

📄 vik_policy.pdf, pages 1–2

step_07_generate.py
# ── Step 07: LLM + Prompt Engineering ────────────
from openai import OpenAI

client = OpenAI()

def build_context(retrieved_chunks):
    """Format retrieved chunks as numbered context."""
    return "\n\n".join(
        f"[{i+1}] (Source: {c['source'].get('source','?')}, "
        f"page {c['source'].get('page','?')})\n{c['text']}"
        for i, c in enumerate(retrieved_chunks)
    )

def rag_answer(question: str, k=3) -> str:
    """Full RAG: retrieve → prompt → answer."""
    # Retrieve relevant chunks
    chunks = retrieve(question, k=k)
    context = build_context(chunks)

    # Build system prompt with strict grounding
    system = """You are the AI assistant for Vikash Innovative Tech.
Answer ONLY based on the retrieved context provided.
If the answer is not in the context, say exactly:
'I don't have enough information to answer this.'
Always cite sources as [1], [2], etc.
Be concise and accurate."""

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": system},
            {"role": "user", "content":
             f"Context:\n{context}\n\nQuestion: {question}"}
        ],
        temperature=0.1
    )
    return response.choices[0].message.content

# ── Complete pipeline in 3 lines ──────────────────
answer = rag_answer("What does the warranty cover?")
print(answer)
# → Based on [1], Vikash Innovative Tech's 1-year
#   warranty covers manufacturing defects, hardware
#   malfunctions, and faulty components...

Chapter 02 — Interactive Lab

⬛ Chunking Strategy Lab

Drag the sliders, switch strategies, and watch Vikash Innovative Tech policy text re-chunk in real time. Click any colored span to inspect that chunk.

Chunking Strategy

Live Text Visualizer

ANIM

CHUNK SIZE DISTRIBUTION

Chunk Inspector

⚡ PARAMETER SWEEP — Watch chunk count evolve as size changes · Click any bar or press ▶

size: —

Darker = more chunks. Highlighted bar = current setting.

⚖️ Side-by-Side Comparison

Shared Parameters

Chunk Size 280

Overlap 50

Metrics Chart

Chapter 03 — Embedding Visualizer

🔮 Semantic Embedding Space

Each chunk becomes a point in high-dimensional space. We project to 2D here. Change the chunk size and watch the semantic clusters re-form with live physics animation.

2D Semantic Projection

COLOR:

t-SNE projection · hover dots to inspect

Adjust Chunks → Watch Clusters Re-form

Chunk Size 300

Overlap 60

Physics Gravity 0.5

Dots with similar content cluster together. Larger dot = larger chunk.

Chapter 04 — Advanced Topics

Beyond Basic RAG

Production RAG systems use these techniques to reach 95%+ accuracy on real-world queries.

📐

Cosine Similarity

The math powering semantic search. Measures angle between two embedding vectors. Score=1.0 means identical meaning. Score=0.0 means completely unrelated. "Warranty" and "guarantee" cluster at ~0.9.

MATH FOUNDATION

🏗️

Chunking Strategies

Fixed-size, sliding window, recursive, semantic, and sentence-level chunking — each with distinct trade-offs. Recursive splitting (paragraph → sentence → word) wins on most structured documents.

ARCHITECTURE

🎯

Hybrid Search

Combine dense (semantic) retrieval with sparse (BM25/TF-IDF keyword) retrieval. Dense finds meaning. Sparse finds exact names/codes. Hybrid search + reciprocal rank fusion = state-of-the-art precision.

ADVANCED

🔄

Re-ranking

Retrieve top-20 chunks, then use a cross-encoder model to re-rank and select top-3. Cross-encoders score query-chunk pairs jointly — 2–3× better precision than bi-encoders at 5× compute cost.

ADVANCED

🌐

Metadata Filtering

Pre-filter by date, department, document type, or author before semantic search. For Vikash Innovative Tech: "Find answers from warranty docs created after 2024" combines structured + unstructured search.

PRODUCTION

📊

RAG Evaluation (RAGAS)

Measure: faithfulness (zero hallucination), answer relevancy, context precision, context recall. Never deploy without evals. RAGAS automates this against ground-truth Q&A pairs you define.

QUALITY

🧠

Query Expansion

Generate multiple paraphrases of the user question before retrieval. "Warranty claim process" → also search "how to file warranty", "warranty repair steps". Dramatically improves recall.

ADVANCED

🗃️

Parent-Child Chunking

Store small child chunks for precise retrieval, but send parent (larger) chunks to the LLM for full context. Best of both worlds — surgical retrieval precision with broad generative context.

PRODUCTION

Chapter 05 — Hands-On

🎮 RAG Playground

Powered by Vikash Innovative Tech company policy data. Ask any question and watch the full pipeline run — retrieve, contextualize, answer.

Vikash Innovative Tech Knowledge Base

SAMPLE QUERIES

YOUR QUESTION

Pipeline Output

STEP 1 — RETRIEVED CONTEXT

STEP 2 — GENERATED ANSWER

Master Retrieval-Augmented Generation

What is RAG & Why It Exists

Problem 1 — Private Data

Problem 2 — Hallucinations

Solution — RAG

End-to-End RAG Architecture

7 Steps, Zero Gaps

Get Text Outof Your Files

Clean theNoisy Raw Text

Split IntoSmart Pieces

Turn Text IntoMeaning-Vectors

Store & IndexAll Embeddings

Find the RightChunks at Query Time

Ground the LLMin Your Context

⬛ Chunking Strategy Lab

⚖️ Side-by-Side Comparison

🔮 Semantic Embedding Space

Beyond Basic RAG

Cosine Similarity

Chunking Strategies

Hybrid Search

Re-ranking

Metadata Filtering

RAG Evaluation (RAGAS)

Query Expansion

Parent-Child Chunking

🎮 RAG Playground

Get Text Out
of Your Files

Clean the
Noisy Raw Text

Split Into
Smart Pieces

Turn Text Into
Meaning-Vectors

Store & Index
All Embeddings

Find the Right
Chunks at Query Time

Ground the LLM
in Your Context