What is RAG in simple terms?

RAG (Retrieval-Augmented Generation) is a technique where an AI model first searches your documents for relevant information, then uses that information to generate an accurate answer. Think of it as giving the AI an open-book exam instead of relying on memory.

Why do LLMs need RAG?

LLMs have a knowledge cutoff date and cannot access your private data. They also hallucinate — making up plausible-sounding but incorrect information. RAG solves both problems by grounding the AI's responses in your actual documents.

What is the difference between RAG and fine-tuning?

RAG retrieves relevant documents at query time and includes them in the prompt. Fine-tuning modifies the model's weights with your data. RAG is easier, cheaper, keeps data updatable, and requires no ML expertise. Fine-tuning is better for teaching the model a new style or behavior.

Can I build a RAG app for free?

Yes. Use free-tier APIs from Google (Gemini) or Hugging Face for embeddings and LLM. ChromaDB is free and open source. You can host on Google Colab or a free Railway/Render tier. The total cost can be zero for development and small-scale use.

What documents can RAG work with?

RAG works with any text-based documents: PDFs, Word files, web pages, CSVs, JSON, code files, emails, Notion pages, and more. LangChain has document loaders for over 100 file formats and data sources.

How much Python do I need to know for this tutorial?

Basic Python is sufficient — variables, functions, loops, and package installation with pip. You do not need machine learning or data science experience. The tutorial explains each step.

Is RAG useful for Indian businesses?

Very much so. Indian companies use RAG for querying internal policy documents in multiple languages, searching legal databases, building customer support bots that reference product manuals, and analyzing government compliance documents.

What is RAG? Build Your First RAG Application (India Tutorial) — India Guide 2026

What is RAG? Build Your First RAG Application (India Tutorial)

Retrieval Augmented Generation from scratch — use your own documents with LLMs

Every large language model — ChatGPT, Claude, Gemini — has two fundamental limitations: they do not know about your private data, and their knowledge has a cutoff date. If you ask Claude about your company's HR policy or ChatGPT about a government circular published last week, they cannot help.

Retrieval-Augmented Generation (RAG) solves both problems. It is the most practical technique for making AI work with your own data, and it does not require any machine learning expertise to implement.

What You Will Learn

What RAG is and why it matters
How RAG works — the retrieval and generation pipeline
How to build a complete RAG application step by step
Text splitting, embedding, and vector storage
Querying your documents with natural language
Free hosting options for Indian developers
Common RAG pitfalls and how to avoid them

For a conceptual introduction to RAG, see our RAG for Beginners guide. This tutorial focuses on building a working application.

Why LLMs Need RAG

Imagine you have 500 pages of your company's compliance documentation. You want employees to ask questions and get accurate answers. Here are your options:

| Approach | Pros | Cons | |----------|------|------| | Paste into prompt | Simple | Limited by context window, expensive at scale | | Fine-tune the model | AI "learns" your data | Expensive, slow, hard to update, requires ML skills | | RAG | Accurate, updatable, affordable | Requires setup (this tutorial helps) |

RAG works by:

Indexing your documents into a searchable database
Retrieving only the relevant portions when a question is asked
Generating an answer using the retrieved context

This means the AI only sees the parts of your documents that are relevant to each question, which improves accuracy and reduces cost.

RAG Architecture

User Question: "What is the leave policy for contract employees?"
         │
         ▼
┌─────────────────┐
│  1. EMBED QUERY  │  Convert question to a vector (numbers)
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  2. SEARCH       │  Find similar document chunks in vector DB
│  (Vector DB)     │  Returns: top 3-5 most relevant chunks
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  3. BUILD PROMPT │  Combine: system instructions + retrieved
│                  │  chunks + user question
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  4. GENERATE     │  LLM generates answer grounded in the
│  (LLM)           │  retrieved context
└────────┬────────┘
         │
         ▼
Answer: "According to the HR Policy v3.2, Section 4.1,
contract employees are entitled to..."

Building a RAG Application: Step by Step

We will build a RAG app that lets you query Indian government PDF documents. The same approach works for any text-based documents.

Step 1: Set Up the Environment

# Create project directory
mkdir rag-india-tutorial
cd rag-india-tutorial

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install langchain langchain-community langchain-chroma
pip install langchain-google-genai  # Free Gemini API
pip install pypdf python-docx       # Document loaders
pip install chromadb                 # Vector database

Step 2: Get a Free API Key

For this tutorial, we use Google's Gemini API which offers a generous free tier.

Go to Google AI Studio
Sign in with your Google account
Create an API key (free, no credit card required)
Save it in a .env file:

# .env
GOOGLE_API_KEY=your_api_key_here

Alternative free options:

Hugging Face Inference API (free tier with rate limits)
Ollama for fully local, offline RAG (see our Ollama guide)
Google Colab with free GPU for running open-source models

Step 3: Load Your Documents

Create rag_app.py:

import os
from dotenv import load_dotenv
from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

load_dotenv()

# --- Step 3: Load Documents ---

def load_documents(directory: str):
    """Load all PDF files from a directory."""
    loader = DirectoryLoader(
        directory,
        glob="**/*.pdf",
        loader_cls=PyPDFLoader,
        show_progress=True,
    )
    documents = loader.load()
    print(f"Loaded {len(documents)} pages from PDFs")
    return documents


# You can also load other file types:
# from langchain_community.document_loaders import (
#     TextLoader,         # .txt files
#     Docx2txtLoader,     # .docx files
#     CSVLoader,          # .csv files
#     UnstructuredHTMLLoader,  # .html files
# )

Create a documents/ folder and add your PDFs there. For testing, you can download any public Indian government PDF — a budget summary, a policy document, or an annual report.

Step 4: Split Documents into Chunks

Large documents need to be split into smaller chunks for effective retrieval. The chunk size matters — too small and you lose context, too large and you waste the context window.

# --- Step 4: Split into Chunks ---

def split_documents(documents):
    """Split documents into smaller chunks for embedding."""
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,       # Characters per chunk
        chunk_overlap=200,     # Overlap between chunks for context continuity
        length_function=len,
        separators=["\n\n", "\n", ". ", " ", ""],  # Split priorities
    )
    chunks = text_splitter.split_documents(documents)
    print(f"Split into {len(chunks)} chunks")
    return chunks

Why these settings?

chunk_size=1000 — Large enough to contain a complete thought, small enough for precise retrieval
chunk_overlap=200 — Ensures sentences at chunk boundaries are not lost
separators — Prefers splitting at paragraph breaks, then sentences, then words

Step 5: Create Embeddings and Store in Vector Database

Embeddings convert text into numerical vectors. Similar texts produce similar vectors, which enables semantic search.

# --- Step 5: Embed and Store ---

from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_chroma import Chroma

def create_vector_store(chunks, persist_directory: str = "./chroma_db"):
    """Create embeddings and store in ChromaDB."""

    # Initialize the embedding model (free with Gemini API)
    embeddings = GoogleGenerativeAIEmbeddings(
        model="models/embedding-001",
        google_api_key=os.getenv("GOOGLE_API_KEY"),
    )

    # Create the vector store
    vector_store = Chroma.from_documents(
        documents=chunks,
        embedding=embeddings,
        persist_directory=persist_directory,
    )
    print(f"Vector store created with {vector_store._collection.count()} vectors")
    return vector_store


def load_vector_store(persist_directory: str = "./chroma_db"):
    """Load an existing vector store."""
    embeddings = GoogleGenerativeAIEmbeddings(
        model="models/embedding-001",
        google_api_key=os.getenv("GOOGLE_API_KEY"),
    )
    return Chroma(
        persist_directory=persist_directory,
        embedding_function=embeddings,
    )

ChromaDB stores the vectors locally — no cloud database needed. The data persists in ./chroma_db/ and loads instantly on restart.

Step 6: Build the RAG Chain

This is where retrieval meets generation. When a user asks a question, we find the most relevant chunks and pass them to the LLM.

# --- Step 6: RAG Chain ---

from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.prompts import ChatPromptTemplate
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain

def create_rag_chain(vector_store):
    """Create the RAG chain combining retrieval and generation."""

    # Initialize the LLM
    llm = ChatGoogleGenerativeAI(
        model="gemini-2.0-flash",
        google_api_key=os.getenv("GOOGLE_API_KEY"),
        temperature=0.3,  # Lower temperature for factual accuracy
    )

    # Create the retriever (searches top 4 most relevant chunks)
    retriever = vector_store.as_retriever(
        search_type="similarity",
        search_kwargs={"k": 4},
    )

    # Define the prompt template
    prompt = ChatPromptTemplate.from_template("""
    You are a helpful assistant that answers questions based on the
    provided context. Follow these rules:

    1. Answer ONLY based on the provided context
    2. If the context does not contain enough information, say
       "I could not find this information in the provided documents"
    3. Cite which document/section the information comes from
    4. Use clear, concise language
    5. For numerical data, present in Indian notation (lakhs, crores)

    Context:
    {context}

    Question: {input}

    Answer:
    """)

    # Create the chain
    document_chain = create_stuff_documents_chain(llm, prompt)
    rag_chain = create_retrieval_chain(retriever, document_chain)

    return rag_chain

Step 7: Put It All Together

# --- Step 7: Main Application ---

def main():
    """Run the RAG application."""
    docs_directory = "./documents"

    # Check if vector store already exists
    if os.path.exists("./chroma_db"):
        print("Loading existing vector store...")
        vector_store = load_vector_store()
    else:
        print("Building vector store from documents...")
        documents = load_documents(docs_directory)
        chunks = split_documents(documents)
        vector_store = create_vector_store(chunks)

    # Create the RAG chain
    rag_chain = create_rag_chain(vector_store)

    # Interactive query loop
    print("\n--- RAG Application Ready ---")
    print("Ask questions about your documents. Type 'quit' to exit.\n")

    while True:
        question = input("Your question: ").strip()
        if question.lower() in ("quit", "exit", "q"):
            break
        if not question:
            continue

        result = rag_chain.invoke({"input": question})
        print(f"\nAnswer: {result['answer']}\n")

        # Show which documents were retrieved
        print("Sources:")
        for i, doc in enumerate(result["context"], 1):
            source = doc.metadata.get("source", "Unknown")
            page = doc.metadata.get("page", "N/A")
            print(f"  {i}. {source} (page {page})")
        print()


if __name__ == "__main__":
    main()

Running the Application

# Make sure you have PDFs in the documents/ folder
mkdir -p documents
# Add your PDF files to documents/

# Run the application
python rag_app.py

The first run builds the vector store (takes a few minutes depending on document size). Subsequent runs load the existing store instantly.

Improving RAG Quality

The basic RAG app works, but here are techniques to significantly improve answer quality.

Technique 1: Hybrid Search

Combine semantic (vector) search with keyword search for better retrieval:

from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever

def create_hybrid_retriever(chunks, vector_store):
    """Combine BM25 keyword search with vector similarity search."""
    bm25_retriever = BM25Retriever.from_documents(chunks)
    bm25_retriever.k = 3

    vector_retriever = vector_store.as_retriever(search_kwargs={"k": 3})

    # 50/50 weight between keyword and semantic search
    ensemble = EnsembleRetriever(
        retrievers=[bm25_retriever, vector_retriever],
        weights=[0.5, 0.5],
    )
    return ensemble

Technique 2: Metadata Filtering

Add metadata to chunks for more precise retrieval:

# When loading documents, add metadata
for chunk in chunks:
    chunk.metadata["department"] = "HR"  # or extract from content
    chunk.metadata["year"] = "2026"
    chunk.metadata["language"] = "English"

# When retrieving, filter by metadata
retriever = vector_store.as_retriever(
    search_kwargs={
        "k": 4,
        "filter": {"department": "HR"},
    }
)

Technique 3: Reranking

After initial retrieval, rerank results for relevance:

from langchain.retrievers import ContextualCompressionRetriever
from langchain_community.document_compressors import LLMChainFilter

def create_reranking_retriever(vector_store, llm):
    """Add LLM-based reranking to filter out irrelevant results."""
    base_retriever = vector_store.as_retriever(search_kwargs={"k": 8})
    compressor = LLMChainFilter.from_llm(llm)

    return ContextualCompressionRetriever(
        base_compressor=compressor,
        base_retriever=base_retriever,
    )

Free Hosting Options for Indian Developers

| Platform | Free Tier | Best For | |----------|-----------|----------| | Google Colab | Free GPU, unlimited notebooks | Development and demos | | Railway | 500 hours/month, 512MB RAM | Small web apps | | Render | 750 hours/month | API-based RAG services | | Hugging Face Spaces | Free CPU instances | Gradio/Streamlit demos | | Replit | Basic tier free | Quick prototypes | | PythonAnywhere | Free tier with limitations | Simple Flask apps |

For production deployments, Indian cloud options include:

AWS Mumbai region (ap-south-1) — Free tier for 12 months
Google Cloud India — $300 free credits
Azure India — $200 free credits

Common RAG Pitfalls

| Problem | Cause | Solution | |---------|-------|----------| | Irrelevant answers | Chunks too large | Reduce chunk_size to 500-800 | | Missing context | Chunks too small | Increase chunk_size or chunk_overlap | | Slow retrieval | Too many vectors | Add metadata filtering, reduce k | | Hallucinations | Weak prompt | Add explicit "only use provided context" instruction | | Wrong language output | Mixed language docs | Add language instruction to prompt template | | Outdated answers | Stale vector store | Rebuild index when documents change |

RAG vs Other Approaches

| Feature | RAG | Fine-Tuning | Long Context Window | |---------|-----|-------------|-------------------| | Setup difficulty | Medium | High | Low | | Cost per query | Low | Low (after training) | High (token cost) | | Data freshness | Real-time updatable | Requires retraining | Real-time | | Accuracy | High with good retrieval | High for style/behavior | Degrades with volume | | Privacy | Data stays local | Data sent for training | Data sent per query | | Best for | Knowledge bases, Q&A | Style, behavior changes | Small document sets |

Where to Go From Here

You have built a working RAG application. Here are the next steps:

Add a web interface — Wrap your RAG chain in a Streamlit or Gradio app for a user-friendly interface. See our guide on building apps with AI.
Explore agentic RAG — Combine RAG with AI agents that can decide when to search documents vs when to use tools
Try local models — Run RAG entirely offline using Ollama for privacy-sensitive Indian enterprise data
Connect MCP — Build an MCP server that wraps your RAG system, making it accessible from Claude Desktop, Cursor, and other AI tools
Learn about vector databases — Explore production-grade options like Pinecone, Weaviate, and Qdrant for scaling beyond ChromaDB
Study advanced prompting — Better prompts improve RAG output. See our advanced prompt engineering guide

RAG is the foundation of most enterprise AI applications in 2026. The skills you have built here — document processing, embedding, vector search, and prompt engineering — apply to virtually every AI product being built today.

Community Questions

No questions yet. Be the first to ask!

Share this guide

r/developersIndia r/india r/ChatGPT

What is RAG? Build Your First RAG Application (India Tutorial)

Retrieval Augmented Generation from scratch — use your own documents with LLMs

What You Will Learn

What RAG is and why it matters
How RAG works — the retrieval and generation pipeline
How to build a complete RAG application step by step
Text splitting, embedding, and vector storage
Querying your documents with natural language
Free hosting options for Indian developers
Common RAG pitfalls and how to avoid them

For a conceptual introduction to RAG, see our RAG for Beginners guide. This tutorial focuses on building a working application.

Why LLMs Need RAG

Imagine you have 500 pages of your company's compliance documentation. You want employees to ask questions and get accurate answers. Here are your options:

RAG works by:

Indexing your documents into a searchable database
Retrieving only the relevant portions when a question is asked
Generating an answer using the retrieved context

This means the AI only sees the parts of your documents that are relevant to each question, which improves accuracy and reduces cost.

RAG Architecture

User Question: "What is the leave policy for contract employees?"
         │
         ▼
┌─────────────────┐
│  1. EMBED QUERY  │  Convert question to a vector (numbers)
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  2. SEARCH       │  Find similar document chunks in vector DB
│  (Vector DB)     │  Returns: top 3-5 most relevant chunks
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  3. BUILD PROMPT │  Combine: system instructions + retrieved
│                  │  chunks + user question
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  4. GENERATE     │  LLM generates answer grounded in the
│  (LLM)           │  retrieved context
└────────┬────────┘
         │
         ▼
Answer: "According to the HR Policy v3.2, Section 4.1,
contract employees are entitled to..."

Building a RAG Application: Step by Step

We will build a RAG app that lets you query Indian government PDF documents. The same approach works for any text-based documents.

Step 1: Set Up the Environment

# Create project directory
mkdir rag-india-tutorial
cd rag-india-tutorial

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install langchain langchain-community langchain-chroma
pip install langchain-google-genai  # Free Gemini API
pip install pypdf python-docx       # Document loaders
pip install chromadb                 # Vector database

Step 2: Get a Free API Key

For this tutorial, we use Google's Gemini API which offers a generous free tier.

Go to Google AI Studio
Sign in with your Google account
Create an API key (free, no credit card required)
Save it in a .env file:

# .env
GOOGLE_API_KEY=your_api_key_here

Alternative free options:

Hugging Face Inference API (free tier with rate limits)
Ollama for fully local, offline RAG (see our Ollama guide)
Google Colab with free GPU for running open-source models

Step 3: Load Your Documents

Create rag_app.py:

import os
from dotenv import load_dotenv
from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

load_dotenv()

# --- Step 3: Load Documents ---

def load_documents(directory: str):
    """Load all PDF files from a directory."""
    loader = DirectoryLoader(
        directory,
        glob="**/*.pdf",
        loader_cls=PyPDFLoader,
        show_progress=True,
    )
    documents = loader.load()
    print(f"Loaded {len(documents)} pages from PDFs")
    return documents


# You can also load other file types:
# from langchain_community.document_loaders import (
#     TextLoader,         # .txt files
#     Docx2txtLoader,     # .docx files
#     CSVLoader,          # .csv files
#     UnstructuredHTMLLoader,  # .html files
# )

Create a documents/ folder and add your PDFs there. For testing, you can download any public Indian government PDF — a budget summary, a policy document, or an annual report.

Step 4: Split Documents into Chunks

Large documents need to be split into smaller chunks for effective retrieval. The chunk size matters — too small and you lose context, too large and you waste the context window.

# --- Step 4: Split into Chunks ---

def split_documents(documents):
    """Split documents into smaller chunks for embedding."""
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,       # Characters per chunk
        chunk_overlap=200,     # Overlap between chunks for context continuity
        length_function=len,
        separators=["\n\n", "\n", ". ", " ", ""],  # Split priorities
    )
    chunks = text_splitter.split_documents(documents)
    print(f"Split into {len(chunks)} chunks")
    return chunks

Why these settings?

chunk_size=1000 — Large enough to contain a complete thought, small enough for precise retrieval
chunk_overlap=200 — Ensures sentences at chunk boundaries are not lost
separators — Prefers splitting at paragraph breaks, then sentences, then words

Step 5: Create Embeddings and Store in Vector Database

Embeddings convert text into numerical vectors. Similar texts produce similar vectors, which enables semantic search.

# --- Step 5: Embed and Store ---

from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_chroma import Chroma

def create_vector_store(chunks, persist_directory: str = "./chroma_db"):
    """Create embeddings and store in ChromaDB."""

    # Initialize the embedding model (free with Gemini API)
    embeddings = GoogleGenerativeAIEmbeddings(
        model="models/embedding-001",
        google_api_key=os.getenv("GOOGLE_API_KEY"),
    )

    # Create the vector store
    vector_store = Chroma.from_documents(
        documents=chunks,
        embedding=embeddings,
        persist_directory=persist_directory,
    )
    print(f"Vector store created with {vector_store._collection.count()} vectors")
    return vector_store


def load_vector_store(persist_directory: str = "./chroma_db"):
    """Load an existing vector store."""
    embeddings = GoogleGenerativeAIEmbeddings(
        model="models/embedding-001",
        google_api_key=os.getenv("GOOGLE_API_KEY"),
    )
    return Chroma(
        persist_directory=persist_directory,
        embedding_function=embeddings,
    )

ChromaDB stores the vectors locally — no cloud database needed. The data persists in ./chroma_db/ and loads instantly on restart.

Step 6: Build the RAG Chain

This is where retrieval meets generation. When a user asks a question, we find the most relevant chunks and pass them to the LLM.

# --- Step 6: RAG Chain ---

from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.prompts import ChatPromptTemplate
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain

def create_rag_chain(vector_store):
    """Create the RAG chain combining retrieval and generation."""

    # Initialize the LLM
    llm = ChatGoogleGenerativeAI(
        model="gemini-2.0-flash",
        google_api_key=os.getenv("GOOGLE_API_KEY"),
        temperature=0.3,  # Lower temperature for factual accuracy
    )

    # Create the retriever (searches top 4 most relevant chunks)
    retriever = vector_store.as_retriever(
        search_type="similarity",
        search_kwargs={"k": 4},
    )

    # Define the prompt template
    prompt = ChatPromptTemplate.from_template("""
    You are a helpful assistant that answers questions based on the
    provided context. Follow these rules:

    1. Answer ONLY based on the provided context
    2. If the context does not contain enough information, say
       "I could not find this information in the provided documents"
    3. Cite which document/section the information comes from
    4. Use clear, concise language
    5. For numerical data, present in Indian notation (lakhs, crores)

    Context:
    {context}

    Question: {input}

    Answer:
    """)

    # Create the chain
    document_chain = create_stuff_documents_chain(llm, prompt)
    rag_chain = create_retrieval_chain(retriever, document_chain)

    return rag_chain

Step 7: Put It All Together

# --- Step 7: Main Application ---

def main():
    """Run the RAG application."""
    docs_directory = "./documents"

    # Check if vector store already exists
    if os.path.exists("./chroma_db"):
        print("Loading existing vector store...")
        vector_store = load_vector_store()
    else:
        print("Building vector store from documents...")
        documents = load_documents(docs_directory)
        chunks = split_documents(documents)
        vector_store = create_vector_store(chunks)

    # Create the RAG chain
    rag_chain = create_rag_chain(vector_store)

    # Interactive query loop
    print("\n--- RAG Application Ready ---")
    print("Ask questions about your documents. Type 'quit' to exit.\n")

    while True:
        question = input("Your question: ").strip()
        if question.lower() in ("quit", "exit", "q"):
            break
        if not question:
            continue

        result = rag_chain.invoke({"input": question})
        print(f"\nAnswer: {result['answer']}\n")

        # Show which documents were retrieved
        print("Sources:")
        for i, doc in enumerate(result["context"], 1):
            source = doc.metadata.get("source", "Unknown")
            page = doc.metadata.get("page", "N/A")
            print(f"  {i}. {source} (page {page})")
        print()


if __name__ == "__main__":
    main()

Running the Application

# Make sure you have PDFs in the documents/ folder
mkdir -p documents
# Add your PDF files to documents/

# Run the application
python rag_app.py

The first run builds the vector store (takes a few minutes depending on document size). Subsequent runs load the existing store instantly.

Improving RAG Quality

The basic RAG app works, but here are techniques to significantly improve answer quality.

Technique 1: Hybrid Search

Combine semantic (vector) search with keyword search for better retrieval:

from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever

def create_hybrid_retriever(chunks, vector_store):
    """Combine BM25 keyword search with vector similarity search."""
    bm25_retriever = BM25Retriever.from_documents(chunks)
    bm25_retriever.k = 3

    vector_retriever = vector_store.as_retriever(search_kwargs={"k": 3})

    # 50/50 weight between keyword and semantic search
    ensemble = EnsembleRetriever(
        retrievers=[bm25_retriever, vector_retriever],
        weights=[0.5, 0.5],
    )
    return ensemble

Technique 2: Metadata Filtering

Add metadata to chunks for more precise retrieval:

# When loading documents, add metadata
for chunk in chunks:
    chunk.metadata["department"] = "HR"  # or extract from content
    chunk.metadata["year"] = "2026"
    chunk.metadata["language"] = "English"

# When retrieving, filter by metadata
retriever = vector_store.as_retriever(
    search_kwargs={
        "k": 4,
        "filter": {"department": "HR"},
    }
)

Technique 3: Reranking

After initial retrieval, rerank results for relevance:

from langchain.retrievers import ContextualCompressionRetriever
from langchain_community.document_compressors import LLMChainFilter

def create_reranking_retriever(vector_store, llm):
    """Add LLM-based reranking to filter out irrelevant results."""
    base_retriever = vector_store.as_retriever(search_kwargs={"k": 8})
    compressor = LLMChainFilter.from_llm(llm)

    return ContextualCompressionRetriever(
        base_compressor=compressor,
        base_retriever=base_retriever,
    )

Free Hosting Options for Indian Developers

For production deployments, Indian cloud options include:

AWS Mumbai region (ap-south-1) — Free tier for 12 months
Google Cloud India — $300 free credits
Azure India — $200 free credits

Common RAG Pitfalls

RAG vs Other Approaches

Where to Go From Here

You have built a working RAG application. Here are the next steps:

Add a web interface — Wrap your RAG chain in a Streamlit or Gradio app for a user-friendly interface. See our guide on building apps with AI.
Explore agentic RAG — Combine RAG with AI agents that can decide when to search documents vs when to use tools
Try local models — Run RAG entirely offline using Ollama for privacy-sensitive Indian enterprise data
Connect MCP — Build an MCP server that wraps your RAG system, making it accessible from Claude Desktop, Cursor, and other AI tools
Learn about vector databases — Explore production-grade options like Pinecone, Weaviate, and Qdrant for scaling beyond ChromaDB
Study advanced prompting — Better prompts improve RAG output. See our advanced prompt engineering guide

Community Questions

No questions yet. Be the first to ask!

Share this guide

r/developersIndia r/india r/ChatGPT

What You Will Learn

Why LLMs Need RAG

RAG Architecture

Building a RAG Application: Step by Step

Step 1: Set Up the Environment

Step 2: Get a Free API Key

Step 3: Load Your Documents

Step 4: Split Documents into Chunks

Step 5: Create Embeddings and Store in Vector Database

Step 6: Build the RAG Chain

Step 7: Put It All Together

Running the Application

Improving RAG Quality

Technique 1: Hybrid Search

Technique 2: Metadata Filtering

Technique 3: Reranking

Free Hosting Options for Indian Developers

Common RAG Pitfalls

RAG vs Other Approaches

Where to Go From Here

Community Questions

Share this guide

More guides in Advanced AI

What is MCP (Model Context Protocol)?

Build Your Own MCP Server

Claude Certification & Learning Paths

You Might Also Like

Claude Code Setup & Tutorial: Build Apps 10x Faster

Cursor IDE Complete Tutorial for Beginners India 2026

Best .cursorrules for Indian Projects: React, Next.js, Python

What You Will Learn

Why LLMs Need RAG

RAG Architecture

Building a RAG Application: Step by Step

Step 1: Set Up the Environment

Step 2: Get a Free API Key

Step 3: Load Your Documents

Step 4: Split Documents into Chunks

Step 5: Create Embeddings and Store in Vector Database

Step 6: Build the RAG Chain

Step 7: Put It All Together

Running the Application

Improving RAG Quality

Technique 1: Hybrid Search

Technique 2: Metadata Filtering

Technique 3: Reranking

Free Hosting Options for Indian Developers

Common RAG Pitfalls

RAG vs Other Approaches

Where to Go From Here

Community Questions

Share this guide

More guides in Advanced AI

What is MCP (Model Context Protocol)?

Build Your Own MCP Server

Claude Certification & Learning Paths

You Might Also Like

Claude Code Setup & Tutorial: Build Apps 10x Faster

Cursor IDE Complete Tutorial for Beginners India 2026

Best .cursorrules for Indian Projects: React, Next.js, Python