What is RAG? Build Your First RAG Application (India Tutorial)
Retrieval Augmented Generation from scratch — use your own documents with LLMs
Every large language model — ChatGPT, Claude, Gemini — has two fundamental limitations: they do not know about your private data, and their knowledge has a cutoff date. If you ask Claude about your company's HR policy or ChatGPT about a government circular published last week, they cannot help.
Retrieval-Augmented Generation (RAG) solves both problems. It is the most practical technique for making AI work with your own data, and it does not require any machine learning expertise to implement.
What You Will Learn
- What RAG is and why it matters
- How RAG works — the retrieval and generation pipeline
- How to build a complete RAG application step by step
- Text splitting, embedding, and vector storage
- Querying your documents with natural language
- Free hosting options for Indian developers
- Common RAG pitfalls and how to avoid them
For a conceptual introduction to RAG, see our RAG for Beginners guide. This tutorial focuses on building a working application.
Why LLMs Need RAG
Imagine you have 500 pages of your company's compliance documentation. You want employees to ask questions and get accurate answers. Here are your options:
| Approach | Pros | Cons | |----------|------|------| | Paste into prompt | Simple | Limited by context window, expensive at scale | | Fine-tune the model | AI "learns" your data | Expensive, slow, hard to update, requires ML skills | | RAG | Accurate, updatable, affordable | Requires setup (this tutorial helps) |
RAG works by:
- Indexing your documents into a searchable database
- Retrieving only the relevant portions when a question is asked
- Generating an answer using the retrieved context
This means the AI only sees the parts of your documents that are relevant to each question, which improves accuracy and reduces cost.
RAG Architecture
User Question: "What is the leave policy for contract employees?"
│
▼
┌─────────────────┐
│ 1. EMBED QUERY │ Convert question to a vector (numbers)
└────────┬────────┘
│
▼
┌─────────────────┐
│ 2. SEARCH │ Find similar document chunks in vector DB
│ (Vector DB) │ Returns: top 3-5 most relevant chunks
└────────┬────────┘
│
▼
┌─────────────────┐
│ 3. BUILD PROMPT │ Combine: system instructions + retrieved
│ │ chunks + user question
└────────┬────────┘
│
▼
┌─────────────────┐
│ 4. GENERATE │ LLM generates answer grounded in the
│ (LLM) │ retrieved context
└────────┬────────┘
│
▼
Answer: "According to the HR Policy v3.2, Section 4.1,
contract employees are entitled to..."
Building a RAG Application: Step by Step
We will build a RAG app that lets you query Indian government PDF documents. The same approach works for any text-based documents.
Step 1: Set Up the Environment
# Create project directory
mkdir rag-india-tutorial
cd rag-india-tutorial
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install langchain langchain-community langchain-chroma
pip install langchain-google-genai # Free Gemini API
pip install pypdf python-docx # Document loaders
pip install chromadb # Vector database
Step 2: Get a Free API Key
For this tutorial, we use Google's Gemini API which offers a generous free tier.
- Go to Google AI Studio
- Sign in with your Google account
- Create an API key (free, no credit card required)
- Save it in a
.envfile:
# .env
GOOGLE_API_KEY=your_api_key_here
Alternative free options:
- Hugging Face Inference API (free tier with rate limits)
- Ollama for fully local, offline RAG (see our Ollama guide)
- Google Colab with free GPU for running open-source models
Step 3: Load Your Documents
Create rag_app.py:
import os
from dotenv import load_dotenv
from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
load_dotenv()
# --- Step 3: Load Documents ---
def load_documents(directory: str):
"""Load all PDF files from a directory."""
loader = DirectoryLoader(
directory,
glob="**/*.pdf",
loader_cls=PyPDFLoader,
show_progress=True,
)
documents = loader.load()
print(f"Loaded {len(documents)} pages from PDFs")
return documents
# You can also load other file types:
# from langchain_community.document_loaders import (
# TextLoader, # .txt files
# Docx2txtLoader, # .docx files
# CSVLoader, # .csv files
# UnstructuredHTMLLoader, # .html files
# )
Create a documents/ folder and add your PDFs there. For testing, you can download any public Indian government PDF — a budget summary, a policy document, or an annual report.
Step 4: Split Documents into Chunks
Large documents need to be split into smaller chunks for effective retrieval. The chunk size matters — too small and you lose context, too large and you waste the context window.
# --- Step 4: Split into Chunks ---
def split_documents(documents):
"""Split documents into smaller chunks for embedding."""
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, # Characters per chunk
chunk_overlap=200, # Overlap between chunks for context continuity
length_function=len,
separators=["\n\n", "\n", ". ", " ", ""], # Split priorities
)
chunks = text_splitter.split_documents(documents)
print(f"Split into {len(chunks)} chunks")
return chunks
Why these settings?
chunk_size=1000— Large enough to contain a complete thought, small enough for precise retrievalchunk_overlap=200— Ensures sentences at chunk boundaries are not lostseparators— Prefers splitting at paragraph breaks, then sentences, then words
Step 5: Create Embeddings and Store in Vector Database
Embeddings convert text into numerical vectors. Similar texts produce similar vectors, which enables semantic search.
# --- Step 5: Embed and Store ---
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_chroma import Chroma
def create_vector_store(chunks, persist_directory: str = "./chroma_db"):
"""Create embeddings and store in ChromaDB."""
# Initialize the embedding model (free with Gemini API)
embeddings = GoogleGenerativeAIEmbeddings(
model="models/embedding-001",
google_api_key=os.getenv("GOOGLE_API_KEY"),
)
# Create the vector store
vector_store = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory=persist_directory,
)
print(f"Vector store created with {vector_store._collection.count()} vectors")
return vector_store
def load_vector_store(persist_directory: str = "./chroma_db"):
"""Load an existing vector store."""
embeddings = GoogleGenerativeAIEmbeddings(
model="models/embedding-001",
google_api_key=os.getenv("GOOGLE_API_KEY"),
)
return Chroma(
persist_directory=persist_directory,
embedding_function=embeddings,
)
ChromaDB stores the vectors locally — no cloud database needed. The data persists in ./chroma_db/ and loads instantly on restart.
Step 6: Build the RAG Chain
This is where retrieval meets generation. When a user asks a question, we find the most relevant chunks and pass them to the LLM.
# --- Step 6: RAG Chain ---
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.prompts import ChatPromptTemplate
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain
def create_rag_chain(vector_store):
"""Create the RAG chain combining retrieval and generation."""
# Initialize the LLM
llm = ChatGoogleGenerativeAI(
model="gemini-2.0-flash",
google_api_key=os.getenv("GOOGLE_API_KEY"),
temperature=0.3, # Lower temperature for factual accuracy
)
# Create the retriever (searches top 4 most relevant chunks)
retriever = vector_store.as_retriever(
search_type="similarity",
search_kwargs={"k": 4},
)
# Define the prompt template
prompt = ChatPromptTemplate.from_template("""
You are a helpful assistant that answers questions based on the
provided context. Follow these rules:
1. Answer ONLY based on the provided context
2. If the context does not contain enough information, say
"I could not find this information in the provided documents"
3. Cite which document/section the information comes from
4. Use clear, concise language
5. For numerical data, present in Indian notation (lakhs, crores)
Context:
{context}
Question: {input}
Answer:
""")
# Create the chain
document_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, document_chain)
return rag_chain
Step 7: Put It All Together
# --- Step 7: Main Application ---
def main():
"""Run the RAG application."""
docs_directory = "./documents"
# Check if vector store already exists
if os.path.exists("./chroma_db"):
print("Loading existing vector store...")
vector_store = load_vector_store()
else:
print("Building vector store from documents...")
documents = load_documents(docs_directory)
chunks = split_documents(documents)
vector_store = create_vector_store(chunks)
# Create the RAG chain
rag_chain = create_rag_chain(vector_store)
# Interactive query loop
print("\n--- RAG Application Ready ---")
print("Ask questions about your documents. Type 'quit' to exit.\n")
while True:
question = input("Your question: ").strip()
if question.lower() in ("quit", "exit", "q"):
break
if not question:
continue
result = rag_chain.invoke({"input": question})
print(f"\nAnswer: {result['answer']}\n")
# Show which documents were retrieved
print("Sources:")
for i, doc in enumerate(result["context"], 1):
source = doc.metadata.get("source", "Unknown")
page = doc.metadata.get("page", "N/A")
print(f" {i}. {source} (page {page})")
print()
if __name__ == "__main__":
main()
Running the Application
# Make sure you have PDFs in the documents/ folder
mkdir -p documents
# Add your PDF files to documents/
# Run the application
python rag_app.py
The first run builds the vector store (takes a few minutes depending on document size). Subsequent runs load the existing store instantly.
Improving RAG Quality
The basic RAG app works, but here are techniques to significantly improve answer quality.
Technique 1: Hybrid Search
Combine semantic (vector) search with keyword search for better retrieval:
from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
def create_hybrid_retriever(chunks, vector_store):
"""Combine BM25 keyword search with vector similarity search."""
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 3
vector_retriever = vector_store.as_retriever(search_kwargs={"k": 3})
# 50/50 weight between keyword and semantic search
ensemble = EnsembleRetriever(
retrievers=[bm25_retriever, vector_retriever],
weights=[0.5, 0.5],
)
return ensemble
Technique 2: Metadata Filtering
Add metadata to chunks for more precise retrieval:
# When loading documents, add metadata
for chunk in chunks:
chunk.metadata["department"] = "HR" # or extract from content
chunk.metadata["year"] = "2026"
chunk.metadata["language"] = "English"
# When retrieving, filter by metadata
retriever = vector_store.as_retriever(
search_kwargs={
"k": 4,
"filter": {"department": "HR"},
}
)
Technique 3: Reranking
After initial retrieval, rerank results for relevance:
from langchain.retrievers import ContextualCompressionRetriever
from langchain_community.document_compressors import LLMChainFilter
def create_reranking_retriever(vector_store, llm):
"""Add LLM-based reranking to filter out irrelevant results."""
base_retriever = vector_store.as_retriever(search_kwargs={"k": 8})
compressor = LLMChainFilter.from_llm(llm)
return ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=base_retriever,
)
Free Hosting Options for Indian Developers
| Platform | Free Tier | Best For | |----------|-----------|----------| | Google Colab | Free GPU, unlimited notebooks | Development and demos | | Railway | 500 hours/month, 512MB RAM | Small web apps | | Render | 750 hours/month | API-based RAG services | | Hugging Face Spaces | Free CPU instances | Gradio/Streamlit demos | | Replit | Basic tier free | Quick prototypes | | PythonAnywhere | Free tier with limitations | Simple Flask apps |
For production deployments, Indian cloud options include:
- AWS Mumbai region (ap-south-1) — Free tier for 12 months
- Google Cloud India — $300 free credits
- Azure India — $200 free credits
Common RAG Pitfalls
| Problem | Cause | Solution | |---------|-------|----------| | Irrelevant answers | Chunks too large | Reduce chunk_size to 500-800 | | Missing context | Chunks too small | Increase chunk_size or chunk_overlap | | Slow retrieval | Too many vectors | Add metadata filtering, reduce k | | Hallucinations | Weak prompt | Add explicit "only use provided context" instruction | | Wrong language output | Mixed language docs | Add language instruction to prompt template | | Outdated answers | Stale vector store | Rebuild index when documents change |
RAG vs Other Approaches
| Feature | RAG | Fine-Tuning | Long Context Window | |---------|-----|-------------|-------------------| | Setup difficulty | Medium | High | Low | | Cost per query | Low | Low (after training) | High (token cost) | | Data freshness | Real-time updatable | Requires retraining | Real-time | | Accuracy | High with good retrieval | High for style/behavior | Degrades with volume | | Privacy | Data stays local | Data sent for training | Data sent per query | | Best for | Knowledge bases, Q&A | Style, behavior changes | Small document sets |
Where to Go From Here
You have built a working RAG application. Here are the next steps:
- Add a web interface — Wrap your RAG chain in a Streamlit or Gradio app for a user-friendly interface. See our guide on building apps with AI.
- Explore agentic RAG — Combine RAG with AI agents that can decide when to search documents vs when to use tools
- Try local models — Run RAG entirely offline using Ollama for privacy-sensitive Indian enterprise data
- Connect MCP — Build an MCP server that wraps your RAG system, making it accessible from Claude Desktop, Cursor, and other AI tools
- Learn about vector databases — Explore production-grade options like Pinecone, Weaviate, and Qdrant for scaling beyond ChromaDB
- Study advanced prompting — Better prompts improve RAG output. See our advanced prompt engineering guide
RAG is the foundation of most enterprise AI applications in 2026. The skills you have built here — document processing, embedding, vector search, and prompt engineering — apply to virtually every AI product being built today.
Community Questions
0No questions yet. Be the first to ask!