RAG for Beginners — Build Your Own
Retrieval-Augmented Generation explained, build it yourself
Have you ever wished you could ask ChatGPT a question and have it answer based on YOUR company's documents, YOUR notes, or YOUR codebase? RAG (Retrieval-Augmented Generation) makes this possible — without training a new model, without uploading all your data to OpenAI, and without a machine learning background.
This guide explains what RAG is, why it works, and walks you through building a simple RAG system that can answer questions from a set of PDF documents — in Python, in an afternoon.
What You'll Learn
- What RAG is and why it works better than fine-tuning for most use cases
- The three components of every RAG system
- Step-by-step: building a PDF Q&A system with LangChain + Chroma
- How to improve RAG accuracy
- Production considerations
- Free and paid tools comparison
What Is RAG?
RAG stands for Retrieval-Augmented Generation. It is a technique where an AI model is given relevant context at query time by first retrieving relevant documents from a collection.
Without RAG: User: "What is our company's refund policy?" AI: "I don't have information about your specific company's policies."
With RAG: User: "What is our company's refund policy?" System: 1) Search documents for "refund policy" → find relevant sections → pass them to AI AI: "According to your company policy document, refunds are processed within 7-10 business days..." (with specific, accurate information)
The key insight is that you do not need to train the AI on your data — you just give it relevant context at the time of each question. This is much cheaper, faster, and more flexible than fine-tuning.
🇮🇳 India Note: RAG is particularly useful for Indian businesses with large document repositories — legal firms with case precedents, CA firms with tax notifications, hospitals with medical records, and government agencies with circular and notification databases. The documents stay on your own infrastructure; only the question and retrieved snippets go to the AI API.
The Three Components of RAG
Every RAG system has three parts:
1. Embedding Model
Converts text into numbers (vectors) that capture semantic meaning. Similar texts get similar vectors. This is how the system finds relevant documents.
Free options:
sentence-transformers/all-MiniLM-L6-v2(local, free, fast)- Google's text-embedding-004 (free tier via Gemini API)
Paid options:
- OpenAI text-embedding-3-small ($0.02/million tokens)
2. Vector Database
Stores the embedded document chunks and allows fast similarity search. When you ask a question, the vector database finds the most similar document chunks.
Options:
- Chroma — Local, free, great for development
- FAISS — Local, free, used in production at Meta
- Pinecone — Cloud, free tier available, easy to use
- Supabase — PostgreSQL + pgvector, free tier
3. Language Model (LLM)
The AI that generates the final answer using the retrieved context. Any LLM works — GPT-4, Claude, Gemini.
Step-by-Step: Build a PDF Q&A System
Prerequisites
pip install langchain langchain-community chromadb pypdf sentence-transformers
You also need either:
- An OpenAI API key (for GPT), OR
- A Google AI API key (free via aistudio.google.com)
Step 1: Load and Split Documents
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
def load_documents(pdf_paths: list[str]):
all_docs = []
for path in pdf_paths:
loader = PyPDFLoader(path)
docs = loader.load()
all_docs.extend(docs)
# Split into chunks for better retrieval
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, # ~250 words per chunk
chunk_overlap=200 # Overlap prevents losing context at boundaries
)
return splitter.split_documents(all_docs)
chunks = load_documents(["company_policy.pdf", "product_manual.pdf"])
print(f"Created {len(chunks)} chunks from your documents")
Step 2: Create Embeddings and Store in Chroma
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
# Use a free local embedding model
embeddings = HuggingFaceEmbeddings(
model_name="sentence-transformers/all-MiniLM-L6-v2"
)
# Create vector store and embed all chunks
# This takes 1-5 minutes depending on document size
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./chroma_db" # Save to disk
)
print("Documents embedded and stored in Chroma!")
Step 3: Build the Q&A Chain
from langchain.chains import RetrievalQA
from langchain_google_genai import ChatGoogleGenerativeAI
# Use free Gemini model
llm = ChatGoogleGenerativeAI(
model="gemini-2.5-flash", # Free tier
google_api_key="YOUR_GOOGLE_AI_KEY" # Free from aistudio.google.com
)
# Create retriever — finds top 4 most relevant chunks
retriever = vectorstore.as_retriever(
search_type="similarity",
search_kwargs={"k": 4}
)
# Create QA chain
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff", # Stuff all chunks into context
retriever=retriever,
return_source_documents=True # Show which documents were used
)
Step 4: Ask Questions
def ask(question: str):
result = qa_chain({"query": question})
print(f"Answer: {result['result']}")
print("\nSources:")
for doc in result['source_documents']:
print(f" - {doc.metadata.get('source', 'Unknown')} (page {doc.metadata.get('page', '?')})")
# Test it!
ask("What is the refund policy?")
ask("How do I reset my password?")
ask("What are the working hours?")
💰 Free Deal: Get your Google AI API key at aistudio.google.com — completely free, no credit card needed. The free tier gives 1 million tokens/day with Gemini 2.5 Flash, which is more than enough for a document Q&A system with moderate usage.
Complete Working Example
Here is the complete script in one place:
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain_google_genai import ChatGoogleGenerativeAI
import os
# Configuration
PDF_FILES = ["document1.pdf", "document2.pdf"]
GOOGLE_API_KEY = "your-api-key-here"
# 1. Load documents
all_docs = []
for pdf in PDF_FILES:
loader = PyPDFLoader(pdf)
all_docs.extend(loader.load())
# 2. Split into chunks
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(all_docs)
# 3. Create embeddings (free, runs locally)
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
vectorstore = Chroma.from_documents(chunks, embeddings, persist_directory="./db")
# 4. Create LLM and QA chain
llm = ChatGoogleGenerativeAI(model="gemini-2.5-flash", google_api_key=GOOGLE_API_KEY)
qa = RetrievalQA.from_chain_type(llm=llm, retriever=vectorstore.as_retriever(search_kwargs={"k": 4}))
# 5. Ask questions
while True:
question = input("\nAsk a question (or 'quit'): ")
if question == "quit":
break
result = qa({"query": question})
print(f"\nAnswer: {result['result']}")
Improving RAG Accuracy
When your RAG system gives wrong answers, these techniques help:
Better chunking: Semantic chunking (splitting at natural paragraph or section boundaries) works better than fixed-size chunking for structured documents.
Hybrid search: Combine vector similarity search with keyword search (BM25). This handles cases where exact terminology matters.
Reranking: After retrieving top-K chunks, use a reranker model to sort them by relevance before passing to the LLM.
Metadata filtering: Add metadata (document date, section, author) to chunks and filter by it. "Answer this from documents published after January 2025 only."
RAG vs Fine-Tuning
| Criteria | RAG | Fine-Tuning | |----------|-----|------------| | Cost | Low ($0-50/month) | High ($500-5,000+) | | Setup time | Hours | Days to weeks | | Data freshness | Real-time update | Re-train for updates | | Accuracy | Good with good retrieval | Higher for specific domains | | Transparency | Shows sources | Black box | | Best for | Q&A, search, chat | Specific behavior/style |
Rule of thumb: Use RAG if you need to answer questions from documents. Consider fine-tuning only if RAG quality is insufficient after optimization.
Official Resources
- LangChain Documentation — Complete LangChain framework docs
- Chroma Documentation — Local vector database
- Google AI Studio — Free Gemini API key, no credit card
- Pinecone Quickstart — Cloud vector DB with free tier
- Sentence Transformers — Free local embedding models
Community Questions
0No questions yet. Be the first to ask!