Fine-Tuning LLMs — When & How
LoRA, datasets, cost comparison, when to fine-tune
Every few weeks, someone asks: "Should I fine-tune a model for my use case?" The honest answer, most of the time, is no — not because fine-tuning is bad, but because it is expensive, slow, and unnecessary for the majority of real-world use cases. RAG and good prompting will get you 90% of the way there at 1% of the cost.
But there are genuine cases where fine-tuning is the right choice — and when those cases arise, you need to know how to do it correctly. This guide helps you make the right decision, and if the answer is "yes, fine-tune," shows you how to do it efficiently with modern techniques like LoRA and QLoRA.
What You'll Learn
- The three approaches: prompting, RAG, and fine-tuning — when to use each
- What fine-tuning actually does to a model
- LoRA and QLoRA — efficient fine-tuning that runs on consumer GPUs
- Cost comparison for different approaches
- Tools: Axolotl, Unsloth, Hugging Face PEFT
- A practical fine-tuning workflow
The Three Approaches: Choose the Right One
Approach 1: Prompt Engineering
You keep the base model as-is and craft better prompts.
Cost: Free to very cheap (only API calls) When it works: General tasks, one-off queries, diverse use cases Limitations: Requires good prompts, context window limits, no persistent "style"
Approach 2: RAG (Retrieval-Augmented Generation)
You give the model relevant context at query time from a document store.
Cost: Very low (storage + cheap embedding model + cheap LLM queries) When it works: Knowledge-intensive tasks, Q&A from documents, search Limitations: Retrieval quality matters, does not change model behavior/style
Approach 3: Fine-Tuning
You train the model on examples of inputs and desired outputs, updating the model's weights.
Cost: High (GPU time, data preparation, iteration cycles) When it works: Changing model behavior/style, very specific domains, production consistency requirements Limitations: Expensive, complex, requires good training data, may "forget" general knowledge
🇮🇳 India Note: Indian developers should strongly consider RAG before fine-tuning. The cost difference is massive: RAG on a free Gemini API might cost ₹0/month for light use, while fine-tuning a Llama model costs $50-500+ depending on dataset size. For most Indian startups and professionals, the ROI on fine-tuning is not there.
When Fine-Tuning Actually Makes Sense
Fine-tuning is genuinely the right choice when:
1. You need a specific writing style or voice consistently. A legal AI that always writes in precise, formal legal language. A customer service bot that always uses your brand voice. Prompting can approximate this but fine-tuning achieves it reliably.
2. Your domain has specialized vocabulary. Medical coding, legal terms, financial jargon, regional Indian languages in specific contexts. Fine-tuning helps the model understand and use specialized terminology correctly.
3. You have large labeled datasets. If you have 10,000+ examples of "input → correct output" for your specific task, fine-tuning can produce a highly accurate specialist model.
4. You need low latency and lower inference cost. A fine-tuned 7B parameter model can often match a 70B model on a specific task — at much lower cost to run.
5. Privacy requires on-premise deployment. You cannot use commercial APIs for your data, so you run and fine-tune an open-source model on your own hardware.
What Fine-Tuning Does
When you fine-tune a model, you run additional training on examples of the behavior you want. The model's weights update to reflect the patterns in your training data.
Full fine-tuning: Updates all weights. Requires significant GPU memory (40-80GB+ for large models). Very expensive.
LoRA (Low-Rank Adaptation): Updates only a small number of additional adapter weights. Requires much less GPU memory. Works with models up to 13B on a single consumer GPU.
QLoRA (Quantized LoRA): Like LoRA but the base model is quantized (compressed) to 4-bit precision first. Allows fine-tuning 70B models on a single 24GB GPU. This is the standard approach in 2026.
Dataset Preparation
The most important factor in fine-tuning quality is training data quality. More data is better, but quality matters more than quantity.
Training data format (Alpaca-style):
[
{
"instruction": "Write a formal legal notice for non-payment of rent",
"input": "Tenant: Ramesh Kumar, Amount: ₹25,000, Due date: January 1, 2026",
"output": "LEGAL NOTICE\n\nTo,\nRamesh Kumar...[complete formal notice]"
},
{
"instruction": "Draft a cease and desist letter",
"input": "...",
"output": "..."
}
]
Minimum dataset sizes:
- Style/tone: 500-1,000 examples
- Task-specific: 1,000-5,000 examples
- Domain knowledge: 5,000-20,000 examples
Creating training data:
- Manually write gold-standard examples (slow but highest quality)
- Use GPT-4 to generate examples and human-review them
- Convert existing documentation into instruction-response pairs
Fine-Tuning with Unsloth
Unsloth is the fastest and most efficient fine-tuning library for 2026. It runs 2x faster than standard methods and uses 60% less memory.
Installation:
pip install unsloth
Basic fine-tuning script:
from unsloth import FastLanguageModel
from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments
# Load a base model (Llama 3.1 8B in this example)
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/llama-3.1-8b-bnb-4bit", # 4-bit quantized
max_seq_length=2048,
dtype=None,
load_in_4bit=True, # QLoRA
)
# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
model,
r=16, # LoRA rank
target_modules=["q_proj", "k_proj", "v_proj", "up_proj", "gate_proj", "down_proj"],
lora_alpha=16,
lora_dropout=0,
bias="none",
use_gradient_checkpointing=True,
)
# Load your dataset
dataset = load_dataset("json", data_files="training_data.json")
# Train
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset["train"],
dataset_text_field="text",
max_seq_length=2048,
args=TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
num_train_epochs=3,
learning_rate=2e-4,
output_dir="./output",
),
)
trainer.train()
# Save the fine-tuned model
model.save_pretrained("my-fine-tuned-model")
Cost Comparison
| Approach | Setup Cost | Per Query Cost | GPU Required | |----------|-----------|----------------|--------------| | RAG + Gemini | Free | ~₹0.01/query | No | | RAG + GPT-4o | Free | ~₹0.15/query | No | | Fine-tune + run locally | $20-200 (once) | Near zero | Yes (8GB+) | | Fine-tune + deploy | $20-200 + $50-200/mo hosting | Low | Via cloud | | OpenAI fine-tuning | $0.008/1K tokens training | $0.012/1K tokens | No |
Rule of thumb:
- Under 100,000 queries/month: RAG is almost always cheaper
- Over 1 million queries/month: Fine-tuning + local inference becomes cost-competitive
Tools Comparison
| Tool | Best For | Hardware Required | Ease of Use | |------|----------|-------------------|-------------| | Unsloth | Speed + efficiency | 8-24GB VRAM | Medium | | Axolotl | Flexibility, production | 8-80GB VRAM | Medium-Hard | | Hugging Face PEFT | Research, experimentation | Varies | Easy | | OpenAI Fine-tuning | No GPU needed | None (cloud) | Very Easy | | Google Vertex AI | Enterprise | None (cloud) | Easy |
Cloud Fine-Tuning (No GPU Needed)
If you do not have a GPU, cloud providers offer fine-tuning services:
Google Vertex AI: Fine-tune Gemini models with a dataset. Pay per training step.
OpenAI: Fine-tune GPT-4.1 mini with your data. ~$8 per 1 million training tokens.
Together AI: Fine-tune open models (Llama, Mistral) on their cloud. Pay per GPU hour.
RunPod / Vast.ai (budget option): Rent a GPU for a few hours to run your own fine-tuning. A single fine-tuning run on a 24GB RTX 3090 costs $1-3 on Vast.ai. Indian developers often use this for cost efficiency.
Official Resources
- Unsloth GitHub — Fastest fine-tuning library
- Axolotl Documentation — Production-grade fine-tuning
- Hugging Face PEFT — Official LoRA implementation
- OpenAI Fine-tuning Guide — Cloud fine-tuning, no GPU needed
- Hugging Face Model Hub — Base models to fine-tune from
Community Questions
0No questions yet. Be the first to ask!