4.1 — LoRA & QLoRA — Fine-Tuning on Consumer GPUs
LoRA & QLoRA — Fine-Tuning on Consumer GPUs
If you've ever wished an AI model "spoke your language" — understood Karachi street names, Pakistani business lingo, or your company's internal tone — fine-tuning is the answer. LoRA (Low-Rank Adaptation) and its memory-efficient cousin QLoRA make this possible on the kind of hardware you already own: a gaming PC, a rented GPU on PaperSpace, or even a MacBook Pro. This lesson covers the theory, the math, the code, and the Pakistani commercial applications.
What Is LoRA and Why Does It Exist
Training an LLM from scratch requires thousands of GPUs and millions of dollars. LoRA sidesteps this entirely with a clever mathematical trick.
FULL FINE-TUNING vs. LoRA:
Full Fine-Tuning:
├── Updates ALL model weights (billions of parameters)
├── Needs: 4-8 GPUs × 80 GB VRAM each
├── Cost: $10,000-100,000+ per training run
├── Time: Days to weeks
└── Verdict: Impossible for 99.9% of Pakistani developers
LoRA (Low-Rank Adaptation):
├── Freezes ALL original weights
├── Injects tiny trainable matrices ("adapters")
├── Updates ONLY the adapters (< 1% of parameters)
├── Needs: 1 GPU with 6-16 GB VRAM
├── Cost: PKR 0-500 per training run
├── Time: 30 minutes to 4 hours
└── Verdict: Your RTX 3060 can do this tonight
QLoRA (Quantized LoRA):
├── Same as LoRA BUT loads base model in 4-bit
├── Cuts VRAM usage by 50-75%
├── Needs: 1 GPU with 6-8 GB VRAM
├── Quality: 95-99% of full LoRA
└── Verdict: Your RTX 3060 (12 GB) or even RTX 3050 (8 GB)
How It Actually Works
Think of the original model as a master chef who knows 10,000 recipes. LoRA doesn't retrain the chef — it gives them a small recipe card for your specific cuisine. The chef's 10,000 recipes stay intact, and the card adds your 50 specialized dishes.
The Math Without the Headache
THE LoRA DECOMPOSITION:
Original weight matrix W: 4096 × 4096 = 16,777,216 parameters
(frozen — don't touch)
LoRA adds two tiny matrices:
B: 4096 × r (r = rank, typically 4-32)
A: r × 4096
For rank 8:
B: 4096 × 8 = 32,768 parameters
A: 8 × 4096 = 32,768 parameters
Total adapter: 65,536 parameters (0.39% of original!)
The weight update: ΔW = B × A
┌─────────────────┐
│ Original W │ (frozen, 16M params)
│ 4096 × 4096 │
└────────┬────────┘
│
┌────────┴────────┐
│ + (B × A) │ (trainable, 65K params)
│ │
│ B: 4096×8 │
│ A: 8×4096 │
└─────────────────┘
YOU TRAIN 0.39% OF THE MODEL
THE OTHER 99.61% STAYS FROZEN
The Rank Hyperparameter
The rank (r) controls the adapter's expressiveness:
| Rank | Trainable Params | VRAM Usage | Best For |
|---|---|---|---|
| 4 | ~33K | Minimal | Tone/style changes, simple patterns |
| 8 | ~65K | Low | Standard fine-tuning, most use cases |
| 16 | ~131K | Medium | Domain-specific knowledge injection |
| 32 | ~262K | Higher | Complex multi-task adaptation |
| 64 | ~524K | High | Near full fine-tuning quality |
Rule of thumb: Start with rank 8. Only increase if results are poor after 3 epochs. Higher rank = more expressive but more likely to overfit on small datasets.
Setting Up LoRA Fine-Tuning with Hugging Face
The peft (Parameter-Efficient Fine-Tuning) library handles LoRA natively. Combined with transformers and bitsandbytes (for 4-bit quantization), the full stack is free.
Installation
pip install transformers peft bitsandbytes datasets accelerate
# For QLoRA on Windows, you may need:
pip install bitsandbytes-windows
Loading a Model with QLoRA
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
# QLoRA: Load base model in 4-bit
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # Normalized Float 4
bnb_4bit_compute_dtype="bfloat16", # Computation in bf16
bnb_4bit_use_double_quant=True # Double quantization saves more VRAM
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3-8B",
quantization_config=bnb_config,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
# Attach LoRA adapter
lora_config = LoraConfig(
r=8, # Rank — start low, increase if needed
lora_alpha=16, # Scaling factor (usually 2× rank)
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.05, # Regularization
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 3,407,872 || all params: 8,030,261,248 || trainable%: 0.0424
Training Loop
from transformers import TrainingArguments, Trainer
from datasets import load_dataset
# Load your dataset (Alpaca format)
dataset = load_dataset("json", data_files="training_data.json")
training_args = TrainingArguments(
output_dir="./lora-output",
num_train_epochs=3, # 3 epochs is usually enough
per_device_train_batch_size=4, # Adjust based on VRAM
gradient_accumulation_steps=4, # Effective batch = 16
learning_rate=2e-4, # LoRA uses higher LR than full FT
warmup_steps=100,
logging_steps=25,
save_strategy="epoch",
fp16=True, # Mixed precision
report_to="none", # Or "wandb" for W&B monitoring
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset["train"],
tokenizer=tokenizer,
)
trainer.train()
# Save just the adapter (tiny file, ~10-50 MB)
model.save_pretrained("./my-pakistan-adapter")
Using Your Fine-Tuned Model
from peft import PeftModel
# Load base model + your adapter
base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3-8B",
quantization_config=bnb_config,
device_map="auto"
)
model = PeftModel.from_pretrained(base_model, "./my-pakistan-adapter")
# Generate text with your fine-tuned model
inputs = tokenizer("DHA Phase 5 mein plot ka rate kya hai?", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0]))
Pakistani Use Cases for Fine-Tuning
Base models are trained predominantly on English web text. They don't know that "DHA Phase 5 mein plot" refers to Defence Housing Authority in Lahore, or that "EOBI payment" relates to workers' social security. Fine-tuning bridges this gap.
Three Revenue Channels
CHANNEL 1: CUSTOMER SERVICE BOTS (PKR 15,000-40,000/month)
├── Fine-tune on: Company FAQ + past chat logs + product catalog
├── Data needed: 500-1,000 Q&A pairs
├── Training time: 2-3 hours on RTX 3060
├── Result: Bot understands "delivery kab ayegi?" and "size chart dikhao"
├── Clients: Karachi retail, F&B, ecommerce sellers
└── Revenue: PKR 15-40K/month per client, 5 clients = PKR 75-200K/month
CHANNEL 2: LEGAL/REGULATORY DOCUMENT PROCESSING (PKR 30,000-80,000/project)
├── Fine-tune on: SECP filings, FBR forms, NADRA documents, contract templates
├── Data needed: 200-500 document examples
├── Training time: 3-4 hours
├── Result: Model extracts entities, summarizes clauses, flags risks
├── Clients: Law firms, corporate registrars, tax consultants
└── Revenue: PKR 30-80K per project, recurring monthly retainer
CHANNEL 3: HR SCREENING FOR LOCAL JOB PORTALS (PKR 20,000-50,000/month)
├── Fine-tune on: Pakistani CV formats, Rozee.pk job postings, salary norms
├── Data needed: 1,000+ CV-job pairs with match scores
├── Training time: 2-3 hours
├── Result: Model screens CVs understanding "NUST grad" vs "LUMS grad" context
├── Clients: Recruitment agencies, HR departments
└── Revenue: PKR 20-50K/month per client
Training Data Sources for Pakistan
| Data Source | Where to Get It | Use Case |
|---|---|---|
| WhatsApp business logs | Export from WhatsApp Business | Customer service fine-tuning |
| Zameen.pk listings | Scrape property descriptions | Real estate chatbot |
| Daraz product reviews | Public reviews page | Sentiment analysis |
| Dawn/Geo articles | RSS feeds or news archives | News summarization |
| PakWheels listings | Scrape car listings | Automotive chatbot |
| Rozee.pk job posts | Job posting data | HR screening model |
| FBR tax guides | Public PDFs | Tax advisory bot |
VRAM Requirements Guide
MODEL + QUANTIZATION → VRAM NEEDED:
Llama 3 8B:
├── Full precision (FP16): 16 GB → RTX 4080 or better
├── 8-bit quantized: 10 GB → RTX 3060 12GB
├── 4-bit (QLoRA): 6 GB → RTX 3050 8GB ✓
└── Training with QLoRA rank 8: 8 GB → RTX 3060 12GB ✓
Llama 3 70B:
├── Full precision: 140 GB → 2× A100 80GB
├── 4-bit (QLoRA): 40 GB → A100 80GB or A6000 48GB
└── Training with QLoRA: 48 GB → Rented GPU only
Qwen 2.5 7B:
├── 4-bit (QLoRA): 5 GB → RTX 3050 8GB ✓
└── Training with QLoRA: 7 GB → RTX 3060 12GB ✓
PAKISTANI HARDWARE OPTIONS:
├── RTX 3060 12GB: PKR 65,000-85,000 (Hafeez Centre Lahore / Saddar Karachi)
├── RTX 4060 8GB: PKR 80,000-100,000 (good for inference, tight for training)
├── RTX 3090 24GB: PKR 150,000-180,000 (gold standard for local training)
├── Google Colab T4: Free (15 GB VRAM, limited hours)
├── Google Colab A100: PKR 2,800/month (Colab Pro+)
└── PaperSpace Gradient: PKR 2,200/month (A4000 16GB)
Practice Lab
Exercise 1: Environment Setup
Install transformers, peft, bitsandbytes, and datasets via pip. Verify your GPU VRAM with nvidia-smi. If you have less than 8 GB, use Google Colab's free T4 GPU. Load any small model (e.g., Qwen/Qwen2-1.5B) with a LoRA config rank=4. Run print_trainable_parameters() and confirm the trainable % is below 1%.
Exercise 2: Rank Comparison Experiment Load the same model with rank 4, rank 8, rank 16, and rank 64. For each, note: trainable parameter count, estimated VRAM usage, and adapter file size. Create a comparison table. At what rank does your GPU run out of VRAM? This is your practical ceiling.
Exercise 3: Mini Fine-Tune Create a 50-example training dataset in Alpaca format (instruction, input, output). Topic: Pakistani customer service responses (use ChatGPT to generate synthetic data). Run a QLoRA fine-tune for 1 epoch. Test the before/after: prompt the base model and fine-tuned model with the same Pakistani question. Is the fine-tuned version better?
Pakistan Case Study
Meet Asad — ML engineer in Karachi, runs a small AI consulting practice.
His problem: Clients wanted AI chatbots that understood Pakistani context — local slang, city names, business terminology. GPT-4o and Claude worked great for English but stumbled on "DHA Phase 5 mein 10 marla ka plot kitne ka hai?" or "EOBI ka form kahan se milega?"
His fine-tuning business:
Client 1 — Karachi Electronics Store (WhatsApp bot):
- Training data: 800 Q&A pairs from WhatsApp Business export
- Model: Llama 3 8B + QLoRA rank 8
- Training: 2.5 hours on his RTX 3090
- Result: Bot handles "Samsung A54 ka rate?" perfectly (base model didn't know PKR prices)
- Revenue: PKR 25,000/month retainer
Client 2 — Lahore Law Firm (Document processor):
- Training data: 300 SECP filing examples + 200 contract templates
- Model: Qwen 2.5 7B + QLoRA rank 16
- Training: 3 hours on Google Colab A100
- Result: Extracts company registration details with 94% accuracy (base model: 61%)
- Revenue: PKR 60,000 one-time + PKR 15,000/month maintenance
Client 3 — Islamabad Recruitment Agency (CV screener):
- Training data: 1,200 CV-job pairs from Rozee.pk data
- Model: Llama 3 8B + QLoRA rank 8
- Training: 4 hours on RTX 3090
- Result: Screens 500 CVs/day, understands "NUST BSCS" vs "Punjab University MBA"
- Revenue: PKR 35,000/month retainer
Total fine-tuning revenue: PKR 135,000/month Hardware investment: RTX 3090 (PKR 170,000) — paid back in 1.3 months His adapter files: Each is 15-40 MB. He stores all client adapters on a single 256 GB SSD.
His key insight: "Base models are general purpose — woh sab kuch thoda thoda jaante hain. Fine-tuning se model specialist ban jata hai. Mera electronics store bot ab Pakistani prices, product names, aur 'bhai discount do' jaise requests perfectly samajhta hai. Generic ChatGPT wrapper ye nahi kar sakta."
Key Takeaways
- LoRA trains a tiny adapter (< 1% of parameters) instead of the full model — making fine-tuning feasible on consumer GPUs
- QLoRA combines 4-bit quantization + LoRA, reducing VRAM from 16 GB to 6-8 GB for 8B parameter models
- Rank is the key hyperparameter: rank 4-8 for tone/style, rank 16-32 for domain knowledge injection
- Your RTX 3060 (PKR 65-85K) is a legitimate fine-tuning machine with QLoRA
- Pakistani market use cases: customer service bots (PKR 15-40K/month), legal doc processing (PKR 30-80K), HR screening (PKR 20-50K/month)
- Training data for Pakistan: WhatsApp exports, Zameen.pk, Daraz reviews, Dawn/Geo news, Rozee.pk
- Adapter files are tiny (15-50 MB) — you can store dozens of client-specific adapters on one drive
- The adapter is portable: train once, deploy anywhere (VPS, API, local server)
- Fine-tuning is a premium service — base model wrappers are commodity, fine-tuned models are specialized
Next lesson: Dataset preparation — cleaning, formatting, and augmenting training data for Pakistani use cases.
Lesson Summary
Quiz: LoRA & QLoRA — Fine-Tuning on Consumer GPUs
4 questions to test your understanding. Score 60% or higher to pass.