AI Infrastructure & Local LLMsModule 4

4.1LoRA & QLoRA — Fine-Tuning on Consumer GPUs

30 min 8 code blocks Practice Lab Quiz (4Q)

LoRA & QLoRA — Fine-Tuning on Consumer GPUs

If you've ever wished an AI model "spoke your language" — understood Karachi street names, Pakistani business lingo, or your company's internal tone — fine-tuning is the answer. LoRA (Low-Rank Adaptation) and its memory-efficient cousin QLoRA make this possible on the kind of hardware you already own: a gaming PC, a rented GPU on PaperSpace, or even a MacBook Pro. This lesson covers the theory, the math, the code, and the Pakistani commercial applications.

What Is LoRA and Why Does It Exist

Training an LLM from scratch requires thousands of GPUs and millions of dollars. LoRA sidesteps this entirely with a clever mathematical trick.

code
FULL FINE-TUNING vs. LoRA:

Full Fine-Tuning:
├── Updates ALL model weights (billions of parameters)
├── Needs: 4-8 GPUs × 80 GB VRAM each
├── Cost: $10,000-100,000+ per training run
├── Time: Days to weeks
└── Verdict: Impossible for 99.9% of Pakistani developers

LoRA (Low-Rank Adaptation):
├── Freezes ALL original weights
├── Injects tiny trainable matrices ("adapters")
├── Updates ONLY the adapters (< 1% of parameters)
├── Needs: 1 GPU with 6-16 GB VRAM
├── Cost: PKR 0-500 per training run
├── Time: 30 minutes to 4 hours
└── Verdict: Your RTX 3060 can do this tonight

QLoRA (Quantized LoRA):
├── Same as LoRA BUT loads base model in 4-bit
├── Cuts VRAM usage by 50-75%
├── Needs: 1 GPU with 6-8 GB VRAM
├── Quality: 95-99% of full LoRA
└── Verdict: Your RTX 3060 (12 GB) or even RTX 3050 (8 GB)

How It Actually Works

Think of the original model as a master chef who knows 10,000 recipes. LoRA doesn't retrain the chef — it gives them a small recipe card for your specific cuisine. The chef's 10,000 recipes stay intact, and the card adds your 50 specialized dishes.

The Math Without the Headache

code
THE LoRA DECOMPOSITION:

Original weight matrix W:    4096 × 4096 = 16,777,216 parameters
                              (frozen — don't touch)

LoRA adds two tiny matrices:
  B: 4096 × r    (r = rank, typically 4-32)
  A: r × 4096

For rank 8:
  B: 4096 × 8   = 32,768 parameters
  A: 8 × 4096   = 32,768 parameters
  Total adapter:   65,536 parameters (0.39% of original!)

The weight update: ΔW = B × A

                    ┌─────────────────┐
                    │  Original W      │ (frozen, 16M params)
                    │  4096 × 4096     │
                    └────────┬────────┘
                             │
                    ┌────────┴────────┐
                    │  + (B × A)      │ (trainable, 65K params)
                    │                  │
                    │  B: 4096×8       │
                    │  A: 8×4096       │
                    └─────────────────┘

YOU TRAIN 0.39% OF THE MODEL
THE OTHER 99.61% STAYS FROZEN

The Rank Hyperparameter

The rank (r) controls the adapter's expressiveness:

RankTrainable ParamsVRAM UsageBest For
4~33KMinimalTone/style changes, simple patterns
8~65KLowStandard fine-tuning, most use cases
16~131KMediumDomain-specific knowledge injection
32~262KHigherComplex multi-task adaptation
64~524KHighNear full fine-tuning quality

Rule of thumb: Start with rank 8. Only increase if results are poor after 3 epochs. Higher rank = more expressive but more likely to overfit on small datasets.

Setting Up LoRA Fine-Tuning with Hugging Face

The peft (Parameter-Efficient Fine-Tuning) library handles LoRA natively. Combined with transformers and bitsandbytes (for 4-bit quantization), the full stack is free.

Installation

bash
pip install transformers peft bitsandbytes datasets accelerate
# For QLoRA on Windows, you may need:
pip install bitsandbytes-windows

Loading a Model with QLoRA

python
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model

# QLoRA: Load base model in 4-bit
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",          # Normalized Float 4
    bnb_4bit_compute_dtype="bfloat16",   # Computation in bf16
    bnb_4bit_use_double_quant=True       # Double quantization saves more VRAM
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B",
    quantization_config=bnb_config,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")

# Attach LoRA adapter
lora_config = LoraConfig(
    r=8,                              # Rank — start low, increase if needed
    lora_alpha=16,                    # Scaling factor (usually 2× rank)
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,                # Regularization
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 3,407,872 || all params: 8,030,261,248 || trainable%: 0.0424

Training Loop

python
from transformers import TrainingArguments, Trainer
from datasets import load_dataset

# Load your dataset (Alpaca format)
dataset = load_dataset("json", data_files="training_data.json")

training_args = TrainingArguments(
    output_dir="./lora-output",
    num_train_epochs=3,              # 3 epochs is usually enough
    per_device_train_batch_size=4,   # Adjust based on VRAM
    gradient_accumulation_steps=4,    # Effective batch = 16
    learning_rate=2e-4,              # LoRA uses higher LR than full FT
    warmup_steps=100,
    logging_steps=25,
    save_strategy="epoch",
    fp16=True,                       # Mixed precision
    report_to="none",                # Or "wandb" for W&B monitoring
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    tokenizer=tokenizer,
)

trainer.train()

# Save just the adapter (tiny file, ~10-50 MB)
model.save_pretrained("./my-pakistan-adapter")

Using Your Fine-Tuned Model

python
from peft import PeftModel

# Load base model + your adapter
base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B",
    quantization_config=bnb_config,
    device_map="auto"
)
model = PeftModel.from_pretrained(base_model, "./my-pakistan-adapter")

# Generate text with your fine-tuned model
inputs = tokenizer("DHA Phase 5 mein plot ka rate kya hai?", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0]))

Pakistani Use Cases for Fine-Tuning

Base models are trained predominantly on English web text. They don't know that "DHA Phase 5 mein plot" refers to Defence Housing Authority in Lahore, or that "EOBI payment" relates to workers' social security. Fine-tuning bridges this gap.

Three Revenue Channels

code
CHANNEL 1: CUSTOMER SERVICE BOTS (PKR 15,000-40,000/month)
├── Fine-tune on: Company FAQ + past chat logs + product catalog
├── Data needed: 500-1,000 Q&A pairs
├── Training time: 2-3 hours on RTX 3060
├── Result: Bot understands "delivery kab ayegi?" and "size chart dikhao"
├── Clients: Karachi retail, F&B, ecommerce sellers
└── Revenue: PKR 15-40K/month per client, 5 clients = PKR 75-200K/month

CHANNEL 2: LEGAL/REGULATORY DOCUMENT PROCESSING (PKR 30,000-80,000/project)
├── Fine-tune on: SECP filings, FBR forms, NADRA documents, contract templates
├── Data needed: 200-500 document examples
├── Training time: 3-4 hours
├── Result: Model extracts entities, summarizes clauses, flags risks
├── Clients: Law firms, corporate registrars, tax consultants
└── Revenue: PKR 30-80K per project, recurring monthly retainer

CHANNEL 3: HR SCREENING FOR LOCAL JOB PORTALS (PKR 20,000-50,000/month)
├── Fine-tune on: Pakistani CV formats, Rozee.pk job postings, salary norms
├── Data needed: 1,000+ CV-job pairs with match scores
├── Training time: 2-3 hours
├── Result: Model screens CVs understanding "NUST grad" vs "LUMS grad" context
├── Clients: Recruitment agencies, HR departments
└── Revenue: PKR 20-50K/month per client

Training Data Sources for Pakistan

Data SourceWhere to Get ItUse Case
WhatsApp business logsExport from WhatsApp BusinessCustomer service fine-tuning
Zameen.pk listingsScrape property descriptionsReal estate chatbot
Daraz product reviewsPublic reviews pageSentiment analysis
Dawn/Geo articlesRSS feeds or news archivesNews summarization
PakWheels listingsScrape car listingsAutomotive chatbot
Rozee.pk job postsJob posting dataHR screening model
FBR tax guidesPublic PDFsTax advisory bot

VRAM Requirements Guide

code
MODEL + QUANTIZATION → VRAM NEEDED:

Llama 3 8B:
├── Full precision (FP16):     16 GB → RTX 4080 or better
├── 8-bit quantized:           10 GB → RTX 3060 12GB
├── 4-bit (QLoRA):              6 GB → RTX 3050 8GB ✓
└── Training with QLoRA rank 8: 8 GB → RTX 3060 12GB ✓

Llama 3 70B:
├── Full precision:            140 GB → 2× A100 80GB
├── 4-bit (QLoRA):             40 GB → A100 80GB or A6000 48GB
└── Training with QLoRA:       48 GB → Rented GPU only

Qwen 2.5 7B:
├── 4-bit (QLoRA):              5 GB → RTX 3050 8GB ✓
└── Training with QLoRA:        7 GB → RTX 3060 12GB ✓

PAKISTANI HARDWARE OPTIONS:
├── RTX 3060 12GB: PKR 65,000-85,000 (Hafeez Centre Lahore / Saddar Karachi)
├── RTX 4060 8GB: PKR 80,000-100,000 (good for inference, tight for training)
├── RTX 3090 24GB: PKR 150,000-180,000 (gold standard for local training)
├── Google Colab T4: Free (15 GB VRAM, limited hours)
├── Google Colab A100: PKR 2,800/month (Colab Pro+)
└── PaperSpace Gradient: PKR 2,200/month (A4000 16GB)
Practice Lab

Practice Lab

Exercise 1: Environment Setup Install transformers, peft, bitsandbytes, and datasets via pip. Verify your GPU VRAM with nvidia-smi. If you have less than 8 GB, use Google Colab's free T4 GPU. Load any small model (e.g., Qwen/Qwen2-1.5B) with a LoRA config rank=4. Run print_trainable_parameters() and confirm the trainable % is below 1%.

Exercise 2: Rank Comparison Experiment Load the same model with rank 4, rank 8, rank 16, and rank 64. For each, note: trainable parameter count, estimated VRAM usage, and adapter file size. Create a comparison table. At what rank does your GPU run out of VRAM? This is your practical ceiling.

Exercise 3: Mini Fine-Tune Create a 50-example training dataset in Alpaca format (instruction, input, output). Topic: Pakistani customer service responses (use ChatGPT to generate synthetic data). Run a QLoRA fine-tune for 1 epoch. Test the before/after: prompt the base model and fine-tuned model with the same Pakistani question. Is the fine-tuned version better?

Pakistan Case Study

Meet Asad — ML engineer in Karachi, runs a small AI consulting practice.

His problem: Clients wanted AI chatbots that understood Pakistani context — local slang, city names, business terminology. GPT-4o and Claude worked great for English but stumbled on "DHA Phase 5 mein 10 marla ka plot kitne ka hai?" or "EOBI ka form kahan se milega?"

His fine-tuning business:

Client 1 — Karachi Electronics Store (WhatsApp bot):

  • Training data: 800 Q&A pairs from WhatsApp Business export
  • Model: Llama 3 8B + QLoRA rank 8
  • Training: 2.5 hours on his RTX 3090
  • Result: Bot handles "Samsung A54 ka rate?" perfectly (base model didn't know PKR prices)
  • Revenue: PKR 25,000/month retainer

Client 2 — Lahore Law Firm (Document processor):

  • Training data: 300 SECP filing examples + 200 contract templates
  • Model: Qwen 2.5 7B + QLoRA rank 16
  • Training: 3 hours on Google Colab A100
  • Result: Extracts company registration details with 94% accuracy (base model: 61%)
  • Revenue: PKR 60,000 one-time + PKR 15,000/month maintenance

Client 3 — Islamabad Recruitment Agency (CV screener):

  • Training data: 1,200 CV-job pairs from Rozee.pk data
  • Model: Llama 3 8B + QLoRA rank 8
  • Training: 4 hours on RTX 3090
  • Result: Screens 500 CVs/day, understands "NUST BSCS" vs "Punjab University MBA"
  • Revenue: PKR 35,000/month retainer

Total fine-tuning revenue: PKR 135,000/month Hardware investment: RTX 3090 (PKR 170,000) — paid back in 1.3 months His adapter files: Each is 15-40 MB. He stores all client adapters on a single 256 GB SSD.

His key insight: "Base models are general purpose — woh sab kuch thoda thoda jaante hain. Fine-tuning se model specialist ban jata hai. Mera electronics store bot ab Pakistani prices, product names, aur 'bhai discount do' jaise requests perfectly samajhta hai. Generic ChatGPT wrapper ye nahi kar sakta."

Key Takeaways

  • LoRA trains a tiny adapter (< 1% of parameters) instead of the full model — making fine-tuning feasible on consumer GPUs
  • QLoRA combines 4-bit quantization + LoRA, reducing VRAM from 16 GB to 6-8 GB for 8B parameter models
  • Rank is the key hyperparameter: rank 4-8 for tone/style, rank 16-32 for domain knowledge injection
  • Your RTX 3060 (PKR 65-85K) is a legitimate fine-tuning machine with QLoRA
  • Pakistani market use cases: customer service bots (PKR 15-40K/month), legal doc processing (PKR 30-80K), HR screening (PKR 20-50K/month)
  • Training data for Pakistan: WhatsApp exports, Zameen.pk, Daraz reviews, Dawn/Geo news, Rozee.pk
  • Adapter files are tiny (15-50 MB) — you can store dozens of client-specific adapters on one drive
  • The adapter is portable: train once, deploy anywhere (VPS, API, local server)
  • Fine-tuning is a premium service — base model wrappers are commodity, fine-tuned models are specialized

Next lesson: Dataset preparation — cleaning, formatting, and augmenting training data for Pakistani use cases.

Lesson Summary

Includes hands-on practice lab8 runnable code examples4-question knowledge check below

Quiz: LoRA & QLoRA — Fine-Tuning on Consumer GPUs

4 questions to test your understanding. Score 60% or higher to pass.