Dataset Preparation — Cleaning & Formatting for Training

The quality of your fine-tuned model is directly proportional to the quality of your training data. Garbage in, garbage out — yeh rule AI mein bhi utna hi sach hai jitna life mein. Building a clean, well-formatted dataset is often 80% of the work in a successful fine-tuning project.

The Two Types of Training Data

Instruction-following datasets are the most common format for fine-tuning chat/assistant models. Each record has three fields: a system prompt (context), a user message (the instruction), and an assistant response (the correct output). This mirrors the conversational structure models like Llama and Mistral were trained on.

Completion datasets are simpler — just input text and the expected continuation. These work well for style transfer, document summarization, or code completion tasks where you don't need instruction-following behaviour.

For most Pakistani business use cases, instruction-following is the right choice. You want a model that responds correctly to commands, not one that just continues text.

The Alpaca Format

The most widely supported fine-tuning format is Alpaca-style JSON, developed by Stanford. Each training example is a dictionary:

json

{
  "instruction": "Write a professional WhatsApp message to follow up with a client about a pending invoice.",
  "input": "Client name: Ahmed Traders. Invoice amount: PKR 75,000. Due since: 15 days ago.",
  "output": "Assalam o Alaikum Ahmed bhai, hope you're doing well! Just a gentle reminder about Invoice #2847 for PKR 75,000 which has been pending for 15 days. Kindly let us know if there's any issue with the payment. JazakAllah Khair!"
}

The input field is optional — use it when the instruction needs additional context. For tasks like translation, summarization, or data extraction, input holds the source material.

Building a Pakistani Domain Dataset

A dataset doesn't need to be massive. Research has shown that 500-2,000 high-quality examples often outperform 50,000 noisy ones. For a customer service bot targeting Karachi's restaurant sector, you need:

100-200 examples of query-response pairs in Pakistani English/Roman Urdu
50-100 examples of menu inquiries, reservation handling, and complaint resolution
50 edge cases — rude customers, unclear questions, requests outside scope

Sources for Pakistani training data:

Export WhatsApp Business conversation logs (anonymized)
Manually write gold-standard examples (expensive but highest quality)
Augment with AI: generate synthetic examples using ChatGPT, then human-review them
Scrape public Pakistani forum data (PakWheels, Zameen.pk forums) — with proper licensing checks

Data Cleaning Pipeline

Raw data is always messy. A production cleaning pipeline includes:

python

import json
import re

def clean_example(example):
    # 1. Minimum length check
    if len(example['output']) < 20:
        return None

    # 2. Remove phone numbers and personal info
    example['output'] = re.sub(r'\b\d{10,11}\b', '[PHONE]', example['output'])

    # 3. Normalize whitespace
    example['output'] = ' '.join(example['output'].split())

    # 4. Filter profanity or off-topic outputs
    forbidden = ['competitor_name', 'personal_email']
    if any(f in example['output'].lower() for f in forbidden):
        return None

    return example

# Apply to dataset
cleaned = [clean_example(ex) for ex in raw_data]
cleaned = [ex for ex in cleaned if ex is not None]
print(f"Kept {len(cleaned)}/{len(raw_data)} examples after cleaning")

Train/Validation Split

Always hold out 10-20% of your data for validation. The validation set tells you whether the model is actually learning or just memorizing. Use datasets library's train_test_split() for reproducible splits.

A common mistake for beginners is training on everything and only evaluating on the training set — you'll see excellent numbers but the model will fail on real inputs. Always test on examples the model has never seen.

Format Conversion for Different Libraries

Different training frameworks (TRL, Axolotl, LLaMA-Factory) have slightly different input formats. The most flexible approach is to store your dataset in Alpaca JSON format and write a conversion script for each framework. The TRL (Transformer Reinforcement Learning) library, the most common choice, expects a text field with the full formatted prompt:

python

def format_for_trl(example):
    return {
        "text": f"### Instruction:\n{example['instruction']}\n\n### Input:\n{example.get('input','')}\n\n### Response:\n{example['output']}"
    }

Practice Lab

Create a 20-example dataset for a hypothetical Karachi grocery delivery chatbot. Write 20 realistic customer queries (in mixed English/Roman Urdu) and ideal responses. Save as karachi_grocery.json in Alpaca format.
Run the cleaning pipeline on your dataset. Introduce 3 intentional "bad" examples (too short, contains phone number, off-topic) and verify the pipeline removes them.
Generate the train/val split: 16 training examples, 4 validation. Verify no overlap between the splits.

Key Takeaways

Instruction-following (Alpaca format) is the standard for fine-tuning assistant models
500-2,000 high-quality examples beats 50,000 noisy ones — quality over quantity
Always clean raw data: remove PII, check minimum length, filter off-topic examples
Hold out 10-20% for validation — never evaluate on your training set

4.2 — Dataset Preparation — Cleaning & Formatting for Training