4.2 — Dataset Preparation — Cleaning & Formatting for Training
Dataset Preparation — Cleaning & Formatting for Training
The quality of your fine-tuned model is directly proportional to the quality of your training data. Garbage in, garbage out — yeh rule AI mein bhi utna hi sach hai jitna life mein. Building a clean, well-formatted dataset is often 80% of the work in a successful fine-tuning project.
The Two Types of Training Data
Instruction-following datasets are the most common format for fine-tuning chat/assistant models. Each record has three fields: a system prompt (context), a user message (the instruction), and an assistant response (the correct output). This mirrors the conversational structure models like Llama and Mistral were trained on.
Completion datasets are simpler — just input text and the expected continuation. These work well for style transfer, document summarization, or code completion tasks where you don't need instruction-following behaviour.
For most Pakistani business use cases, instruction-following is the right choice. You want a model that responds correctly to commands, not one that just continues text.
The Alpaca Format
The most widely supported fine-tuning format is Alpaca-style JSON, developed by Stanford. Each training example is a dictionary:
{
"instruction": "Write a professional WhatsApp message to follow up with a client about a pending invoice.",
"input": "Client name: Ahmed Traders. Invoice amount: PKR 75,000. Due since: 15 days ago.",
"output": "Assalam o Alaikum Ahmed bhai, hope you're doing well! Just a gentle reminder about Invoice #2847 for PKR 75,000 which has been pending for 15 days. Kindly let us know if there's any issue with the payment. JazakAllah Khair!"
}
The input field is optional — use it when the instruction needs additional context. For tasks like translation, summarization, or data extraction, input holds the source material.
Building a Pakistani Domain Dataset
A dataset doesn't need to be massive. Research has shown that 500-2,000 high-quality examples often outperform 50,000 noisy ones. For a customer service bot targeting Karachi's restaurant sector, you need:
- 100-200 examples of query-response pairs in Pakistani English/Roman Urdu
- 50-100 examples of menu inquiries, reservation handling, and complaint resolution
- 50 edge cases — rude customers, unclear questions, requests outside scope
Sources for Pakistani training data:
- Export WhatsApp Business conversation logs (anonymized)
- Manually write gold-standard examples (expensive but highest quality)
- Augment with AI: generate synthetic examples using ChatGPT, then human-review them
- Scrape public Pakistani forum data (PakWheels, Zameen.pk forums) — with proper licensing checks
Data Cleaning Pipeline
Raw data is always messy. A production cleaning pipeline includes:
import json
import re
def clean_example(example):
# 1. Minimum length check
if len(example['output']) < 20:
return None
# 2. Remove phone numbers and personal info
example['output'] = re.sub(r'\b\d{10,11}\b', '[PHONE]', example['output'])
# 3. Normalize whitespace
example['output'] = ' '.join(example['output'].split())
# 4. Filter profanity or off-topic outputs
forbidden = ['competitor_name', 'personal_email']
if any(f in example['output'].lower() for f in forbidden):
return None
return example
# Apply to dataset
cleaned = [clean_example(ex) for ex in raw_data]
cleaned = [ex for ex in cleaned if ex is not None]
print(f"Kept {len(cleaned)}/{len(raw_data)} examples after cleaning")
Train/Validation Split
Always hold out 10-20% of your data for validation. The validation set tells you whether the model is actually learning or just memorizing. Use datasets library's train_test_split() for reproducible splits.
A common mistake for beginners is training on everything and only evaluating on the training set — you'll see excellent numbers but the model will fail on real inputs. Always test on examples the model has never seen.
Format Conversion for Different Libraries
Different training frameworks (TRL, Axolotl, LLaMA-Factory) have slightly different input formats. The most flexible approach is to store your dataset in Alpaca JSON format and write a conversion script for each framework. The TRL (Transformer Reinforcement Learning) library, the most common choice, expects a text field with the full formatted prompt:
def format_for_trl(example):
return {
"text": f"### Instruction:\n{example['instruction']}\n\n### Input:\n{example.get('input','')}\n\n### Response:\n{example['output']}"
}
Practice Lab
-
Create a 20-example dataset for a hypothetical Karachi grocery delivery chatbot. Write 20 realistic customer queries (in mixed English/Roman Urdu) and ideal responses. Save as
karachi_grocery.jsonin Alpaca format. -
Run the cleaning pipeline on your dataset. Introduce 3 intentional "bad" examples (too short, contains phone number, off-topic) and verify the pipeline removes them.
-
Generate the train/val split: 16 training examples, 4 validation. Verify no overlap between the splits.
Key Takeaways
- Instruction-following (Alpaca format) is the standard for fine-tuning assistant models
- 500-2,000 high-quality examples beats 50,000 noisy ones — quality over quantity
- Always clean raw data: remove PII, check minimum length, filter off-topic examples
- Hold out 10-20% for validation — never evaluate on your training set
Lesson Summary
Quiz: Dataset Preparation — Cleaning & Formatting for Training
4 questions to test your understanding. Score 60% or higher to pass.