1.2 — AI Property Valuation — Zameen.pk Data Analysis with ChatGPT
AI Property Valuation — Zameen.pk
AI-powered property valuation can estimate prices with 85% accuracy by analyzing 200+ factors: location, amenities, comparable sales, market trends. This lesson teaches you to build valuation models using Zameen.pk data and AI. In Pakistan's dynamic real estate market, where prices can fluctuate significantly based on micro-local factors, AI offers a crucial edge, moving beyond traditional "gut-feel" valuations to data-driven insights. This can be a game-changer for investors, real estate agents, and individuals looking to buy or sell property in cities like Karachi, Lahore, and Islamabad.
Zameen.pk Data: Your Goldmine
Zameen.pk is Pakistan's largest property portal (50M+ visits/month). It contains a treasure trove of data vital for accurate valuations:
- 500k+ active listings
- Historical sale prices
- Rental data
- Property photos and attributes (e.g., plot size in Marla/Kanal, covered area in sq ft)
- Broker information
- Neighborhood demographics and amenities data
Accessing this data ethically and efficiently is the first step to building robust AI models. Always ensure you comply with Zameen.pk's terms of service and robots.txt file when scraping.
Data Scraping Flow Diagram:
+----------------+ +-------------------+ +---------------------+
| Zameen.pk | | Web Scraper | | Data Storage (CSV) |
| (HTML Pages) |------>| (Python Requests, |------>| (Price, Size, Loc, |
| | | BeautifulSoup) | | Amenities, Type) |
+----------------+ +-------------------+ +---------------------+
^ |
| (Pagination & |
| Error Handling) |
+------------------------------------------------------+
Free data access strategies:
- Scrape publicly available listings (ethical, within terms of service). Focus on specific cities or property types to manage data volume.
- Parse property attributes (price, size, location, amenities, number of bedrooms/bathrooms, construction year, etc.). Pay attention to local units like Marla, Kanal, or square yards which are common in Pakistan.
- Build dataset of 10,000+ properties. The larger and more diverse your dataset, the more accurate your model will be.
- Train AI valuation model.
Comparison of Pakistani Property Portals:
| Feature | Zameen.pk | Graana.com | OLX Property (part of OLX Pakistan) |
|---|---|---|---|
| Market Share | Largest, dominant | Growing, strong presence in Islamabad | Significant, broader classifieds reach |
| Listings Volume | 500k+ active listings | ~100k+ active listings | Millions of general listings, incl. property |
| Data Depth | Rich historical, rental, agent data | Focus on new developments, project details | Basic property details, user-generated |
| API Access | Limited, commercial partnerships | Limited, direct inquiries | None specific for property data aggregation |
| Target Audience | Agents, buyers, sellers, investors | Developers, high-end buyers, investors | General public, individual buyers/sellers |
Building Property Valuation Model
Step 1: Data Collection The provided Python snippet is a good start. For real-world scraping, you'll need to handle pagination, rotate user agents, add delays, and parse more detailed attributes.
import requests
from bs4 import BeautifulSoup
import time
import random
import json
def scrape_zameen(location: str, property_type: str, num_pages: int = 2):
all_properties = []
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
print(f"Starting scrape for {property_type} in {location}...")
for page in range(1, num_pages + 1):
url = f"https://www.zameen.com/search/?city={location}&purpose=sale&type={property_type}&page={page}"
print(f"Scraping page {page}: {url}")
try:
response = requests.get(url, headers=headers, timeout=10)
response.raise_for_status() # Raise an exception for HTTP errors
soup = BeautifulSoup(response.content, 'html.parser')
# Zameen.pk's structure changes; this is a simplified example.
# You'd need to inspect the current HTML structure for precise class names.
# This targets a common card structure found in search results.
listings = soup.find_all('li', class_='_391a8e53') # Example class, inspect Zameen.pk
if not listings:
print(f"No listings found on page {page} or structure changed. Stopping.")
break
for listing in listings:
try:
price_elem = listing.find('span', class_='_9d05e325') # Example class
price = price_elem.text.strip() if price_elem else 'N/A'
size_elem = listing.find('span', class_='_826477e2') # Example class
size = size_elem.text.strip() if size_elem else 'N/A'
location_elem = listing.find('div', class_='_162e6469') # Example class
location_detail = location_elem.text.strip() if location_elem else 'N/A'
bedrooms_elem = listing.find('span', {'aria-label': 'Beds'}) # Example
bedrooms = bedrooms_elem.text.strip() if bedrooms_elem else 'N/A'
bathrooms_elem = listing.find('span', {'aria-label': 'Baths'}) # Example
bathrooms = bathrooms_elem.text.strip() if bathrooms_elem else 'N/A'
all_properties.append({
'price': price,
'size': size,
'location': location_detail,
'bedrooms': bedrooms,
'bathrooms': bathrooms,
'property_type': property_type,
'source_url': url # Link back to the source page
})
except AttributeError as e:
print(f"Error parsing listing: {e}")
continue
time.sleep(random.uniform(2, 5)) # Be polite, add random delay
except requests.exceptions.RequestException as e:
print(f"Request failed for page {page}: {e}")
break
print(f"Scraping completed. Found {len(all_properties)} properties.")
return all_properties
# Example usage (would need to save to CSV for further steps)
# karachi_houses = scrape_zameen(location='Karachi', property_type='House', num_pages=3)
# print(json.dumps(karachi_houses[:2], indent=2)) # Print first 2 properties
Example of raw scraped data (JSON format):
[
{
"price": "PKR 3.5 Crore",
"size": "2 Kanal",
"location": "DHA Phase 6, Lahore",
"bedrooms": "5",
"bathrooms": "6",
"property_type": "House",
"source_url": "https://www.zameen.com/..."
},
{
"price": "PKR 1.2 Crore",
"size": "250 Sq. Yd.",
"location": "Scheme 33, Karachi",
"bedrooms": "3",
"bathrooms": "3",
"property_type": "House",
"source_url": "https://www.zameen.com/..."
}
]
Step 2: Feature Engineering Transforming raw scraped data into meaningful numerical features is crucial. This step involves cleaning, normalizing, and creating new features from existing ones.
import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import numpy as np
# Load data (assuming 'properties.csv' is generated from scraping and cleaned)
# Ensure columns like 'price', 'size', 'bedrooms', 'bathrooms', 'construction_year', 'distance_main_road', 'phase_dha' exist
df = pd.read_csv('properties.csv')
# --- Data Cleaning and Conversion ---
# Convert price to numeric (remove PKR, Crore, Lakh, etc.)
def clean_price(price_str):
if isinstance(price_str, str):
price_str = price_str.lower().replace('pkr', '').replace(',', '').strip()
if 'crore' in price_str:
return float(price_str.replace('crore', '').strip()) * 10_000_000
elif 'lakh' in price_str:
return float(price_str.replace('lakh', '').strip()) * 100_000
else:
try:
return float(price_str)
except ValueError:
return np.nan
return np.nan
df['price_numeric'] = df['price'].apply(clean_price)
df.dropna(subset=['price_numeric'], inplace=True) # Drop rows where price couldn't be parsed
# Convert size to a standard unit (e.g., sq ft). Handle Marla, Kanal, Sq. Yd.
def convert_size_to_sqft(size_str):
if isinstance(size_str, str):
size_str = size_str.lower().replace(',', '').strip()
if 'marla' in size_str:
value = float(size_str.replace('marla', '').strip())
return value * 272.25 # 1 Marla = ~272.25 sq ft
elif 'kanal' in size_str:
value = float(size_str.replace('kanal', '').strip())
return value * 5445 # 1 Kanal = ~5445 sq ft (20 Marla)
elif 'sq. yd.' in size_str or 'square yards' in size_str:
value = float(size_str.replace('sq. yd.', '').replace('square yards', '').strip())
return value * 9 # 1 Sq. Yd. = 9 sq ft
elif 'sq. ft.' in size_str or 'square feet' in size_str:
return float(size_str.replace('sq. ft.', '').replace('square feet', '').strip())
else: # Assume default is sq ft if no unit specified, or handle as NaN
try:
return float(size_str)
except ValueError:
return np.nan
return np.nan
df['size_sqft'] = df['size'].apply(convert_size_to_sqft)
df.dropna(subset=['size_sqft'], inplace=True) # Drop rows where size couldn't be parsed
# --- Feature Engineering ---
# Assume 'construction_year' is a column in your CSV, 'distance_main_road', 'parking', 'garden', 'gym'
# For 'phase_dha', let's assume it's a categorical feature like 'DHA Phase 1', 'DHA Phase 5'
# If it's a numerical proximity score, it's simpler.
# Create 'amenities_count' (assuming binary columns for parking, garden, gym)
# df['amenities_count'] = df['parking'].fillna(0) + df['garden'].fillna(0) + df['gym'].fillna(0)
# Example for specific location features (e.g., for Lahore)
df['is_dha_lahore'] = df['location'].apply(lambda x: 1 if 'DHA Lahore' in x else 0)
df['is_bahria_town'] = df['location'].apply(lambda x: 1 if 'Bahria Town' in x else 0)
df['proximity_to_university'] = df['distance_to_uni_km'].fillna(df['distance_to_uni_km'].mean()) # Example
# Define numerical and categorical features
numerical_features = ['size_sqft', 'bedrooms', 'bathrooms', 'age_years', 'distance_to_main_road_km', 'amenities_count', 'proximity_to_university']
categorical_features = ['city', 'property_type', 'is_dha_lahore', 'is_bahria_town'] # Example categorical features
# Ensure all required features are present, fill NaNs for numerical features if necessary
for col in numerical_features:
if col not in df.columns:
df[col] = 0 # Default if column not found, or use mean/median imputation
df[col] = pd.to_numeric(df[col], errors='coerce').fillna(df[col].mean()) # Impute numerical NaNs
for col in categorical_features:
if col not in df.columns:
df[col] = 'Unknown' # Default if column not found
# Create preprocessor for scaling numerical and one-hot encoding categorical features
preprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), numerical_features),
('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
])
# Apply preprocessing
X = preprocessor.fit_transform(df[numerical_features + categorical_features])
y = df['price_numeric']
# Now X and y are ready for model training.
Common Features for Property Valuation in Pakistan:
| Feature Category | Specific Features (Examples) | Impact on Price |
|---|---|---|
| Location | DHA Phase (e.g., DHA Phase 5, Lahore), Bahria Town, F-sectors (Islamabad), Clifton (Karachi), Gulshan-e-Iqbal | High demand, better infrastructure, security, amenities |
| Proximity to main roads, commercial hubs, universities, hospitals | Convenience, accessibility, rental potential | |
| Neighborhood crime rate, development plans | Safety, future appreciation | |
| Property Specs | Size (sq ft, Marla, Kanal), Bedrooms, Bathrooms, Kitchens | Larger, more rooms generally higher price |
| Construction Year (Age), Condition (New, Renovated) | Newer properties, well-maintained homes fetch higher prices | |
| Property Type (House, Apartment, Plot, Commercial) | Different market segments, varying demand and pricing | |
| Amenities | Parking, Garden, Gym, Swimming Pool, Servant Quarters, Balcony | Adds value, improves living standards |
| Gated community, 24/7 security, underground wiring | Premium for safety and modern infrastructure | |
| Market Factors | Recent comparable sales in the area | Direct benchmark for market value |
| Supply & Demand in the specific locality | High demand, low supply drives prices up | |
| Economic indicators (interest rates, inflation, PKR value) | Broader market influence, investor sentiment |
Step 3: Train Model
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_percentage_error, mean_squared_error
import numpy as np
# Assuming X and y are already prepared from Step 2
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Random Forest Regressor
rf_model = RandomForestRegressor(n_estimators=100, max_depth=15, random_state=42, n_jobs=-1)
rf_model.fit(X_train, y_train)
# Gradient Boosting Regressor (often performs well on tabular data)
gb_model = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=5, random_state=42)
gb_model.fit(X_train, y_train)
# Evaluate Random Forest Model
rf_predictions = rf_model.predict(X_test)
rf_mape = mean_absolute_percentage_error(y_test, rf_predictions) * 100
rf_rmse = np.sqrt(mean_squared_error(y_test, rf_predictions))
print(f"Random Forest Model Accuracy (MAPE): {100 - rf_mape:.1f}%")
print(f"Random Forest Model RMSE: PKR {rf_rmse:,.0f}")
# Evaluate Gradient Boosting Model
gb_predictions = gb_model.predict(X_test)
gb_mape = mean_absolute_percentage_error(y_test, gb_predictions) * 100
gb_rmse = np.sqrt(mean_squared_error(y_test, gb_predictions))
print(f"Gradient Boosting Model Accuracy (MAPE): {100 - gb_mape:.1f}%")
print(f"Gradient Boosting Model RMSE: PKR {gb_rmse:,.0f}")
# For final deployment, you might choose the better performing model or an ensemble.
model = rf_model # Let's stick with RF for consistency with original lesson
# predictions = model.predict(X) # This was using X_full, but for accuracy comparison on test set, use X_test
# accuracy = 100 - mean_absolute_percentage_error(y, predictions)
# print(f"Model Accuracy: {accuracy:.1f}%") # Expected: 82-87%
ML Model Architecture Diagram:
+-----------------+ +-------------------+ +------------------+ +---------------------+
| Raw Zameen Data |----->| Data Preprocessing|----->| Feature |----->| Machine Learning |
| (HTML, CSV) | | (Cleaning, Parse) | | Engineering | | Model (Random Forest)|
+-----------------+ +-------------------+ | (Numerical, Cat) | | |
+------------------+ +----------+----------+
|
v
+---------------------+
| Property Valuation |
| (Estimated Price) |
+---------------------+
AI Valuation: Claude-Powered Analysis
Even better than traditional ML: Use Large Language Models (LLMs) like Claude to analyze comparable properties and provide nuanced, human-like reasoning. LLMs excel at understanding context, handling unstructured data (like property descriptions), and explaining their rationale, which is highly valuable for clients and agents.
from anthropic import Anthropic
import os
# Initialize Anthropic client (replace with your actual API key)
# client = Anthropic(api_key=os.environ.get("ANTHROPIC_API_KEY")) # Or directly provide
def fetch_comparables(location: str, size: float, bedrooms: int, num_comparables: int = 5):
"""
Placeholder function to fetch comparable properties from your cleaned dataset.
In a real scenario, this would query a database of scraped Zameen data.
"""
# This is a simplified example. You'd implement a proper search logic here.
# For instance, filter properties by location, similar size range, same number of bedrooms.
# Then sort by proximity to target property or recent sale date.
# Dummy data for demonstration
comparable_data = [
{"location": f"{location} - DHA Phase 5", "size_sqft": size * 1.05, "bedrooms": bedrooms, "bathrooms": bedrooms+1, "age_years": 8, "amenities": "Garden, Parking, AC", "price_pkr": 48_000_000, "date_sold": "2023-11-15"},
{"location": f"{location} - Model Town", "size_sqft": size * 0.98, "bedrooms": bedrooms, "bathrooms": bedrooms, "age_years": 12, "amenities": "Parking", "price_pkr": 32_000_000, "date_sold": "2023-10-28"},
{"location": f"{location} - Cantt", "size_sqft": size * 1.10, "bedrooms": bedrooms+1, "bathrooms": bedrooms+2, "age_years": 5, "amenities": "Garden, Gym, Pool, Parking", "price_pkr": 65_000_000, "date_sold": "2023-12-01"},
{"location": f"{location} - Gulberg", "size_sqft": size * 0.95, "bedrooms": bedrooms, "bathrooms": bedrooms-1, "age_years": 18, "amenities": "Parking", "price_pkr": 38_000_000, "date_sold": "2023-09-20"},
{"location": f"{location} - Johar Town", "size_sqft": size * 1.02, "bedrooms": bedrooms, "bathrooms": bedrooms, "age_years": 10, "amenities": "Parking, Store Room", "price_pkr": 30_000_000, "date_sold": "2023-11-05"},
]
# Format comparables for prompt
formatted_comparables = ""
for i, prop in enumerate(comparable_data[:num_comparables]):
formatted_comparables += (
f" {i+1}. Location: {prop['location']}, Size: {prop['size_sqft']} sq ft, "
f"Beds: {prop['bedrooms']}, Baths: {prop['bathrooms']}, Age: {prop['age_years']} years, "
f"Amenities: {prop['amenities']}, Sold Price: PKR {prop['price_pkr']:,.0f}, Sold Date: {prop['date_sold']}\n"
)
return formatted_comparables
def claude_valuation(property_details: dict):
comparable_properties = fetch_comparables(
location=property_details['location'],
size=property_details['size'],
bedrooms=property_details['bedrooms']
)
prompt = f"""
I'm valuing a property in {property_details['location']}:
- Property Type: {property_details.get('property_type', 'House')}
- Size: {property_details['size']} sq ft
- Bedrooms: {property_details['bedrooms']}
- Bathrooms: {property_details['bathrooms']}
- Age: {property_details['age']} years
- Amenities: {property_details['amenities']}
- Specific features: {property_details.get('specific_features', 'None')}
Here are 5 comparable properties recently sold in similar areas:
{comparable_properties}
Based on these comparable sales, current market trends in Pakistan (e.g., inflation, PKR devaluation, interest rates impacting real estate investment), and the specific location factors (e.g., security, proximity to commercial areas like Liberty Market in Lahore or Dolmen Mall in Karachi, educational institutions), estimate the fair market value of the target property in PKR.
Provide your estimated value as a single figure first, then explain your valuation reasoning step-by-step, highlighting key factors that influence the price up or down. Consider the condition of the property and its unique selling points.
"""
# Ensure client is initialized before calling
client = Anthropic(api_key=os.environ.get("ANTHROPIC_API_KEY")) # Re-initialize for demonstration if not global
response = client.messages.create(
model="claude-3-opus-20240229", # Using a more recent model
max_tokens=800, # Increased max tokens for detailed explanation
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
# Example usage:
# property_to_value = {
# 'location': 'DHA Phase 7, Lahore',
# 'property_type': 'House',
# 'size': 5000, # sq ft
# 'bedrooms': 5,
# 'bathrooms': 6,
# 'age': 7,
# 'amenities': 'Garden, Swimming Pool, Double Car Parking, Servant Quarter',
# 'specific_features': 'Corner plot, facing park'
# }
# valuation_report = claude_valuation(property_to_value)
# print(valuation_report)
Comparison: Traditional ML vs. LLM for Valuation
| Feature | Traditional ML Models (e.g., Random Forest) | LLM (e.g., Claude) |
|---|---|---|
| Data Type | Primarily structured, numerical data | Structured and unstructured text (descriptions, reviews) |
| Transparency | Feature importance can be analyzed, but rationale is implicit | Provides explicit, natural language reasoning and explanation |
| Flexibility | Requires re-training for new features/data types | Adapts to new information via prompt engineering |
| Contextualization | Limited to explicit features, struggles with nuance | Excellent at understanding complex context, local sentiment |
| Accuracy | High for quantifiable factors (80-90% MAPE) | Can be very high, especially with good comparables and prompt design |
| Cost | Compute resources for training, less for inference | API calls (token usage) for inference |
| Best For | Large-scale automated valuations, quantitative analysis | Detailed, nuanced valuations, qualitative insights, client-facing reports |
Zameen.pk API Integration
Zameen.pk provides limited official API access. This is primarily for large-scale commercial partners or real estate agencies. For individual developers or smaller startups, alternatives are often necessary.
Alternative strategies for data access or integration:
- Contact Zameen.pk for commercial partnership: If your project has significant potential, a direct partnership might open doors to their data.
- Use their Ads API: This is for managing listings, not typically for data extraction for valuation models.
- Partner with property data aggregators: Globally, companies like PropertyShark or CoreLogic exist. In Pakistan, dedicated property data aggregators are less common, but some real estate consultancies might offer data services.
- Leverage other public sources: Supplement Zameen.pk data with information from Graana.com, OLX Property, or local government land records (if accessible).
- Develop advanced scraping techniques: With careful ethical consideration, more sophisticated scraping (e.g., using headless browsers like Selenium) can gather richer data, but requires more maintenance.
Pakistan Example: Real Estate Valuation SaaS
Bilal, a developer in Karachi, builds "PropValue.pk"—an AI valuation tool for Pakistani property brokers and individual sellers. He leveraged the burgeoning freelance culture in Pakistan, hiring remote talent from platforms like Fiverr.com and Upwork for UI/UX design and supplementary data analysis.
PropValue.pk Architecture Diagram:
+--------------------+ +--------------------+ +-------------------+
| User (Broker/Seller)|----->| PropValue.pk Web App|----->| API Gateway |
| (Input Property | | (Frontend & Backend)| | |
| Details, Photos) | +----------+---------+ +--------+----------+
+--------------------+ | |
v v
+-----------+-----------+ +-----------+-----------+
| Zameen Data Lake | | LLM Service (Claude) |
| (Scraped Data, DB) | | (Valuation Reasoning) |
+-----------+-----------+ +-----------------------+
|
v
+-----------+-----------+
| ML Model Service |
Lesson Summary
AI Property Valuation Quiz
4 questions to test your understanding. Score 60% or higher to pass.