AI for Real Estate PakistanModule 1

1.2AI Property Valuation — Zameen.pk Data Analysis with ChatGPT

30 min 8 code blocks Quiz (4Q)

AI Property Valuation — Zameen.pk

AI-powered property valuation can estimate prices with 85% accuracy by analyzing 200+ factors: location, amenities, comparable sales, market trends. This lesson teaches you to build valuation models using Zameen.pk data and AI. In Pakistan's dynamic real estate market, where prices can fluctuate significantly based on micro-local factors, AI offers a crucial edge, moving beyond traditional "gut-feel" valuations to data-driven insights. This can be a game-changer for investors, real estate agents, and individuals looking to buy or sell property in cities like Karachi, Lahore, and Islamabad.

Zameen.pk Data: Your Goldmine

Zameen.pk is Pakistan's largest property portal (50M+ visits/month). It contains a treasure trove of data vital for accurate valuations:

  • 500k+ active listings
  • Historical sale prices
  • Rental data
  • Property photos and attributes (e.g., plot size in Marla/Kanal, covered area in sq ft)
  • Broker information
  • Neighborhood demographics and amenities data

Accessing this data ethically and efficiently is the first step to building robust AI models. Always ensure you comply with Zameen.pk's terms of service and robots.txt file when scraping.

Data Scraping Flow Diagram:

code
+----------------+       +-------------------+       +---------------------+
| Zameen.pk      |       | Web Scraper       |       | Data Storage (CSV)  |
| (HTML Pages)   |------>| (Python Requests, |------>| (Price, Size, Loc,  |
|                |       |  BeautifulSoup)   |       | Amenities, Type)    |
+----------------+       +-------------------+       +---------------------+
       ^                                                      |
       | (Pagination &                                        |
       |  Error Handling)                                     |
       +------------------------------------------------------+

Free data access strategies:

  1. Scrape publicly available listings (ethical, within terms of service). Focus on specific cities or property types to manage data volume.
  2. Parse property attributes (price, size, location, amenities, number of bedrooms/bathrooms, construction year, etc.). Pay attention to local units like Marla, Kanal, or square yards which are common in Pakistan.
  3. Build dataset of 10,000+ properties. The larger and more diverse your dataset, the more accurate your model will be.
  4. Train AI valuation model.

Comparison of Pakistani Property Portals:

FeatureZameen.pkGraana.comOLX Property (part of OLX Pakistan)
Market ShareLargest, dominantGrowing, strong presence in IslamabadSignificant, broader classifieds reach
Listings Volume500k+ active listings~100k+ active listingsMillions of general listings, incl. property
Data DepthRich historical, rental, agent dataFocus on new developments, project detailsBasic property details, user-generated
API AccessLimited, commercial partnershipsLimited, direct inquiriesNone specific for property data aggregation
Target AudienceAgents, buyers, sellers, investorsDevelopers, high-end buyers, investorsGeneral public, individual buyers/sellers

Building Property Valuation Model

Step 1: Data Collection The provided Python snippet is a good start. For real-world scraping, you'll need to handle pagination, rotate user agents, add delays, and parse more detailed attributes.

python
import requests
from bs4 import BeautifulSoup
import time
import random
import json

def scrape_zameen(location: str, property_type: str, num_pages: int = 2):
    all_properties = []
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
    }

    print(f"Starting scrape for {property_type} in {location}...")
    for page in range(1, num_pages + 1):
        url = f"https://www.zameen.com/search/?city={location}&purpose=sale&type={property_type}&page={page}"
        print(f"Scraping page {page}: {url}")
        try:
            response = requests.get(url, headers=headers, timeout=10)
            response.raise_for_status() # Raise an exception for HTTP errors
            soup = BeautifulSoup(response.content, 'html.parser')

            # Zameen.pk's structure changes; this is a simplified example.
            # You'd need to inspect the current HTML structure for precise class names.
            # This targets a common card structure found in search results.
            listings = soup.find_all('li', class_='_391a8e53') # Example class, inspect Zameen.pk

            if not listings:
                print(f"No listings found on page {page} or structure changed. Stopping.")
                break

            for listing in listings:
                try:
                    price_elem = listing.find('span', class_='_9d05e325') # Example class
                    price = price_elem.text.strip() if price_elem else 'N/A'

                    size_elem = listing.find('span', class_='_826477e2') # Example class
                    size = size_elem.text.strip() if size_elem else 'N/A'

                    location_elem = listing.find('div', class_='_162e6469') # Example class
                    location_detail = location_elem.text.strip() if location_elem else 'N/A'
                    
                    bedrooms_elem = listing.find('span', {'aria-label': 'Beds'}) # Example
                    bedrooms = bedrooms_elem.text.strip() if bedrooms_elem else 'N/A'

                    bathrooms_elem = listing.find('span', {'aria-label': 'Baths'}) # Example
                    bathrooms = bathrooms_elem.text.strip() if bathrooms_elem else 'N/A'

                    all_properties.append({
                        'price': price,
                        'size': size,
                        'location': location_detail,
                        'bedrooms': bedrooms,
                        'bathrooms': bathrooms,
                        'property_type': property_type,
                        'source_url': url # Link back to the source page
                    })
                except AttributeError as e:
                    print(f"Error parsing listing: {e}")
                    continue
            time.sleep(random.uniform(2, 5)) # Be polite, add random delay
        except requests.exceptions.RequestException as e:
            print(f"Request failed for page {page}: {e}")
            break
    
    print(f"Scraping completed. Found {len(all_properties)} properties.")
    return all_properties

# Example usage (would need to save to CSV for further steps)
# karachi_houses = scrape_zameen(location='Karachi', property_type='House', num_pages=3)
# print(json.dumps(karachi_houses[:2], indent=2)) # Print first 2 properties

Example of raw scraped data (JSON format):

json
[
  {
    "price": "PKR 3.5 Crore",
    "size": "2 Kanal",
    "location": "DHA Phase 6, Lahore",
    "bedrooms": "5",
    "bathrooms": "6",
    "property_type": "House",
    "source_url": "https://www.zameen.com/..."
  },
  {
    "price": "PKR 1.2 Crore",
    "size": "250 Sq. Yd.",
    "location": "Scheme 33, Karachi",
    "bedrooms": "3",
    "bathrooms": "3",
    "property_type": "House",
    "source_url": "https://www.zameen.com/..."
  }
]

Step 2: Feature Engineering Transforming raw scraped data into meaningful numerical features is crucial. This step involves cleaning, normalizing, and creating new features from existing ones.

python
import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import numpy as np

# Load data (assuming 'properties.csv' is generated from scraping and cleaned)
# Ensure columns like 'price', 'size', 'bedrooms', 'bathrooms', 'construction_year', 'distance_main_road', 'phase_dha' exist
df = pd.read_csv('properties.csv')

# --- Data Cleaning and Conversion ---
# Convert price to numeric (remove PKR, Crore, Lakh, etc.)
def clean_price(price_str):
    if isinstance(price_str, str):
        price_str = price_str.lower().replace('pkr', '').replace(',', '').strip()
        if 'crore' in price_str:
            return float(price_str.replace('crore', '').strip()) * 10_000_000
        elif 'lakh' in price_str:
            return float(price_str.replace('lakh', '').strip()) * 100_000
        else:
            try:
                return float(price_str)
            except ValueError:
                return np.nan
    return np.nan

df['price_numeric'] = df['price'].apply(clean_price)
df.dropna(subset=['price_numeric'], inplace=True) # Drop rows where price couldn't be parsed

# Convert size to a standard unit (e.g., sq ft). Handle Marla, Kanal, Sq. Yd.
def convert_size_to_sqft(size_str):
    if isinstance(size_str, str):
        size_str = size_str.lower().replace(',', '').strip()
        if 'marla' in size_str:
            value = float(size_str.replace('marla', '').strip())
            return value * 272.25 # 1 Marla = ~272.25 sq ft
        elif 'kanal' in size_str:
            value = float(size_str.replace('kanal', '').strip())
            return value * 5445 # 1 Kanal = ~5445 sq ft (20 Marla)
        elif 'sq. yd.' in size_str or 'square yards' in size_str:
            value = float(size_str.replace('sq. yd.', '').replace('square yards', '').strip())
            return value * 9 # 1 Sq. Yd. = 9 sq ft
        elif 'sq. ft.' in size_str or 'square feet' in size_str:
            return float(size_str.replace('sq. ft.', '').replace('square feet', '').strip())
        else: # Assume default is sq ft if no unit specified, or handle as NaN
            try:
                return float(size_str)
            except ValueError:
                return np.nan
    return np.nan

df['size_sqft'] = df['size'].apply(convert_size_to_sqft)
df.dropna(subset=['size_sqft'], inplace=True) # Drop rows where size couldn't be parsed

# --- Feature Engineering ---
# Assume 'construction_year' is a column in your CSV, 'distance_main_road', 'parking', 'garden', 'gym'
# For 'phase_dha', let's assume it's a categorical feature like 'DHA Phase 1', 'DHA Phase 5'
# If it's a numerical proximity score, it's simpler.

# Create 'amenities_count' (assuming binary columns for parking, garden, gym)
# df['amenities_count'] = df['parking'].fillna(0) + df['garden'].fillna(0) + df['gym'].fillna(0)

# Example for specific location features (e.g., for Lahore)
df['is_dha_lahore'] = df['location'].apply(lambda x: 1 if 'DHA Lahore' in x else 0)
df['is_bahria_town'] = df['location'].apply(lambda x: 1 if 'Bahria Town' in x else 0)
df['proximity_to_university'] = df['distance_to_uni_km'].fillna(df['distance_to_uni_km'].mean()) # Example

# Define numerical and categorical features
numerical_features = ['size_sqft', 'bedrooms', 'bathrooms', 'age_years', 'distance_to_main_road_km', 'amenities_count', 'proximity_to_university']
categorical_features = ['city', 'property_type', 'is_dha_lahore', 'is_bahria_town'] # Example categorical features

# Ensure all required features are present, fill NaNs for numerical features if necessary
for col in numerical_features:
    if col not in df.columns:
        df[col] = 0 # Default if column not found, or use mean/median imputation
    df[col] = pd.to_numeric(df[col], errors='coerce').fillna(df[col].mean()) # Impute numerical NaNs

for col in categorical_features:
    if col not in df.columns:
        df[col] = 'Unknown' # Default if column not found

# Create preprocessor for scaling numerical and one-hot encoding categorical features
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ])

# Apply preprocessing
X = preprocessor.fit_transform(df[numerical_features + categorical_features])
y = df['price_numeric']

# Now X and y are ready for model training.

Common Features for Property Valuation in Pakistan:

Feature CategorySpecific Features (Examples)Impact on Price
LocationDHA Phase (e.g., DHA Phase 5, Lahore), Bahria Town, F-sectors (Islamabad), Clifton (Karachi), Gulshan-e-IqbalHigh demand, better infrastructure, security, amenities
Proximity to main roads, commercial hubs, universities, hospitalsConvenience, accessibility, rental potential
Neighborhood crime rate, development plansSafety, future appreciation
Property SpecsSize (sq ft, Marla, Kanal), Bedrooms, Bathrooms, KitchensLarger, more rooms generally higher price
Construction Year (Age), Condition (New, Renovated)Newer properties, well-maintained homes fetch higher prices
Property Type (House, Apartment, Plot, Commercial)Different market segments, varying demand and pricing
AmenitiesParking, Garden, Gym, Swimming Pool, Servant Quarters, BalconyAdds value, improves living standards
Gated community, 24/7 security, underground wiringPremium for safety and modern infrastructure
Market FactorsRecent comparable sales in the areaDirect benchmark for market value
Supply & Demand in the specific localityHigh demand, low supply drives prices up
Economic indicators (interest rates, inflation, PKR value)Broader market influence, investor sentiment

Step 3: Train Model

python
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_percentage_error, mean_squared_error
import numpy as np

# Assuming X and y are already prepared from Step 2
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Random Forest Regressor
rf_model = RandomForestRegressor(n_estimators=100, max_depth=15, random_state=42, n_jobs=-1)
rf_model.fit(X_train, y_train)

# Gradient Boosting Regressor (often performs well on tabular data)
gb_model = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=5, random_state=42)
gb_model.fit(X_train, y_train)

# Evaluate Random Forest Model
rf_predictions = rf_model.predict(X_test)
rf_mape = mean_absolute_percentage_error(y_test, rf_predictions) * 100
rf_rmse = np.sqrt(mean_squared_error(y_test, rf_predictions))

print(f"Random Forest Model Accuracy (MAPE): {100 - rf_mape:.1f}%")
print(f"Random Forest Model RMSE: PKR {rf_rmse:,.0f}")

# Evaluate Gradient Boosting Model
gb_predictions = gb_model.predict(X_test)
gb_mape = mean_absolute_percentage_error(y_test, gb_predictions) * 100
gb_rmse = np.sqrt(mean_squared_error(y_test, gb_predictions))

print(f"Gradient Boosting Model Accuracy (MAPE): {100 - gb_mape:.1f}%")
print(f"Gradient Boosting Model RMSE: PKR {gb_rmse:,.0f}")

# For final deployment, you might choose the better performing model or an ensemble.
model = rf_model # Let's stick with RF for consistency with original lesson
# predictions = model.predict(X) # This was using X_full, but for accuracy comparison on test set, use X_test
# accuracy = 100 - mean_absolute_percentage_error(y, predictions)
# print(f"Model Accuracy: {accuracy:.1f}%")  # Expected: 82-87%

ML Model Architecture Diagram:

code
+-----------------+      +-------------------+      +------------------+      +---------------------+
| Raw Zameen Data |----->| Data Preprocessing|----->| Feature          |----->| Machine Learning    |
| (HTML, CSV)     |      | (Cleaning, Parse) |      | Engineering      |      | Model (Random Forest)|
+-----------------+      +-------------------+      | (Numerical, Cat) |      |                     |
                                                     +------------------+      +----------+----------+
                                                                                          |
                                                                                          v
                                                                             +---------------------+
                                                                             | Property Valuation  |
                                                                             | (Estimated Price)   |
                                                                             +---------------------+

AI Valuation: Claude-Powered Analysis

Even better than traditional ML: Use Large Language Models (LLMs) like Claude to analyze comparable properties and provide nuanced, human-like reasoning. LLMs excel at understanding context, handling unstructured data (like property descriptions), and explaining their rationale, which is highly valuable for clients and agents.

python
from anthropic import Anthropic
import os

# Initialize Anthropic client (replace with your actual API key)
# client = Anthropic(api_key=os.environ.get("ANTHROPIC_API_KEY")) # Or directly provide

def fetch_comparables(location: str, size: float, bedrooms: int, num_comparables: int = 5):
    """
    Placeholder function to fetch comparable properties from your cleaned dataset.
    In a real scenario, this would query a database of scraped Zameen data.
    """
    # This is a simplified example. You'd implement a proper search logic here.
    # For instance, filter properties by location, similar size range, same number of bedrooms.
    # Then sort by proximity to target property or recent sale date.
    
    # Dummy data for demonstration
    comparable_data = [
        {"location": f"{location} - DHA Phase 5", "size_sqft": size * 1.05, "bedrooms": bedrooms, "bathrooms": bedrooms+1, "age_years": 8, "amenities": "Garden, Parking, AC", "price_pkr": 48_000_000, "date_sold": "2023-11-15"},
        {"location": f"{location} - Model Town", "size_sqft": size * 0.98, "bedrooms": bedrooms, "bathrooms": bedrooms, "age_years": 12, "amenities": "Parking", "price_pkr": 32_000_000, "date_sold": "2023-10-28"},
        {"location": f"{location} - Cantt", "size_sqft": size * 1.10, "bedrooms": bedrooms+1, "bathrooms": bedrooms+2, "age_years": 5, "amenities": "Garden, Gym, Pool, Parking", "price_pkr": 65_000_000, "date_sold": "2023-12-01"},
        {"location": f"{location} - Gulberg", "size_sqft": size * 0.95, "bedrooms": bedrooms, "bathrooms": bedrooms-1, "age_years": 18, "amenities": "Parking", "price_pkr": 38_000_000, "date_sold": "2023-09-20"},
        {"location": f"{location} - Johar Town", "size_sqft": size * 1.02, "bedrooms": bedrooms, "bathrooms": bedrooms, "age_years": 10, "amenities": "Parking, Store Room", "price_pkr": 30_000_000, "date_sold": "2023-11-05"},
    ]
    
    # Format comparables for prompt
    formatted_comparables = ""
    for i, prop in enumerate(comparable_data[:num_comparables]):
        formatted_comparables += (
            f"  {i+1}. Location: {prop['location']}, Size: {prop['size_sqft']} sq ft, "
            f"Beds: {prop['bedrooms']}, Baths: {prop['bathrooms']}, Age: {prop['age_years']} years, "
            f"Amenities: {prop['amenities']}, Sold Price: PKR {prop['price_pkr']:,.0f}, Sold Date: {prop['date_sold']}\n"
        )
    return formatted_comparables

def claude_valuation(property_details: dict):
    comparable_properties = fetch_comparables(
        location=property_details['location'],
        size=property_details['size'],
        bedrooms=property_details['bedrooms']
    )

    prompt = f"""
    I'm valuing a property in {property_details['location']}:
    - Property Type: {property_details.get('property_type', 'House')}
    - Size: {property_details['size']} sq ft
    - Bedrooms: {property_details['bedrooms']}
    - Bathrooms: {property_details['bathrooms']}
    - Age: {property_details['age']} years
    - Amenities: {property_details['amenities']}
    - Specific features: {property_details.get('specific_features', 'None')}

    Here are 5 comparable properties recently sold in similar areas:
    {comparable_properties}

    Based on these comparable sales, current market trends in Pakistan (e.g., inflation, PKR devaluation, interest rates impacting real estate investment), and the specific location factors (e.g., security, proximity to commercial areas like Liberty Market in Lahore or Dolmen Mall in Karachi, educational institutions), estimate the fair market value of the target property in PKR.
    
    Provide your estimated value as a single figure first, then explain your valuation reasoning step-by-step, highlighting key factors that influence the price up or down. Consider the condition of the property and its unique selling points.
    """

    # Ensure client is initialized before calling
    client = Anthropic(api_key=os.environ.get("ANTHROPIC_API_KEY")) # Re-initialize for demonstration if not global
    
    response = client.messages.create(
        model="claude-3-opus-20240229", # Using a more recent model
        max_tokens=800, # Increased max tokens for detailed explanation
        messages=[{"role": "user", "content": prompt}]
    )

    return response.content[0].text

# Example usage:
# property_to_value = {
#     'location': 'DHA Phase 7, Lahore',
#     'property_type': 'House',
#     'size': 5000, # sq ft
#     'bedrooms': 5,
#     'bathrooms': 6,
#     'age': 7,
#     'amenities': 'Garden, Swimming Pool, Double Car Parking, Servant Quarter',
#     'specific_features': 'Corner plot, facing park'
# }
# valuation_report = claude_valuation(property_to_value)
# print(valuation_report)

Comparison: Traditional ML vs. LLM for Valuation

FeatureTraditional ML Models (e.g., Random Forest)LLM (e.g., Claude)
Data TypePrimarily structured, numerical dataStructured and unstructured text (descriptions, reviews)
TransparencyFeature importance can be analyzed, but rationale is implicitProvides explicit, natural language reasoning and explanation
FlexibilityRequires re-training for new features/data typesAdapts to new information via prompt engineering
ContextualizationLimited to explicit features, struggles with nuanceExcellent at understanding complex context, local sentiment
AccuracyHigh for quantifiable factors (80-90% MAPE)Can be very high, especially with good comparables and prompt design
CostCompute resources for training, less for inferenceAPI calls (token usage) for inference
Best ForLarge-scale automated valuations, quantitative analysisDetailed, nuanced valuations, qualitative insights, client-facing reports

Zameen.pk API Integration

Zameen.pk provides limited official API access. This is primarily for large-scale commercial partners or real estate agencies. For individual developers or smaller startups, alternatives are often necessary.

Alternative strategies for data access or integration:

  • Contact Zameen.pk for commercial partnership: If your project has significant potential, a direct partnership might open doors to their data.
  • Use their Ads API: This is for managing listings, not typically for data extraction for valuation models.
  • Partner with property data aggregators: Globally, companies like PropertyShark or CoreLogic exist. In Pakistan, dedicated property data aggregators are less common, but some real estate consultancies might offer data services.
  • Leverage other public sources: Supplement Zameen.pk data with information from Graana.com, OLX Property, or local government land records (if accessible).
  • Develop advanced scraping techniques: With careful ethical consideration, more sophisticated scraping (e.g., using headless browsers like Selenium) can gather richer data, but requires more maintenance.

Pakistan Example: Real Estate Valuation SaaS

Bilal, a developer in Karachi, builds "PropValue.pk"—an AI valuation tool for Pakistani property brokers and individual sellers. He leveraged the burgeoning freelance culture in Pakistan, hiring remote talent from platforms like Fiverr.com and Upwork for UI/UX design and supplementary data analysis.

PropValue.pk Architecture Diagram:

code
+--------------------+      +--------------------+      +-------------------+
| User (Broker/Seller)|----->| PropValue.pk Web App|----->| API Gateway       |
| (Input Property    |      | (Frontend & Backend)|      |                   |
|  Details, Photos)  |      +----------+---------+      +--------+----------+
+--------------------+                 |                           |
                                       v                           v
                           +-----------+-----------+     +-----------+-----------+
                           | Zameen Data Lake      |     | LLM Service (Claude)  |
                           | (Scraped Data, DB)    |     | (Valuation Reasoning) |
                           +-----------+-----------+     +-----------------------+
                                       |
                                       v
                           +-----------+-----------+
                           | ML Model Service      |

Lesson Summary

8 runnable code examples4-question knowledge check below

AI Property Valuation Quiz

4 questions to test your understanding. Score 60% or higher to pass.