Data Darbar — Pakistan in Numbers

The Project

Data Darbar is an interactive data explorer for Pakistan, bringing together district-level statistics from multiple official sources published by the Pakistan Bureau of Statistics (PBS).

The platform visualises demographic, economic, education, and social indicators across all districts of Pakistan, enabling researchers, policymakers, journalists, and citizens to explore spatial patterns and changes over time. The name "Data Darbar" is a nod to the famous shrine in Lahore — reimagined here as a place of gathering for Pakistan's data.

Built By

Data Darbar was created by Hiba Sameen, an economist and data scientist based in London, UK. Hiba has a PhD in Economics and her background spans working across academia, government and think tanks on economic policy. She also has an interest in data science and data engineering, focussing on making public data more accessible and useful for evidence-based decision-making. This project was born out of a frustration with how difficult it is to explore and compare district-level statistics from PBS publications, which are often buried in PDF tables and scattered across multiple reports.

GitHub LinkedIn

Data Sources

Population & Housing Census 2017 and 2023 — Tables 1, 5, 12, 15, 16 and education/employment breakdowns
PSLM 2019-20 — Pakistan Social and Living Standards Measurement Survey (district-level microdata aggregates)
LFS 2020-21 & 2024-25 — Labour Force Survey (employment, LFPR, industry breakdown)
HIES 2024-25 — Household Integrated Economic Survey (expenditure, food security, housing)
Economic Census 2023 — Establishment counts, workforce size, industry composition

Coverage

The map covers 141 districts across all provinces and territories: Punjab, Sindh, Khyber Pakhtunkhwa, Balochistan, Islamabad Capital Territory, Azad Jammu & Kashmir, Gilgit-Baltistan, and the former Federally Administered Tribal Areas (FATA). 124 districts currently have matched data; the remaining 17 (primarily in AJK and Gilgit-Baltistan) appear on the map but lack PBS coverage in the datasets used.

Data Pipeline

Raw data is sourced from official PBS publications including PDF tables (parsed programmatically), CSV releases, and survey microdata files. A Python ETL pipeline (build_dataset.py, ~1,800 lines) parses, cleans, and normalises all sources into a single unified JSON file (districts.json) keyed by normalised district name.

District Name Matching & Crosswalk

One of the core challenges in working with Pakistani administrative data is the inconsistency in district naming across sources. The same district may appear as "D.G. Khan", "Dera Ghazi Khan", "DG Khan", or "D. G. Khan" depending on the publication. The pipeline addresses this through a two-step process:

1. Normalisation: All district names are lowercased, stripped of punctuation, and collapsed to a canonical form (e.g. "d.g. khan" and "dera ghazi khan" both become "dera ghazi khan").

2. Crosswalk table: A manually curated lookup of over 1,800 name variants maps alternative spellings, abbreviations, and historical names to a single canonical name that matches the GeoJSON boundary file. Examples include "Abbotabad" → "Abbottabad", "Naushero Feroze" → "Naushahro Firoz", and "Mianwali" variants.

Multi-District Aggregation

Several districts in the census are reported at sub-district level but appear as a single polygon in the boundary data. The pipeline aggregates these automatically:

Karachi — 7 sub-districts (Central, East, West, South, Malir, Korangi, Kemari) summed into one
Kohistan — Upper and Lower Kohistan merged
Chitral — Upper and Lower Chitral merged

For count-based indicators (population, literate count, establishments), sub-district values are summed. For rate-based indicators (literacy ratio, unemployment rate), the pipeline recomputes the rate from aggregated numerator and denominator counts rather than averaging percentages, which would be statistically incorrect.

Mapping Data to GeoJSON

The GeoJSON boundary file contains 141 district polygons sourced from publicly available shapefiles at the ADM2 administrative level. Each polygon has a districts property containing the official district name. The pipeline matches each data record to a GeoJSON feature by normalising both the data district name and the GeoJSON property name through the same crosswalk, then joining on the canonical key.

A match report is generated on each build showing how many data districts successfully mapped to GeoJSON features, which data districts had no matching boundary (CSV-only), and which GeoJSON features had no data (GeoJSON-only, typically AJK and Gilgit-Baltistan districts not covered by PBS surveys).

Choropleth Colour Mapping

Indicator values are mapped to fill colours using quantile breaks (5 classes) computed via Chroma.js. Quantile classification ensures each colour class contains roughly the same number of districts, which is useful for revealing spatial patterns in skewed distributions (e.g. population, where a few large cities dominate).

For year-on-year change views ("Change" toggle), a diverging colour scale centred at zero is used: red shades indicate decline, green shades indicate growth, and near-white indicates little change. The scale is symmetric, anchored to the maximum absolute value in either direction.

Employment Classification (2017 vs 2023)

The 2017 census (Table 16) reported five mutually exclusive categories for population aged 10+: Worked, Seeking Work, Students, House Keeping, and Others. The 2023 census (Table 14) changed to a different classification: Employed (with sub-categories: Paid Employee, Own Account Agriculture/Non-Agriculture, Employer, Unpaid Family Helper), Unemployed, and “Not in Labour Force & Students (15–24)”. House Keeping is no longer reported as a separate category.

To enable cross-census comparison, the pipeline computes derived rates (LFPR, employment ratio, unemployment rate) from the raw counts for both years. These rates are comparable across the classification change. Count-based indicators that exist in only one year (e.g. House Keeping for 2017, employment composition for 2023) show as “—” for the other year.

Change Computation

The "Change" view for census indicators computes a simple difference: value_2023 − value_2017. For percentage-point indicators (literacy rate, urban proportion), the difference is expressed in percentage points. This approach does not adjust for boundary changes or definitional shifts between census rounds. Users should interpret large changes in smaller districts with caution, as boundary revisions may explain some variation.

Survey Statistical Adjustments (LFS & HIES)

The Labour Force Survey and HIES are designed to be representative at the provincial level, not at the district level. To produce district-level estimates, the pipeline applies two adjustments:

1. Minimum sample-size filter (n<30): Districts with fewer than 30 survey observations have all derived indicators suppressed and are flagged on the map with a distinct gold dashed border. This reflects the standard convention that small samples produce unreliable ratio and proportion estimates. In the current data, this affects 6 HIES districts in remote areas with limited survey coverage: Dera Bugti, Khuzdar, Mastung, Orakzai Agency, Panjgur, and Ziarat. The sample size (n) is shown in tooltips for all survey-based indicators.

2. Post-stratification to 2023 Census totals: Survey weights are recalibrated so that weighted district-level totals align with known population counts from Census 2023 (Table 1). For LFS (individual-level microdata), this takes the form of a sex-ratio adjustment: male and female observations within each district are reweighted so that the weighted sex composition matches the census male/female population shares. For HIES (household-level data), a population calibration factor scales household weights so the weighted population total per district matches the census figure. This corrects for the fact that PBS survey sampling frames may not reflect post-census population shifts across districts, and reduces bias from differential non-response by sex.

These adjustments improve the plausibility of district-level estimates but do not eliminate the fundamental limitation that provincial-level surveys have limited statistical power at finer geographies. Users should interpret district-level survey indicators as approximate and treat cross-district rankings with caution.

Limitations

Survey-based indicators (PSLM, LFS, HIES) are sample estimates and carry sampling error, particularly for smaller districts with fewer primary sampling units.
Not all indicators are available for every district. Missing data is shown as a grey fill with a dash (—) in the tooltip and sidebar.
Census 2023 results used are provisional and may differ from final published figures.
District boundaries may not perfectly align with the administrative units used in all data sources, especially for newly created districts.
Gender-disaggregated literacy rates for 2023 are available for all districts, including merged districts (Karachi, Kohistan, Chitral), where they are recomputed from aggregated gender-level counts across sub-districts.