Data Darbar is an interactive data explorer for Pakistan, bringing together district-level statistics from multiple official sources published by the Pakistan Bureau of Statistics (PBS).
The platform visualises demographic, economic, education, and social indicators across all districts of Pakistan, enabling researchers, policymakers, journalists, and citizens to explore spatial patterns and changes over time. The name "Data Darbar" is a nod to the famous shrine in Lahore — reimagined here as a place of gathering for Pakistan's data.
Data Darbar was created by Hiba Sameen, an economist and data scientist based in London, UK. Hiba has a PhD in Economics and her background spans working across academia, government and think tanks on economic policy. She also has an interest in data science and data engineering, focussing on making public data more accessible and useful for evidence-based decision-making. This project was born out of a frustration with how difficult it is to explore and compare district-level statistics from PBS publications, which are often buried in PDF tables and scattered across multiple reports.
The map covers 141 districts across all provinces and territories: Punjab, Sindh, Khyber Pakhtunkhwa, Balochistan, Islamabad Capital Territory, Azad Jammu & Kashmir, Gilgit-Baltistan, and the former Federally Administered Tribal Areas (FATA). 124 districts currently have matched data; the remaining 17 (primarily in AJK and Gilgit-Baltistan) appear on the map but lack PBS coverage in the datasets used.
Raw data is sourced from official PBS publications including PDF tables (parsed
programmatically), CSV releases, and survey microdata files. A Python ETL pipeline
(build_dataset.py, ~1,800 lines) parses, cleans, and normalises all
sources into a single unified JSON file (districts.json) keyed by
normalised district name.
One of the core challenges in working with Pakistani administrative data is the inconsistency in district naming across sources. The same district may appear as "D.G. Khan", "Dera Ghazi Khan", "DG Khan", or "D. G. Khan" depending on the publication. The pipeline addresses this through a two-step process:
1. Normalisation: All district names are lowercased, stripped of punctuation, and collapsed to a canonical form (e.g. "d.g. khan" and "dera ghazi khan" both become "dera ghazi khan").
2. Crosswalk table: A manually curated lookup of over 1,800 name variants maps alternative spellings, abbreviations, and historical names to a single canonical name that matches the GeoJSON boundary file. Examples include "Abbotabad" → "Abbottabad", "Naushero Feroze" → "Naushahro Firoz", and "Mianwali" variants.
Several districts in the census are reported at sub-district level but appear as a single polygon in the boundary data. The pipeline aggregates these automatically:
For count-based indicators (population, literate count, establishments), sub-district values are summed. For rate-based indicators (literacy ratio, unemployment rate), the pipeline recomputes the rate from aggregated numerator and denominator counts rather than averaging percentages, which would be statistically incorrect.
The GeoJSON boundary file contains 141 district polygons sourced from publicly
available shapefiles at the ADM2 administrative level. Each polygon has a
districts property containing the official district name. The
pipeline matches each data record to a GeoJSON feature by normalising both
the data district name and the GeoJSON property name through the same crosswalk,
then joining on the canonical key.
A match report is generated on each build showing how many data districts successfully mapped to GeoJSON features, which data districts had no matching boundary (CSV-only), and which GeoJSON features had no data (GeoJSON-only, typically AJK and Gilgit-Baltistan districts not covered by PBS surveys).
Indicator values are mapped to fill colours using quantile breaks (5 classes) computed via Chroma.js. Quantile classification ensures each colour class contains roughly the same number of districts, which is useful for revealing spatial patterns in skewed distributions (e.g. population, where a few large cities dominate).
For year-on-year change views ("Change" toggle), a diverging colour scale centred at zero is used: red shades indicate decline, green shades indicate growth, and near-white indicates little change. The scale is symmetric, anchored to the maximum absolute value in either direction.
The 2017 census (Table 16) reported five mutually exclusive categories for population aged 10+: Worked, Seeking Work, Students, House Keeping, and Others. The 2023 census (Table 14) changed to a different classification: Employed (with sub-categories: Paid Employee, Own Account Agriculture/Non-Agriculture, Employer, Unpaid Family Helper), Unemployed, and “Not in Labour Force & Students (15–24)”. House Keeping is no longer reported as a separate category.
To enable cross-census comparison, the pipeline computes derived rates (LFPR, employment ratio, unemployment rate) from the raw counts for both years. These rates are comparable across the classification change. Count-based indicators that exist in only one year (e.g. House Keeping for 2017, employment composition for 2023) show as “—” for the other year.
The "Change" view for census indicators computes a simple difference:
value_2023 − value_2017. For percentage-point indicators
(literacy rate, urban proportion), the difference is expressed in percentage
points. This approach does not adjust for boundary changes or definitional
shifts between census rounds. Users should interpret large changes in smaller
districts with caution, as boundary revisions may explain some variation.
The Labour Force Survey and HIES are designed to be representative at the provincial level, not at the district level. To produce district-level estimates, the pipeline applies two adjustments:
1. Minimum sample-size filter (n<30): Districts with fewer than 30 survey observations have all derived indicators suppressed and are flagged on the map with a distinct gold dashed border. This reflects the standard convention that small samples produce unreliable ratio and proportion estimates. In the current data, this affects 6 HIES districts in remote areas with limited survey coverage: Dera Bugti, Khuzdar, Mastung, Orakzai Agency, Panjgur, and Ziarat. The sample size (n) is shown in tooltips for all survey-based indicators.
2. Post-stratification to 2023 Census totals: Survey weights are recalibrated so that weighted district-level totals align with known population counts from Census 2023 (Table 1). For LFS (individual-level microdata), this takes the form of a sex-ratio adjustment: male and female observations within each district are reweighted so that the weighted sex composition matches the census male/female population shares. For HIES (household-level data), a population calibration factor scales household weights so the weighted population total per district matches the census figure. This corrects for the fact that PBS survey sampling frames may not reflect post-census population shifts across districts, and reduces bias from differential non-response by sex.
These adjustments improve the plausibility of district-level estimates but do not eliminate the fundamental limitation that provincial-level surveys have limited statistical power at finer geographies. Users should interpret district-level survey indicators as approximate and treat cross-district rankings with caution.
Have a question, suggestion, or found a data issue? Send a message and it will reach Hiba directly.