About the Project
Our mission to transform urban planning through data-driven insights.
India's rapid urbanization presents a significant challenge: the absence of real-time, reliable building data. Traditional manual surveys are slow, expensive, and quickly become obsolete. This project addresses this nationwide problem by creating a dynamic digital building stock for the entire country, essential for sustainable development.
By utilizing GIS and machine learning, this foundational data layer enables a new generation of data-driven applications.
About Our Data: Assumptions, Definitions
and Limitations
We are pleased to provide this comprehensive building stock dataset for public use. Our goal is to offer a valuable resource for researchers, planners, and the public. To ensure this data is used responsibly and effectively, we believe in full transparency. Users should be aware of the following key definitions, assumptions, estimations, and limitations before beginning their analysis.
We provide this dataset as a foundational resource and encourage the scientific community to build upon this work. The inclusion of provenance fields allows all users to validate our results, refine the models, and contribute to an iteratively improved dataset.
Building Footprints and Scope
- Source: The building footprints in this dataset are sourced from the Google Open Buildings dataset (v3). Please note that these are not based on official surveys. They are detected outlines of buildings derived from high-resolution satellite imagery.
- Scope: This estimate includes all detected structures (e.g., derelict, abandoned, informal buildings), reflecting the full physical footprint of the built environment.
Building Height and Floor Count (2.5D Data)
- Height Source: Building height was extracted from the Google Open Buildings Temporal dataset (v1).
- Floor Count Estimation: The
building_floorattribute is an estimate, calculated by dividing the extracted building height by a standard floor height of 3 metres. This 3-metre standard is a useful general estimate. - Custom Rounding: The final integer floor count was determined using a custom rounding logic. A number was rounded up only if the decimal part was 0.8 or greater; otherwise, it was rounded down.
- User Discretion: We provide the raw
building_heightdata to allow end-users to apply more nuanced assumptions , such as using different floor-to-height ratios for commercial or industrial buildings. - Important Data Caveat: The raw dataset was computed based on an average height of 3 metres for all land use types . Therefore, the aggregated results might differ from what is presented on the dashboard, which uses AEEE internal assumptions for visualisation.
Land Use (Building Typology)
- Proxy Data: Official, digitised building function data is not broadly available. This dataset uses OpenStreetMap (OSM) data as a proxy.
- Core Assumption: The methodology relies on the underlying assumption that the landuse tag in OSM provides a strong, albeit indirect, indicator of the likely function of buildings within that area.
- Class Simplification: The raw OSM landuse tags are highly granular. To create a functional classification model, these tags were mapped to a smaller, more functional set of seven primary classes (e.g., Residential, Commercial / Retail, Industrial).
Machine Learning and Transparency
- Imputation of Missing Data: To address inherent incompleteness in OSM data, we employed a supervised machine learning model (Random Forest) to predict the land use for all unlabelled buildings.
- Training Data Limitations: The model was trained only on the subset of buildings that had an original landuse tag from OSM, acknowledging the real-world biases in the data (e.g., a predominance of residential buildings).
- Transparency Fields: We have included two columns in the final dataset:
is_predicted(Boolean flag indicating ML assignment) andprediction_confidence(the model's confidence score). We encourage users to leverage these fields to filter the dataset for their specific analysis.
Data Filtering and Geographic Disclaimer
- Filtering Exclusions: To ensure high data quality, the dataset excludes buildings with Confidence Level (CL) below 0.75, and buildings with height less than 2.4 metres or greater than 100 metres (due to source dataset limitations).
- Geographic Disclaimer: The state/municipal boundary maps used for visualisation are sourced from a publicly available GitHub repository, due to the lack of digital layers/maps available in the public domain on the latest municipal boundaries.
Data Dictionary
The GOBS dataset includes the following features for each building record.
| Feature Name | Data Type | Description |
|---|---|---|
| latitude | Float | Describes the latitude of the building's centroid. |
| longitude | Float | Describes the longitude of the building's centroid. |
| area_in_meters | Float | The area of the building's footprint polygon in square meters. |
| confidence | Float | The confidence level (CL) of the polygon being a building, provided by the source dataset. CL ∈ [0.75, 1]. |
| full_plus_code | String | A global, high-precision address code from Google that represents a specific geographic area. |
| perimeter | Float | The calculated perimeter of the building's footprint polygon in meters. |
| building_height | Float | The building's height in meters, derived from Google Open Buildings 2.5D Temporal Dataset |
| building_floor | Integer | The estimated number of floors, calculated from 'building height' using a custom rounding logic. |
| landuse | String | The functional classification of the land (e.g., 'Residential', 'Industrial'), sourced from OpenStreetMap and harmonized into standard categories. |
| total_built_up | Float | The total estimated built-up area in square meters, engineered by multiplying 'area in meters' by 'building floor'. |
| state_name | String | The name of the state where the building is located. |
| district_name | String | The name of the district where the building is located. |
| is_predicted | Boolean | A flag indicating if the 'landuse' value was imputed by the machine learning model (True) or is an original label (False). |
| prediction_confidence | Float | For predicted 'landuse' values, this is the model's confidence score [0, 1]. For original labels, the value is 1.0. |
Our Data Creation Process
A three-step methodology for turning raw data into actionable intelligence
Geometric Enrichment
Our process begins with AI-extracted building footprints. We then integrate building height data to create a 2.5D model of every structure, allowing us to calculate key metrics like perimeter, floors, and total built-up area after a series of rigorous data validation and cleaning steps.
Semantic Enrichment
To understand a building's function, we fuse its geometry with land-use data sourced from OpenStreetMap (OSM). A Random Forest machine learning model then classifies each structure, predicting the function (e.g., residential, commercial) for buildings with missing information, ensuring a complete dataset.
Actionable Aggregation
Finally, the enriched, building-level data from all districts is unified into a foundational dataset. This raw dataset is made available for download and serves as a critical resource for diverse use cases—from modeling, climate studies, and to inform national policy, allowing for custom analysis and aggregation at state, city, or district levels.
Core Datasets Used
This project stands on the shoulders of giants, integrating several powerful open-source datasets.
Open Buildings
This dataset provides the foundational 2D building footprints derived from high-resolution satellite imagery using AI. It's one of the largest open-access building datasets, crucial for humanitarian work and environmental science.
Open Building 2.5D Temporal Dataset
This complementary dataset offers insights into when each building was likely constructed. It helps track urban growth and changes in the built environment, adding a vital historical context to the spatial data.
OpenStreetMap (OSM)
A global, collaborative mapping project. It provides the essential crowdsourced data on land use and points of interest, which we use to classify buildings by their function (e.g., residential, commercial).
Citation & Publication
This work is associated with a Data Descriptor paper currently in preparation, which will provide extensive documentation on the database and methodology. All code used to generate this data will be openly available on GitHub.
If you use the GOBS data for your project, please use the following citation:Jindal, R., Johnson, J. & Kumar, S. (2025). A geospatial dataset of building stock and height for building-level analysis in India.