Dataset Processing and Harmonization

Dataset Processing and Harmonization#

Landslide inventories included in the CRESCENT Ground Failure Repository originate from multiple state and provincial geological surveys and are produced using different mapping practices, classification schemes, spatial representations, and attribute conventions. As a result, raw inventories are not directly comparable across regions without careful preprocessing.

To address this, all datasets in the repository are passed through a transparent, reproducible processing pipeline implemented as a series of Jupyter notebooks. These notebooks document how each source inventory is ingested, cleaned, standardized, and prepared for visualization and distribution.

The processing workflow is divided into two main stages:

PreProcessing: source-specific ingestion and normalization
PostProcessing: cross-dataset harmonization and final schema alignment

All processing notebooks are provided in full to support reproducibility and reuse.

PreProcessing#

The PreProcessing stage focuses on converting each source inventory from its native format into a common geospatial representation while preserving the original information content and attribution.

Because each agency distributes data differently, preprocessing steps are necessarily dataset-specific. Typical tasks performed at this stage include:

Downloading or loading source data (e.g., shapefiles, geodatabases, CSVs)
Converting data into GeoPandas-compatible formats
Reprojecting geometries to a common coordinate reference system
Cleaning invalid or empty geometries
Renaming and documenting original attribute fields
Adding required identifiers and provenance metadata

Each source inventory has its own dedicated preprocessing notebook:

California Geological Survey (CGS)
processing_notebooks/PreProcessing/01_cgs_download_to_geopandas.ipynb
Washington Department of Natural Resources (WADNR)
processing_notebooks/PreProcessing/washington_preprocessing.ipynb
Oregon DOGAMI
processing_notebooks/PreProcessing/oregon_preprocessing.ipynb
Natural Resources Canada (NRCan)
processing_notebooks/PreProcessing/canada_preprocessing.ipynb

These notebooks document source-specific assumptions, data limitations, and processing choices and should be read alongside the resulting datasets.

Note: PreProcessing does not reinterpret landslide classifications or alter scientific meaning. The goal is normalization and documentation, not reanalysis.

PostProcessing#

The PostProcessing stage applies once all source inventories have been individually preprocessed into a consistent geospatial format.

At this stage, datasets are combined and harmonized to support region-wide analysis and visualization. Typical post-processing steps include:

Mapping source-specific classifications to a common controlled vocabulary
Standardizing attribute names, data types, and units
Adding derived fields used by the Ground Failure Viewer
Enforcing a shared schema across all inventories
Validating geometry types and attribute completeness
Exporting final, standardized data products for distribution

PostProcessing is implemented in a shared notebook:

Unified Post-Processing and Harmonization
processing_notebooks/PostProcessing/post_processing.ipynb

This separation ensures that:

Source-specific logic remains clearly isolated, and
Cross-dataset decisions are applied consistently and transparently.

Reproducibility and Extension#

All processing notebooks are version-controlled and designed to be re-run as source datasets evolve or new inventories are added. Contributors who wish to add new datasets are encouraged to follow the same two-stage structure:

Create a new PreProcessing notebook for the source dataset
Integrate the output into the existing PostProcessing workflow

This design allows the repository to grow while maintaining consistency, traceability, and scientific integrity.

Dataset Processing and Harmonization

Contents

Dataset Processing and Harmonization#

PreProcessing#

PostProcessing#

Reproducibility and Extension#