← Back to maps | Guide | Parameters
Methodology
This page summarizes what the pipeline is doing at a high level. For the detailed write‑up, see
METHODOLOGY.md and CODE_GUIDE.md in the repo.
Pipeline overview
Cluster + Rank
Monte Carlo stability
Stress‑test ladder
Spacing
Twin similarity
- Cluster + Rank: group BGs by feature similarity (robust z‑scores), then rank clusters by a LocationScore.
- Monte Carlo: re‑fit clustering under perturbations (bootstrap + noise) to measure selection stability (pass‑rate).
- Stress‑test ladder: run strict→relaxed feasibility regimes and aggregate candidates by frequency + earliest regime.
- Spacing: enforce minimum distance to anchor BGs (public demo uses synthetic anchors).
- Twin similarity: connect candidates to the most similar anchor profile (e.g., Mahalanobis distance).
Key features
share_college_plus — education attainment share
share_commute_60p — long‑commute share
pop_density — population per km² (BG)
occ_units_density — occupied housing units per km²
potbus_per_1k — “PotentialBus” per 1,000 residents (proxy for business intensity)
median_income — median household income
Most scripts operate on robust z‑scores (suffix _z) so variables are comparable and outliers are dampened.
How to interpret the maps
- Cluster Ranking Map: shows BGs colored by cluster rank (Rank 1 = most promising cluster group).
- Stress‑Test Aggregated Map: shows candidates that survive across regimes; tooltips include earliest regime and frequency.
- Grey BGs: excluded from modeling (missing data, open‑space, etc.). Water‑dominant BGs are rendered transparent so the ocean stays unshaded.
Run everything with python scripts/run_all.py and then publish maps with python scripts/publish_maps.py.