Methodology

This page summarizes what the pipeline is doing at a high level. For the detailed write‑up, see METHODOLOGY.md and CODE_GUIDE.md in the repo.

Pipeline overview

Cluster + Rank Monte Carlo stability Stress‑test ladder Spacing Twin similarity

Cluster + Rank: group BGs by feature similarity (robust z‑scores), then rank clusters by a LocationScore.
Monte Carlo: re‑fit clustering under perturbations (bootstrap + noise) to measure selection stability (pass‑rate).
Stress‑test ladder: run strict→relaxed feasibility regimes and aggregate candidates by frequency + earliest regime.
Spacing: enforce minimum distance to anchor BGs (public demo uses synthetic anchors).
Twin similarity: connect candidates to the most similar anchor profile (e.g., Mahalanobis distance).

share_college_plus — education attainment share
share_commute_60p — long‑commute share
pop_density — population per km² (BG)
occ_units_density — occupied housing units per km²
potbus_per_1k — “PotentialBus” per 1,000 residents (proxy for business intensity)
median_income — median household income

Most scripts operate on robust z‑scores (suffix _z) so variables are comparable and outliers are dampened.

Cluster Ranking Map: shows BGs colored by cluster rank (Rank 1 = most promising cluster group).
Stress‑Test Aggregated Map: shows candidates that survive across regimes; tooltips include earliest regime and frequency.
Grey BGs: excluded from modeling (missing data, open‑space, etc.). Water‑dominant BGs are rendered transparent so the ocean stays unshaded.

Run everything with python scripts/run_all.py and then publish maps with python scripts/publish_maps.py.