AI-powered epidemic forecasting, hotspot detection, and risk classification across 201 countries using epidemiologically grounded machine learning.
| Country | Risk | Rₜ | 7d Growth |
|---|
Hover over a region to see details. Click to open a country panel. Pulsing circles mark active hotspots where Rₜ > 2.0.
Risk scored on Rₜ proxy, 7-day growth rate, and cases per million. Thresholds from WHO multi-indicator outbreak definitions. Click another region to compare.
Model predicts 7 days ahead, feeds prediction back as input, repeats 4× → 30-day projection. Confidence band from 50 Monte Carlo runs with ±2% feature noise.
Predicting infectious disease spread requires integrating epidemic dynamics, demographic vulnerability, and policy response. Our pipeline builds four layers: time-series forecasting for case counts, risk classification for severity, compartmental SIR modeling for transmission dynamics, and K-Means clustering for hotspot detection.
JHU CSSE provides wide-format daily case counts (one column per date). We transform to long format via melt(), aggregate sub-national provinces to country level, then left-join with OWID on country + date. Dynamic columns (vaccination, stringency) are forward-filled within each country in time order — critically, this is done after sorting by date to prevent future data leaking backwards into past rows.
All 21 features have direct biological justification:
Random Forest over LSTM: Tree-based models outperform sequence models when the series is non-stationary (multiple waves with different parameters) and the dataset is wide (21 features) relative to per-country length (~1000 rows). RF also gives interpretable feature importance — critical for biological insight.
Time-based split: Non-negotiable for time series. Random splits cause data leakage — the model sees future rows while predicting past rows, artificially inflating R². We train on pre-2022-01-01 and test on 2022 onwards.
SIR per wave, not full timeline: Full-timeline fitting averages over multiple waves with different β/γ, yielding R₀ ≈ 1. Wave-1 fitting isolates biological transmission before interventions, giving R₀ = 2.5–3.0 consistent with published literature.
4-level system (Low / Medium / High / Critical) scores each country-week on three independent epidemiological signals: Rₜ proxy (1–3 pts), 7-day growth rate (1–3 pts), cases per million (1–3 pts). Score 0–1=Low, 2–4=Medium, 5–6=High, 7–9=Critical. Mirrors WHO multi-indicator approach avoiding single-metric thresholds.
Built for the Epidemic Spread Prediction hackathon track. Combining epidemiological science with machine learning to produce biologically meaningful insights.
| Criterion | Implementation | Status |
|---|---|---|
| Outbreak prediction | RF + XGBoost 7-day forecasting, time-split validated | Done |
| Hotspot detection | K-Means on per-capita burden features → 4 burden groups | Done |
| Transmission modeling | Wave-1 SIR fit with scipy curve_fit → realistic R₀ 2.5–3.0 | Done |
| Historical outbreak analysis | EDA: 201 countries, wave detection, CFR, Rₜ trends | Done |
| Demographic/mobility factors | OWID: vaccination, stringency, median age, pop density, hospital beds | Done |
| Risk visualization | Interactive maps, 4-level classification, global dashboard | Done |
| Biological insights | Methodology section with epi rationale for every feature + references | Done |
| 30-day forecast | Iterative prediction with 80% Monte Carlo confidence band | Done |