🔬 JHU + OWID · 225,321 Records · 21 Features

Predicting the Epidemic
Before it Begins

AI-powered epidemic forecasting, hotspot detection, and risk classification across 201 countries using epidemiologically grounded machine learning.

Critical Risk

14

countries

High Risk

22

countries

Model R²

0.89

Random Forest

Countries

201

tracked globally

Records

225K

daily data points

14 countries at CRITICAL risk — Rₜ > 2.0 and weekly growth > 50%. RF MAPE ≈ 15%, time-split validated on 2022–2023 holdout.

Epidemic Trends

Daily New Cases — Top 4 Countries (7-day avg, millions)

Effective Reproduction Number Rₜ Over Time

Country Risk Assessment

Risk Ranking — Latest Assessment

Country	Risk	Rₜ	7d Growth

Risk Distribution Over Time (countries per level)

SIR Transmission Modeling

Wave-1 Basic Reproduction Number R₀ per Country

Key Insights

Vaccination Effect

Countries with >60% coverage show ~40% lower cases per million in subsequent waves

Age Vulnerability

Median age >40 correlates with 2.3× higher CFR — consistent with Levin et al. IFR data

Policy Lag

Stringency increase precedes Rₜ reduction by ~14 days — matching incubation + reporting delay window

Wave R₀ Range

Wave-1 R₀ of 2.5–3.0 consistent with published COVID-19 literature (Liu et al., 2020)

Epidemic Burden Clusters (cases per million)

Global Risk Map

Hover over a region to see details. Click to open a country panel. Pulsing circles mark active hotspots where Rₜ > 2.0.

Critical

14

countries

High

22

countries

Medium

54

countries

Low

111

countries

Critical

High

Medium

Low

● Pulsing = active hotspot (Rₜ > 2.0)

—

Risk Level

—

Wave-1 R₀

—

Key Countries

—

Region

—

Risk scored on Rₜ proxy, 7-day growth rate, and cases per million. Thresholds from WHO multi-indicator outbreak definitions. Click another region to compare.

Cluster Analysis

Burden Clusters (cases per million)

Vaccination vs Cases Per Million

30-Day Iterative Forecast

Model predicts 7 days ahead, feeds prediction back as input, repeats 4× → 30-day projection. Confidence band from 50 Monte Carlo runs with ±2% feature noise.

India — 30-Day Forecast

US — 30-Day Forecast

Brazil — 30-Day Forecast

Forecast Summary — Top Countries

How It Works

1

Take the last known row per country — confirmed cases, Rₜ proxy, growth rate, vaccination rate, all 21 features

2

Feed into the Random Forest model → predict cumulative confirmed cases 7 days ahead

3

Update the row — shift all lag features forward, recalculate Rₜ and growth rate from the predicted value

4

Repeat 3 more times → 28-day projection. Run 50× with ±2% noise → confidence band

Data Cutoff

JHU dataset ends March 2023. All forecasts demonstrate the architecture on historical holdout — not live prediction.

Error Propagation

Each step compounds prediction error. Confidence band widens sharply at day 21+. Treat as directional, not point-exact.

Future Work

Live deployment: WHO/ECDC data feeds. SEIR model with vaccination compartment. Mobility data from Google/Apple.

Model Performance

Random Forest

R² score0.8921

Raw MAE421,532

MAE per million pop.~320

MAPE~15%

Median country MAE~8,400

n_estimators200

XGBoost

R² score0.7634

Raw MAE1,034,359

Learning rate0.05

n_estimators300

AssessmentNeeds tuning

WinnerRF preferred

Risk Classifier (RF)

Critical detected14

High detected22

Medium detected54

Low detected111

Class_weightBalanced

Total tracked201

Feature Importance & Engineering

Feature Importance — Random Forest (top 14)

Epidemiological Rationale

Confirmed cases + lags (1,3,7,14d) — captures incubation period dynamics. COVID median incubation = 5 days; 7d lag is biologically optimal.

Rₜ proxy — fundamental epidemic control parameter. Rₜ > 1 → outbreak grows. Standard WHO monitoring metric.

Growth rate (7d) — acceleration/deceleration of spread. Positive = emerging outbreak, negative = containment.

Stringency index — government policy effect on β (contact rate). ~14 day lag to measurable case impact.

Vaccination % — reduces susceptible pool S in SIR. Direct biological protection signal integrated into features.

Median age + pop density — proxy for IFR (infection-fatality rate) and contact rate β respectively.

SIR Transmission Model

Wave-1 SIR Fit — β (transmission rate), γ (recovery rate), R₀ = β/γ

Actual vs Predicted — India Test Set

Error Distribution Across Countries (log MAE)

Methodology

Problem Statement

Predicting infectious disease spread requires integrating epidemic dynamics, demographic vulnerability, and policy response. Our pipeline builds four layers: time-series forecasting for case counts, risk classification for severity, compartmental SIR modeling for transmission dynamics, and K-Means clustering for hotspot detection.

Data Pipeline

JHU CSSE provides wide-format daily case counts (one column per date). We transform to long format via melt(), aggregate sub-national provinces to country level, then left-join with OWID on country + date. Dynamic columns (vaccination, stringency) are forward-filled within each country in time order — critically, this is done after sorting by date to prevent future data leaking backwards into past rows.

Feature Engineering — Epidemiological Grounding

All 21 features have direct biological justification:

Rₜ proxy — 7d avg cases ÷ 7d avg from 7 days prior. When Rₜ > 1, epidemic expands. Standard WHO surveillance metric.
Growth rate (7d) — Captures acceleration of spread. 20% weekly = emerging outbreak signal.
Doubling time — log(2) / log(1 + daily_growth). 7-day doubling = public health emergency threshold.
CFR — Deaths / confirmed. Rising CFR signals healthcare stress or variant shift.
Lag features (1,3,7,14d) — COVID-19 median incubation = 5 days. 7-day lag is biologically optimal. 14-day = full contact-tracing window.
Stringency index — Proxies contact rate reduction (β in SIR). ~14-day lag to case impact (incubation + reporting delay).
Vaccination coverage — Reduces susceptible pool S. Direct biological protection mechanism in SIR framework.
Median age, population density, hospital beds — Demographic vulnerability (IFR proxy), transmission rate, and healthcare capacity.

Model Selection Rationale

Random Forest over LSTM: Tree-based models outperform sequence models when the series is non-stationary (multiple waves with different parameters) and the dataset is wide (21 features) relative to per-country length (~1000 rows). RF also gives interpretable feature importance — critical for biological insight.

Time-based split: Non-negotiable for time series. Random splits cause data leakage — the model sees future rows while predicting past rows, artificially inflating R². We train on pre-2022-01-01 and test on 2022 onwards.

SIR per wave, not full timeline: Full-timeline fitting averages over multiple waves with different β/γ, yielding R₀ ≈ 1. Wave-1 fitting isolates biological transmission before interventions, giving R₀ = 2.5–3.0 consistent with published literature.

Risk Classification

4-level system (Low / Medium / High / Critical) scores each country-week on three independent epidemiological signals: Rₜ proxy (1–3 pts), 7-day growth rate (1–3 pts), cases per million (1–3 pts). Score 0–1=Low, 2–4=Medium, 5–6=High, 7–9=Critical. Mirrors WHO multi-indicator approach avoiding single-metric thresholds.

References

Liu Y et al. (2020). The reproductive number of COVID-19 is higher compared to SARS coronavirus. J Travel Med. R₀ range 2.2–3.6.

Sanche S et al. (2020). High contagiousness and rapid spread of SARS-CoV-2. Emerg Infect Dis. Early Wuhan R₀ up to 5.7.

Levin AT et al. (2020). Assessing age specificity of IFR for COVID-19. Eur J Epidemiol. IFR exponential increase with age.

Hale T et al. (2021). A global panel database of pandemic policies. Nature Human Behaviour. Oxford Stringency Index.

About PredemicAI

Built for the Epidemic Spread Prediction hackathon track. Combining epidemiological science with machine learning to produce biologically meaningful insights.

Datasets

Johns Hopkins CSSE COVID-19 Time Series

Daily confirmed cases and deaths across 201 countries, Feb 2020 – Mar 2023. Primary dataset for forecasting, SIR modeling, and risk classification.

github.com/CSSEGISandData/COVID-19

225,321 records201 countriesPrimary

Our World in Data COVID-19 Repository

Vaccination rates, government stringency index, testing data, and demographic indicators. Used for feature enrichment and demographic vulnerability analysis.

github.com/owid/covid-19-data

VaccinationStringency indexDemographicsSecondary

Technical Stack

🐍

Python / Jupyter

pandas, numpy, scikit-learn, xgboost, scipy, plotly, matplotlib, seaborn

🤖

ML Models

Random Forest, XGBoost, K-Means clustering, SIR compartmental model

🌐

Web Dashboard

Pure HTML/CSS/JS + Chart.js — no server needed, opens in any browser

Hackathon Criteria Coverage

Criterion	Implementation	Status
Outbreak prediction	RF + XGBoost 7-day forecasting, time-split validated	Done
Hotspot detection	K-Means on per-capita burden features → 4 burden groups	Done
Transmission modeling	Wave-1 SIR fit with scipy curve_fit → realistic R₀ 2.5–3.0	Done
Historical outbreak analysis	EDA: 201 countries, wave detection, CFR, Rₜ trends	Done
Demographic/mobility factors	OWID: vaccination, stringency, median age, pop density, hospital beds	Done
Risk visualization	Interactive maps, 4-level classification, global dashboard	Done
Biological insights	Methodology section with epi rationale for every feature + references	Done
30-day forecast	Iterative prediction with 80% Monte Carlo confidence band	Done

Predicting the EpidemicBefore it Begins

Epidemic Trends

Country Risk Assessment

SIR Transmission Modeling

Key Insights

Global Risk Map

Cluster Analysis

30-Day Iterative Forecast

How It Works

Model Performance

Feature Importance & Engineering

SIR Transmission Model

Methodology

Problem Statement

Data Pipeline

Feature Engineering — Epidemiological Grounding

Model Selection Rationale

Risk Classification

References

Datasets

Technical Stack

Hackathon Criteria Coverage

Team

Predicting the Epidemic
Before it Begins