PredemicAI is loading
Initializing pipeline...
🔬 JHU + OWID · 225,321 Records · 21 Features

Predicting the Epidemic
Before it Begins

AI-powered epidemic forecasting, hotspot detection, and risk classification across 201 countries using epidemiologically grounded machine learning.

Critical Risk
14
countries
High Risk
22
countries
Model R²
0.89
Random Forest
Countries
201
tracked globally
Records
225K
daily data points
14 countries at CRITICAL risk — Rₜ > 2.0 and weekly growth > 50%. RF MAPE ≈ 15%, time-split validated on 2022–2023 holdout.

Epidemic Trends

Daily New Cases — Top 4 Countries (7-day avg, millions)
Effective Reproduction Number Rₜ Over Time

Country Risk Assessment

Risk Ranking — Latest Assessment
CountryRiskRₜ7d Growth
Risk Distribution Over Time (countries per level)

SIR Transmission Modeling

Wave-1 Basic Reproduction Number R₀ per Country

Key Insights

Vaccination Effect
Countries with >60% coverage show ~40% lower cases per million in subsequent waves
Age Vulnerability
Median age >40 correlates with 2.3× higher CFR — consistent with Levin et al. IFR data
Policy Lag
Stringency increase precedes Rₜ reduction by ~14 days — matching incubation + reporting delay window
Wave R₀ Range
Wave-1 R₀ of 2.5–3.0 consistent with published COVID-19 literature (Liu et al., 2020)
Epidemic Burden Clusters (cases per million)

Global Risk Map

Hover over a region to see details. Click to open a country panel. Pulsing circles mark active hotspots where Rₜ > 2.0.

Critical
14
countries
High
22
countries
Medium
54
countries
Low
111
countries
Critical
High
Medium
Low
● Pulsing = active hotspot (Rₜ > 2.0)
Risk Level
Wave-1 R₀
Key Countries
Region

Risk scored on Rₜ proxy, 7-day growth rate, and cases per million. Thresholds from WHO multi-indicator outbreak definitions. Click another region to compare.

Cluster Analysis

Burden Clusters (cases per million)
Vaccination vs Cases Per Million

30-Day Iterative Forecast

Model predicts 7 days ahead, feeds prediction back as input, repeats 4× → 30-day projection. Confidence band from 50 Monte Carlo runs with ±2% feature noise.

India — 30-Day Forecast
US — 30-Day Forecast
Brazil — 30-Day Forecast
Forecast Summary — Top Countries

How It Works

1
Take the last known row per country — confirmed cases, Rₜ proxy, growth rate, vaccination rate, all 21 features
2
Feed into the Random Forest model → predict cumulative confirmed cases 7 days ahead
3
Update the row — shift all lag features forward, recalculate Rₜ and growth rate from the predicted value
4
Repeat 3 more times → 28-day projection. Run 50× with ±2% noise → confidence band
Data Cutoff
JHU dataset ends March 2023. All forecasts demonstrate the architecture on historical holdout — not live prediction.
Error Propagation
Each step compounds prediction error. Confidence band widens sharply at day 21+. Treat as directional, not point-exact.
Future Work
Live deployment: WHO/ECDC data feeds. SEIR model with vaccination compartment. Mobility data from Google/Apple.

Model Performance

Random Forest
R² score0.8921
Raw MAE421,532
MAE per million pop.~320
MAPE~15%
Median country MAE~8,400
n_estimators200
XGBoost
R² score0.7634
Raw MAE1,034,359
Learning rate0.05
n_estimators300
AssessmentNeeds tuning
WinnerRF preferred
Risk Classifier (RF)
Critical detected14
High detected22
Medium detected54
Low detected111
Class_weightBalanced
Total tracked201

Feature Importance & Engineering

Feature Importance — Random Forest (top 14)
Epidemiological Rationale
Confirmed cases + lags (1,3,7,14d) — captures incubation period dynamics. COVID median incubation = 5 days; 7d lag is biologically optimal.
Rₜ proxy — fundamental epidemic control parameter. Rₜ > 1 → outbreak grows. Standard WHO monitoring metric.
Growth rate (7d) — acceleration/deceleration of spread. Positive = emerging outbreak, negative = containment.
Stringency index — government policy effect on β (contact rate). ~14 day lag to measurable case impact.
Vaccination % — reduces susceptible pool S in SIR. Direct biological protection signal integrated into features.
Median age + pop density — proxy for IFR (infection-fatality rate) and contact rate β respectively.

SIR Transmission Model

Wave-1 SIR Fit — β (transmission rate), γ (recovery rate), R₀ = β/γ
Actual vs Predicted — India Test Set
Error Distribution Across Countries (log MAE)

Methodology

Problem Statement

Predicting infectious disease spread requires integrating epidemic dynamics, demographic vulnerability, and policy response. Our pipeline builds four layers: time-series forecasting for case counts, risk classification for severity, compartmental SIR modeling for transmission dynamics, and K-Means clustering for hotspot detection.

Data Pipeline

JHU CSSE provides wide-format daily case counts (one column per date). We transform to long format via melt(), aggregate sub-national provinces to country level, then left-join with OWID on country + date. Dynamic columns (vaccination, stringency) are forward-filled within each country in time order — critically, this is done after sorting by date to prevent future data leaking backwards into past rows.

Feature Engineering — Epidemiological Grounding

All 21 features have direct biological justification:

  • Rₜ proxy — 7d avg cases ÷ 7d avg from 7 days prior. When Rₜ > 1, epidemic expands. Standard WHO surveillance metric.
  • Growth rate (7d) — Captures acceleration of spread. 20% weekly = emerging outbreak signal.
  • Doubling time — log(2) / log(1 + daily_growth). 7-day doubling = public health emergency threshold.
  • CFR — Deaths / confirmed. Rising CFR signals healthcare stress or variant shift.
  • Lag features (1,3,7,14d) — COVID-19 median incubation = 5 days. 7-day lag is biologically optimal. 14-day = full contact-tracing window.
  • Stringency index — Proxies contact rate reduction (β in SIR). ~14-day lag to case impact (incubation + reporting delay).
  • Vaccination coverage — Reduces susceptible pool S. Direct biological protection mechanism in SIR framework.
  • Median age, population density, hospital beds — Demographic vulnerability (IFR proxy), transmission rate, and healthcare capacity.

Model Selection Rationale

Random Forest over LSTM: Tree-based models outperform sequence models when the series is non-stationary (multiple waves with different parameters) and the dataset is wide (21 features) relative to per-country length (~1000 rows). RF also gives interpretable feature importance — critical for biological insight.

Time-based split: Non-negotiable for time series. Random splits cause data leakage — the model sees future rows while predicting past rows, artificially inflating R². We train on pre-2022-01-01 and test on 2022 onwards.

SIR per wave, not full timeline: Full-timeline fitting averages over multiple waves with different β/γ, yielding R₀ ≈ 1. Wave-1 fitting isolates biological transmission before interventions, giving R₀ = 2.5–3.0 consistent with published literature.

Risk Classification

4-level system (Low / Medium / High / Critical) scores each country-week on three independent epidemiological signals: Rₜ proxy (1–3 pts), 7-day growth rate (1–3 pts), cases per million (1–3 pts). Score 0–1=Low, 2–4=Medium, 5–6=High, 7–9=Critical. Mirrors WHO multi-indicator approach avoiding single-metric thresholds.

References

Liu Y et al. (2020). The reproductive number of COVID-19 is higher compared to SARS coronavirus. J Travel Med. R₀ range 2.2–3.6.
Sanche S et al. (2020). High contagiousness and rapid spread of SARS-CoV-2. Emerg Infect Dis. Early Wuhan R₀ up to 5.7.
Levin AT et al. (2020). Assessing age specificity of IFR for COVID-19. Eur J Epidemiol. IFR exponential increase with age.
Hale T et al. (2021). A global panel database of pandemic policies. Nature Human Behaviour. Oxford Stringency Index.
About PredemicAI

Built for the Epidemic Spread Prediction hackathon track. Combining epidemiological science with machine learning to produce biologically meaningful insights.

Datasets

Johns Hopkins CSSE COVID-19 Time Series
Daily confirmed cases and deaths across 201 countries, Feb 2020 – Mar 2023. Primary dataset for forecasting, SIR modeling, and risk classification.
github.com/CSSEGISandData/COVID-19
225,321 records201 countriesPrimary
Our World in Data COVID-19 Repository
Vaccination rates, government stringency index, testing data, and demographic indicators. Used for feature enrichment and demographic vulnerability analysis.
github.com/owid/covid-19-data
VaccinationStringency indexDemographicsSecondary

Technical Stack

🐍
Python / Jupyter
pandas, numpy, scikit-learn, xgboost, scipy, plotly, matplotlib, seaborn
🤖
ML Models
Random Forest, XGBoost, K-Means clustering, SIR compartmental model
🌐
Web Dashboard
Pure HTML/CSS/JS + Chart.js — no server needed, opens in any browser

Hackathon Criteria Coverage

CriterionImplementationStatus
Outbreak predictionRF + XGBoost 7-day forecasting, time-split validatedDone
Hotspot detectionK-Means on per-capita burden features → 4 burden groupsDone
Transmission modelingWave-1 SIR fit with scipy curve_fit → realistic R₀ 2.5–3.0Done
Historical outbreak analysisEDA: 201 countries, wave detection, CFR, Rₜ trendsDone
Demographic/mobility factorsOWID: vaccination, stringency, median age, pop density, hospital bedsDone
Risk visualizationInteractive maps, 4-level classification, global dashboardDone
Biological insightsMethodology section with epi rationale for every feature + referencesDone
30-day forecastIterative prediction with 80% Monte Carlo confidence bandDone

Team