Stroke risk factors are conditions or behaviors that increase the likelihood of having a stroke. They are usually classified into modifiable and non-modifiable categories.
Understanding and addressing modifiable risk factors is essential for the primary and secondary prevention of stroke.
In a Retrospective observational study with predictive modeling Wu et al. 1) wish to identify key clinical, biochemical, and socioeconomic risk factors for stroke and post-stroke depression, and to develop a reliable and explainable prediction model and scoring tool based on population-level data, in order to improve strategies for primary stroke prevention.
This study, while superficially appealing due to its use of “modern” AI methods Shapley Additive Explanations (SHAP), ultimately collapses under the weight of methodological overreach and interpretative overconfidence.
🧱 1. Rhetorical Inflation in Purpose
The authors wish to “provide novel prevention strategies” for stroke and post-stroke depression using a retrospective cross-sectional dataset. This is a textbook case of rhetorical inflation—confusing statistical association with clinical innovation. National Health and Nutrition Examination Survey (NHANES) data are observational, not designed for causal inference or dynamic modeling. Yet, the conclusions sound as if they were derived from a randomized trial or prospective cohort.
📊 2. Overfitting and Illusion of Precision
The model boasts an AUC of 82%. However:
No external validation cohort is presented.
Internal validation alone, especially on a subset of ~4,000 out of ~49,000 participants, is insufficient and prone to overestimation.
SHAP values do not inherently confer reliability or causality—they only explain how a given model behaves, not whether it makes sense clinically.
Using SHAP on weakly curated variables from a non-stroke-focused dataset is like explaining how a broken compass points north.
🧪 3. Biochemical Noise Masquerading as Insight
Variables such as alkaline phosphatase, albumin, neutrophils, and lymphocyte percentage are included in the model, yet:
No mechanistic rationale is given.
No stratification by stroke subtype (ischemic vs. hemorrhagic) is made.
Their clinical interpretability is poor, turning the resulting “risk score” into a black box with false credibility.
This is conceptual ambiguity in action: quantitative signals presented as meaningful without context, causality, or physiological grounding.
📉 4. False Equivalence Between Predictive and Preventive Value
The authors conflate model accuracy with clinical utility. A scoring tool built on retrospective correlations does not equal a preventive instrument. Prevention requires prospective testing, behavioral change modeling, and context-aware implementation. None of that is attempted here.
Instead, they offer a DIY self-assessment tool from a population-level dataset — a tempting but misleading notion, potentially causing false reassurance or anxiety.
🧩 5. Neglect of Stroke Complexity
Stroke is heterogeneous. It cannot be reduced to a one-size-fits-all model using NHANES data, which lacks imaging, timing, vascular anatomy, or medication data. No mention is made of key confounders such as:
Atrial fibrillation
Hypertension treatment
Prior antithrombotic use
This reflects sample simplification fallacy—where convenience trumps clinical complexity.
🧠 6. No Contribution to Post-Stroke Depression Understanding
Despite including it in the title, the “post-stroke depression” angle is entirely speculative. The authors provide no validated psychiatric metrics, no follow-up data, and no proper modeling of mental health outcomes. It's a keyword, not a conclusion.
⚠️ Conclusion: This study commits multiple scientific sins:
Rhetorical overreach (promising prevention from cross-sectional data)
Overfitted modeling with uncritical use of SHAP
Biochemical cherry-picking without clinical plausibility
False equivalence between correlation and clinical actionability
It is dressed as data science, but fundamentally lacks the anatomical, temporal, and pathophysiological depth required for meaningful stroke research.
A flashy model on shallow ground — predictive numerology disguised as precision medicine.