GRASP: Group-SHAP Feature Selection Built for Clinical Use
Feature selection in medical prediction still falters on two fronts: consistency and clarity. GRASP (GRoup-SHAPley feature Selection for Patients) tackles both by merging Shapley value attribution with group regularisation to surface fewer, clearer, and more dependable variables for clinical models.
The result: models that hold their accuracy while using fewer, less correlated features that clinicians can actually reason about.
Why this matters for healthcare teams
Most tools can hit a target metric. Fewer make it obvious why a patient's risk score changed, or which variables are worth tracking in a care pathway. GRASP optimises for performance and interpretability at the same time-so you can tie model outputs back to clinical judgement and operational decisions.
How GRASP works (simple view)
- Train a tree-based model (e.g., XGBoost) on the training fold to establish baseline predictions.
- Compute SHAP values on a held-out validation fold to measure each feature's contribution to predictions.
- Group features into clinically coherent sets (e.g., vitals, renal labs, lipid panel, medications).
- Aggregate feature SHAP scores within each group to get a single group importance score.
- Run a group-L21 regularised logistic regression with a proximal-gradient algorithm (Armijo backtracking) to select entire groups, promoting compact and interpretable models.
That structure limits redundancy, improves stability, and keeps the final variable list aligned with clinical logic rather than scattered single features.
What the results show
- Across NHANES and UK Biobank, GRASP matched or beat established methods on accuracy while using fewer features.
- Average selected features: GRASP 23 vs LASSO 44, SHAP 43, AFS 59.
- Adjusted Stability Measure: GRASP 0.593 vs LASSO 0.382, SHAP 0.398, AFS 0.258.
- Lower redundancy: Variance Inflation Factor averaged 2.942 for GRASP (lower indicates less multicollinearity).
Model performance using GRASP-selected features:
- Logistic Regression - Accuracy: 0.783 (NHANES), 0.755 (UKB); F1: 0.483, 0.197.
- Random Forest - Accuracy: 0.890, 0.946; F1: 0.226, 0.016.
- XGBoost - Accuracy: 0.897, 0.942; F1: 0.437, 0.143.
Calibration curves showed closer alignment between predicted and observed risk-especially in higher-risk strata. Kaplan-Meier curves indicated similar or stronger discrimination compared to LASSO and SHAP alone.
What makes it more interpretable
Interpretability isn't an afterthought here. By coupling SHAP-based importance with group-L21 penalties, GRASP prefers feature sets that make clinical sense and are easier to audit. Entire groups are selected or excluded, which maps cleanly to how clinicians think about systems, panels, and care pathways.
Practical checklist to pilot GRASP
- Define a single clinical question and primary outcome (e.g., 30-day readmission, incident CKD stage, MACE within 1 year).
- Create feature groups with clinicians: labs (by panel), vitals, history, medications, imaging-derived metrics, social determinants, device data.
- Train a tree model on the training fold; compute SHAP on a validation fold (not the training data).
- Aggregate SHAP to group scores; fit group-L21 logistic regression via proximal-gradient with Armijo backtracking.
- Benchmark against LASSO and a plain SHAP thresholding workflow; compare accuracy, F1, calibration (reliability curves), and decision-level metrics (e.g., NNE, PPV at fixed sensitivity).
- Quantify stability (bootstrap or repeated CV), redundancy (VIF), and clinical face validity (expert review).
- Run a decision impact review: what changes in care would the selected features enable or simplify?
Where to use it now
- Pathway design: prioritize labs or vitals that actually move risk predictions.
- Registry modeling: stabilize variable lists so multi-site models stay consistent across cohorts.
- CDS prototyping: feed fewer, clearer inputs to bedside alerts to reduce noise and alert fatigue.
Limitations to watch
- Upstream bias: GRASP starts with SHAP from a pretrained tree model. If that model is biased, the selection can inherit it. Use careful validation and subgroup checks.
- Grouping matters: poor grouping can hide useful signals or keep weak signals. Build groups with clinical SMEs and iterate.
- Generalisation: promising results on NHANES and UK Biobank still need confirmation on noisier, high-dimensional EHRs, imaging, and device streams.
- Class imbalance: F1 scores varied across models and datasets-monitor thresholding, reweighting, and sampling strategies.
How to evaluate before deployment
- Calibration first: is predicted risk aligned with observed outcomes in each risk band?
- Stability under resampling: do selected groups stay consistent across folds and hospitals?
- Clinician review: do selected groups match known physiology, guidelines, and data availability?
- Operational fit: are the chosen variables routinely captured, clean, and timely?
For teams building or buying
- Ask vendors for grouped SHAP summaries, stability metrics, and VIFs-not just AUC.
- Require calibration plots and subgroup analyses (age, sex, ethnicity, site).
- Request the full variable-to-group map and rationale so governance can audit it.
Further reading
- SHAP documentation for computing and interpreting feature attributions.
- XGBoost docs for the tree model used before grouping and selection.
- AI for Healthcare for practical workflows that bring interpretable ML into clinical settings.
Bottom line: GRASP streamlines models to the variables that matter, keeps selections consistent, and grounds decisions in clinical logic. For healthcare teams, that means clearer handoffs from data science to the bedside-and fewer surprises in production.
Your membership also unlocks: