Validating and Updating GRASP: A New Evidence-Based Framework for Grading and Assessment of Clinical Predictive Tools

Blanca Gallego; Farah Magrabi; Mohamed Khalifa

arxiv: 1907.11524 · v1 · pith:CUQ5HXH2new · submitted 2019-07-25 · 💻 cs.CY

Validating and Updating GRASP: A New Evidence-Based Framework for Grading and Assessment of Clinical Predictive Tools

Mohamed Khalifa , Farah Magrabi , Blanca Gallego This is my paper

Pith reviewed 2026-05-24 16:25 UTC · model grok-4.3

classification 💻 cs.CY

keywords clinical predictive toolsevidence grading frameworkGRASPvalidation studyexpert surveyinterrater reliabilitypredictive model assessmentclinical decision support

0 comments

The pith

GRASP grades clinical predictive tools by combining the highest evaluation phase with the strongest supporting evidence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper validates and updates GRASP, an evidence-based framework for grading clinical predictive tools through expert survey and reliability testing. Experts largely endorsed the core criteria for assessing phase of evaluation, level of evidence, and direction of evidence. The framework assigns a final grade based on the best available phase backed by positive or supportive mixed evidence. This approach aims to help clinicians and guideline developers navigate the growing number of tools by focusing on published evidence quality rather than untested claims.

Core claim

The GRASP framework grades predictive tools based on the critical appraisal of the published evidence across three dimensions: 1) Phase of evaluation; 2) Level of evidence; and 3) Direction of evidence. The final grade of a tool is based on the highest phase of evaluation, supported by the highest level of positive evidence, or mixed evidence that supports positive conclusion.

What carries the argument

The GRASP framework, which assigns grades to predictive tools by combining the highest reached phase of evaluation with the strongest level and direction of supporting evidence.

If this is right

Clinicians can apply GRASP grades to decide which predictive tools to implement in practice.
Guideline developers can use the grades to recommend tools with stronger evidence backing.
Tool developers gain clear targets for advancing evaluation phases and evidence quality.
The framework enables consistent comparison across tools that vary in study design and outcomes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

GRASP could be extended to grade tools in non-clinical prediction domains such as public health forecasting.
Integration with existing evidence synthesis platforms might reduce duplication in tool assessments.
Longitudinal tracking of how GRASP grades change with new publications would test its dynamic utility.
Direct head-to-head comparisons of GRASP-graded tools in clinical outcomes studies would provide external validation.

Load-bearing premise

The 81 expert responses from the survey sufficiently represent the views of the broader clinical prediction community and validate the framework criteria.

What would settle it

A new large-scale expert survey finding widespread disagreement with the GRASP criteria or repeated tests showing poor interrater reliability would undermine the validation.

read the original abstract

Background: When selecting predictive tools, for implementation in clinical practice or for recommendation in guidelines, clinicians are challenged with an overwhelming and ever-growing number of tools. Many of these have never been implemented or evaluated for comparative effectiveness. The authors developed an evidence-based framework for grading and assessment of predictive tools (GRASP), based on critical appraisal of published evidence. The objective of this study is to validate, update GRASP, and evaluate its reliability. Methods: We aimed at validating and updating GRASP through surveying a wide international group of experts then evaluating GRASP reliability. Results: Out of 882 invited experts, 81 valid responses were received. Experts overall strongly agreed to GRASP evaluation criteria of predictive tools (4.35/5). Experts strongly agreed to six criteria; predictive performance (4.87/5), predictive performance levels (4.44/5), usability (4.68/5), potential effect (4.61/5), post-implementation impact (4.78/5) and evidence direction (4.26/5). Experts somewhat agreed to one criterion; post-implementation impact levels (4.16/5). Experts were neutral about one criterion; usability is higher than potential effect (2.97/5). Experts also provided recommendations to six open-ended questions regarding adding, removing or changing evaluation criteria. The GRASP concept and its detailed report were updated then the interrater reliability of GRASP was tested and found to be reliable. Discussion and Conclusion: The GRASP framework grades predictive tools based on the critical appraisal of the published evidence across three dimensions: 1) Phase of evaluation; 2) Level of evidence; and 3) Direction of evidence. The final grade of a tool is based on the highest phase of evaluation, supported by the highest level of positive evidence, or mixed evidence that supports positive conclusion.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GRASP gets expert agreement on most criteria but the 9% response rate leaves the validation claim thin.

read the letter

The paper updates the GRASP framework for grading clinical prediction tools and reports a survey of 81 experts who mostly backed the criteria. Strong agreement showed up on predictive performance, usability, potential effect, and post-implementation impact, while one item on usability versus potential effect landed neutral. The authors folded in open-ended suggestions, revised the framework, and stated that interrater reliability tested out as reliable. The core grading rule—highest evaluation phase supported by highest positive evidence level, or mixed evidence that still points positive—is straightforward and could help guideline groups sort through the flood of tools. The survey is the soft spot. An 81-out-of-882 response rate is low, and nothing is shown on whether responders differed from non-responders in expertise or location. That gap makes it hard to treat the mean scores as solid community endorsement. The reliability test is asserted without numbers or method details in the abstract. This work is aimed at clinical informaticians and evidence-based medicine groups who need a practical grading scheme for prediction tools. A reader who wants a structured three-dimension approach might still extract value from the final framework even if the validation step stays preliminary. I would send it to peer review so the authors can get direct questions on the sample and the missing reliability data.

Referee Report

3 major / 1 minor

Summary. The paper presents and validates GRASP, an evidence-based framework for grading clinical predictive tools. GRASP evaluates tools across three dimensions—phase of evaluation, level of evidence, and direction of evidence—and assigns a final grade based on the highest phase supported by the highest level of positive evidence or mixed evidence supporting a positive conclusion. Validation rests on an international expert survey (81 valid responses from 882 invitations) showing strong agreement on most criteria (overall mean 4.35/5, with specific scores such as 4.87/5 for predictive performance), framework updates informed by open-ended feedback, and a subsequent interrater reliability test reported as reliable.

Significance. If the expert survey is shown to be representative, GRASP would offer a practical, standardized approach to appraising the growing number of clinical predictive tools, helping clinicians and guideline developers distinguish well-supported tools from those lacking implementation or comparative-effectiveness evidence.

major comments (3)

[Results] Results section (survey response rate): The validation claim rests on 81 responses from 882 invited experts (~9% rate). No data or analysis is provided comparing respondents to non-respondents or the broader clinical prediction community on expertise, geography, or tool-evaluation experience; without this, agreement scores cannot securely establish that the survey validates GRASP as reflecting community consensus.
[Results] Results section (reliability test): The manuscript states that interrater reliability of the updated GRASP was tested and found reliable, yet reports no quantitative measures (e.g., Cohen’s kappa, intraclass correlation, sample size, or confidence intervals). This omission prevents assessment of whether the reliability finding is robust enough to support the framework’s use.
[Results] Results section (neutral criterion): Experts were neutral on the criterion “usability is higher than potential effect” (2.97/5), yet both dimensions appear retained in the final GRASP; the decision process for retaining or weighting this criterion after the survey feedback should be explicitly justified, as it directly affects grading logic.

minor comments (1)

[Abstract/Results] The abstract and results would benefit from a brief table summarizing the six open-ended question themes and the specific changes made to GRASP criteria in response.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for these constructive comments on our manuscript. We address each major point below and indicate where revisions will be made to improve clarity and transparency.

read point-by-point responses

Referee: [Results] Results section (survey response rate): The validation claim rests on 81 responses from 882 invited experts (~9% rate). No data or analysis is provided comparing respondents to non-respondents or the broader clinical prediction community on expertise, geography, or tool-evaluation experience; without this, agreement scores cannot securely establish that the survey validates GRASP as reflecting community consensus.

Authors: We agree that the 9% response rate limits strong claims of representativeness and that a formal non-response analysis would be ideal. We did not collect data allowing direct comparison of respondents and non-respondents. In the revised manuscript we will (1) report the demographic characteristics of the 81 respondents in more detail, (2) explicitly state this as a limitation in the Discussion, and (3) moderate language from “validates GRASP” to “provides initial expert feedback supporting GRASP.” revision: partial
Referee: [Results] Results section (reliability test): The manuscript states that interrater reliability of the updated GRASP was tested and found reliable, yet reports no quantitative measures (e.g., Cohen’s kappa, intraclass correlation, sample size, or confidence intervals). This omission prevents assessment of whether the reliability finding is robust enough to support the framework’s use.

Authors: We acknowledge the omission of quantitative reliability statistics. The revised manuscript will report the exact method (Cohen’s kappa), number of tools and raters, obtained kappa value with confidence interval, and interpretation threshold used. revision: yes
Referee: [Results] Results section (neutral criterion): Experts were neutral on the criterion “usability is higher than potential effect” (2.97/5), yet both dimensions appear retained in the final GRASP; the decision process for retaining or weighting this criterion after the survey feedback should be explicitly justified, as it directly affects grading logic.

Authors: The neutral mean score on the comparative statement was noted. Open-ended comments from several experts indicated that usability and potential effect should remain distinct dimensions rather than being collapsed. We therefore retained both dimensions and the comparative criterion as an optional integration step. The revised manuscript will add a short paragraph in the Results (or Methods) section explaining this decision process and noting that the primary grading logic still rests on the highest phase and level of evidence. revision: yes

Circularity Check

0 steps flagged

No significant circularity: GRASP validation rests on external expert survey with no derivations or self-referential reductions

full rationale

The paper defines the GRASP framework via three explicit dimensions (phase of evaluation, level of evidence, direction of evidence) and validates it through an independent survey of 81 experts yielding agreement scores (e.g., 4.87/5 on predictive performance). No equations, fitted parameters, predictions, or self-citations appear in the provided text; the final grade rule is a direct definition rather than a derived output. The load-bearing step is external expert input, not internal fitting or renaming, so the chain is self-contained against external benchmarks with no reduction to the paper's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that expert survey consensus provides valid evidence for framework criteria; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Expert consensus via survey is a valid method to validate and update an evidence-based grading framework for clinical tools.
Invoked in the methods and results when using survey scores (e.g., 4.35/5 overall agreement) to confirm and modify criteria.

pith-pipeline@v0.9.0 · 5884 in / 1204 out tokens · 26340 ms · 2026-05-24T16:25:59.945215+00:00 · methodology

Validating and Updating GRASP: A New Evidence-Based Framework for Grading and Assessment of Clinical Predictive Tools

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)