pith. machine review for the scientific record. sign in

arxiv: 2603.14273 · v3 · submitted 2026-03-15 · 📊 stat.OT

Recognition: 2 theorem links

· Lean Theorem

Using large language models for sensitivity analysis in causal inference: case studies on Cornfield inequality and E-value

Authors on Pith no claims yet

Pith reviewed 2026-05-15 11:15 UTC · model grok-4.3

classification 📊 stat.OT
keywords large language modelssensitivity analysisunmeasured confoundingE-valueCornfield inequalitycausal inferenceobservational studies
0
0 comments X

The pith

Large language models can accurately perform sensitivity analyses for unmeasured confounding when guided by structured prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether LLMs can extract study details and apply Cornfield inequalities and E-value calculations to evaluate how much unmeasured confounding would be needed to explain away observed associations in observational research. It runs four models on four published studies from different fields, checking calculation accuracy, qualitative robustness judgments, and suggestions for possible unmeasured confounders. The results indicate that three of the models reproduce E-values correctly and draw conclusions consistent with the reported effect sizes, while all identify plausible confounders. A reader would care because unmeasured confounding remains a central threat to trusting observational findings, and simpler tools could help more researchers assess and design around it.

Core claim

When supplied with exposure, outcome, measured confounders, and effect estimates through structured prompts, ChatGPT, Claude, and Gemini accurately reproduce E-values from the four studies, produce qualitative interpretations that match the magnitude of those E-values and the original effect sizes, and propose biologically and epidemiologically plausible unmeasured confounders; DeepSeek exhibits small calculation biases but otherwise aligns on interpretation.

What carries the argument

Structured prompts that direct LLMs to extract study-specific information and then compute and interpret E-values and Cornfield inequalities for sensitivity analysis.

If this is right

  • LLMs can help researchers identify candidate unmeasured confounders during study design.
  • Qualitative robustness assessments become feasible for clinicians and interdisciplinary teams without deep statistical training.
  • Sensitivity analysis steps can be incorporated earlier in observational research workflows.
  • Decision-making that relies on observational associations can incorporate quick checks for unmeasured confounding.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Embedding these prompts into research software could lower the barrier to routine sensitivity checks.
  • The approach might extend to other sensitivity methods beyond E-value and Cornfield inequality.
  • Repeated human auditing of outputs would still be needed before relying on results for high-stakes decisions.

Load-bearing premise

The performance seen on these four chosen studies and the chosen prompts will generalize to other datasets, fields, and prompt variations without systematic human verification of every LLM output.

What would settle it

Running the same structured prompts on a fresh collection of observational studies from additional fields and finding repeated errors in E-value arithmetic or consistently implausible suggested confounders would show the claim does not hold.

read the original abstract

Sensitivity analysis methods such as the Cornfield inequality and the E-value were developed to assess the robustness of observed associations against unmeasured confounding -- a major challenge in observational studies. However, the calculation and interpretation of these methods can be difficult for clinicians and interdisciplinary researchers. Recent advances in large language models (LLMs) offer accessible tools that could assist sensitivity analyses, but their reliability in this context has not been studied. We assess four widely used LLMs, ChatGPT, Claude, DeepSeek, and Gemini, on their ability to conduct sensitivity analyses using Cornfield inequalities and E-values. We first extract study-specific information (exposures, outcomes, measured confounders, and effect estimates) from four published observational studies in different fields. Using such information, we develop structured prompts to assess the performance of the LLMs in three aspects: (1) accuracy of E-value calculation, (2) qualitative interpretation of robustness to unmeasured confounding, and (3) suggestion of possible unmeasured confounders. To our knowledge, there has been little prior work on using LLMs for sensitivity analysis, and this study is an early investigation in this area. The results show that ChatGPT, Claude, and Gemini accurately reproduce the E-values, whereas DeepSeek shows small biases. Qualitative conclusions from all the LLMs align with the magnitude of the E-values and the reported effect sizes, and all models identify biologically and epidemiologically plausible unmeasured confounders. These findings suggest that, when guided by structured prompts, LLMs can effectively assist in evaluating unmeasured confounding, and thereby can support study design and decision-making in observational studies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript evaluates four LLMs (ChatGPT, Claude, DeepSeek, Gemini) on sensitivity analysis for unmeasured confounding via Cornfield inequalities and E-values. It extracts exposure, outcome, confounder, and effect data from four published observational studies, applies structured prompts, and assesses three aspects: (1) numerical accuracy of E-value calculations, (2) qualitative interpretation of robustness, and (3) suggestions for plausible unmeasured confounders. The abstract reports that three models reproduce published E-values accurately, all four produce aligned interpretations, and all suggest biologically plausible confounders, concluding that structured-prompt LLMs can assist sensitivity analysis and support observational study design.

Significance. If the central claims hold under more rigorous validation, the work could lower the barrier to performing sensitivity analyses for clinicians and applied researchers, potentially improving the credibility of observational findings. The multi-model comparison and use of real published studies are positive features. However, the current evidence base is narrow (four hand-selected studies) and the qualitative components rest on unvalidated author judgment rather than independent metrics, so the practical utility for decision-making remains provisional.

major comments (2)
  1. [Results] Results section (aspects 2 and 3): the statements that LLM interpretations 'align with the magnitude of the E-values' and that suggested confounders are 'biologically and epidemiologically plausible' are presented without any independent validation protocol, expert panel ratings, pre-specified gold-standard confounder lists, or inter-rater reliability statistics. Because these qualitative outputs are central to the claim that LLMs can 'effectively assist' study design and decision-making, the absence of such checks is load-bearing.
  2. [Methods] Methods section: no quantitative error metrics (absolute or relative differences, confidence intervals, or statistical tests) are reported for the E-value calculations, and the selection criteria or sampling frame for the four observational studies are not described. This prevents assessment of reproducibility and raises the possibility that performance is overstated by favorable case selection.
minor comments (2)
  1. [Methods] The exact structured prompts should be reproduced verbatim in an appendix or supplementary material to allow replication and extension by other researchers.
  2. [Methods] LLM version numbers and access dates should be stated explicitly, given that model behavior can change with updates.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which highlight important areas for improving the rigor and transparency of our exploratory study. We address each major comment point by point below, indicating the specific revisions we will implement.

read point-by-point responses
  1. Referee: [Results] Results section (aspects 2 and 3): the statements that LLM interpretations 'align with the magnitude of the E-values' and that suggested confounders are 'biologically and epidemiologically plausible' are presented without any independent validation protocol, expert panel ratings, pre-specified gold-standard confounder lists, or inter-rater reliability statistics. Because these qualitative outputs are central to the claim that LLMs can 'effectively assist' study design and decision-making, the absence of such checks is load-bearing.

    Authors: We agree that the qualitative assessments lack independent validation and that this limits the strength of claims about assisting decision-making. As an early investigation, our evaluations relied on direct comparison to published effect sizes and author expertise rather than external panels or pre-specified lists. In the revision, we will expand the Results section to detail the specific criteria used to judge alignment and plausibility, add a limitations subsection explicitly noting the absence of expert review or reliability statistics, and moderate the language around practical utility for study design. We will also outline directions for future work that incorporates such validation. revision: partial

  2. Referee: [Methods] Methods section: no quantitative error metrics (absolute or relative differences, confidence intervals, or statistical tests) are reported for the E-value calculations, and the selection criteria or sampling frame for the four observational studies are not described. This prevents assessment of reproducibility and raises the possibility that performance is overstated by favorable case selection.

    Authors: We agree that quantitative error metrics and explicit study selection details are needed for reproducibility. We will revise the Methods section to describe the selection criteria: the four studies were chosen as published observational research reporting effect estimates suitable for E-value computation and spanning diverse fields to illustrate broad applicability. We will also add absolute and relative error calculations for the E-value results, including any applicable statistical comparisons between LLM outputs and published values. These changes will allow readers to assess accuracy more objectively. revision: yes

Circularity Check

0 steps flagged

No circularity; evaluation compares LLM outputs to external published E-values and effect sizes

full rationale

The paper conducts empirical case studies on four published observational studies, extracting exposures/outcomes/confounders/effect estimates and prompting LLMs to compute E-values (compared directly to the known published values), interpret robustness, and suggest confounders (judged for alignment and plausibility). No mathematical derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. All load-bearing steps reference external study results rather than reducing to the paper's own inputs or definitions by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on the assumption that the four chosen observational studies are representative and that LLM outputs can be meaningfully compared to human-calculated E-values without additional validation layers.

axioms (1)
  • domain assumption Structured prompts can elicit accurate mathematical calculations and domain-plausible suggestions from current LLMs
    Invoked in the prompt design and performance assessment sections described in the abstract.

pith-pipeline@v0.9.0 · 5599 in / 1224 out tokens · 84522 ms · 2026-05-15T11:15:52.237565+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.