Leveraging Large Language Models to Improve Precision in Randomized Controlled Trials

Adam Sales; Jaylin Lowe; Johann A. Gagnon-Bartsch

arxiv: 2605.30157 · v1 · pith:4IUL6ECMnew · submitted 2026-05-28 · 📊 stat.AP

Leveraging Large Language Models to Improve Precision in Randomized Controlled Trials

Jaylin Lowe , Adam Sales , Johann A. Gagnon-Bartsch This is my paper

Pith reviewed 2026-06-28 23:47 UTC · model grok-4.3

classification 📊 stat.AP

keywords randomized controlled trialslarge language modelsprecision improvementcovariate adjustmentstatistical estimationprediction models

0 comments

The pith

LLM predictions can be incorporated into RCT analysis to safely improve precision without bias.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper explores whether large language models can be used to improve the precision of randomized controlled trials in a safe and rigorous way. It develops a pipeline to integrate LLM predictions into RCT analysis, following approaches used with observational data. The pipeline is tested on three case studies, showing that the predictions improve precision particularly when the RCT has few predictive covariates or includes text data suited to LLMs. A reader would care because RCTs are the standard for causal inference but often have limited precision due to outcome variability, and this offers a potential way to leverage external model outputs while preserving estimator validity.

Core claim

LLM predictions can be incorporated into RCT analysis to safely improve precision, with particular value when the RCT lacks predictive covariates or contains covariates such as text data that are well-suited to LLMs.

What carries the argument

A pipeline for best leveraging LLM predictions in RCT analysis that maintains the statistical properties of the estimator.

If this is right

Precision gains occur without biasing the treatment effect estimator.
The largest improvements appear in RCTs that lack strong predictive covariates.
Text-based covariates can be processed by LLMs to yield useful adjustments.
The approach extends methods previously applied to observational data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

RCTs that already collect text at baseline could pre-specify LLM adjustment to increase statistical power.
The same pipeline might work with other predictive models if they can be shown to integrate without bias.
Routine use could reduce required sample sizes in trials where text or similar data is available.
Further case studies in new domains would clarify the range of settings where gains are reliable.

Load-bearing premise

That LLM predictions can be integrated via the pipeline in a safe and rigorous way that avoids introducing bias or unreliability into the RCT estimator.

What would settle it

A case study or simulation in which adding the LLM predictions produces a treatment effect estimate that differs from the unadjusted estimate by more than sampling variability would predict.

read the original abstract

Large language models (LLMs) are increasingly used in statistical research and applications. However,they are also notorious for unreliable or biased information. Here, we explore whether LLMs can be used to improve the precision of randomized controlled trials (RCTs) in a safe and rigorous way. Following similar work on leveraging observational data, we incorporate LLM predictions into an RCT analysis. While incorporating external predictions to improve precision is not new, the value of using LLM predictions in this manner is an open question. We develop a pipeline for best leveraging LLM predictions in this context and apply it to three different case studies. We find that these predictions can safely improve precision, particularly when the RCT lacks predictive covariates or contains covariates, such as text data, that are well-suited to LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper tests a pipeline for folding LLM predictions into RCT estimators to cut variance without bias, and claims it works on three case studies, but the abstract alone leaves the actual gains and checks unclear.

read the letter

The main point is that LLM predictions can be added to standard RCT estimators to improve precision in a way that stays unbiased, at least according to the three case studies. This is positioned as an extension of existing work on external predictions rather than a wholly new statistical trick.

What stands out as new is the concrete pipeline for handling LLM outputs as fixed pre-treatment functions and the direct test on three different RCTs. The abstract notes that this helps most when the trial has little in the way of predictive covariates or when the covariates are text that LLMs can process. That framing is consistent with prior methods like augmented IPW or covariate adjustment, so there is no obvious internal contradiction.

The paper does a reasonable job identifying the practical settings where the approach might pay off. It also avoids overclaiming by saying the value of LLMs specifically was still an open question before their tests.

The soft spots are straightforward: only the abstract is in front of us, so there are no equations for the pipeline, no reported precision gains, no error analysis, and no details on how they verified the predictions did not introduce bias. Without those pieces it is difficult to judge whether the safety claim actually holds up or how large the efficiency gains are. The soundness score in the reader's note reflects exactly this gap.

This is for statisticians and trialists who already work on covariate adjustment or data augmentation in RCTs and want to see whether LLMs fit into that toolkit. A reader who needs a fully worked example with numbers and checks will not get much yet.

If the full manuscript supplies the pipeline details, the quantitative results, and clear bias diagnostics, it is worth sending to peer review. Right now the idea is plausible but the evidence is still too thin to evaluate.

Referee Report

2 major / 1 minor

Summary. The paper claims to develop a pipeline for incorporating LLM-generated predictions into RCT estimators (following methods for external predictions such as covariates or augmented IPW) to improve precision without introducing bias, and demonstrates this via three case studies, concluding that the approach is safe and effective particularly when RCTs lack strong predictive covariates or involve text data well-suited to LLMs.

Significance. If the pipeline is shown to preserve unbiasedness of the RCT estimator while delivering measurable precision gains in the case studies, the work would provide a practical extension of existing prediction-augmented RCT methods to LLMs, with potential value in settings with limited covariates or unstructured data.

major comments (2)

[Abstract] Abstract: the central claim that LLM predictions 'can safely improve precision' rests on the pipeline avoiding bias, but no equations, identification assumptions, or estimator formulas are provided to confirm that LLM outputs are treated strictly as fixed pre-treatment functions (as required to maintain RCT validity).
[Abstract] Abstract: the three case studies are positioned as empirical support, yet no quantitative results, error analysis, or comparison to baseline RCT estimators (e.g., precision gains or bias checks) are reported, making it impossible to evaluate whether the safety and improvement claims hold.

minor comments (1)

[Abstract] The abstract mentions 'following similar work on leveraging observational data' but does not cite specific references for the established methods being adapted.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful comments. We address the two major comments on the abstract below and will revise the abstract to incorporate the requested details while preserving its conciseness.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that LLM predictions 'can safely improve precision' rests on the pipeline avoiding bias, but no equations, identification assumptions, or estimator formulas are provided to confirm that LLM outputs are treated strictly as fixed pre-treatment functions (as required to maintain RCT validity).

Authors: The manuscript develops the pipeline by extending established methods for incorporating external predictions (treated as fixed pre-treatment functions) into RCT estimators such as covariate-adjusted or augmented IPW estimators. This structure preserves the unbiasedness guaranteed by randomization. We agree the abstract would be strengthened by briefly referencing these assumptions and the estimator form; we will revise the abstract accordingly. revision: yes
Referee: [Abstract] Abstract: the three case studies are positioned as empirical support, yet no quantitative results, error analysis, or comparison to baseline RCT estimators (e.g., precision gains or bias checks) are reported, making it impossible to evaluate whether the safety and improvement claims hold.

Authors: The abstract summarizes the overall finding without numbers for brevity, but the full manuscript reports quantitative results from the three case studies, including precision gains relative to the unadjusted estimator and explicit checks confirming no bias is introduced. We will revise the abstract to include key quantitative highlights (e.g., reported precision improvements and bias verification) to make the empirical support more transparent. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper develops and applies a pipeline for incorporating LLM predictions as fixed pre-treatment functions into RCT estimators (following established external-prediction methods), then validates the approach empirically via three case studies. No load-bearing step reduces by construction to a fitted parameter renamed as prediction, a self-citation chain, or a self-definitional relation; the central claim rests on the case-study results rather than any internal re-derivation of its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides insufficient detail to identify any free parameters, axioms, or invented entities; no equations or modeling choices are visible.

pith-pipeline@v0.9.1-grok · 5659 in / 1047 out tokens · 30289 ms · 2026-06-28T23:47:01.486176+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages

[1]

A Class of Unbiased Estimators of the Average Treatment Effect in Randomized Experiments

Aronow, Peter M. and Joel A. Middleton (May 2013). “A Class of Unbiased Estimators of the Average Treatment Effect in Randomized Experiments”. en. In:Journal of Causal Inference1.1, pp. 135–154.DOI: 10.1515/jci-2012-0009. URL: https://www.degruyter.com/document/doi/10.1515/jci-2012-0009/html. Blinder, Alan S. (1973). “Wage Discrimination: Reduced Form and...

work page doi:10.1515/jci-2012-0009 2013
[2]

Interpreting Effect Sizes of Education Interventions

Curran Associates, Inc.URL: https://papers. nips.cc/paper_files/paper/2018/hash/566f0ea4f6c2e947f36795c8f58ba901-Abstract.html. Kraft, Matthew A. (May 2020). “Interpreting Effect Sizes of Education Interventions”. en. In:Educational Researcher 49.4, pp. 241–253.DOI: 10.3102/0013189X20912798.URL: https://doi.org/10.3102/0013189X20912798. Kurlychek, Megan C...

work page doi:10.3102/0013189x20912798.url: 2018

[1] [1]

A Class of Unbiased Estimators of the Average Treatment Effect in Randomized Experiments

Aronow, Peter M. and Joel A. Middleton (May 2013). “A Class of Unbiased Estimators of the Average Treatment Effect in Randomized Experiments”. en. In:Journal of Causal Inference1.1, pp. 135–154.DOI: 10.1515/jci-2012-0009. URL: https://www.degruyter.com/document/doi/10.1515/jci-2012-0009/html. Blinder, Alan S. (1973). “Wage Discrimination: Reduced Form and...

work page doi:10.1515/jci-2012-0009 2013

[2] [2]

Interpreting Effect Sizes of Education Interventions

Curran Associates, Inc.URL: https://papers. nips.cc/paper_files/paper/2018/hash/566f0ea4f6c2e947f36795c8f58ba901-Abstract.html. Kraft, Matthew A. (May 2020). “Interpreting Effect Sizes of Education Interventions”. en. In:Educational Researcher 49.4, pp. 241–253.DOI: 10.3102/0013189X20912798.URL: https://doi.org/10.3102/0013189X20912798. Kurlychek, Megan C...

work page doi:10.3102/0013189x20912798.url: 2018