IV Co-Scientist: Multi-Agent LLM Framework for Causal Instrumental Variable Discovery

Bryan Wilder; Dominik Janzing; Ivaxi Sheth; Mario Fritz; Zhijing Jin

arxiv: 2602.07943 · v2 · submitted 2026-02-08 · 💻 cs.AI

IV Co-Scientist: Multi-Agent LLM Framework for Causal Instrumental Variable Discovery

Ivaxi Sheth , Zhijing Jin , Bryan Wilder , Dominik Janzing , Mario Fritz This is my paper

Pith reviewed 2026-05-16 06:13 UTC · model grok-4.3

classification 💻 cs.AI

keywords instrumental variablescausal inferencelarge language modelsmulti-agent systemsobservational dataconfoundingcausal effectsconsistency test

0 comments

The pith

A multi-agent LLM system can discover valid instrumental variables from observational data by proposing, critiquing, and refining candidates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

When confounding prevents direct estimation of causal effects between a treatment and outcome, valid instrumental variables are required to isolate the effect. Identifying such instruments demands domain knowledge, creativity, and the ability to reject invalid candidates, tasks the paper tests large language models on through recovery of established examples from the literature and avoidance of discredited ones. The authors introduce IV Co-Scientist, a multi-agent framework in which agents propose instruments for a given treatment-outcome pair, critique the proposals, and refine them iteratively. They add a statistical consistency test to assess reliability when no ground truth is available. The results indicate that LLMs can surface valid instruments from large observational databases.

Core claim

The paper claims that large language models, structured as the IV Co-Scientist multi-agent system, can propose candidate instruments, subject them to critique and refinement, and pass a statistical consistency test that contextualizes their reliability without ground truth, thereby recovering well-established instruments while rejecting empirically or theoretically invalid ones.

What carries the argument

IV Co-Scientist, the multi-agent LLM system that proposes, critiques, and refines instrumental-variable candidates for a treatment-outcome pair, augmented by a statistical consistency test that evaluates proposals in the absence of ground truth.

If this is right

LLMs can replicate standard literature-based reasoning to recover established instruments.
The system can identify and exclude instruments that have been empirically or theoretically discredited.
Valid instruments become discoverable from large observational databases without manual expert search.
The statistical consistency test supplies a practical substitute for ground truth when evaluating proposals.
The multi-agent workflow automates a step in causal inference that normally requires interdisciplinary expertise.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar multi-agent structures could be applied to other causal tasks such as identifying potential confounders or mediators from text descriptions of studies.
The approach might scale to entirely new domains where no prior instruments are documented, provided the consistency test remains informative.
Integration with empirical validation methods could further strengthen the proposals by checking statistical properties directly on data.
The framework may reduce reliance on single-model outputs by using critique agents to surface and correct flawed reasoning steps.

Load-bearing premise

LLMs can reliably separate valid from invalid instruments in new domains using only their pre-trained knowledge and the multi-agent consistency test, without any external ground truth.

What would settle it

Run the IV Co-Scientist on a dataset containing both documented valid instruments and known invalid instruments for the same treatment-outcome pair and measure whether the system consistently retains only the valid ones across repeated trials.

Figures

Figures reproduced from arXiv: 2602.07943 by Bryan Wilder, Dominik Janzing, Ivaxi Sheth, Mario Fritz, Zhijing Jin.

**Figure 1.** Figure 1: Overview of the IV Co-Scientist framework, which integrates LLM-based agents with traditional statistical tools. 5. IV Co-Scientist Having validated the capabilities of LLMs in recovering canonical IVs (subsection 4.1) and avoiding discredited ones (subsection 4.2), we now evaluate the performance of the system in a fully openended setting. The goal here is to test whether LLMs can generate meaningful and… view at source ↗

**Figure 2.** Figure 2: Comparison of the ATE density while using two different IVs: (a) LLM proposed and (b) random. This is for Sanitation → Mortality for GPT-4o. This approach is inspired by the self-compatibility test introduced in causal discovery (Faller et al., 2024), for evaluation in the absence of ground truth. While it does not confirm the validity of the instrument, it provides indirect evidence of the quality. Consis… view at source ↗

read the original abstract

In the presence of confounding between an endogenous variable and the outcome, instrumental variables (IVs) are used to isolate the causal effect of the endogenous variable. Identifying valid instruments requires interdisciplinary knowledge, creativity, and contextual understanding, making it a non-trivial task. In this paper, we investigate whether large language models (LLMs) can aid in this task. We perform a two-stage evaluation framework. First, we test whether LLMs can recover well-established instruments from the literature, assessing their ability to replicate standard reasoning. Second, we evaluate whether LLMs can identify and avoid instruments that have been empirically or theoretically discredited. Building on these results, we introduce IV Co-Scientist, a multi-agent system that proposes, critiques, and refines IVs for a given treatment-outcome pair. We also introduce a statistical test to contextualize consistency in the absence of ground truth. Our results show the potential of LLMs to discover valid instrumental variables from a large observational database.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The multi-agent IV Co-Scientist workflow is a concrete new engineering step for LLM-assisted causal work, but the consistency test lacks synthetic validation so the discovery claims stay preliminary.

read the letter

The main takeaway is that this paper introduces IV Co-Scientist, a multi-agent LLM system that proposes, critiques, and refines candidate instrumental variables for a treatment-outcome pair, along with a statistical consistency test for cases without ground truth. That architecture is new relative to the prior LLM causal papers they cite. The two-stage evaluation—recovering established instruments from the literature and avoiding discredited ones—gives a reasonable way to check whether the agents are following standard reasoning patterns on familiar examples. That part is executed cleanly and provides a practical baseline for the workflow. The multi-agent loop itself is a specific design choice that has not been applied to IV discovery before, so the engineering contribution is real. The soft spot is the statistical consistency test. The paper evaluates the system on real observational cases drawn from the literature, which is useful for replication but does not test whether the test actually ranks or selects valid instruments when ground truth is hidden. No controlled synthetic experiments are described, such as data generated from linear structural causal models with planted valid and invalid instruments. Without that check, it is hard to know whether the consistency metric adds reliable signal or simply echoes what the LLMs already know from training data. The central claim about discovering valid IVs from large databases therefore rests on an unanchored assumption. This work is aimed at researchers building LLM tools for causal inference or applied social science and medicine. A reader interested in agent workflows for observational data could extract useful design ideas from the propose-critique-refine structure. It deserves a serious referee because the idea is timely, the evaluation framework is organized, and the architecture is reproducible enough to iterate on. I would send it to review with a request for synthetic validation of the consistency test.

Referee Report

2 major / 2 minor

Summary. The paper proposes IV Co-Scientist, a multi-agent LLM framework that proposes, critiques, and refines candidate instrumental variables for a given treatment-outcome pair. It conducts a two-stage evaluation: first recovering well-established instruments from the literature, then avoiding empirically or theoretically discredited ones. Building on these, the authors introduce a statistical consistency test intended to assess IV validity in the absence of ground truth and claim that the results demonstrate the potential of LLMs to discover valid instruments from large observational databases.

Significance. If the framework and test can be shown to generalize beyond literature cases, the work could meaningfully lower the barrier to valid IV identification in causal inference tasks that currently demand substantial domain expertise. The multi-agent design and the consistency test represent engineering contributions that could be extended to other causal discovery settings, but the current evaluation does not yet establish this generalization.

major comments (2)

[Evaluation] Evaluation section (two-stage framework): the statistical consistency test is introduced to 'contextualize consistency in the absence of ground truth,' yet no controlled experiment on synthetic data with planted valid and invalid instruments (e.g., linear SCMs with known confounders) is described that would verify whether the test's metric correctly ranks or selects the valid instruments. Recovery of literature examples alone does not demonstrate this capability for genuine discovery.
[Abstract and Results] Results and abstract: the two-stage evaluation is described without any quantitative metrics, error bars, or details on how the statistical test was validated against known cases, leaving the central claim that LLMs can 'discover valid instrumental variables' unanchored by measurable performance.

minor comments (2)

[Abstract] The abstract would benefit from a brief statement of the specific metrics or success criteria used in the two-stage evaluation.
[Method] Notation for the consistency test statistic should be defined explicitly with a formula or pseudocode to allow replication.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address the major comments point by point below and commit to revisions that strengthen the evaluation without overstating current results.

read point-by-point responses

Referee: [Evaluation] Evaluation section (two-stage framework): the statistical consistency test is introduced to 'contextualize consistency in the absence of ground truth,' yet no controlled experiment on synthetic data with planted valid and invalid instruments (e.g., linear SCMs with known confounders) is described that would verify whether the test's metric correctly ranks or selects the valid instruments. Recovery of literature examples alone does not demonstrate this capability for genuine discovery.

Authors: We agree that the absence of synthetic-data validation with known ground truth limits the strength of the claims about the consistency test. The two-stage literature-based evaluation shows the multi-agent system can replicate established expert reasoning and reject discredited instruments, but it does not directly test ranking performance under controlled confounding. In the revised manuscript we will add a dedicated synthetic-data experiment using linear SCMs with planted valid and invalid instruments; we will report how the consistency metric ranks and selects instruments and include quantitative recovery rates. revision: yes
Referee: [Abstract and Results] Results and abstract: the two-stage evaluation is described without any quantitative metrics, error bars, or details on how the statistical test was validated against known cases, leaving the central claim that LLMs can 'discover valid instrumental variables' unanchored by measurable performance.

Authors: We accept that the current abstract and results lack explicit quantitative metrics and error bars. We will revise both sections to report success rates (e.g., fraction of literature-valid instruments recovered and fraction of discredited instruments rejected) across repeated runs, include standard-error bars, and add a concise description of how the consistency test was applied to the known cases, including the exact scoring procedure and threshold used. revision: yes

Circularity Check

0 steps flagged

No circularity: engineering framework evaluated on external literature cases

full rationale

The paper describes a multi-agent LLM system (IV Co-Scientist) that proposes, critiques, and refines candidate instruments for a treatment-outcome pair, followed by a statistical consistency test introduced explicitly for settings without ground truth. No equations, fitted parameters, or derivations are presented that reduce to their own inputs by construction. Evaluation consists of recovering established instruments and avoiding discredited ones from prior literature, which constitutes external benchmarking rather than self-referential prediction. No self-citation chains, uniqueness theorems, or ansatzes are invoked as load-bearing justifications for the core claims. The framework is therefore self-contained as an applied engineering contribution whose validity rests on observable performance against independent examples.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on standard causal assumptions plus the new multi-agent system as the primary addition.

axioms (1)

domain assumption Valid instruments must satisfy relevance, exclusion restriction, and independence from confounders
Invoked throughout the evaluation of LLM-proposed IVs.

invented entities (1)

IV Co-Scientist multi-agent system no independent evidence
purpose: To propose, critique, and refine instrumental variable candidates via LLM agents
New framework introduced by the paper; no independent evidence provided beyond the described workflow.

pith-pipeline@v0.9.0 · 5476 in / 1207 out tokens · 122138 ms · 2026-05-16T06:13:43.996746+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages

[1]

Global commodity prices

work page
[2]

Colonial legal-origin dummies

work page
[3]

Industrial or resource endow- ments

Historical settler-mortality rates Income Carbon emissions 1. Industrial or resource endow- ments

work page
[4]

Distance to the equator

work page
[5]

Groundwater depth

Railroad network density Sanitation Child mortality rate 1. Groundwater depth

work page
[6]

Sanitation subsidy rollout schedule

work page
[7]

Distance to health center

work page
[8]

Cash-transfer age cutoff

Terrain Poverty Cholesterol 1. Cash-transfer age cutoff

work page
[9]

Childcare-program timing

work page
[10]

Number of female teachers

State EITC rate Female literacy rate Number of kids per female 1. Number of female teachers

work page
[11]

Raised compulsory school- leaving age

work page
[12]

Introduction years of a girls-only scholarship program

work page
[13]

Historical density of missionary girls’ schools (pre-independence)

work page
[14]

Prompts HypothesisGenerator (Instrumental Variable) You are an economist helping to identify causal relationships

UI replacement rate Table 8: Accepted and Rejected Instruments by Treatment–Outcome Pair 23 SHETHJINWILDERJANZINGFRITZ Appendix H. Prompts HypothesisGenerator (Instrumental Variable) You are an economist helping to identify causal relationships. Given the treatment variable{T}and the outcome variable{Y}, please provide a list of 5 possible instrumental va...

work page
[15]

Is it more plausible that B causes A? 3

Is it more plausible that A causes B? 2. Is it more plausible that B causes A? 3. Could the relationship be bidirectional? 4. Or is the correlation likely driven by confounding or coincidence, with no direct causal link? Use real-world knowledge and reasoning as an economist to assess plausibility. Think step by step. Return your answer as: Answer = [1 / ...

work page

[1] [1]

Global commodity prices

work page

[2] [2]

Colonial legal-origin dummies

work page

[3] [3]

Industrial or resource endow- ments

Historical settler-mortality rates Income Carbon emissions 1. Industrial or resource endow- ments

work page

[4] [4]

Distance to the equator

work page

[5] [5]

Groundwater depth

Railroad network density Sanitation Child mortality rate 1. Groundwater depth

work page

[6] [6]

Sanitation subsidy rollout schedule

work page

[7] [7]

Distance to health center

work page

[8] [8]

Cash-transfer age cutoff

Terrain Poverty Cholesterol 1. Cash-transfer age cutoff

work page

[9] [9]

Childcare-program timing

work page

[10] [10]

Number of female teachers

State EITC rate Female literacy rate Number of kids per female 1. Number of female teachers

work page

[11] [11]

Raised compulsory school- leaving age

work page

[12] [12]

Introduction years of a girls-only scholarship program

work page

[13] [13]

Historical density of missionary girls’ schools (pre-independence)

work page

[14] [14]

Prompts HypothesisGenerator (Instrumental Variable) You are an economist helping to identify causal relationships

UI replacement rate Table 8: Accepted and Rejected Instruments by Treatment–Outcome Pair 23 SHETHJINWILDERJANZINGFRITZ Appendix H. Prompts HypothesisGenerator (Instrumental Variable) You are an economist helping to identify causal relationships. Given the treatment variable{T}and the outcome variable{Y}, please provide a list of 5 possible instrumental va...

work page

[15] [15]

Is it more plausible that B causes A? 3

Is it more plausible that A causes B? 2. Is it more plausible that B causes A? 3. Could the relationship be bidirectional? 4. Or is the correlation likely driven by confounding or coincidence, with no direct causal link? Use real-world knowledge and reasoning as an economist to assess plausibility. Think step by step. Return your answer as: Answer = [1 / ...

work page