pith. sign in

arxiv: 2605.08197 · v1 · submitted 2026-05-05 · 💻 cs.LG · cs.AI

ReplaySCM: A Benchmark for Executable Causal Mechanism Induction from Interventions

Pith reviewed 2026-05-12 01:15 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords causal mechanism inductionstructural causal modelsinterventional dataexecutable replaybenchmarklanguage modelsBoolean DSLheld-out generalization
0
0 comments X

The pith

Frontier LLMs recover portions of causal parent functions from interventions but held-out replay accuracy falls sharply when order or root structure is withheld.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ReplaySCM, a benchmark of 1,300 items in which each task supplies finite interventional data generated by a latent acyclic Boolean structural causal model and requires the system to output an executable mechanism in a restricted Boolean DSL. The mechanism is parsed, validated for acyclicity, and scored by whether it replays the observed outcomes correctly on both the supplied training interventions and additional held-out ones. This evaluation matters because it measures whether a system has induced the underlying functional dependencies in a form that generalizes, rather than matching surface patterns or known graph edges. Results show that current large language models capture some functional-parent relationships yet exhibit clear degradation in held-out replay when the variable ordering or the set of root variables is not disclosed in the prompt. A matched support-audit ladder that supplies extra worlds and counterexamples raises local predecessor-pattern coverage to 1.0 while the ordered-versus-hidden-order gap remains.

Core claim

ReplaySCM evaluates executable causal mechanism induction by requiring submitted mechanisms to be executed on both training and held-out interventional worlds drawn from latent acyclic Boolean SCMs. The benchmark supplies four levels of structural disclosure (Ordered, Block-order, Hidden-order, Hidden-roots) and includes Alternative-SCM tasks that ask for a semantically distinct mechanism consistent with the training data together with a separating intervention. Frontier LLMs recover parts of the functional-parent structure, but held-out replay drops sharply when order or root structure is hidden. Under audited searches that reach complete local coverage, no discovered semantic alternative,

What carries the argument

The replay scoring procedure that validates a submitted Boolean mechanism by executing it on held-out interventional worlds and comparing the produced outcomes to the observed data.

If this is right

  • Replay-based scoring distinguishes mechanisms that merely fit the training data from those that generalize to new interventions.
  • The performance gap between ordered and hidden-order settings shows that explicit disclosure of variable ordering aids mechanism induction.
  • Audited searches that reach full local coverage still find no consistent alternative SCMs, indicating the training worlds tightly constrain the space of possible mechanisms.
  • The support-audit ladder demonstrates that additional interventional evidence can raise local predecessor-pattern coverage from 0.8949 to 1.0.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The replay method could be adapted to evaluate causal induction in domains with continuous or probabilistic variables where exact Boolean replay is not feasible.
  • If future models close the ordered-versus-hidden-order gap, it would indicate improved ability to infer latent ordering directly from interventional patterns.
  • The benchmark's emphasis on executable output rather than graph structure suggests a way to test whether causal reasoning supports downstream tasks such as planning under intervention.

Load-bearing premise

That correct replay behavior on held-out interventions reliably indicates the model has induced the latent causal mechanism rather than memorizing or overfitting patterns within the specific DSL and world distribution.

What would settle it

A model that achieves perfect held-out replay scores on all benchmark items yet produces incorrect predictions on a fresh set of interventions generated from an equivalent but syntactically different SCM that fits the same training worlds.

Figures

Figures reproduced from arXiv: 2605.08197 by Serafim Batzoglou.

Figure 1
Figure 1. Figure 1: Same-latent disclosure ladder for LLMs. Left: TrainExact; right: HeldoutWorldExact, labeled HeldoutWorld in the panel title. Markers are means over the 100 matched latent SCMs; vertical bars are 95% bootstrap intervals over matched problem IDs. Alternative-SCM supplies a valid reference SCM, so the model does not have to infer the structure from the intervention worlds. The task is to construct a semantica… view at source ↗
read the original abstract

Most causal benchmarks for language models score local answers or graph structure. We introduce ReplaySCM, a 1,300 item benchmark for executable causal mechanism induction from finite interventional evidence. Each item contains binary worlds generated by a latent fully observed acyclic Boolean structural causal model (SCM). A system must output a mechanism map in a restricted Boolean DSL; the submission is parsed, checked for legality and acyclicity, and replayed on training and held-out intervention worlds. Scoring uses replay behavior rather than formula strings, so syntactically different mechanisms receive credit when they behave correctly. ReplaySCM varies the structural information disclosed to the model through Ordered, Block-order, Hidden-order, and Hidden-roots settings, and includes Alternative-SCM tasks that supply a valid reference SCM and ask for a semantically distinct alternative that fits the training worlds, together with a separating intervention and witness. Frontier LLMs infer parts of the functional-parent structure, but held-out replay drops sharply when order or root structure is hidden. We also evaluate a matched support-audit ladder: Original, Extra Worlds, and Counterexample Audit (CEx), that raises mean local predecessor-pattern coverage from 0.8949 to 0.9815 to 1.0; under the audited searches, no discovered semantic alternative remains consistent with the training worlds. The Ordered/Hidden-order gap persists under this stronger evidence. ReplaySCM complements answer-level causal reasoning and graph-discovery benchmarks by evaluating executable replay generalization from finite interventional evidence, without claiming unique identification of the latent SCM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces ReplaySCM, a 1,300-item benchmark for evaluating executable causal mechanism induction in LLMs from finite interventional evidence on binary worlds generated by latent acyclic Boolean SCMs. Models output mechanism maps in a restricted Boolean DSL; submissions are parsed, legality/acyclicity-checked, and scored by replay accuracy on training and held-out interventions rather than syntactic match. The benchmark varies disclosed structural information across Ordered, Block-order, Hidden-order, and Hidden-roots conditions, includes Alternative-SCM tasks requiring semantically distinct alternatives plus separating interventions, and evaluates a support-audit ladder (Original, Extra Worlds, Counterexample Audit) that raises predecessor-pattern coverage to 1.0 with no consistent alternatives found. Frontier LLMs show partial inference of functional-parent structure but sharp held-out replay drops when order or roots are hidden; the Ordered/Hidden-order gap persists under auditing.

Significance. If the replay-based scoring and audit results hold, ReplaySCM supplies a reproducible, falsifiable complement to existing causal benchmarks by directly testing generalization of executable mechanisms from interventions rather than local answers or graph recovery. The controlled variation in structural disclosure and the finding that stronger evidence audits do not close the performance gap provide concrete, quantifiable evidence of current LLM limitations in causal induction.

major comments (3)
  1. [Abstract and Alternative-SCM tasks description] The central claim that replay success on held-out interventions indicates induction of the latent functional-parent structure (rather than DSL-specific pattern fitting within the finite Boolean worlds) is load-bearing for the benchmark's validity. The support-audit ladder reaches 1.0 coverage with no consistent alternatives, yet this does not rule out overfitting to regularities in the restricted DSL and world distribution; a concrete control (e.g., testing replay on out-of-distribution interventions generated from the same DSL but different root distributions) is needed to separate these explanations.
  2. [Results on structural information settings] The reported sharp drop in held-out replay from Ordered to Hidden-order settings is presented as evidence of failure to induce mechanisms when order is hidden. However, without an ablation that holds the DSL and world distribution fixed while varying only the order disclosure (or an analysis of whether models exploit positional cues in the prompt), it remains possible that the gap reflects increased pattern-matching difficulty rather than a specific deficit in causal induction.
  3. [Alternative-SCM tasks] In the Alternative-SCM tasks, the requirement that the submitted alternative be semantically distinct yet still replay-correct on training worlds, together with a separating intervention and witness, is a strong design choice. The paper reports no such alternatives survive the audit, but the generation procedure for reference SCMs and the precise definition of 'semantically distinct' (e.g., differing in at least one functional parent) must be specified in sufficient detail to allow independent verification that the 'no alternatives' result is not an artifact of the search procedure.
minor comments (2)
  1. [Abstract] The abstract states the benchmark contains '1,300 item' but does not break down the distribution across the four structural settings or the Alternative-SCM subset; a table or explicit count would improve clarity.
  2. [Method] Notation for the DSL primitives and the exact legality/acyclicity checks performed on submitted mechanism maps should be defined once in a dedicated subsection rather than scattered across the method description.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on ReplaySCM. We address each major comment point by point below, providing justifications for our design choices and indicating where we will revise the manuscript for greater clarity and rigor.

read point-by-point responses
  1. Referee: [Abstract and Alternative-SCM tasks description] The central claim that replay success on held-out interventions indicates induction of the latent functional-parent structure (rather than DSL-specific pattern fitting within the finite Boolean worlds) is load-bearing for the benchmark's validity. The support-audit ladder reaches 1.0 coverage with no consistent alternatives, yet this does not rule out overfitting to regularities in the restricted DSL and world distribution; a concrete control (e.g., testing replay on out-of-distribution interventions generated from the same DSL but different root distributions) is needed to separate these explanations.

    Authors: We agree that separating causal structure induction from DSL-specific pattern fitting is central to the benchmark's interpretation. The held-out interventions apply previously unseen intervention values to the same latent SCMs, testing whether induced mechanisms generalize functionally rather than replaying memorized training observations. The support audit further shows that no other mechanism in the DSL space is consistent with the training data. We acknowledge that this does not exhaustively rule out overfitting to the specific root-variable distribution used in world generation. In the revision we will add a dedicated limitations paragraph clarifying this distinction and explicitly listing the suggested OOD control (different root distributions under the same DSL) as valuable future work. We do not claim the current results uniquely identify the latent SCM, consistent with the manuscript's closing statement. revision: partial

  2. Referee: [Results on structural information settings] The reported sharp drop in held-out replay from Ordered to Hidden-order settings is presented as evidence of failure to induce mechanisms when order is hidden. However, without an ablation that holds the DSL and world distribution fixed while varying only the order disclosure (or an analysis of whether models exploit positional cues in the prompt), it remains possible that the gap reflects increased pattern-matching difficulty rather than a specific deficit in causal induction.

    Authors: The benchmark design already holds the underlying SCMs, DSL, and world distributions fixed while varying only the structural information disclosed in the prompt (Ordered vs. Hidden-order conditions). The performance gap therefore isolates the effect of removing order information. This directly supports our interpretation that models rely on the provided ordering cue to recover functional-parent structure. To address potential positional confounds, we will add an appendix analysis in the revision that correlates model accuracy with variable position in the prompt across conditions. We maintain that the controlled variation in disclosure already constitutes the requested ablation on order information. revision: partial

  3. Referee: [Alternative-SCM tasks] In the Alternative-SCM tasks, the requirement that the submitted alternative be semantically distinct yet still replay-correct on training worlds, together with a separating intervention and witness, is a strong design choice. The paper reports no such alternatives survive the audit, but the generation procedure for reference SCMs and the precise definition of 'semantically distinct' (e.g., differing in at least one functional parent) must be specified in sufficient detail to allow independent verification that the 'no alternatives' result is not an artifact of the search procedure.

    Authors: We agree that reproducibility requires explicit detail on these elements. In the revised manuscript we will expand the Methods and Appendix sections to specify: (1) the exact sampling procedure for reference SCMs, including how parent sets and Boolean functions are drawn; (2) the formal definition of semantic distinctness as any difference in at least one variable's functional parent or its Boolean function; and (3) the enumeration and consistency-checking steps used in the audit search. These additions will enable independent replication of the 'no alternatives' result. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark proposal with external replay evaluation

full rationale

The paper introduces ReplaySCM as a benchmark for executable causal mechanism induction. It generates finite interventional worlds from latent Boolean SCMs, requires models to output DSL mechanism maps, and scores via replay on held-out interventions after legality/acyclicity checks. No mathematical derivation chain, fitted parameters, or self-referential predictions exist. The Ordered/Hidden-order gap, Alternative-SCM tasks, and support-audit ladder (raising predecessor coverage to 1.0) are empirical measurements against generated data, not reductions of outputs to inputs by construction. Evaluation is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The paper relies on standard assumptions about acyclic Boolean SCMs and introduces the benchmark itself as the primary new object; no free parameters are fitted to data and no new physical entities are postulated.

axioms (2)
  • domain assumption Binary worlds are generated by latent fully observed acyclic Boolean structural causal models.
    Explicitly stated as the generative process for all benchmark items.
  • domain assumption Replay fidelity on held-out interventions measures quality of causal mechanism induction.
    Central premise of the scoring method described in the abstract.
invented entities (1)
  • ReplaySCM benchmark no independent evidence
    purpose: To evaluate executable causal mechanism induction from finite interventional evidence
    Newly defined in this work; no independent evidence outside the paper is provided.

pith-pipeline@v0.9.0 · 5570 in / 1361 out tokens · 52700 ms · 2026-05-12T01:15:41.367243+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages

  1. [1]

    what if

    Hugging Face model card. Accessed May 4, 2026. Stephen Muggleton. Inductive logic programming.New Generation Computing, 8(4):295–318, 1991. doi: 10.1007/BF03037089. Stephen H. Muggleton and Luc De Raedt. Inductive logic programming: theory and methods. Journal of Logic Programming, 19–20:629–679, 1994. OpenAI. Introducing GPT-5.4. https://openai.com/index...

  2. [2]

    Sample n in {6,...,10}; create n latent slots

  3. [3]

    Mark the first three latent slots as roots and the remaining slots as endogenous

  4. [4]

    18 Component Specification Variables 6–10 observed binary variables; 3 roots; 3–7 endogenous variables

    Randomly permute visible labels X1,...,Xn over the latent slots. 18 Component Specification Variables 6–10 observed binary variables; 3 roots; 3–7 endogenous variables. Variable labels Observed labels X1, . . . , Xnare randomly permuted over latent slots, so numeric suffixes contain no topological-order information. Topological order Roots precede endogen...

  5. [5]

    Let C(V) be the bounded set of earlier latent variables

    For each endogenous variable V in latent order: a. Let C(V) be the bounded set of earlier latent variables. b. Sample a parent count uniformly from the admissible range. c. Sample that many parents uniformly from C(V). d. Sample a Boolean DSL expression over those parents. e. Reject and resample unless every declared parent is semantically active and the ...

  6. [6]

    B.2.2 Intervention-world construction A world is a table of unit rows under one intervention

    Return the acyclic Boolean SCM. B.2.2 Intervention-world construction A world is a table of unit rows under one intervention. A unit is a row identifier with latent root thresholds. Ordinary generated worlds contain 10–12 rows. Unit IDs share latent root thresholds across worlds, but non-intervened root values can change across worlds because the world-le...

  7. [7]

    For each root R: if R is intervened on, set R to its intervention value; otherwise set R from the unit threshold and world environment

  8. [8]

    For each endogenous variable V in latent order: if V is intervened on, set V to its intervention value; otherwise evaluate f_V on the already assigned parent values

  9. [9]

    Return the world table and intervention metadata

    Record the complete row over observed variables. Return the world table and intervention metadata. Training worlds are selected and then strengthened by the filters in Section B.2.3. Eight held-out worlds are simulated from the same latent SCM under intervention signatures withheld from the prompt. These held-out worlds are withheld from the model and use...

  10. [10]

    Reject M if any mechanism has inactive parents or a constant truth table

  11. [11]

    Simulate candidate training worlds from M

  12. [12]

    Check local support, scored exposure, intervention coverage, and held-out balance

  13. [13]

    Enumerate bounded shortcut formulas that fit W

  14. [14]

    propose candidate intervention worlds; b

    While shortcut survivors remain above threshold and the world-addition budget remains: a. propose candidate intervention worlds; b. simulate each candidate from M; c. add the world that rules out the most surviving shortcuts

  15. [15]

    Enumerate bounded local semantic alternatives

  16. [16]

    Add compact separating worlds when they rule out many alternatives

  17. [17]

    Run bounded local and coordinated ambiguity audits

  18. [18]

    X3": 1} Rows: - u00: X1=0 X2=1 X3=1 X4=1 X5=0 - u01: X1=1 X2=0 X3=1 X4=0 X5=1 Output schema excerpt. {

    Accept the instance if all checks pass; otherwise reject or retry. B.2.4 Benchmark variants Ordered reveals the root/endogenous partition and a full topological order, so mechanisms may reference only earlier variables. Block-order reveals the roots and coarse precedence blocks; within- block order is hidden, but submitted dependencies must remain compati...