ReplaySCM: A Benchmark for Executable Causal Mechanism Induction from Interventions
Pith reviewed 2026-05-12 01:15 UTC · model grok-4.3
The pith
Frontier LLMs recover portions of causal parent functions from interventions but held-out replay accuracy falls sharply when order or root structure is withheld.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ReplaySCM evaluates executable causal mechanism induction by requiring submitted mechanisms to be executed on both training and held-out interventional worlds drawn from latent acyclic Boolean SCMs. The benchmark supplies four levels of structural disclosure (Ordered, Block-order, Hidden-order, Hidden-roots) and includes Alternative-SCM tasks that ask for a semantically distinct mechanism consistent with the training data together with a separating intervention. Frontier LLMs recover parts of the functional-parent structure, but held-out replay drops sharply when order or root structure is hidden. Under audited searches that reach complete local coverage, no discovered semantic alternative,
What carries the argument
The replay scoring procedure that validates a submitted Boolean mechanism by executing it on held-out interventional worlds and comparing the produced outcomes to the observed data.
If this is right
- Replay-based scoring distinguishes mechanisms that merely fit the training data from those that generalize to new interventions.
- The performance gap between ordered and hidden-order settings shows that explicit disclosure of variable ordering aids mechanism induction.
- Audited searches that reach full local coverage still find no consistent alternative SCMs, indicating the training worlds tightly constrain the space of possible mechanisms.
- The support-audit ladder demonstrates that additional interventional evidence can raise local predecessor-pattern coverage from 0.8949 to 1.0.
Where Pith is reading between the lines
- The replay method could be adapted to evaluate causal induction in domains with continuous or probabilistic variables where exact Boolean replay is not feasible.
- If future models close the ordered-versus-hidden-order gap, it would indicate improved ability to infer latent ordering directly from interventional patterns.
- The benchmark's emphasis on executable output rather than graph structure suggests a way to test whether causal reasoning supports downstream tasks such as planning under intervention.
Load-bearing premise
That correct replay behavior on held-out interventions reliably indicates the model has induced the latent causal mechanism rather than memorizing or overfitting patterns within the specific DSL and world distribution.
What would settle it
A model that achieves perfect held-out replay scores on all benchmark items yet produces incorrect predictions on a fresh set of interventions generated from an equivalent but syntactically different SCM that fits the same training worlds.
Figures
read the original abstract
Most causal benchmarks for language models score local answers or graph structure. We introduce ReplaySCM, a 1,300 item benchmark for executable causal mechanism induction from finite interventional evidence. Each item contains binary worlds generated by a latent fully observed acyclic Boolean structural causal model (SCM). A system must output a mechanism map in a restricted Boolean DSL; the submission is parsed, checked for legality and acyclicity, and replayed on training and held-out intervention worlds. Scoring uses replay behavior rather than formula strings, so syntactically different mechanisms receive credit when they behave correctly. ReplaySCM varies the structural information disclosed to the model through Ordered, Block-order, Hidden-order, and Hidden-roots settings, and includes Alternative-SCM tasks that supply a valid reference SCM and ask for a semantically distinct alternative that fits the training worlds, together with a separating intervention and witness. Frontier LLMs infer parts of the functional-parent structure, but held-out replay drops sharply when order or root structure is hidden. We also evaluate a matched support-audit ladder: Original, Extra Worlds, and Counterexample Audit (CEx), that raises mean local predecessor-pattern coverage from 0.8949 to 0.9815 to 1.0; under the audited searches, no discovered semantic alternative remains consistent with the training worlds. The Ordered/Hidden-order gap persists under this stronger evidence. ReplaySCM complements answer-level causal reasoning and graph-discovery benchmarks by evaluating executable replay generalization from finite interventional evidence, without claiming unique identification of the latent SCM.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ReplaySCM, a 1,300-item benchmark for evaluating executable causal mechanism induction in LLMs from finite interventional evidence on binary worlds generated by latent acyclic Boolean SCMs. Models output mechanism maps in a restricted Boolean DSL; submissions are parsed, legality/acyclicity-checked, and scored by replay accuracy on training and held-out interventions rather than syntactic match. The benchmark varies disclosed structural information across Ordered, Block-order, Hidden-order, and Hidden-roots conditions, includes Alternative-SCM tasks requiring semantically distinct alternatives plus separating interventions, and evaluates a support-audit ladder (Original, Extra Worlds, Counterexample Audit) that raises predecessor-pattern coverage to 1.0 with no consistent alternatives found. Frontier LLMs show partial inference of functional-parent structure but sharp held-out replay drops when order or roots are hidden; the Ordered/Hidden-order gap persists under auditing.
Significance. If the replay-based scoring and audit results hold, ReplaySCM supplies a reproducible, falsifiable complement to existing causal benchmarks by directly testing generalization of executable mechanisms from interventions rather than local answers or graph recovery. The controlled variation in structural disclosure and the finding that stronger evidence audits do not close the performance gap provide concrete, quantifiable evidence of current LLM limitations in causal induction.
major comments (3)
- [Abstract and Alternative-SCM tasks description] The central claim that replay success on held-out interventions indicates induction of the latent functional-parent structure (rather than DSL-specific pattern fitting within the finite Boolean worlds) is load-bearing for the benchmark's validity. The support-audit ladder reaches 1.0 coverage with no consistent alternatives, yet this does not rule out overfitting to regularities in the restricted DSL and world distribution; a concrete control (e.g., testing replay on out-of-distribution interventions generated from the same DSL but different root distributions) is needed to separate these explanations.
- [Results on structural information settings] The reported sharp drop in held-out replay from Ordered to Hidden-order settings is presented as evidence of failure to induce mechanisms when order is hidden. However, without an ablation that holds the DSL and world distribution fixed while varying only the order disclosure (or an analysis of whether models exploit positional cues in the prompt), it remains possible that the gap reflects increased pattern-matching difficulty rather than a specific deficit in causal induction.
- [Alternative-SCM tasks] In the Alternative-SCM tasks, the requirement that the submitted alternative be semantically distinct yet still replay-correct on training worlds, together with a separating intervention and witness, is a strong design choice. The paper reports no such alternatives survive the audit, but the generation procedure for reference SCMs and the precise definition of 'semantically distinct' (e.g., differing in at least one functional parent) must be specified in sufficient detail to allow independent verification that the 'no alternatives' result is not an artifact of the search procedure.
minor comments (2)
- [Abstract] The abstract states the benchmark contains '1,300 item' but does not break down the distribution across the four structural settings or the Alternative-SCM subset; a table or explicit count would improve clarity.
- [Method] Notation for the DSL primitives and the exact legality/acyclicity checks performed on submitted mechanism maps should be defined once in a dedicated subsection rather than scattered across the method description.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on ReplaySCM. We address each major comment point by point below, providing justifications for our design choices and indicating where we will revise the manuscript for greater clarity and rigor.
read point-by-point responses
-
Referee: [Abstract and Alternative-SCM tasks description] The central claim that replay success on held-out interventions indicates induction of the latent functional-parent structure (rather than DSL-specific pattern fitting within the finite Boolean worlds) is load-bearing for the benchmark's validity. The support-audit ladder reaches 1.0 coverage with no consistent alternatives, yet this does not rule out overfitting to regularities in the restricted DSL and world distribution; a concrete control (e.g., testing replay on out-of-distribution interventions generated from the same DSL but different root distributions) is needed to separate these explanations.
Authors: We agree that separating causal structure induction from DSL-specific pattern fitting is central to the benchmark's interpretation. The held-out interventions apply previously unseen intervention values to the same latent SCMs, testing whether induced mechanisms generalize functionally rather than replaying memorized training observations. The support audit further shows that no other mechanism in the DSL space is consistent with the training data. We acknowledge that this does not exhaustively rule out overfitting to the specific root-variable distribution used in world generation. In the revision we will add a dedicated limitations paragraph clarifying this distinction and explicitly listing the suggested OOD control (different root distributions under the same DSL) as valuable future work. We do not claim the current results uniquely identify the latent SCM, consistent with the manuscript's closing statement. revision: partial
-
Referee: [Results on structural information settings] The reported sharp drop in held-out replay from Ordered to Hidden-order settings is presented as evidence of failure to induce mechanisms when order is hidden. However, without an ablation that holds the DSL and world distribution fixed while varying only the order disclosure (or an analysis of whether models exploit positional cues in the prompt), it remains possible that the gap reflects increased pattern-matching difficulty rather than a specific deficit in causal induction.
Authors: The benchmark design already holds the underlying SCMs, DSL, and world distributions fixed while varying only the structural information disclosed in the prompt (Ordered vs. Hidden-order conditions). The performance gap therefore isolates the effect of removing order information. This directly supports our interpretation that models rely on the provided ordering cue to recover functional-parent structure. To address potential positional confounds, we will add an appendix analysis in the revision that correlates model accuracy with variable position in the prompt across conditions. We maintain that the controlled variation in disclosure already constitutes the requested ablation on order information. revision: partial
-
Referee: [Alternative-SCM tasks] In the Alternative-SCM tasks, the requirement that the submitted alternative be semantically distinct yet still replay-correct on training worlds, together with a separating intervention and witness, is a strong design choice. The paper reports no such alternatives survive the audit, but the generation procedure for reference SCMs and the precise definition of 'semantically distinct' (e.g., differing in at least one functional parent) must be specified in sufficient detail to allow independent verification that the 'no alternatives' result is not an artifact of the search procedure.
Authors: We agree that reproducibility requires explicit detail on these elements. In the revised manuscript we will expand the Methods and Appendix sections to specify: (1) the exact sampling procedure for reference SCMs, including how parent sets and Boolean functions are drawn; (2) the formal definition of semantic distinctness as any difference in at least one variable's functional parent or its Boolean function; and (3) the enumeration and consistency-checking steps used in the audit search. These additions will enable independent replication of the 'no alternatives' result. revision: yes
Circularity Check
No circularity: empirical benchmark proposal with external replay evaluation
full rationale
The paper introduces ReplaySCM as a benchmark for executable causal mechanism induction. It generates finite interventional worlds from latent Boolean SCMs, requires models to output DSL mechanism maps, and scores via replay on held-out interventions after legality/acyclicity checks. No mathematical derivation chain, fitted parameters, or self-referential predictions exist. The Ordered/Hidden-order gap, Alternative-SCM tasks, and support-audit ladder (raising predecessor coverage to 1.0) are empirical measurements against generated data, not reductions of outputs to inputs by construction. Evaluation is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Binary worlds are generated by latent fully observed acyclic Boolean structural causal models.
- domain assumption Replay fidelity on held-out interventions measures quality of causal mechanism induction.
invented entities (1)
-
ReplaySCM benchmark
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
ReplaySCM... output a mechanism map in a restricted Boolean DSL; the submission is parsed, checked for legality and acyclicity, and replayed on training and held-out intervention worlds. Scoring uses replay behavior rather than formula strings
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
support-audit ladder... raises mean local predecessor-pattern coverage from 0.8949 to 0.9815 to 1.0
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Hugging Face model card. Accessed May 4, 2026. Stephen Muggleton. Inductive logic programming.New Generation Computing, 8(4):295–318, 1991. doi: 10.1007/BF03037089. Stephen H. Muggleton and Luc De Raedt. Inductive logic programming: theory and methods. Journal of Logic Programming, 19–20:629–679, 1994. OpenAI. Introducing GPT-5.4. https://openai.com/index...
-
[2]
Sample n in {6,...,10}; create n latent slots
-
[3]
Mark the first three latent slots as roots and the remaining slots as endogenous
-
[4]
Randomly permute visible labels X1,...,Xn over the latent slots. 18 Component Specification Variables 6–10 observed binary variables; 3 roots; 3–7 endogenous variables. Variable labels Observed labels X1, . . . , Xnare randomly permuted over latent slots, so numeric suffixes contain no topological-order information. Topological order Roots precede endogen...
-
[5]
Let C(V) be the bounded set of earlier latent variables
For each endogenous variable V in latent order: a. Let C(V) be the bounded set of earlier latent variables. b. Sample a parent count uniformly from the admissible range. c. Sample that many parents uniformly from C(V). d. Sample a Boolean DSL expression over those parents. e. Reject and resample unless every declared parent is semantically active and the ...
-
[6]
B.2.2 Intervention-world construction A world is a table of unit rows under one intervention
Return the acyclic Boolean SCM. B.2.2 Intervention-world construction A world is a table of unit rows under one intervention. A unit is a row identifier with latent root thresholds. Ordinary generated worlds contain 10–12 rows. Unit IDs share latent root thresholds across worlds, but non-intervened root values can change across worlds because the world-le...
-
[7]
For each root R: if R is intervened on, set R to its intervention value; otherwise set R from the unit threshold and world environment
-
[8]
For each endogenous variable V in latent order: if V is intervened on, set V to its intervention value; otherwise evaluate f_V on the already assigned parent values
-
[9]
Return the world table and intervention metadata
Record the complete row over observed variables. Return the world table and intervention metadata. Training worlds are selected and then strengthened by the filters in Section B.2.3. Eight held-out worlds are simulated from the same latent SCM under intervention signatures withheld from the prompt. These held-out worlds are withheld from the model and use...
-
[10]
Reject M if any mechanism has inactive parents or a constant truth table
-
[11]
Simulate candidate training worlds from M
-
[12]
Check local support, scored exposure, intervention coverage, and held-out balance
-
[13]
Enumerate bounded shortcut formulas that fit W
-
[14]
propose candidate intervention worlds; b
While shortcut survivors remain above threshold and the world-addition budget remains: a. propose candidate intervention worlds; b. simulate each candidate from M; c. add the world that rules out the most surviving shortcuts
-
[15]
Enumerate bounded local semantic alternatives
-
[16]
Add compact separating worlds when they rule out many alternatives
-
[17]
Run bounded local and coordinated ambiguity audits
-
[18]
Accept the instance if all checks pass; otherwise reject or retry. B.2.4 Benchmark variants Ordered reveals the root/endogenous partition and a full topological order, so mechanisms may reference only earlier variables. Block-order reveals the roots and coarse precedence blocks; within- block order is hidden, but submitted dependencies must remain compati...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.