The Causal Description Gap: Information-Theoretic Separations Across Pearl's Hierarchy
Pith reviewed 2026-05-08 19:28 UTC · model grok-4.3
The pith
There exist binary acyclic causal models where observational descriptions are constant-length but single-variable interventional oracles require Θ(n²) bits.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Our main construction gives binary acyclic SCMs whose observational distribution has constant description length, while the single-variable interventional answer oracle has description length Θ(n²). A degree-sensitive upper bound shows that finite-gate-schema SCMs of indegree d have observational-interventional gap at most O(nd log(en/d) + n log n). The quadratic separation persists under ε-accurate total-variation descriptions for every fixed ε < 1/4. At the next rung, the full hard-do interventional oracle can still leave a Θ(n) counterfactual description gap. A general ambiguity-to-bits theorem and Shannon analogue show that these gaps equal the logarithm of residual higher-rung ambiguity
What carries the argument
Query-class description length, the Kolmogorov complexity of the answer oracle induced by an SCM for a class of queries.
Load-bearing premise
The formalization assumes that query-class description length via Kolmogorov complexity of the answer oracle induced by an SCM accurately captures the information needed for causal queries, and that the constructed binary acyclic SCMs exist with the claimed properties.
What would settle it
An explicit family of binary acyclic SCMs where the interventional oracle has description length o(n²) while keeping observational complexity constant would falsify the quadratic separation.
Figures
read the original abstract
Pearl's causal hierarchy shows that observational, interventional, and counterfactual queries are qualitatively distinct. We ask a quantitative version of this question: how many additional bits are needed to specify higher-rung causal answers once lower-rung answers are known? We formalize this via query-class description length, the Kolmogorov complexity of the answer oracle induced by an SCM for a class of queries. Our main construction gives binary acyclic SCMs whose observational distribution has constant description length, while the single-variable interventional answer oracle has description length $\Theta(n^2)$. A degree-sensitive upper bound shows that finite-gate-schema SCMs of indegree $d$ have observational-interventional gap at most $O(nd \log(en/d) + n \log n)$, making the quadratic construction order-optimal in the dense regime and a rooted-tree construction order-optimal for bounded indegree. The quadratic separation persists under $\varepsilon$-accurate total-variation descriptions for every fixed $\varepsilon < 1/4$. At the next rung, the full hard-do interventional oracle can still leave a $\Theta(n)$ counterfactual description gap. A general ambiguity-to-bits theorem and Shannon analogue show that these gaps equal the logarithm of residual higher-rung ambiguity up to lower-order terms.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper quantifies information-theoretic gaps across Pearl's causal hierarchy by defining query-class description length as the Kolmogorov complexity of the answer oracle induced by a structural causal model (SCM) for a given class of queries. It presents explicit constructions of binary acyclic SCMs on n variables where the observational distribution has constant description length, yet the single-variable interventional answer oracle requires Θ(n²) bits; a degree-sensitive upper bound of O(nd log(en/d) + n log n) for finite-gate-schema SCMs of indegree d shows the quadratic gap is order-optimal in the dense regime, while a rooted-tree construction is optimal for bounded indegree. The separation persists under ε-accurate total-variation approximations for fixed ε < 1/4, a Θ(n) counterfactual gap remains after the full hard-do interventional oracle, and a general ambiguity-to-bits theorem equates these gaps (up to lower-order terms) to the log of residual higher-rung ambiguity.
Significance. If the constructions and bounds hold, the work supplies a precise, non-asymptotic information-theoretic separation between rungs of Pearl's hierarchy, showing that interventional and counterfactual answers can require quadratically or linearly more bits than observational data even for simple binary acyclic models. The optimality results in dense and sparse regimes, the robustness to approximate descriptions, and the ambiguity-to-bits theorem provide concrete, falsifiable predictions about the minimal additional information needed for causal queries, which could inform both theoretical bounds on causal discovery algorithms and practical considerations in model specification.
major comments (3)
- [Main construction] Main construction (as described in the abstract and presumably §3): the claim that binary acyclic SCMs exist with constant-Kolmogorov-complexity observational distributions but Θ(n²)-bit single-variable interventional oracles is load-bearing for the central separation result; the provided abstract outlines the existence but the full SCM wiring diagrams, gate functions, and explicit Kolmogorov-complexity calculations must be supplied and verified to confirm the constant vs. quadratic gap.
- [Degree-sensitive upper bound] Degree-sensitive upper bound (abstract): the O(nd log(en/d) + n log n) bound for finite-gate-schema SCMs is presented as showing order-optimality, but it is unclear whether the proof accounts for the precise Kolmogorov complexity of the induced interventional oracle or only an upper bound on the SCM description itself; a concrete comparison to the lower-bound construction is needed.
- [Ambiguity-to-bits theorem] Ambiguity-to-bits theorem (abstract): the statement that gaps equal the logarithm of residual higher-rung ambiguity up to lower-order terms is central to interpreting the separations; the precise statement, including the Shannon analogue and any assumptions on the query classes, should be stated formally with the error terms made explicit.
minor comments (2)
- Notation for query-class description length and answer oracles should be introduced with a dedicated definition box or equation early in the paper to avoid ambiguity when comparing observational, interventional, and counterfactual oracles.
- The persistence result under ε-accurate total-variation approximations is stated for every fixed ε < 1/4; a brief remark on the dependence of the hidden constants on ε would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the careful and constructive review. We address each major comment below and have revised the manuscript to supply the requested details and clarifications.
read point-by-point responses
-
Referee: [Main construction] Main construction (as described in the abstract and presumably §3): the claim that binary acyclic SCMs exist with constant-Kolmogorov-complexity observational distributions but Θ(n²)-bit single-variable interventional oracles is load-bearing for the central separation result; the provided abstract outlines the existence but the full SCM wiring diagrams, gate functions, and explicit Kolmogorov-complexity calculations must be supplied and verified to confirm the constant vs. quadratic gap.
Authors: Section 3 of the manuscript contains the full construction: a family of binary acyclic SCMs whose structural equations are drawn from a fixed finite gate schema such that the observational distribution is exactly uniform on {0,1}^n. This distribution is generated by a constant-size program, yielding constant Kolmogorov complexity. The single-variable interventional oracle, however, must encode the distinct effect of intervening on each of the n variables; because the underlying graph is dense, this requires Θ(n²) bits. In the revision we have added an appendix with explicit wiring diagrams and gate tables for n=4 and n=5, together with the step-by-step Kolmogorov-complexity calculations that confirm the constant-versus-quadratic gap. revision: yes
-
Referee: [Degree-sensitive upper bound] Degree-sensitive upper bound (abstract): the O(nd log(en/d) + n log n) bound for finite-gate-schema SCMs is presented as showing order-optimality, but it is unclear whether the proof accounts for the precise Kolmogorov complexity of the induced interventional oracle or only an upper bound on the SCM description itself; a concrete comparison to the lower-bound construction is needed.
Authors: The proof establishes an upper bound directly on the Kolmogorov complexity of the interventional oracle. Because every answer of the oracle is computable from the SCM description by a fixed program of constant size, K(oracle) ≤ K(SCM) + O(1). The stated O(nd log(en/d) + n log n) bound is therefore an upper bound on the oracle itself. The revision adds a dedicated comparison subsection showing that, for d = Θ(n), the upper bound is O(n² log n) and therefore matches the Θ(n²) lower bound up to logarithmic factors; for constant d the rooted-tree construction yields an O(n log n) upper bound that is order-optimal. revision: yes
-
Referee: [Ambiguity-to-bits theorem] Ambiguity-to-bits theorem (abstract): the statement that gaps equal the logarithm of residual higher-rung ambiguity up to lower-order terms is central to interpreting the separations; the precise statement, including the Shannon analogue and any assumptions on the query classes, should be stated formally with the error terms made explicit.
Authors: Theorem 5 states the result formally: for nested query classes Q_r ⊂ Q_{r+1} the description-length gap satisfies K(Q_{r+1}) − K(Q_r) = log₂ |Amb(Q_r, Q_{r+1})| + O(log n), where Amb(Q_r, Q_{r+1}) is the number of distinct extensions of a lower-rung oracle to the higher rung, under the assumption that all oracles are total computable functions on countable domains. The Shannon analogue replaces Kolmogorov complexity by Shannon entropy and holds with an additive O(1) term. The revision makes the O(log n) error term and the computability assumptions explicit in the theorem statement and adds a short proof sketch of the error bound. revision: yes
Circularity Check
No significant circularity; results rest on explicit constructions and general theorems
full rationale
The paper's central claims are established via explicit constructions of binary acyclic SCMs and information-theoretic arguments based on Kolmogorov complexity of query oracles. The observational-interventional gap of Θ(n²) and the counterfactual gap of Θ(n) follow from these constructions and a general ambiguity-to-bits theorem, without any reduction of predictions to fitted parameters, self-definitional equivalences, or load-bearing self-citations. The degree-sensitive upper bound and persistence under ε-approximations are derived from standard combinatorial and information-theoretic bounds that do not presuppose the target separations. This is a standard non-circular theoretical result.
Axiom & Free-Parameter Ledger
axioms (2)
- standard math Kolmogorov complexity provides a valid measure of description length for oracles
- domain assumption Structural causal models can be represented as acyclic graphs with binary variables and finite-gate schemas
Lean theorems connected to this paper
-
Foundation/AbsoluteFloorClosure.lean (specifiability/distinguishability framework)absolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We formalize this via query-class description length, the Kolmogorov complexity of the answer oracle induced by an SCM for a class of queries.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Correa, Duligur Ibeling, and Thomas Icard
Elias Bareinboim, Juan D. Correa, Duligur Ibeling, and Thomas Icard. On Pearl’s hierarchy and the foundations of causal inference. InProbabilistic and Causal Inference: The Works of Judea Pearl, pages 507–556. ACM Books, 2022
work page 2022
-
[2]
Julian Dörfler, Benito van der Zander, Markus Bläser, and Maciej Liskiewicz. From probability to counterfactuals: The increasing complexity of satisfiability in Pearl’s causal hierarchy. In International Conference on Learning Representations (ICLR), 2025
work page 2025
-
[3]
Frederick Eberhardt, Clark Glymour, and Richard Scheines. On the number of experiments sufficient and in the worst case necessary to identify all causal relations among N variables. Proceedings of the UAI Conference, pages 178–184, 2005
work page 2005
-
[4]
Grünwald.The Minimum Description Length Principle
Peter D. Grünwald.The Minimum Description Length Principle. MIT Press, 2007
work page 2007
-
[5]
Alain Hauser and Peter Bühlmann. Characterization and greedy learning of interventional Markov equivalence classes of directed acyclic graphs.Journal of Machine Learning Research, 13:2409–2464, 2012
work page 2012
-
[6]
Comparing causal frameworks: Potential outcomes, struc- tural models, graphs, and abstractions
Duligur Ibeling and Thomas Icard. Comparing causal frameworks: Potential outcomes, struc- tural models, graphs, and abstractions. InAdvances in Neural Information Processing Systems, volume 36, 2023
work page 2023
-
[7]
CLadder: Assessing causal reasoning in language models
Zhijing Jin, Yuen Chen, Felix Leeb, Luigi Gresele, Ojasv Kamal, Zhiheng Lyu, Kevin Blin, Fernando Gonzalez Adauto, Max Kleiman-Weiner, Mrinmaya Sachan, and Bernhard Schölkopf. CLadder: Assessing causal reasoning in language models. InAdvances in Neural Information Processing Systems, volume 36, 2023
work page 2023
-
[8]
Emre Kıcıman, Robert Osazuwa Ness, Amit Sharma, and Chenhao Tan. Causal reasoning and large language models: Opening a new frontier for causality.Transactions on Machine Learning Research, 2024
work page 2024
-
[9]
Andrey N. Kolmogorov. Three approaches to the quantitative definition of information.Prob- lems of Information Transmission, 1(1):1–7, 1965
work page 1965
-
[10]
Ming Li and Paul Vitányi.An Introduction to Kolmogorov Complexity and Its Applications. Springer, 3rd edition, 2008
work page 2008
-
[11]
Cambridge University Press, 2nd edition, 2009
Judea Pearl.Causality: Models, Reasoning, and Inference. Cambridge University Press, 2nd edition, 2009
work page 2009
- [12]
-
[13]
Jonas Peters, Dominik Janzing, and Bernhard Schölkopf.Elements of Causal Inference: Foundations and Learning Algorithms. MIT Press, 2017
work page 2017
-
[14]
Dimakis, and Sriram Vishwanath
Karthikeyan Shanmugam, Murat Kocaoglu, Alexandros G. Dimakis, and Sriram Vishwanath. Learning causal graphs with small interventions. InAdvances in Neural Information Processing Systems, volume 28, 2015
work page 2015
-
[15]
Ilya Shpitser and Judea Pearl. Complete identification methods for the causal hierarchy.Journal of Machine Learning Research, 9:1941–1979, 2008
work page 1941
-
[16]
Peter Spirtes, Clark Glymour, and Richard Scheines.Causation, Prediction, and Search. MIT Press, 2nd edition, 2000. A Rooted-Tree Construction: Full Details This appendix gives the formal construction and full proofs underlying the Θ(nlogn) rooted-tree separation. Theorem A.5 below restates and proves the main-text Theorem 4.1; together with Corollary 5.8...
work page 2000
-
[17]
The mapT7→Int 1(MT )is injective onM n tree. 2.Lower bound:There existsTwith∆ 2|1(MT )≥(n−1) log 2 n−O(logn). 3.Upper bound:For allT,DL 2(MT )≤(n−1) log 2 n+O(logn). 4.High probability:For uniformly random rooted labeled treeT, P ∆2|1(MT )≥(n−1) log 2 n−c−O(logn) ≥1−2 −c. Consequently,∆ 2|1(MT ) = Θ(nlogn)for all but an exponentially small fraction of tre...
-
[18]
a topological ordering of thenvariables, usingO(nlogn)bits
-
[19]
for each variable, its parent set, chosen from at most Pd k=0 n−1 k possibilities
-
[20]
What is P(Y (Xt←0) t = Y (Xt←1) t )?
for each variable, a gate from the fixed library Γ and a noise distribution from the fixed libraryΠ, usingO(1)bits per variable. The entire SCM thus has a description of length O(nlogn) +nlog 2 dX k=0 n−1 k +O Γ,Π(n). Given this description and n, a fixed program can compute all single-node interventional distributions and outputInt 1(M). Hence the same e...
-
[21]
This is not a computational limitation; it is information-theoretic
Averaging over a proves the claim. This is not a computational limitation; it is information-theoretic. No algorithm, however powerful, can reliably predict interventional outcomes from observations when the underlying mechanism is drawn from our family. Remark D.3(Shannon Mutual Information is Zero).In the bipartite family Mn bip, a dataset D∼ (P ⋆)N of ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.