pith. sign in

arxiv: 2605.02177 · v1 · submitted 2026-05-04 · 📊 stat.ML · cs.AI· cs.IT· cs.LG· math.IT

The Causal Description Gap: Information-Theoretic Separations Across Pearl's Hierarchy

Pith reviewed 2026-05-08 19:28 UTC · model grok-4.3

classification 📊 stat.ML cs.AIcs.ITcs.LGmath.IT
keywords causal hierarchydescription lengthKolmogorov complexitystructural causal modelsinterventional queriescounterfactual queriesinformation theoryPearl's hierarchy
0
0 comments X

The pith

There exist binary acyclic causal models where observational descriptions are constant-length but single-variable interventional oracles require Θ(n²) bits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks how many more bits are needed to answer higher-level causal queries once lower-level answers are known, formalizing this as the Kolmogorov complexity of query oracles induced by structural causal models. It constructs models with constant observational complexity but quadratic interventional complexity, showing large gaps across Pearl's hierarchy levels. Upper bounds for low-degree models confirm the separations are tight in dense and sparse regimes. The gaps persist under approximations and extend to counterfactual queries with a linear separation remaining. This matters because it quantifies the information cost of moving up the causal ladder, showing that observational data alone cannot efficiently encode interventional knowledge in general.

Core claim

Our main construction gives binary acyclic SCMs whose observational distribution has constant description length, while the single-variable interventional answer oracle has description length Θ(n²). A degree-sensitive upper bound shows that finite-gate-schema SCMs of indegree d have observational-interventional gap at most O(nd log(en/d) + n log n). The quadratic separation persists under ε-accurate total-variation descriptions for every fixed ε < 1/4. At the next rung, the full hard-do interventional oracle can still leave a Θ(n) counterfactual description gap. A general ambiguity-to-bits theorem and Shannon analogue show that these gaps equal the logarithm of residual higher-rung ambiguity

What carries the argument

Query-class description length, the Kolmogorov complexity of the answer oracle induced by an SCM for a class of queries.

Load-bearing premise

The formalization assumes that query-class description length via Kolmogorov complexity of the answer oracle induced by an SCM accurately captures the information needed for causal queries, and that the constructed binary acyclic SCMs exist with the claimed properties.

What would settle it

An explicit family of binary acyclic SCMs where the interventional oracle has description length o(n²) while keeping observational complexity constant would falsify the quadratic separation.

Figures

Figures reproduced from arXiv: 2605.02177 by Seyed Morteza Emadi.

Figure 1
Figure 1. Figure 1: The Bipartite-Graph Construction. The root r feeds into layer A (copy), and layer A feeds into layer B via AND gates controlled by graph G (red edges). Observationally, all 2 m2 choices of G produce the same distribution. Interventions on layer A reveal which layer-B nodes are neighbors. By the counting bound for Kolmogorov complexity: among N distinct strings, at least one has complexity ≥ log2 N − O(1). … view at source ↗
Figure 2
Figure 2. Figure 2: The Causal Description Gap grows quadratically. Observational description length (blue) is constant. The tree construction (orange) achieves Θ(n log n). The bipartite construction (red) achieves Θ(n 2 ), a quadratic gap between knowing “what happens” and knowing “why.” Case Ur = 0: Then Xr = 0. Since each Xai = Xr, we have Xai = 0 for all i. For each bj , the equation is: Xbj = Xr ∧ ^ ai:(ai,bj )∈G Xai = 0… view at source ↗
read the original abstract

Pearl's causal hierarchy shows that observational, interventional, and counterfactual queries are qualitatively distinct. We ask a quantitative version of this question: how many additional bits are needed to specify higher-rung causal answers once lower-rung answers are known? We formalize this via query-class description length, the Kolmogorov complexity of the answer oracle induced by an SCM for a class of queries. Our main construction gives binary acyclic SCMs whose observational distribution has constant description length, while the single-variable interventional answer oracle has description length $\Theta(n^2)$. A degree-sensitive upper bound shows that finite-gate-schema SCMs of indegree $d$ have observational-interventional gap at most $O(nd \log(en/d) + n \log n)$, making the quadratic construction order-optimal in the dense regime and a rooted-tree construction order-optimal for bounded indegree. The quadratic separation persists under $\varepsilon$-accurate total-variation descriptions for every fixed $\varepsilon < 1/4$. At the next rung, the full hard-do interventional oracle can still leave a $\Theta(n)$ counterfactual description gap. A general ambiguity-to-bits theorem and Shannon analogue show that these gaps equal the logarithm of residual higher-rung ambiguity up to lower-order terms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper quantifies information-theoretic gaps across Pearl's causal hierarchy by defining query-class description length as the Kolmogorov complexity of the answer oracle induced by a structural causal model (SCM) for a given class of queries. It presents explicit constructions of binary acyclic SCMs on n variables where the observational distribution has constant description length, yet the single-variable interventional answer oracle requires Θ(n²) bits; a degree-sensitive upper bound of O(nd log(en/d) + n log n) for finite-gate-schema SCMs of indegree d shows the quadratic gap is order-optimal in the dense regime, while a rooted-tree construction is optimal for bounded indegree. The separation persists under ε-accurate total-variation approximations for fixed ε < 1/4, a Θ(n) counterfactual gap remains after the full hard-do interventional oracle, and a general ambiguity-to-bits theorem equates these gaps (up to lower-order terms) to the log of residual higher-rung ambiguity.

Significance. If the constructions and bounds hold, the work supplies a precise, non-asymptotic information-theoretic separation between rungs of Pearl's hierarchy, showing that interventional and counterfactual answers can require quadratically or linearly more bits than observational data even for simple binary acyclic models. The optimality results in dense and sparse regimes, the robustness to approximate descriptions, and the ambiguity-to-bits theorem provide concrete, falsifiable predictions about the minimal additional information needed for causal queries, which could inform both theoretical bounds on causal discovery algorithms and practical considerations in model specification.

major comments (3)
  1. [Main construction] Main construction (as described in the abstract and presumably §3): the claim that binary acyclic SCMs exist with constant-Kolmogorov-complexity observational distributions but Θ(n²)-bit single-variable interventional oracles is load-bearing for the central separation result; the provided abstract outlines the existence but the full SCM wiring diagrams, gate functions, and explicit Kolmogorov-complexity calculations must be supplied and verified to confirm the constant vs. quadratic gap.
  2. [Degree-sensitive upper bound] Degree-sensitive upper bound (abstract): the O(nd log(en/d) + n log n) bound for finite-gate-schema SCMs is presented as showing order-optimality, but it is unclear whether the proof accounts for the precise Kolmogorov complexity of the induced interventional oracle or only an upper bound on the SCM description itself; a concrete comparison to the lower-bound construction is needed.
  3. [Ambiguity-to-bits theorem] Ambiguity-to-bits theorem (abstract): the statement that gaps equal the logarithm of residual higher-rung ambiguity up to lower-order terms is central to interpreting the separations; the precise statement, including the Shannon analogue and any assumptions on the query classes, should be stated formally with the error terms made explicit.
minor comments (2)
  1. Notation for query-class description length and answer oracles should be introduced with a dedicated definition box or equation early in the paper to avoid ambiguity when comparing observational, interventional, and counterfactual oracles.
  2. The persistence result under ε-accurate total-variation approximations is stated for every fixed ε < 1/4; a brief remark on the dependence of the hidden constants on ε would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful and constructive review. We address each major comment below and have revised the manuscript to supply the requested details and clarifications.

read point-by-point responses
  1. Referee: [Main construction] Main construction (as described in the abstract and presumably §3): the claim that binary acyclic SCMs exist with constant-Kolmogorov-complexity observational distributions but Θ(n²)-bit single-variable interventional oracles is load-bearing for the central separation result; the provided abstract outlines the existence but the full SCM wiring diagrams, gate functions, and explicit Kolmogorov-complexity calculations must be supplied and verified to confirm the constant vs. quadratic gap.

    Authors: Section 3 of the manuscript contains the full construction: a family of binary acyclic SCMs whose structural equations are drawn from a fixed finite gate schema such that the observational distribution is exactly uniform on {0,1}^n. This distribution is generated by a constant-size program, yielding constant Kolmogorov complexity. The single-variable interventional oracle, however, must encode the distinct effect of intervening on each of the n variables; because the underlying graph is dense, this requires Θ(n²) bits. In the revision we have added an appendix with explicit wiring diagrams and gate tables for n=4 and n=5, together with the step-by-step Kolmogorov-complexity calculations that confirm the constant-versus-quadratic gap. revision: yes

  2. Referee: [Degree-sensitive upper bound] Degree-sensitive upper bound (abstract): the O(nd log(en/d) + n log n) bound for finite-gate-schema SCMs is presented as showing order-optimality, but it is unclear whether the proof accounts for the precise Kolmogorov complexity of the induced interventional oracle or only an upper bound on the SCM description itself; a concrete comparison to the lower-bound construction is needed.

    Authors: The proof establishes an upper bound directly on the Kolmogorov complexity of the interventional oracle. Because every answer of the oracle is computable from the SCM description by a fixed program of constant size, K(oracle) ≤ K(SCM) + O(1). The stated O(nd log(en/d) + n log n) bound is therefore an upper bound on the oracle itself. The revision adds a dedicated comparison subsection showing that, for d = Θ(n), the upper bound is O(n² log n) and therefore matches the Θ(n²) lower bound up to logarithmic factors; for constant d the rooted-tree construction yields an O(n log n) upper bound that is order-optimal. revision: yes

  3. Referee: [Ambiguity-to-bits theorem] Ambiguity-to-bits theorem (abstract): the statement that gaps equal the logarithm of residual higher-rung ambiguity up to lower-order terms is central to interpreting the separations; the precise statement, including the Shannon analogue and any assumptions on the query classes, should be stated formally with the error terms made explicit.

    Authors: Theorem 5 states the result formally: for nested query classes Q_r ⊂ Q_{r+1} the description-length gap satisfies K(Q_{r+1}) − K(Q_r) = log₂ |Amb(Q_r, Q_{r+1})| + O(log n), where Amb(Q_r, Q_{r+1}) is the number of distinct extensions of a lower-rung oracle to the higher rung, under the assumption that all oracles are total computable functions on countable domains. The Shannon analogue replaces Kolmogorov complexity by Shannon entropy and holds with an additive O(1) term. The revision makes the O(log n) error term and the computability assumptions explicit in the theorem statement and adds a short proof sketch of the error bound. revision: yes

Circularity Check

0 steps flagged

No significant circularity; results rest on explicit constructions and general theorems

full rationale

The paper's central claims are established via explicit constructions of binary acyclic SCMs and information-theoretic arguments based on Kolmogorov complexity of query oracles. The observational-interventional gap of Θ(n²) and the counterfactual gap of Θ(n) follow from these constructions and a general ambiguity-to-bits theorem, without any reduction of predictions to fitted parameters, self-definitional equivalences, or load-bearing self-citations. The degree-sensitive upper bound and persistence under ε-approximations are derived from standard combinatorial and information-theoretic bounds that do not presuppose the target separations. This is a standard non-circular theoretical result.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the formal definition of query-class description length and constructions of specific SCMs. No free parameters are fitted to data; the results are asymptotic. Axioms include standard information theory and causality assumptions.

axioms (2)
  • standard math Kolmogorov complexity provides a valid measure of description length for oracles
    This is the foundation for defining query-class description length in the paper.
  • domain assumption Structural causal models can be represented as acyclic graphs with binary variables and finite-gate schemas
    Used directly in the main construction and upper bound for indegree d.

pith-pipeline@v0.9.0 · 5521 in / 1493 out tokens · 31375 ms · 2026-05-08T19:28:48.348743+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages

  1. [1]

    Correa, Duligur Ibeling, and Thomas Icard

    Elias Bareinboim, Juan D. Correa, Duligur Ibeling, and Thomas Icard. On Pearl’s hierarchy and the foundations of causal inference. InProbabilistic and Causal Inference: The Works of Judea Pearl, pages 507–556. ACM Books, 2022

  2. [2]

    From probability to counterfactuals: The increasing complexity of satisfiability in Pearl’s causal hierarchy

    Julian Dörfler, Benito van der Zander, Markus Bläser, and Maciej Liskiewicz. From probability to counterfactuals: The increasing complexity of satisfiability in Pearl’s causal hierarchy. In International Conference on Learning Representations (ICLR), 2025

  3. [3]

    On the number of experiments sufficient and in the worst case necessary to identify all causal relations among N variables

    Frederick Eberhardt, Clark Glymour, and Richard Scheines. On the number of experiments sufficient and in the worst case necessary to identify all causal relations among N variables. Proceedings of the UAI Conference, pages 178–184, 2005

  4. [4]

    Grünwald.The Minimum Description Length Principle

    Peter D. Grünwald.The Minimum Description Length Principle. MIT Press, 2007

  5. [5]

    Characterization and greedy learning of interventional Markov equivalence classes of directed acyclic graphs.Journal of Machine Learning Research, 13:2409–2464, 2012

    Alain Hauser and Peter Bühlmann. Characterization and greedy learning of interventional Markov equivalence classes of directed acyclic graphs.Journal of Machine Learning Research, 13:2409–2464, 2012

  6. [6]

    Comparing causal frameworks: Potential outcomes, struc- tural models, graphs, and abstractions

    Duligur Ibeling and Thomas Icard. Comparing causal frameworks: Potential outcomes, struc- tural models, graphs, and abstractions. InAdvances in Neural Information Processing Systems, volume 36, 2023

  7. [7]

    CLadder: Assessing causal reasoning in language models

    Zhijing Jin, Yuen Chen, Felix Leeb, Luigi Gresele, Ojasv Kamal, Zhiheng Lyu, Kevin Blin, Fernando Gonzalez Adauto, Max Kleiman-Weiner, Mrinmaya Sachan, and Bernhard Schölkopf. CLadder: Assessing causal reasoning in language models. InAdvances in Neural Information Processing Systems, volume 36, 2023

  8. [8]

    Causal reasoning and large language models: Opening a new frontier for causality.Transactions on Machine Learning Research, 2024

    Emre Kıcıman, Robert Osazuwa Ness, Amit Sharma, and Chenhao Tan. Causal reasoning and large language models: Opening a new frontier for causality.Transactions on Machine Learning Research, 2024

  9. [9]

    Kolmogorov

    Andrey N. Kolmogorov. Three approaches to the quantitative definition of information.Prob- lems of Information Transmission, 1(1):1–7, 1965

  10. [10]

    Springer, 3rd edition, 2008

    Ming Li and Paul Vitányi.An Introduction to Kolmogorov Complexity and Its Applications. Springer, 3rd edition, 2008

  11. [11]

    Cambridge University Press, 2nd edition, 2009

    Judea Pearl.Causality: Models, Reasoning, and Inference. Cambridge University Press, 2nd edition, 2009

  12. [12]

    Basic Books, 2018

    Judea Pearl and Dana Mackenzie.The Book of Why. Basic Books, 2018

  13. [13]

    MIT Press, 2017

    Jonas Peters, Dominik Janzing, and Bernhard Schölkopf.Elements of Causal Inference: Foundations and Learning Algorithms. MIT Press, 2017

  14. [14]

    Dimakis, and Sriram Vishwanath

    Karthikeyan Shanmugam, Murat Kocaoglu, Alexandros G. Dimakis, and Sriram Vishwanath. Learning causal graphs with small interventions. InAdvances in Neural Information Processing Systems, volume 28, 2015

  15. [15]

    Complete identification methods for the causal hierarchy.Journal of Machine Learning Research, 9:1941–1979, 2008

    Ilya Shpitser and Judea Pearl. Complete identification methods for the causal hierarchy.Journal of Machine Learning Research, 9:1941–1979, 2008

  16. [16]

    all equal

    Peter Spirtes, Clark Glymour, and Richard Scheines.Causation, Prediction, and Search. MIT Press, 2nd edition, 2000. A Rooted-Tree Construction: Full Details This appendix gives the formal construction and full proofs underlying the Θ(nlogn) rooted-tree separation. Theorem A.5 below restates and proves the main-text Theorem 4.1; together with Corollary 5.8...

  17. [17]

    what happens

    The mapT7→Int 1(MT )is injective onM n tree. 2.Lower bound:There existsTwith∆ 2|1(MT )≥(n−1) log 2 n−O(logn). 3.Upper bound:For allT,DL 2(MT )≤(n−1) log 2 n+O(logn). 4.High probability:For uniformly random rooted labeled treeT, P ∆2|1(MT )≥(n−1) log 2 n−c−O(logn) ≥1−2 −c. Consequently,∆ 2|1(MT ) = Θ(nlogn)for all but an exponentially small fraction of tre...

  18. [18]

    a topological ordering of thenvariables, usingO(nlogn)bits

  19. [19]

    for each variable, its parent set, chosen from at most Pd k=0 n−1 k possibilities

  20. [20]

    What is P(Y (Xt←0) t = Y (Xt←1) t )?

    for each variable, a gate from the fixed library Γ and a noise distribution from the fixed libraryΠ, usingO(1)bits per variable. The entire SCM thus has a description of length O(nlogn) +nlog 2 dX k=0 n−1 k +O Γ,Π(n). Given this description and n, a fixed program can compute all single-node interventional distributions and outputInt 1(M). Hence the same e...

  21. [21]

    This is not a computational limitation; it is information-theoretic

    Averaging over a proves the claim. This is not a computational limitation; it is information-theoretic. No algorithm, however powerful, can reliably predict interventional outcomes from observations when the underlying mechanism is drawn from our family. Remark D.3(Shannon Mutual Information is Zero).In the bipartite family Mn bip, a dataset D∼ (P ⋆)N of ...