arxiv: 2604.05829 · v1 · submitted 2026-04-07 · 💻 cs.LG · stat.ML

Recognition: 2 theorem links

· Lean Theorem

Bivariate Causal Discovery Using Rate-Distortion MDL: An Information Dimension Approach

M\'ario A.T. Figueiredo, Tiago Brogueira

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:12 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords bivariate causal discoveryminimum description lengthrate-distortion theoryinformation dimensioncausal inferenceKolmogorov complexity approximation

0 comments

The pith

A rate-distortion version of minimum description length correctly estimates the complexity of the cause variable itself in bivariate causal discovery.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that existing MDL methods for deciding which of two observed variables causes the other fail to properly count the description length needed for the cause variable. As a result, their direction choice is driven almost entirely by how simply the mapping from cause to effect can be described. The authors replace that missing piece with a rate-distortion calculation: the shortest code length required to reproduce the cause variable up to a small distortion level chosen by standard histogram density rules. They obtain the rate itself from the asymptotic information dimension of the variable. When this term is added to a conventional description of the mechanism, the combined score is called RDMDL and is shown to reach competitive accuracy on the Tubingen causal pairs collection. A reader would care because many real-world questions involve only observational pairs and need a principled way to break the symmetry between the two directions.

Core claim

Approaches to bivariate causal discovery based on the minimum description length principle approximate the Kolmogorov complexity of the models in each causal direction and select the direction with lower total complexity. The premise is that nature's mechanisms are simpler in their true causal order. The total complexity in each direction includes both the description of the cause variable and that of the causal mechanism. Prior MDL methods do not correctly estimate the description length of the cause, leaving the decision to the mechanism term alone. The new rate-distortion MDL approach measures the cause description length as the minimum rate needed to achieve a distortion level that is a)

What carries the argument

Rate-distortion MDL (RDMDL) for the cause variable, obtained by selecting a representative distortion via histogram density estimation rules and computing the required rate via an asymptotic information-dimension approximation.

Load-bearing premise

The distortion level chosen by histogram density estimation rules stands in for the true underlying distribution, and the asymptotic information-dimension formula then supplies the exact minimum rate needed at that distortion.

What would settle it

A collection of synthetic bivariate pairs whose true causal direction is known, the cause variable has higher intrinsic rate at the chosen distortion than the reverse, yet RDMDL still selects the wrong direction, would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.05829 by M\'ario A.T. Figueiredo, Tiago Brogueira.

**Figure 1.** Figure 1: displays the AUDRC for the four versions of RDMDL, compared to other methods on the Tubingen benchmark. In it, it is clear that RDMDL performs consistently above average across ¨ all decision rates. This is an even stronger result than the values of AUROC, accuracy, and AUDRC shown in [PITH_FULL_IMAGE:figures/full_fig_p012_1.png] view at source ↗

read the original abstract

Approaches to bivariate causal discovery based on the minimum description length (MDL) principle approximate the (uncomputable) Kolmogorov complexity of the models in each causal direction, selecting the one with the lower total complexity. The premise is that nature's mechanisms are simpler in their true causal order. Inherently, the description length (complexity) in each direction includes the description of the cause variable and that of the causal mechanism. In this work, we argue that current state-of-the-art MDL-based methods do not correctly address the problem of estimating the description length of the cause variable, effectively leaving the decision to the description length of the causal mechanism. Based on rate-distortion theory, we propose a new way to measure the description length of the cause, corresponding to the minimum rate required to achieve a distortion level representative of the underlying distribution. This distortion level is deduced using rules from histogram-based density estimation, while the rate is computed using the related concept of information dimension, based on an asymptotic approximation. Combining it with a traditional approach for the causal mechanism, we introduce a new bivariate causal discovery method, termed rate-distortion MDL (RDMDL). We show experimentally that RDMDL achieves competitive performance on the T\"ubingen dataset. All the code and experiments are publicly available at github.com/tiagobrogueira/Causal-Discovery-In-Exchangeable-Data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper proposes RDMDL to better penalize the cause variable in MDL causal discovery via rate-distortion and information dimension, but the finite-sample justification for the approximation is thin.

read the letter

The paper's main contribution is a new way to estimate the description length of the cause variable in MDL-based bivariate causal discovery, by picking a distortion level from histogram rules and then using information dimension to approximate the minimum rate needed. This is actually new relative to the cited MDL literature, and the authors do a decent job making the code public so others can check the experiments on the Tübingen set. They get competitive performance, which is something. The weak part is the reliance on the asymptotic information-dimension formula for what is supposed to be a practical description length. The stress-test concern holds up: for the finite discrete samples in typical benchmarks, there's no clear reason this approximation tracks Kolmogorov complexity closely enough to change the causal decision reliably. The paper also skips error bars and detailed ablations, so it's tough to judge if the gains are stable or just from the mechanism term as before. The distortion choice via histograms could also sneak in some dependence on the data that affects the causal score. This is for researchers already deep in information-theoretic causal methods who are looking for tweaks to the MDL formulation. A general reader or someone needing solid theory for finite data won't get much. It deserves a serious referee because the gap it identifies is legitimate and the proposal is specific enough to critique in detail, even though the current evidence is preliminary. Recommendation: Send it for peer review with requests for more justification on the rate approximation and better experimental reporting.

Referee Report

3 major / 1 minor

Summary. The manuscript argues that existing MDL-based bivariate causal discovery methods fail to properly estimate the description length of the cause variable, leaving decisions driven mainly by the causal mechanism complexity. It proposes RDMDL, which selects a representative distortion level via histogram-based density estimation rules and approximates the minimum rate using the asymptotic information dimension, then combines this with a standard mechanism description length to decide causal direction. Experiments claim competitive performance on the Tübingen benchmark, with code released publicly.

Significance. If the finite-sample accuracy of the information-dimension rate approximation holds and the distortion choice is independent of the causal decision, RDMDL could provide a more balanced MDL penalty that accounts for cause-variable complexity, addressing a gap in prior methods. The public code strengthens reproducibility. Significance is limited by the absence of validation for the core approximation at benchmark sample sizes.

major comments (3)

[Abstract] Abstract and experimental section: The claim of competitive performance on the Tübingen dataset is stated without error bars, ablation details, number of runs, or statistical tests, making it impossible to evaluate whether the improvement over SOTA MDL methods is reliable or driven by the new cause-length term.
[Method] Method section on rate computation: The replacement of the Kolmogorov complexity term for the cause with the asymptotic information-dimension formula at the histogram-derived distortion is presented without derivation showing monotonicity to true description length or accuracy for finite discrete samples typical of the benchmark.
[Method] Distortion selection paragraph: The histogram rules used to deduce the distortion level are not shown to be independent of data properties later used in causal evaluation, leaving open the circularity risk that the rate approximation implicitly depends on the target decision variable.

minor comments (1)

[Abstract] The abstract states that 'all the code and experiments are publicly available' at the given GitHub link; the manuscript should include a direct pointer to the exact commit or folder containing the RDMDL implementation and Tübingen evaluation scripts.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each of the major comments in detail below, indicating the changes we intend to implement in the revised version. Our responses aim to clarify the methodological choices and strengthen the empirical validation.

read point-by-point responses

Referee: [Abstract] Abstract and experimental section: The claim of competitive performance on the Tübingen dataset is stated without error bars, ablation details, number of runs, or statistical tests, making it impossible to evaluate whether the improvement over SOTA MDL methods is reliable or driven by the new cause-length term.

Authors: We agree with this observation. The current manuscript does not provide sufficient statistical details in the experimental results. In the revised manuscript, we will augment the experimental section with error bars computed over multiple independent runs, specify the exact number of runs performed, include ablation studies that isolate the contribution of the rate-distortion term for the cause variable, and apply statistical significance tests (such as McNemar's test or paired Wilcoxon signed-rank tests) to compare RDMDL against existing MDL-based methods. These additions will enable a more rigorous evaluation of the performance claims. revision: yes
Referee: [Method] Method section on rate computation: The replacement of the Kolmogorov complexity term for the cause with the asymptotic information-dimension formula at the histogram-derived distortion is presented without derivation showing monotonicity to true description length or accuracy for finite discrete samples typical of the benchmark.

Authors: The information dimension is used as an asymptotic proxy for the rate in the rate-distortion function, which in turn approximates the description length for small distortions. While we cannot provide a direct monotonicity proof to the uncomputable Kolmogorov complexity, the approximation is derived from the properties of information dimension for distributions with finite dimension. We will revise the method section to include a more detailed derivation linking the information dimension to the rate-distortion function and discuss the conditions under which it serves as a reasonable approximation for finite samples. Additionally, we will report empirical checks on the benchmark data to assess its practical accuracy. revision: partial
Referee: [Method] Distortion selection paragraph: The histogram rules used to deduce the distortion level are not shown to be independent of data properties later used in causal evaluation, leaving open the circularity risk that the rate approximation implicitly depends on the target decision variable.

Authors: The distortion level is determined solely from the marginal distribution of the putative cause variable using established histogram density estimation rules (e.g., Sturges' rule or similar, based on sample size and data range). This choice does not involve the joint distribution or the conditional mechanism, which are handled separately in the mechanism complexity term. To mitigate concerns of circularity, we will explicitly clarify this independence in the revised manuscript and provide a step-by-step argument demonstrating that the distortion selection depends only on univariate properties of the cause, independent of the causal direction decision. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation uses independent standard tools

full rationale

The paper derives the cause-variable description length from rate-distortion theory by selecting a distortion level via established histogram-based density estimation rules and approximating the minimum rate via the information-dimension limit. These steps are drawn from external statistical and information-theoretic concepts, applied uniformly without reference to the causal direction or the final model selection. No equation or step reduces the RDMDL score to a fitted parameter, self-definition, or self-citation chain that presupposes the target causal conclusion. The mechanism description length is handled by a conventional approach, and experimental results on the Tübingen benchmark serve as an external check. The derivation chain therefore remains self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The histogram distortion rule and information-dimension asymptotic are treated as background but their applicability to causal MDL is not derived here.

pith-pipeline@v0.9.0 · 5558 in / 1114 out tokens · 38465 ms · 2026-05-10T19:12:56.521013+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
We propose to approximate L(X) by R(X,D) ... distortion level D ... deduced using rules from histogram-based density estimation ... rate is computed using ... information dimension, based on an asymptotic approximation.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
L(X) = N · R(X,D) = N dim_I(X)/2 log(1/D)

Reference graph

Works this paper leans on

10 extracted references · 5 canonical work pages

[1]

Bivariate causal discovery using Bayesian model selection.arXiv preprint arXiv:2306.02931,

Anish Dhir, Samuel Power, and Mark van der Wilk. Bivariate causal discovery using Bayesian model selection.arXiv preprint arXiv:2306.02931,

work page arXiv
[2]

Conditional distribution variability measures for causality detection.arXiv preprint arXiv:1601.06680,

Jos´e AR Fonollosa. Conditional distribution variability measures for causality detection.arXiv preprint arXiv:1601.06680,

work page arXiv
[3]

Osman Gani

Uzma Hasan, Emam Hossain, and Md. Osman Gani. A survey on causal discovery methods for i.i.d. and time series data.Trans. Mach. Learn. Res., 2023,

2023
[4]

Identifying causal direction via dense func- tional classes.arXiv preprint arXiv:2509.00538,

Katerina Hlavackova-Schindler and Suzana Marsela. Identifying causal direction via dense func- tional classes.arXiv preprint arXiv:2509.00538,

work page arXiv
[5]

Causal discovery toolbox: Uncover causal relationships in python.arXiv preprint arXiv:1903.02278,

Diviyan Kalainathan and Olivier Goudet. Causal discovery toolbox: Uncover causal relationships in python.arXiv preprint arXiv:1903.02278,

work page arXiv 1903
[6]

Formally justifying MDL-based inference of cause and effect

Alexander Marx and Jilles Vreeken. Formally justifying MDL-based inference of cause and effect. arXiv preprint arXiv:2105.01902,

work page arXiv
[7]

Rate-distortion dimension of stochas- tic processes

Farideh Rezagah, Shirin Jalali, Elza Erkip, and Vincent Poor. Rate-distortion dimension of stochas- tic processes. InIEEE International Symposium on Information Theory, pages 2079–2083,

2079
[8]

Robust estimation of causal heteroscedastic noise models

Quang-Duy Tran, Bao Duong, Phuoc Nguyen, and Thin Nguyen. Robust estimation of causal heteroscedastic noise models. InProceedings of the 2024 SIAM International Conference on Data Mining (SDM), pages 788–796. SIAM,

2024
[9]

Hi ≈alogϵ i +b with least squares regression

9Fit a linear model to the points(logϵ i, Hi):▷Using Numpy’s polyfit. Hi ≈alogϵ i +b with least squares regression. 10Compute the information dimension: dimX ← −a. In Algorithm 3, the function logspace stands for the standard Numpy function. Algorithm 4EntropyH(from RDMDL) Input:X,ϵ i Output:H X,ϵi 11b←linspace(min(X),max(X), 1 ϵi + 1)▷Define bin edges 12...

2025
[10]

(2019) (CE), Mooij et al

and the synthetic datasets of Guyon et al. (2019) (CE), Mooij et al. (2016) (SIM), and Tagasovska et al. (2020) (ANLSMN). Table 5: Accuracy (in %) obtained by the different unimplemented methods for all evaluated datasets. Method Source AN AN-s CE-Cha CE-Multi CE-Net LS LS-s MN-U Multi Net SIM SIM-G SIM-c SIM-ln T¨ubingen COMICCOMIC 100 100 43 79 78 100 1...

2019