The Good, the Bad, and the Ugly of Markov Boundary for Tabular Prediction

Abhinav Gorantla; Huan Liu; K. Sel\c{c}uk Candan; Shu Wan

arxiv: 2605.29411 · v1 · pith:E55OYR2Knew · submitted 2026-05-28 · 💻 cs.LG · cs.AI· stat.ME· stat.ML

The Good, the Bad, and the Ugly of Markov Boundary for Tabular Prediction

Shu Wan , Abhinav Gorantla , Huan Liu , K. Sel\c{c}uk Candan This is my paper

Pith reviewed 2026-06-29 08:34 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.MEstat.ML

keywords Markov boundarytabular predictioncausal discoveryfeature selectionsynthetic benchmarksstructural causal modelsconditional independenceregressors

0 comments

The pith

Restricting tabular regressors to the true Markov boundary improves accuracy, especially in large sparse tables, but causal discovery rarely recovers a useful boundary.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether the Markov boundary—the smallest feature set that makes a target variable independent of all others—can serve as an effective feature mask for tabular prediction. On a benchmark of 3450 synthetic tasks spanning six structural causal model families and feature counts up to 1000, models trained only on the oracle boundary outperform those using the full feature set, with the gap widening as dimensionality and sparsity increase. Pipelines that first apply causal discovery to recover a boundary and then train on the resulting mask fail to beat full-feature baselines. The shortfall traces to three mismatches: discovery algorithms optimize for structural correctness rather than predictive utility, false negatives and false positives carry sharply different costs for downstream error, and many feature sets besides the exact boundary already outperform the complete table.

Core claim

Under standard graphical assumptions the Markov boundary is the minimal feature set rendering the target conditionally independent of all remaining variables. On the SCM3K benchmark of 3450 tasks, training six regressors on the oracle boundary substantially improves prediction over the full feature set, and the improvement grows with feature count. Boundaries recovered by existing causal discovery methods produce masks that rarely improve upon the full feature set because discovery optimizes structural recovery rather than prediction, false negatives and positives incur asymmetric predictive costs, and the exact boundary is only one of many feature sets that beat all features.

What carries the argument

The Markov boundary: the smallest set of features that renders the target conditionally independent of all other features under standard graphical assumptions.

If this is right

Restricting regressors to the oracle Markov boundary improves prediction accuracy over full-feature training on high-dimensional tabular data.
The accuracy gain from the oracle boundary grows as the number of features increases and the table becomes sparser.
Existing causal discovery estimators exhaust compute before reaching the high-dimensional regime where boundary use would help most.
False negatives in boundary recovery carry higher predictive cost than false positives.
Many feature sets besides the exact Markov boundary already outperform training on all columns.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Feature selection for prediction may need algorithms that optimize directly for downstream model error rather than graph recovery metrics.
Tabular models could incorporate internal mechanisms to detect and ignore redundant features without recovering an explicit causal boundary.
The existence of many sufficient feature sets suggests room for approximate or heuristic selection methods tuned to a specific regressor.
Hybrid causal-predictive pipelines might close the gap between oracle and recovered boundaries on real tabular tasks.

Load-bearing premise

Standard graphical assumptions hold such that the Markov boundary is the minimal set of features rendering the target conditionally independent of all others.

What would settle it

On SCM3K tasks with 500 or more features, compare test error of regressors trained on boundaries recovered by standard causal discovery algorithms against the same regressors trained on the full feature set; if the recovered-mask models do not underperform the full set, the claim that discovery pipelines fail is falsified.

Figures

Figures reproduced from arXiv: 2605.29411 by Abhinav Gorantla, Huan Liu, K. Sel\c{c}uk Candan, Shu Wan.

**Figure 2.** Figure 2: Oracle MB gap by downstream regressor. The boundary helps most for regressors that pay a high finite-sample cost [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Regressor-conditioned attribution of the MB gap. The gap is explained primarily by redundancy, feature dimension, [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Causal boundary estimators are least available [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: False negatives and false positives have different downstream costs. The TabPFN panel shown here follows the same [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Layered blankets versus target proximity as mask families. The top row expands by accumulated Markov-boundary [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Test RMSE of two mask-expansion strategies for [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Prediction gain maps for Ridge and TabPFN. The axes are mask precision and recall against the oracle boundary; the [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: Why the exact boundary can be too exclusive. The exact boundary is the minimal sufficient set, but layer-2, mixed [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗

read the original abstract

Under standard graphical assumptions, the Markov boundary of a target variable is the smallest set of features that renders every other feature redundant. Once the boundary is observed, the target is conditionally independent of the rest of the table. This is a tempting object for tabular prediction, since it names exactly the columns a model should need. Yet modern regressors are still trained on the full feature set. We ask whether the Markov boundary is genuinely useful for prediction on SCM3K, a 3,450-task synthetic SCM benchmark with feature counts from 40 to 1000 and six SCM families, evaluated with six regressors. The answer is more nuanced than the theory suggests. Restricting a regressor to the oracle boundary often improves prediction substantially, and the improvement grows as the feature space becomes larger and sparser. But the natural pipeline of recovering the boundary with causal discovery and training on the recovered mask does not deliver. Existing estimators exhaust the compute budget before reaching the regime where the boundary helps most, and even where they run they rarely beat the full feature set. We trace this to three causes. Discovery optimizes structural recovery rather than prediction. False negatives and false positives carry sharply asymmetric predictive cost. The exact boundary is only one of many feature sets that beat all features. We then develop what these facts imply for prediction-aligned feature selection and for tabular models that learn to use causal structure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Oracle Markov boundaries help on this synthetic benchmark but recovered ones do not, and the paper gives three concrete reasons why.

read the letter

The main thing to know is that on their SCM3K benchmark of 3450 synthetic tasks, feeding a regressor only the true Markov boundary often beats training on the full feature set, and the advantage grows with larger and sparser feature spaces. Trying to recover the boundary first with causal discovery and then training on the recovered mask does not beat the full set.

The paper does the empirical work cleanly. It generates data from six known SCM families, varies feature count from 40 to 1000, and tests six regressors. The patterns are reported directly: the oracle gap exists and widens in the high-dimensional regime. The authors then list three post-hoc explanations that match the results—discovery optimizes structure rather than prediction error, false negatives and positives have asymmetric costs, and many feature subsets already beat the full set. Because the data are generated from the models, the standard conditional-independence definition of the boundary holds by construction.

The obvious limit is that everything stays inside synthetic SCMs. No real tabular datasets appear, so the size of the practical gap outside these families is unknown. The causal discovery methods are off-the-shelf; the paper does not test whether methods tuned for downstream prediction would narrow the difference.

This is useful for people who do feature selection on tabular data or who want evidence on why causal pipelines underperform in practice. The benchmark and the breakdown of the three failure modes are new enough to be worth referee time. The thinking is straightforward and the claims stay within what the synthetic experiments show.

Send it to review. The scale and the specific oracle-versus-recovered comparison are worth having checked even with the synthetic restriction.

Referee Report

1 major / 2 minor

Summary. The paper claims that on the SCM3K synthetic benchmark (3,450 tasks from six SCM families, feature counts 40-1000), restricting regressors to the oracle Markov boundary often improves prediction substantially, with gains increasing in larger and sparser regimes. However, the pipeline of recovering the boundary via causal discovery and training on the recovered mask fails to deliver comparable gains. The authors trace the failure to three factors: discovery optimizing structural recovery rather than prediction, asymmetric predictive costs of false negatives versus positives, and the existence of multiple feature sets that outperform the full set.

Significance. If the results hold, this provides a controlled large-scale empirical demonstration of the gap between the theoretical utility of Markov boundaries for conditional independence and their practical value when recovered by existing causal discovery methods. The synthetic benchmark with known ground-truth SCMs is a clear strength, enabling direct oracle-vs-recovered comparisons and falsifiable observations about when boundaries help prediction. This can inform development of prediction-aligned rather than structure-aligned feature selection.

major comments (1)

[Abstract] Abstract: the central claim that 'the improvement grows as the feature space becomes larger and sparser' is load-bearing for the headline result, yet the abstract provides no reference to a specific table, figure, or statistical test (e.g., trend across dimension bins or p-values) that would allow verification of the scaling behavior.

minor comments (2)

[Abstract] Abstract: the six SCM families and six regressors are referenced but not named; naming them would improve immediate readability without requiring the reader to reach the methods section.
[Abstract] Abstract: the sentence beginning 'Existing estimators exhaust the compute budget...' would be clearer if it quantified the budget (e.g., wall-clock or memory limits) or the dimension threshold at which the boundary becomes beneficial.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive evaluation, the recognition of the benchmark's controlled nature, and the recommendation of minor revision. We address the single major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'the improvement grows as the feature space becomes larger and sparser' is load-bearing for the headline result, yet the abstract provides no reference to a specific table, figure, or statistical test (e.g., trend across dimension bins or p-values) that would allow verification of the scaling behavior.

Authors: We agree that the abstract claim would benefit from an explicit pointer to the supporting evidence for verifiability. The scaling behavior (larger gains in higher-dimensional and sparser regimes) is shown via binned results in Figure 5 (mean R2 improvement vs. feature count bins) and Table 4 (stratified by sparsity), with standard errors across the six SCM families. In the revised manuscript we will append a brief parenthetical '(see Figure 5)' to the relevant abstract sentence. This is a minor textual change that preserves abstract length while addressing the concern. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

This paper presents an empirical evaluation on the external SCM3K synthetic benchmark rather than any derivation chain. The headline result (oracle Markov boundaries improve regressors while recovered ones do not) is measured directly on data generated from known SCMs, where the invoked graphical assumptions hold by construction. No equations, fitted parameters, or self-citations are used to derive the central claims; the three explanatory factors are presented as post-hoc observations. The study is therefore self-contained against external benchmarks with no load-bearing reductions to inputs or prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that standard graphical models apply to the generated SCM families and that the benchmark captures relevant properties of tabular prediction tasks.

axioms (1)

domain assumption Standard graphical assumptions hold such that the Markov boundary is the minimal set rendering the target conditionally independent of all others
Invoked in the first sentence of the abstract to define the object of study.

pith-pipeline@v0.9.1-grok · 5800 in / 1238 out tokens · 36486 ms · 2026-06-29T08:34:30.709262+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 10 canonical work pages

[1]

Constantin F Aliferis, Alexander Statnikov, Ioannis Tsamardinos, Subramani Mani, and Xenofon D Koutsoukos. 2010. Local causal and Markov blanket in- duction for causal discovery and feature selection for classification part I: Algo- rithms and empirical evaluation.Journal of Machine Learning Research11 (2010), 171–234

2010
[2]

Constantin F Aliferis, Ioannis Tsamardinos, and Alexander Statnikov. 2003. HI- TON: a novel Markov Blanket algorithm for optimal variable selection. InAMIA annual symposium proceedings, Vol. 2003. 21

2003
[3]

Cresswell, and Rahul Krishnan

Vahid Balazadeh, Hamidreza Kamkari, Valentin Thomas, Junwei Ma, Bingru Li, Jesse C. Cresswell, and Rahul Krishnan. 2026. CausalPFN: Amortized Causal Effect Estimation via In-Context Learning. InThe Thirty-ninth Annual Confer- ence on Neural Information Processing Systems. https://openreview.net/forum? id=RblaNJGx8C

2026
[4]

Philippe Brouillard, Chandler Squires, Jonas Wahl, Konrad K"ording, Karen Sachs, Alexandre Drouin, and Dhanya Sridhar. 2025. The Landscape of Causal Discovery Data: Grounding Causal Discovery in Real-World Applications. In Proceedings of the Fourth Conference on Causal Learning and Reasoning (Proceed- ings of Machine Learning Research, Vol. 275), Biwei Hu...

2025
[5]

Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting Sys- tem. InProceedings of the 22nd ACM SIGKDD International Conference on Knowl- edge Discovery and Data Mining(San Francisco, California, USA)(KDD ’16). As- sociation for Computing Machinery, New York, NY, USA, 785–794. doi:10.1145/ 2939672.2939785

work page arXiv 2016
[6]

David Maxwell Chickering. 2003. Optimal structure identification with greedy search.J. Mach. Learn. Res.3, null (March 2003), 507–554. doi:10.1162/ 153244303321897717

2003
[7]

Paul Erdős and Alfréd Rényi. 1960. On the evolution of random graphs.Publica- tions of the Mathematical Institute of the Hungarian Academy of Sciences5 (1960), 17–61

1960
[8]

2024.causalAssembly: Generating Realistic Pro- duction Data for Benchmarking Causal Discovery

Konstantin Göbler, Tobias Windisch, Mathias Drton, Tim Pychynski, Martin Roth, and Steffen Sonntag. 2024.causalAssembly: Generating Realistic Pro- duction Data for Benchmarking Causal Discovery. InProceedings of the Third Conference on Causal Learning and Reasoning (Proceedings of Machine Learning Research, Vol. 236), Francesco Locatello and Vanessa Didel...

2024
[9]

https://proceedings.mlr.press/v236/gobler24a.html
[10]

The Elements of Statistical Learning: Data Mining, Inference, and Prediction

Trevor Hastie, Robert Tibshirani, and Jerome Friedman. 2009.The Elements of Statistical Learning: Data Mining, Inference, and Prediction(2 ed.). Springer, New York, NY. doi:10.1007/978-0-387-84858-7

work page doi:10.1007/978-0-387-84858-7 2009
[11]

Hoerl and Robert W

Arthur E. Hoerl and Robert W. Kennard. 1970. Ridge Regression: Biased Estimation for Nonorthogonal Problems.Technometrics12, 1 (1970), 55–

1970
[12]

Hoerl and Robert W

arXiv:https://doi.org/10.1080/00401706.1970.10488634 doi:10.1080/00401706. 1970.10488634

work page doi:10.1080/00401706.1970.10488634 1970
[13]

Noah Hollmann, Samuel Müller, Katharina Eggensperger, and Frank Hutter
[14]

InThe Eleventh International Conference on Learning Representations

TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second. InThe Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=cp5PvcI6w8_
[15]

Kurt Hornik, Maxwell Stinchcombe, and Halbert White. 1989. Multilayer feed- forward networks are universal approximators.Neural Networks2, 5 (1989), 359–366. doi:10.1016/0893-6080(89)90020-8

work page doi:10.1016/0893-6080(89)90020-8 1989
[16]

2009.Probabilistic Graphical Models: Principles and Techniques - Adaptive Computation and Machine Learning

Daphne Koller and Nir Friedman. 2009.Probabilistic Graphical Models: Principles and Techniques - Adaptive Computation and Machine Learning. The MIT Press

2009
[17]

Laird and James H

Nan M. Laird and James H. Ware. 1982. Random-Effects Models for Longitudinal Data.Biometrics38, 4 (1982), 963–974. http://www.jstor.org/stable/2529876

work page arXiv 1982
[18]

Yuchen Ma, Dennis Frauen, Emil Javurek, and Stefan Feuerriegel. 2026. Foun- dation Models for Causal Inference via Prior-Data Fitted Networks. InThe Four- teenth International Conference on Learning Representations. https://openreview. net/forum?id=d2L1ndOKjq

2026
[19]

Daniel Marbach, James C Costello, Robert Küffner, Nicole M Vega, Robert J Prill, Diogo M Camacho, Kyle R Allison, Manolis Kellis, James J Collins, et al. 2012. Wisdom of crowds for robust gene network inference.Nature methods9, 8 (2012), 796–804

2012
[20]

Prill, Thomas Schaffter, Claudio Mat- tiussi, Dario Floreano, and Gustavo Stolovitzky

Daniel Marbach, Robert J. Prill, Thomas Schaffter, Claudio Mat- tiussi, Dario Floreano, and Gustavo Stolovitzky. 2010. Revealing strengths and weaknesses of methods for gene network inference.Pro- ceedings of the National Academy of Sciences107, 14 (2010), 6286–

2010
[21]

arXiv:https://www.pnas.org/doi/pdf/10.1073/pnas.0913357107 doi:10.1073/pnas.0913357107

work page doi:10.1073/pnas.0913357107
[22]

Dimitris Margaritis and Sebastian Thrun. 1999. Bayesian Network Induction via Local Neighborhoods. InAdvances in Neural Informa- tion Processing Systems, S. Solla, T. Leen, and K. Müller (Eds.), Vol. 12. MIT Press. https://proceedings.neurips.cc/paper_files/paper/1999/file/ 5d79099fcdf499f12b79770834c0164a-Paper.pdf

1999
[23]

Christopher Meek. 1995. Causal inference and causal explanation with back- ground knowledge.Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence(1995), 403–410

1995
[24]

Wang Miao, Zhi Geng, and Eric J Tchetgen Tchetgen. 2018. Identifying causal effects with proxy variables of an unmeasured confounder.Biometrika 105, 4 (12 2018), 987–993. arXiv:https://academic.oup.com/biomet/article- pdf/105/4/987/27121264/asy038.pdf doi:10.1093/biomet/asy038

work page doi:10.1093/biomet/asy038 2018
[25]

Samuel Müller, Noah Hollmann, Sebastian Pineda Arango, Josif Grabocka, and Frank Hutter. 2022. Transformers Can Do Bayesian Inference. InInterna- tional Conference on Learning Representations. https://openreview.net/forum? id=KSugKcbNf9

2022
[26]

Samuel Müller, Arik Reuter, Noah Hollmann, David Rügamer, and Frank Hutter
[27]

InProceedings of the 42nd International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol

Position: The Future of Bayesian Prediction Is Prior-Fitted. InProceedings of the 42nd International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 267). PMLR, Vancouver, Canada. https://proceedings.mlr. press/v267/muller25d.html
[28]

Rainer Opgen-Rhein and Korbinian Strimmer. 2007. From correlation to causa- tion networks: a simple approximate learning algorithm and its application to high-dimensional plant gene expression data.BMC systems biology1, 1 (2007), 37

2007
[29]

1988.Probabilistic Reasoning in Intelligent Systems: Networks of Plau- sible Inference

Judea Pearl. 1988.Probabilistic Reasoning in Intelligent Systems: Networks of Plau- sible Inference. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA

1988
[30]

2009.Causality: Models, Reasoning and Inference(2nd ed.)

Judea Pearl. 2009.Causality: Models, Reasoning and Inference(2nd ed.). Cam- bridge University Press, USA

2009
[31]

2017.Elements of Causal Inference: Foundations and Learning Algorithms

Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. 2017.Elements of Causal Inference: Foundations and Learning Algorithms. MIT Press, Cambridge, MA

2017
[32]

Jingang Qu, David Holzmüller, Gaël Varoquaux, and Marine Le Morvan. 2025. TabICL: A Tabular Foundation Model for In-Context Learning on Large Data. In Proceedings of the 42nd International Conference on Machine Learning (Proceed- ings of Machine Learning Research, Vol. 267). PMLR, Vancouver, Canada, 50817– 50847. https://proceedings.mlr.press/v267/qu25d.html

2025
[33]

Jingang Qu, David Holzmüller, Gaël Varoquaux, and Marine Le Morvan
[34]

TabICLv2: A better, faster, scalable, and open tabular foundation model.arXiv:2602.11139, 2026

TabICLv2: A better, faster, scalable, and open tabular foundation model. arXiv:2602.11139 [cs.LG] https://arxiv.org/abs/2602.11139

work page arXiv
[35]

Alexander Reisach, Christof Seiler, and Sebastian Weichwald. 2021. Beware of the simulated dag! causal discovery benchmarks may be easy to game.Advances in Neural Information Processing Systems34 (2021), 27772–27784

2021
[36]

Jake Robertson, Arik Reuter, Siyuan Guo, Noah Hollmann, Frank Hutter, and Bernhard Schölkopf. 2025. Do-PFN: In-Context Learning for Causal Effect Esti- mation. arXiv:2506.06039 [cs.LG] https://arxiv.org/abs/2506.06039

work page arXiv 2025
[37]

Marco Scutari. 2010. Learning Bayesian networks with the bnlearn R package. Journal of statistical software35 (2010), 1–22

2010
[38]

2000.Causation, Prediction, and Search(2 ed.)

Peter Spirtes, Clark Glymour, and Richard Scheines. 2000.Causation, Prediction, and Search(2 ed.). MIT Press, Cambridge, MA

2000
[39]

Robert Tibshirani. 1996. Regression Shrinkage and Selection Via the Lasso.Jour- nal of the Royal Statistical Society: Series B (Methodological)58, 1 (01 1996), 267–

1996
[40]

doi:10.1111/j.2517-6161.1996.tb02080.x

work page doi:10.1111/j.2517-6161.1996.tb02080.x 1996
[41]

Aliferis, and Alexander Statnikov

Ioannis Tsamardinos, Constantin F. Aliferis, and Alexander Statnikov. 2003. Al- gorithms for Large Scale Markov Blanket Discovery. InProceedings of the Six- teenth International Florida Artificial Intelligence Research Society Conference (FLAIRS 2003). AAAI Press, Menlo Park, CA, 376–381. https://aaai.org/papers/ flairs-2003-073/

2003
[42]

Ioannis Tsamardinos, Laura E Brown, and Constantin F Aliferis. 2006. The max-min hill-climbing Bayesian network structure learning algorithm.Machine Learning65, 1 (2006), 31–78

2006
[43]

Tim Van den Bulcke, Koenraad Van Leemput, Bart Naudts, Piet van Remortel, Hongwu Ma, Alain Verschoren, Bart De Moor, and Kathleen Marchal. 2006. Syn- TReN: a generator of synthetic gene expression data for design and analysis of structure learning algorithms.BMC bioinformatics7, 1 (2006), 43

2006
[44]

Thomas Verma and Judea Pearl. 1990. Equivalence and synthesis of causal mod- els.Proceedings of the Sixth Conference on Uncertainty in Artificial Intelligence (1990), 220–227

1990
[45]

Wooldridge

Jefrey M. Wooldridge. 2010.Econometric Analysis of Cross Section and Panel Data. The MIT Press. http://www.jstor.org/stable/j.ctt5hhcfr

2010
[46]

Liyuan Xu, Heishiro Kanagawa, and Arthur Gretton. 2021. Deep Proxy Causal Learning and its Application to Confounded Bandit Policy Evaluation. InAd- vances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (Eds.), Vol. 34. Curran As- sociates, Inc., 26264–26275. https://proceedings.neurips...

2021
[47]

Naiyu Yin, Hanjing Wang, Yue Yu, Tian Gao, Amit Dhurandhar, and Qiang Ji
[48]

InEuropean Conference on Computer Vision

Integrating Markov blanket discovery into causal representation learning for domain generalization. InEuropean Conference on Computer Vision. Springer, 271–288
[49]

Kui Yu, Lin Liu, Jiuyong Li, Weiping Ding, and Thuc Duy Le. 2020. Causality- based feature selection: Methods and evaluations.Comput. Surveys53, 5 (2020), 1–36

2020

[1] [1]

Constantin F Aliferis, Alexander Statnikov, Ioannis Tsamardinos, Subramani Mani, and Xenofon D Koutsoukos. 2010. Local causal and Markov blanket in- duction for causal discovery and feature selection for classification part I: Algo- rithms and empirical evaluation.Journal of Machine Learning Research11 (2010), 171–234

2010

[2] [2]

Constantin F Aliferis, Ioannis Tsamardinos, and Alexander Statnikov. 2003. HI- TON: a novel Markov Blanket algorithm for optimal variable selection. InAMIA annual symposium proceedings, Vol. 2003. 21

2003

[3] [3]

Cresswell, and Rahul Krishnan

Vahid Balazadeh, Hamidreza Kamkari, Valentin Thomas, Junwei Ma, Bingru Li, Jesse C. Cresswell, and Rahul Krishnan. 2026. CausalPFN: Amortized Causal Effect Estimation via In-Context Learning. InThe Thirty-ninth Annual Confer- ence on Neural Information Processing Systems. https://openreview.net/forum? id=RblaNJGx8C

2026

[4] [4]

Philippe Brouillard, Chandler Squires, Jonas Wahl, Konrad K"ording, Karen Sachs, Alexandre Drouin, and Dhanya Sridhar. 2025. The Landscape of Causal Discovery Data: Grounding Causal Discovery in Real-World Applications. In Proceedings of the Fourth Conference on Causal Learning and Reasoning (Proceed- ings of Machine Learning Research, Vol. 275), Biwei Hu...

2025

[5] [5]

Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting Sys- tem. InProceedings of the 22nd ACM SIGKDD International Conference on Knowl- edge Discovery and Data Mining(San Francisco, California, USA)(KDD ’16). As- sociation for Computing Machinery, New York, NY, USA, 785–794. doi:10.1145/ 2939672.2939785

work page arXiv 2016

[6] [6]

David Maxwell Chickering. 2003. Optimal structure identification with greedy search.J. Mach. Learn. Res.3, null (March 2003), 507–554. doi:10.1162/ 153244303321897717

2003

[7] [7]

Paul Erdős and Alfréd Rényi. 1960. On the evolution of random graphs.Publica- tions of the Mathematical Institute of the Hungarian Academy of Sciences5 (1960), 17–61

1960

[8] [8]

2024.causalAssembly: Generating Realistic Pro- duction Data for Benchmarking Causal Discovery

Konstantin Göbler, Tobias Windisch, Mathias Drton, Tim Pychynski, Martin Roth, and Steffen Sonntag. 2024.causalAssembly: Generating Realistic Pro- duction Data for Benchmarking Causal Discovery. InProceedings of the Third Conference on Causal Learning and Reasoning (Proceedings of Machine Learning Research, Vol. 236), Francesco Locatello and Vanessa Didel...

2024

[9] [9]

https://proceedings.mlr.press/v236/gobler24a.html

[10] [10]

The Elements of Statistical Learning: Data Mining, Inference, and Prediction

Trevor Hastie, Robert Tibshirani, and Jerome Friedman. 2009.The Elements of Statistical Learning: Data Mining, Inference, and Prediction(2 ed.). Springer, New York, NY. doi:10.1007/978-0-387-84858-7

work page doi:10.1007/978-0-387-84858-7 2009

[11] [11]

Hoerl and Robert W

Arthur E. Hoerl and Robert W. Kennard. 1970. Ridge Regression: Biased Estimation for Nonorthogonal Problems.Technometrics12, 1 (1970), 55–

1970

[12] [12]

Hoerl and Robert W

arXiv:https://doi.org/10.1080/00401706.1970.10488634 doi:10.1080/00401706. 1970.10488634

work page doi:10.1080/00401706.1970.10488634 1970

[13] [13]

Noah Hollmann, Samuel Müller, Katharina Eggensperger, and Frank Hutter

[14] [14]

InThe Eleventh International Conference on Learning Representations

TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second. InThe Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=cp5PvcI6w8_

[15] [15]

Kurt Hornik, Maxwell Stinchcombe, and Halbert White. 1989. Multilayer feed- forward networks are universal approximators.Neural Networks2, 5 (1989), 359–366. doi:10.1016/0893-6080(89)90020-8

work page doi:10.1016/0893-6080(89)90020-8 1989

[16] [16]

2009.Probabilistic Graphical Models: Principles and Techniques - Adaptive Computation and Machine Learning

Daphne Koller and Nir Friedman. 2009.Probabilistic Graphical Models: Principles and Techniques - Adaptive Computation and Machine Learning. The MIT Press

2009

[17] [17]

Laird and James H

Nan M. Laird and James H. Ware. 1982. Random-Effects Models for Longitudinal Data.Biometrics38, 4 (1982), 963–974. http://www.jstor.org/stable/2529876

work page arXiv 1982

[18] [18]

Yuchen Ma, Dennis Frauen, Emil Javurek, and Stefan Feuerriegel. 2026. Foun- dation Models for Causal Inference via Prior-Data Fitted Networks. InThe Four- teenth International Conference on Learning Representations. https://openreview. net/forum?id=d2L1ndOKjq

2026

[19] [19]

Daniel Marbach, James C Costello, Robert Küffner, Nicole M Vega, Robert J Prill, Diogo M Camacho, Kyle R Allison, Manolis Kellis, James J Collins, et al. 2012. Wisdom of crowds for robust gene network inference.Nature methods9, 8 (2012), 796–804

2012

[20] [20]

Prill, Thomas Schaffter, Claudio Mat- tiussi, Dario Floreano, and Gustavo Stolovitzky

Daniel Marbach, Robert J. Prill, Thomas Schaffter, Claudio Mat- tiussi, Dario Floreano, and Gustavo Stolovitzky. 2010. Revealing strengths and weaknesses of methods for gene network inference.Pro- ceedings of the National Academy of Sciences107, 14 (2010), 6286–

2010

[21] [21]

arXiv:https://www.pnas.org/doi/pdf/10.1073/pnas.0913357107 doi:10.1073/pnas.0913357107

work page doi:10.1073/pnas.0913357107

[22] [22]

Dimitris Margaritis and Sebastian Thrun. 1999. Bayesian Network Induction via Local Neighborhoods. InAdvances in Neural Informa- tion Processing Systems, S. Solla, T. Leen, and K. Müller (Eds.), Vol. 12. MIT Press. https://proceedings.neurips.cc/paper_files/paper/1999/file/ 5d79099fcdf499f12b79770834c0164a-Paper.pdf

1999

[23] [23]

Christopher Meek. 1995. Causal inference and causal explanation with back- ground knowledge.Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence(1995), 403–410

1995

[24] [24]

Wang Miao, Zhi Geng, and Eric J Tchetgen Tchetgen. 2018. Identifying causal effects with proxy variables of an unmeasured confounder.Biometrika 105, 4 (12 2018), 987–993. arXiv:https://academic.oup.com/biomet/article- pdf/105/4/987/27121264/asy038.pdf doi:10.1093/biomet/asy038

work page doi:10.1093/biomet/asy038 2018

[25] [25]

Samuel Müller, Noah Hollmann, Sebastian Pineda Arango, Josif Grabocka, and Frank Hutter. 2022. Transformers Can Do Bayesian Inference. InInterna- tional Conference on Learning Representations. https://openreview.net/forum? id=KSugKcbNf9

2022

[26] [26]

Samuel Müller, Arik Reuter, Noah Hollmann, David Rügamer, and Frank Hutter

[27] [27]

InProceedings of the 42nd International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol

Position: The Future of Bayesian Prediction Is Prior-Fitted. InProceedings of the 42nd International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 267). PMLR, Vancouver, Canada. https://proceedings.mlr. press/v267/muller25d.html

[28] [28]

Rainer Opgen-Rhein and Korbinian Strimmer. 2007. From correlation to causa- tion networks: a simple approximate learning algorithm and its application to high-dimensional plant gene expression data.BMC systems biology1, 1 (2007), 37

2007

[29] [29]

1988.Probabilistic Reasoning in Intelligent Systems: Networks of Plau- sible Inference

Judea Pearl. 1988.Probabilistic Reasoning in Intelligent Systems: Networks of Plau- sible Inference. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA

1988

[30] [30]

2009.Causality: Models, Reasoning and Inference(2nd ed.)

Judea Pearl. 2009.Causality: Models, Reasoning and Inference(2nd ed.). Cam- bridge University Press, USA

2009

[31] [31]

2017.Elements of Causal Inference: Foundations and Learning Algorithms

Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. 2017.Elements of Causal Inference: Foundations and Learning Algorithms. MIT Press, Cambridge, MA

2017

[32] [32]

Jingang Qu, David Holzmüller, Gaël Varoquaux, and Marine Le Morvan. 2025. TabICL: A Tabular Foundation Model for In-Context Learning on Large Data. In Proceedings of the 42nd International Conference on Machine Learning (Proceed- ings of Machine Learning Research, Vol. 267). PMLR, Vancouver, Canada, 50817– 50847. https://proceedings.mlr.press/v267/qu25d.html

2025

[33] [33]

Jingang Qu, David Holzmüller, Gaël Varoquaux, and Marine Le Morvan

[34] [34]

TabICLv2: A better, faster, scalable, and open tabular foundation model.arXiv:2602.11139, 2026

TabICLv2: A better, faster, scalable, and open tabular foundation model. arXiv:2602.11139 [cs.LG] https://arxiv.org/abs/2602.11139

work page arXiv

[35] [35]

Alexander Reisach, Christof Seiler, and Sebastian Weichwald. 2021. Beware of the simulated dag! causal discovery benchmarks may be easy to game.Advances in Neural Information Processing Systems34 (2021), 27772–27784

2021

[36] [36]

Jake Robertson, Arik Reuter, Siyuan Guo, Noah Hollmann, Frank Hutter, and Bernhard Schölkopf. 2025. Do-PFN: In-Context Learning for Causal Effect Esti- mation. arXiv:2506.06039 [cs.LG] https://arxiv.org/abs/2506.06039

work page arXiv 2025

[37] [37]

Marco Scutari. 2010. Learning Bayesian networks with the bnlearn R package. Journal of statistical software35 (2010), 1–22

2010

[38] [38]

2000.Causation, Prediction, and Search(2 ed.)

Peter Spirtes, Clark Glymour, and Richard Scheines. 2000.Causation, Prediction, and Search(2 ed.). MIT Press, Cambridge, MA

2000

[39] [39]

Robert Tibshirani. 1996. Regression Shrinkage and Selection Via the Lasso.Jour- nal of the Royal Statistical Society: Series B (Methodological)58, 1 (01 1996), 267–

1996

[40] [40]

doi:10.1111/j.2517-6161.1996.tb02080.x

work page doi:10.1111/j.2517-6161.1996.tb02080.x 1996

[41] [41]

Aliferis, and Alexander Statnikov

Ioannis Tsamardinos, Constantin F. Aliferis, and Alexander Statnikov. 2003. Al- gorithms for Large Scale Markov Blanket Discovery. InProceedings of the Six- teenth International Florida Artificial Intelligence Research Society Conference (FLAIRS 2003). AAAI Press, Menlo Park, CA, 376–381. https://aaai.org/papers/ flairs-2003-073/

2003

[42] [42]

Ioannis Tsamardinos, Laura E Brown, and Constantin F Aliferis. 2006. The max-min hill-climbing Bayesian network structure learning algorithm.Machine Learning65, 1 (2006), 31–78

2006

[43] [43]

Tim Van den Bulcke, Koenraad Van Leemput, Bart Naudts, Piet van Remortel, Hongwu Ma, Alain Verschoren, Bart De Moor, and Kathleen Marchal. 2006. Syn- TReN: a generator of synthetic gene expression data for design and analysis of structure learning algorithms.BMC bioinformatics7, 1 (2006), 43

2006

[44] [44]

Thomas Verma and Judea Pearl. 1990. Equivalence and synthesis of causal mod- els.Proceedings of the Sixth Conference on Uncertainty in Artificial Intelligence (1990), 220–227

1990

[45] [45]

Wooldridge

Jefrey M. Wooldridge. 2010.Econometric Analysis of Cross Section and Panel Data. The MIT Press. http://www.jstor.org/stable/j.ctt5hhcfr

2010

[46] [46]

Liyuan Xu, Heishiro Kanagawa, and Arthur Gretton. 2021. Deep Proxy Causal Learning and its Application to Confounded Bandit Policy Evaluation. InAd- vances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (Eds.), Vol. 34. Curran As- sociates, Inc., 26264–26275. https://proceedings.neurips...

2021

[47] [47]

Naiyu Yin, Hanjing Wang, Yue Yu, Tian Gao, Amit Dhurandhar, and Qiang Ji

[48] [48]

InEuropean Conference on Computer Vision

Integrating Markov blanket discovery into causal representation learning for domain generalization. InEuropean Conference on Computer Vision. Springer, 271–288

[49] [49]

Kui Yu, Lin Liu, Jiuyong Li, Weiping Ding, and Thuc Duy Le. 2020. Causality- based feature selection: Methods and evaluations.Comput. Surveys53, 5 (2020), 1–36

2020