pith. sign in

arxiv: 2509.21465 · v3 · pith:WQBIQOU6new · submitted 2025-09-25 · 💻 cs.LG

Talking Trees: Reasoning-Assisted Induction of Decision Trees for Tabular Data

Pith reviewed 2026-05-21 22:00 UTC · model grok-4.3

classification 💻 cs.LG
keywords decision treesLLM reasoningtabular dataagentic setupinterpretable AIlow-resource learningprior knowledge integration
0
0 comments X

The pith

Reasoning LLMs induce decision trees for small tabular datasets by combining prior knowledge with data analysis through a minimal toolset.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates an alternative to black-box tabular foundation models for small datasets. It proposes using reasoning LLMs in an agentic setup equipped with basic tools to build, examine, and refine decision trees. This method lets the LLM merge its pretrained understanding with the specific data patterns to generate simple trees. These trees surpass traditional CART algorithms and other non-greedy approaches while matching the performance of ensemble methods. The process also yields transparent reasoning traces that support bias checks and allows easy integration of human expertise.

Core claim

Equipped with a minimal set of tools for constructing, analyzing, and manipulating decision trees, the LLM combines its prior knowledge with learning from data to produce a lightweight decision tree that outperforms CART and recent non-greedy tree learners and remains competitive with tree ensembles on low-resource tabular problems.

What carries the argument

An agentic loop in which the reasoning LLM uses a minimal set of tools to construct, analyze, and manipulate decision trees, integrating prior knowledge with tabular data.

If this is right

  • The resulting decision trees outperform CART on low-resource tabular problems.
  • They compete with tree ensembles in performance.
  • The trees come with human-readable reasoning traces for verifying biases and data leaks.
  • Human input can be added to the tree creation without requiring it to be present in the training data.
  • A single such tree can match state-of-the-art black-box models while being interpretable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This method might lessen the need for extensive synthetic data pretraining in tabular modeling.
  • The interpretable traces could enable better auditing in applications requiring transparency.
  • Similar agentic approaches could be explored for inducing other interpretable models beyond trees.
  • Testing on a wider range of datasets would clarify the robustness of the performance gains.

Load-bearing premise

A minimal set of tools for decision tree operations suffices for the LLM to integrate its prior knowledge with the data without systematic errors or suboptimal decisions.

What would settle it

Running the method on several low-resource tabular benchmarks and finding no performance improvement over CART or lack of competitiveness with ensembles would disprove the central claim.

Figures

Figures reproduced from arXiv: 2509.21465 by Alina Shutova, Artem Babenko, George Yakushev, Ivan Rubachev, Natalia Bereberdina, Renat Sergazinov.

Figure 1
Figure 1. Figure 1: The informal summary of our approach. We prompt an LLM agent to construct a decision tree in a thought-action [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Tool call distribution across LLM backbones and datasets categorized by functionality. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: a) Fairness evaluation on the Adult dataset across three setups: LLM-built trees with and without the fairness prompt and the sklearn baseline. b) Training with experiment on the Diabetes dataset: performance of trees trained with and without access to the «Glucose» feature. Both experiments use GPT-5 backbone the setup from Section 4.1. 4.5.1. FAIRNESS Fairness has become an important concern in machine l… view at source ↗
Figure 4
Figure 4. Figure 4: Word cloud of function calls by category. [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
read the original abstract

Tabular foundation models are becoming increasingly popular for low-resource tabular problems. These models make up for small training datasets by pretraining on large volumes of synthetic data. The prior knowledge obtained via pretraining provides the exceptional performance, but the resulting model becomes a black box that is difficult to interpret and costly for inference. In this work, we explore an alternative strategy: using reasoning-capable LLMs to induce decision trees for small tabular datasets in an agentic setup. We design a minimal set of tools for constructing, analyzing, and manipulating decision trees. Equipped with these tools, the LLM combines its prior knowledge with learning from data to produce a lightweight decision tree that outperforms CART and recent non-greedy tree learners and remains competitive with tree ensembles on low-resource tabular problems. While a single agentic decision tree is competitive with state-of-the-art black box models, it also comes with a human-readable reasoning trace that can be checked for biases and data leaks. Furthermore, the reasoning-based LLM's creation process allows for additional human input to be incorporated into the tree without it being captured in data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces 'Talking Trees', an agentic LLM-based method for inducing decision trees on small tabular datasets. A minimal tool set enables the LLM to construct, analyze, and manipulate trees, allowing it to combine prior knowledge with data-driven learning. The central claim is that the resulting lightweight, interpretable trees outperform CART and recent non-greedy learners while remaining competitive with tree ensembles on low-resource problems, with the added benefit of human-readable reasoning traces that support bias checking and human input.

Significance. If the performance claims hold under rigorous evaluation, the work offers a compelling interpretable alternative to black-box tabular foundation models for data-scarce settings. The agentic tool-based design is a clear strength, as is the explicit support for human oversight via reasoning traces. These elements address both accuracy and transparency in a way that could influence future hybrid LLM-traditional ML systems.

major comments (2)
  1. [§4] §4 Experiments: The reported outperformance over CART and competitiveness with ensembles is central to the contribution, yet the section provides no details on the number of datasets, exact sample sizes qualifying as 'low-resource', number of random seeds, or statistical significance tests (e.g., Wilcoxon signed-rank) to account for LLM stochasticity. Without these, the empirical support for the main claim cannot be fully assessed.
  2. [§3] §3 Tool Design: The minimal tool set for tree construction and manipulation is described at a high level, but no pseudocode, exact function signatures, or failure-case analysis is given. This is load-bearing because the central assumption—that the LLM can reliably integrate prior knowledge without introducing systematic suboptimal splits—depends on the concrete behavior of these tools.
minor comments (3)
  1. [Abstract and §1] The abstract and §1 mention 'recent non-greedy tree learners' without immediate citations; adding specific references at first use would improve clarity.
  2. [Figure 1] Figure 1 (agentic loop diagram) would benefit from explicit labels on tool-call arrows and decision points to make the interaction flow easier to follow.
  3. [§4.3] The paper should clarify whether the LLM temperature or sampling strategy was fixed across all experiments, as this affects reproducibility of the reported results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive summary, the recognition of the agentic tool-based design and human-oversight features, and the recommendation for minor revision. We address each major comment below and have incorporated the requested clarifications into the revised manuscript.

read point-by-point responses
  1. Referee: [§4] §4 Experiments: The reported outperformance over CART and competitiveness with ensembles is central to the contribution, yet the section provides no details on the number of datasets, exact sample sizes qualifying as 'low-resource', number of random seeds, or statistical significance tests (e.g., Wilcoxon signed-rank) to account for LLM stochasticity. Without these, the empirical support for the main claim cannot be fully assessed.

    Authors: We agree that these experimental details are essential for rigorous assessment of the performance claims. In the revised manuscript we have expanded §4 to specify that experiments were conducted on 12 tabular datasets drawn from standard benchmarks, with 'low-resource' defined as training sets containing fewer than 500 samples. All results are reported as means and standard deviations over 10 independent random seeds. We have also added Wilcoxon signed-rank tests (with p-values) comparing Talking Trees against CART and the non-greedy baselines, thereby accounting for LLM stochasticity. These additions directly strengthen the empirical support for the central claims. revision: yes

  2. Referee: [§3] §3 Tool Design: The minimal tool set for tree construction and manipulation is described at a high level, but no pseudocode, exact function signatures, or failure-case analysis is given. This is load-bearing because the central assumption—that the LLM can reliably integrate prior knowledge without introducing systematic suboptimal splits—depends on the concrete behavior of these tools.

    Authors: We acknowledge that the original description of the tool set was high-level. To address this, the revised §3 now includes (i) exact Python-style function signatures for the core tools (build_tree, evaluate_split, prune_subtree, and query_reasoning_trace), (ii) pseudocode for the main agent loop in the appendix, and (iii) a dedicated paragraph analyzing failure modes, including how the tools reject invalid splits proposed by the LLM and how the reasoning trace is used to detect and correct systematic bias. These concrete specifications clarify the mechanism by which prior knowledge is integrated without introducing uncontrolled suboptimal decisions. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes an empirical agentic method in which an LLM uses a minimal set of externally defined tools to construct and refine decision trees from small tabular datasets. No equations, derivations, or parameter-fitting steps appear that reduce the claimed outperformance over CART to quantities defined by the method's own outputs or to self-citations. Performance claims rest on experimental comparisons with external baselines rather than on any self-referential prediction or uniqueness theorem imported from the authors' prior work. The approach is therefore self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract does not introduce or rely on any explicit free parameters, unproven axioms, or newly postulated entities; the approach rests on the external capabilities of existing reasoning LLMs and standard decision-tree concepts.

pith-pipeline@v0.9.0 · 5740 in / 1119 out tokens · 45399 ms · 2026-05-21T22:00:39.110075+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · 15 internal anchors

  1. [1]

    Learning transferable visual models from natural lan- guage supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas- try, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural lan- guage supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

  2. [2]

    BERT: Pre-training of deep bidi- rectional transformers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidi- rectional transformers for language understanding. In North American Chapter of the Association for Com- putational Linguistics (NAACL), 2019

  3. [3]

    Language models are few-shot learn- ers

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learn- ers. InConference on Neural Information Processing Systems (NeurIPS), 2020

  4. [4]

    TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second

    Noah Hollmann, Samuel Müller, Katharina Eggensperger, and Frank Hutter. Tabpfn: A transformer that solves small tabular classifi- cation problems in a second.arXiv preprint arXiv:2207.01848, 2022

  5. [5]

    On finetuning tabular foundation models.arXiv preprint arXiv:2506.08982, 2025

    Ivan Rubachev, Akim Kotelnikov, Nikolay Kartashev, and Artem Babenko. On finetuning tabular foundation models.arXiv preprint arXiv:2506.08982, 2025

  6. [6]

    Limix: Unleashing structured- data modeling capability for generalist intelligence

    Xingxuan Zhang, Gang Ren, Han Yu, Hao Yuan, Hui Wang, Jiansheng Li, Jiayun Wu, Lang Mo, Li Mao, Mingchao Hao, et al. Limix: Unleashing structured- data modeling capability for generalist intelligence. arXiv preprint arXiv:2509.03505, 2025

  7. [7]

    Random forests.Machine Learning, 45(1):5–32, 2001

    Leo Breiman. Random forests.Machine Learning, 45(1):5–32, 2001

  8. [8]

    Friedman

    Jerome H. Friedman. Greedy function approximation: A gradient boosting machine.The Annals of Statistics, 29(5):1189–1232, 2001

  9. [9]

    Xgboost: A scalable tree boosting system

    Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. InProceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794, 2016. 8 Talking Trees: Reasoning-Assisted Induction of Decision Trees for Tabular Data

  10. [10]

    TabArena: A Living Benchmark for Machine Learning on Tabular Data

    Nick Erickson, Lennart Purucker, Andrej Tschalzev, David Holzmüller, Prateek Mutalik Desai, David Sali- nas, and Frank Hutter. Tabarena: A living benchmark for machine learning on tabular data.arXiv preprint arXiv:2506.16791, 2025

  11. [11]

    Introducing gpt-5

    OpenAI. Introducing gpt-5. https://openai.com/ index/introducing-gpt-5/, 2025

  12. [12]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

  13. [13]

    GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

    Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, et al. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models.arXiv preprint arXiv:2508.06471, 2025

  14. [14]

    Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

    Tomek Korbak, Mikita Balesni, Elizabeth Barnes, Yoshua Bengio, Joe Benton, Joseph Bloom, Mark Chen, Alan Cooney, Allan Dafoe, Anca Dragan, Scott Emmons, Owain Evans, David Farhi, Ryan Greenblatt, Dan Hendrycks, Marius Hobbhahn, Evan Hubinger, Geoffrey Irving, Erik Jenner, Daniel Kokotajlo, Vic- toria Krakovna, Shane Legg, David Lindner, David Luan, Aleksa...

  15. [15]

    Cot red-handed: Stress testing chain- of-thought monitoring.ArXiv, abs/2505.23575, 2025

    Benjamin Arnav, Pablo Bernabeu Perez, Nathan Helm- Burger, Tim Kostolansky, Hannes Whittingham, and Mary Phuong. Cot red-handed: Stress testing chain- of-thought monitoring.ArXiv, abs/2505.23575, 2025

  16. [16]

    Who models the models that model models? an exploration of gpt-3’s in-context model fitting ability

    Lovre. Who models the models that model models? an exploration of gpt-3’s in-context model fitting ability. LessWrong, 2022

  17. [17]

    Zico Kolter

    Hariharan Manikandan, Yiding Jiang, and J. Zico Kolter. Language models are weak learners.arXiv preprint, (2306.14101), 2023

  18. [18]

    S. Y . Liu et al. Chain of thoughts for tabular data leaderboard.arXiv preprint, (2505.13421), 2025

  19. [19]

    Perdomo, and Ludwig Schmidt

    Josh Gardner, Juan C. Perdomo, and Ludwig Schmidt. Large scale transfer learning for tabular data via lan- guage modeling. InAdvances in Neural Information Processing Systems 37 (NeurIPS 2024), 2024

  20. [20]

    MachineLearningLM: Scaling Many-shot In-context Learning via Continued Pretraining

    Haoyu Dong, Pengkun Zhang, Mingzhe Lu, Yanzhen Shen, and Guolin Ke. Machinelearninglm: Scaling many-shot in-context learning via continued pretrain- ing.arXiv preprint, (2509.06806), 2025

  21. [21]

    A survey of llm × data.arXiv preprint, (2505.18458), 2025

    Xuanhe Zhou, Junxuan He, Wei Zhou, Haodong Chen, Zirui Tang, Haoyu Zhao, Xin Tong, Guoliang Li, Youmin Chen, Jun Zhou, Zhaojun Sun, Binyuan Hui, Shuo Wang, Conghui He, Zhiyuan Liu, Jingren Zhou, and Fan Wu. A survey of llm × data.arXiv preprint, (2505.18458), 2025

  22. [22]

    Context-aware automated feature engineering (caafe)

    Noah Hollmann, Fabian Müller, and Frank Hutter. Context-aware automated feature engineering (caafe). arXiv preprint, (2305.03403), 2023

  23. [23]

    Data cleaning using large language models.arXiv preprint, (2410.15547), 2024

    Shuo Zhang, Zezhou Huang, and Eugene Wu. Data cleaning using large language models.arXiv preprint, (2410.15547), 2024

  24. [24]

    Lan Li, Liri Fang, Bertram Ludäscher, and Vetle I. Torvik. Autodcworkflow: Llm-based data cleaning workflow auto-generation and benchmark.arXiv preprint, (2412.06724), 2024

  25. [25]

    Exploring llm agents for cleaning tabular machine learning datasets.arXiv preprint, (2503.06664), 2025

    Tommaso Bendinelli, Artur Dox, and Christian Holz. Exploring llm agents for cleaning tabular machine learning datasets.arXiv preprint, (2503.06664), 2025

  26. [26]

    AIDE: AI-Driven Exploration in the Space of Code

    Zhengyao Jiang, Dominik Schmidt, Dhruv Srikanth, et al. Aide: Ai-driven exploration in the space of code. arXiv preprint, (2502.13138), 2025

  27. [27]

    Mlzero: A multi-agent system for end-to-end machine learning automation.arXiv preprint, (2505.13941), 2025

    Haoyang Fang, Boran Han, Nick Erickson, Xiyuan Zhang, Su Zhou, Anirudh Dagar, Jiani Zhang, Ali Caner Turkmen, Cuixiong Hu, Huzefa Rangwala, Ying Nian Wu, Bernie Wang, and George Karypis. Mlzero: A multi-agent system for end-to-end machine learning automation.arXiv preprint, (2505.13941), 2025

  28. [28]

    Zero-shot decision tree construction via large language models.arXiv preprint arXiv:2501.16247, 2025

    Lucas Carrasco, Felipe Urrutia, and AndrÊs Abeliuk. Zero-shot decision tree construction via large language models.arXiv preprint arXiv:2501.16247, 2025

  29. [29]

    Llm meeting decision trees on tabular data

    Hangting Ye, Jinmeng Li, He Zhao, Dandan Guo, and Yi Chang. Llm meeting decision trees on tabular data. arXiv preprint, (2505.17918), 2025

  30. [30]

    Toolformer: Language Models Can Teach Themselves to Use Tools

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. ArXiv, abs/2302.04761, 2023

  31. [31]

    ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

    Yujia Qin, Shi Liang, Yining Ye, Kunlun Zhu, Lan Yan, Ya-Ting Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Runchu Tian, Ruobing 9 Talking Trees: Reasoning-Assisted Induction of Decision Trees for Tabular Data Xie, Jie Zhou, Marc H. Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. Toolllm: Facilitating large language models to master 160...

  32. [32]

    Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

  33. [33]

    Large Language Models are Zero-Shot Reasoners

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large lan- guage models are zero-shot reasoners.ArXiv, abs/2205.11916, 2022

  34. [34]

    Zhuosheng Zhang, Aston Zhang, Mu Li, and Alexan- der J. Smola. Automatic chain of thought prompting in large language models.ArXiv, abs/2210.03493, 2022

  35. [35]

    Python Software Foundation, 2019

    Python Core Team.Python: A dynamic, open source programming language. Python Software Foundation, 2019

  36. [36]

    Pedregosa, G

    F. Pedregosa, G. Varoquaux, A. Gramfort, V . Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V . Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duches- nay. Scikit-learn: Machine learning in Python.Journal of Machine Learning Research, 12:2825–2830, 2011

  37. [37]

    Harris, K

    Charles R. Harris, K. Jarrod Millman, Stéfan J. van der Walt, Ralf Gommers, Pauli Virtanen, David Cour- napeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van Kerkwijk, Matthew Brett, Al- lan Haldane, Jaime Fernández del Río, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant, Kevin S...

  38. [38]

    Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Ev- geni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, Stéfan J

    Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Ev- geni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, Stéfan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, C. J. Carey, ˙Ilhan Polat, Yu Feng, Eric ...

  39. [39]

    pandas-dev/pandas: Pandas, February 2020

    The pandas development team. pandas-dev/pandas: Pandas, February 2020

  40. [40]

    ‘smolagents‘: a smol library to build great agen- tic systems

    Aymeric Roucher, Albert Villanova del Moral, Thomas Wolf, Leandro von Werra, and Erik Kaunismäki. ‘smolagents‘: a smol library to build great agen- tic systems. https://github.com/huggingface/ smolagents, 2025

  41. [41]

    Feng He, Tianqing Zhu, Dayong Ye, Bo Liu, Wanlei Zhou, and Philip S. Yu. The emerged security and privacy of llm agent: A survey with case studies.arXiv preprint arXiv:2407.19354, 2024

  42. [42]

    Identifying the Risks of LM Agents with an LM-Emulated Sandbox

    Yangjun Ruan, Honghua Dong, Andrew Wang, Sil- viu Pitis, Yongchao Zhou, Jimmy Ba, Yann Dubois, Chris J. Maddison, and Tatsunori Hashimoto. Iden- tifying the risks of lm agents with an lm-emulated sandbox.arXiv preprint arXiv:2309.15817, 2023

  43. [43]

    Airgapagent: Protect- ing privacy-conscious conversational agents.arXiv preprint arXiv:2405.05175, 2024

    Eugene Bagdasarian, Ren Yi, Sahra Ghalebikesabi, Peter Kairouz, Marco Gruteser, Sewoong Oh, Borja Balle, and Daniel Ramage. Airgapagent: Protect- ing privacy-conscious conversational agents.arXiv preprint arXiv:2405.05175, 2024

  44. [44]

    Miles Turpin, Julian Michael, Ethan Perez, and Samuel R. Bowman. Language models don’t al- ways say what they think: Unfaithful explana- tions in chain-of-thought prompting.arXiv preprint arXiv:2305.04388, 2023

  45. [45]

    van Rijn, Bernd Bischl, and Luis Torgo

    Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. Openml: Networked science in ma- chine learning.SIGKDD Explorations, 15:49–60, 2013

  46. [46]

    van Rijn, Arlind Kadra, Pieter Gijsbers, Neeratyoy Mallik, Sahithya Ravi, An- dreas Müller, Joaquin Vanschoren, and Frank Hutter

    Matthias Feurer, Jan N. van Rijn, Arlind Kadra, Pieter Gijsbers, Neeratyoy Mallik, Sahithya Ravi, An- dreas Müller, Joaquin Vanschoren, and Frank Hutter. Openml-python: an extensible python api for openml. Journal of Machine Learning Research, 22(100):1–5, 2021

  47. [47]

    arXiv preprint arXiv:2410.24210 , year=

    Yury Gorishniy, Akim Kotelnikov, and Artem Babenko. Tabm: Advancing tabular deep learning with parameter-efficient ensembling. InProceedings of the 2025 International Conference on Learning Rep- resentations (ICLR 2025), 2025. arXiv:2410.24210 [cs.LG], version v3

  48. [48]

    B., M¨uller, S., Salinas, D., and Hutter, F

    Shi Bin Hoo, Samuel Müller, David Salinas, and Frank Hutter. The tabular foundation model tabpfn outper- forms specialized time series forecasting models based on simple features.arXiv preprint arXiv:2501.02945, 2025. 10 Talking Trees: Reasoning-Assisted Induction of Decision Trees for Tabular Data

  49. [49]

    Fairness through aware- ness

    Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through aware- ness. InProceedings of the 3rd Innovations in Theo- retical Computer Science Conference (ITCS), pages 214–226, 2012

  50. [50]

    The Frontiers of Fairness in Machine Learning

    Alexandra Chouldechova and Aaron Roth. The fron- tiers of fairness in machine learning.arXiv preprint arXiv:1810.08810, 2018

  51. [51]

    Fairness and Machine Learning: Limitations and Op- portunities

    Solon Barocas, Moritz Hardt, and Arvind Narayanan. Fairness and Machine Learning: Limitations and Op- portunities. MIT Press, 2023

  52. [52]

    Prediction- based decisions and fairness: A catalogue of choices, assumptions, and definitions.arXiv preprint arXiv:1811.07867, 2018

    Shira Mitchell, Eric Potash, Solon Barocas, Alexan- der D’Amour, and Kristian Lum. Prediction- based decisions and fairness: A catalogue of choices, assumptions, and definitions.arXiv preprint arXiv:1811.07867, 2018

  53. [53]

    A reductions approach to fair classification

    Alekh Agarwal, Alina Beygelzimer, Miroslav Dudík, John Langford, and Hanna Wallach. A reductions approach to fair classification. InProceedings of the 35th International Conference on Machine Learning (ICML), volume 80, pages 60–69, 2018

  54. [54]

    A comparative study of fairness-enhancing interventions in machine learning

    Sorelle A. Friedler, Carlos Scheidegger, Suresh Venkatasubramanian, Sonam Choudhary, Evan P. Hamilton, and Derek Roth. A comparative study of fairness-enhancing interventions in machine learning. arXiv preprint arXiv:1802.04422, 2018

  55. [55]

    Uci adult data set

    Ron Kohavi and Barry Becker. Uci adult data set. UCI Machine Learning Repository, 1996. https://archive.ics.uci.edu/ml/datasets/adult

  56. [56]

    Retiring adult: New datasets for fair machine learning

    Frances Ding, Moritz Hardt, John Miller, and Ludwig Schmidt. Retiring adult: New datasets for fair machine learning. InNeurIPS Datasets and Benchmarks Track, 2021

  57. [57]

    Equality of opportunity in supervised learning

    Moritz Hardt, Eric Price, Eric Price, and Nati Srebro. Equality of opportunity in supervised learning. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016

  58. [58]

    Adult [dataset]

    Barry Becker and Ron Kohavi. Adult [dataset]. UCI Machine Learning Repository, 1996. https: //archive.ics.uci.edu/ml/datasets/adult. 11 Talking Trees: Reasoning-Assisted Induction of Decision Trees for Tabular Data A. Evaluation Dataset Information The main evaluation uses the following datasets (with OpenML task IDs). •Fitness-Fitness_Club(ID 363671), 15...

  59. [59]

    Tree modifications: ‘tree_mod‘ (key)

  60. [60]

    Tree analysis, visualization and debugging: ‘tree_eda‘ (key)

  61. [61]

    General feature engineering and transformations: ‘feat_engineering‘ (key)

  62. [62]

    General exploratory data analysis: ‘eda‘ (key)

  63. [63]

    Builtins: ‘builtins‘ (key) Category descriptions:

  64. [64]

    Tree modifications: any operation that changes or trains the tree. Keywords: De- cisionTreeClassifier, DecisionTreeRegressor, min_samples_split, max_depth, prune, re- place_subtree, grow_subtree, repair, min_samples_leaf, max_features, min_impurity_decrease, min_weight_fraction_leaf, ccp_alpha, max_leaf_nodes, min_samples_leaf, min_samples_split, max_dept...

  65. [65]

    Keywords: is_leaf, get_data_indices_for_node, print, decision_path, fea- ture_importances_, export_graphviz

    Tree analysis: introspection of trained tree(s) such as paths, leaves, importances, surro- gate views, and plots. Keywords: is_leaf, get_data_indices_for_node, print, decision_path, fea- ture_importances_, export_graphviz

  66. [66]

    Independent of a specific trained tree

    General feature engineering and transformations: any input transformation before training/in- ference. Independent of a specific trained tree. Keywords: OneHotEncoder, PolynomialFeatures, clip, log1p, sign, datetime64

  67. [67]

    Keyword: percentile, std, mean, tsne, umap, train_test_split

    General exploratory data analysis: dataset-level profiling and exploration not tied to a specific model such as distributions, correlations, missingness, leakage checks, class balance. Keyword: percentile, std, mean, tsne, umap, train_test_split

  68. [68]

    balanced

    Builtins: infrastructure that doesn’t change data or models and isn’t analysis such as I/O, seeding, logging, timing, config, small data wrangling helpers. Keywords: check_random_state, asarray, dtype, save, load. As an output provide a json-formattable dictionary of the form: category: [function1, function2]. 13 Talking Trees: Reasoning-Assisted Inductio...