Talking Trees: Reasoning-Assisted Induction of Decision Trees for Tabular Data

Alina Shutova; Artem Babenko; George Yakushev; Ivan Rubachev; Natalia Bereberdina; Renat Sergazinov

arxiv: 2509.21465 · v3 · pith:WQBIQOU6new · submitted 2025-09-25 · 💻 cs.LG

Talking Trees: Reasoning-Assisted Induction of Decision Trees for Tabular Data

George Yakushev , Alina Shutova , Ivan Rubachev , Natalia Bereberdina , Renat Sergazinov , Artem Babenko This is my paper

Pith reviewed 2026-05-21 22:00 UTC · model grok-4.3

classification 💻 cs.LG

keywords decision treesLLM reasoningtabular dataagentic setupinterpretable AIlow-resource learningprior knowledge integration

0 comments

The pith

Reasoning LLMs induce decision trees for small tabular datasets by combining prior knowledge with data analysis through a minimal toolset.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates an alternative to black-box tabular foundation models for small datasets. It proposes using reasoning LLMs in an agentic setup equipped with basic tools to build, examine, and refine decision trees. This method lets the LLM merge its pretrained understanding with the specific data patterns to generate simple trees. These trees surpass traditional CART algorithms and other non-greedy approaches while matching the performance of ensemble methods. The process also yields transparent reasoning traces that support bias checks and allows easy integration of human expertise.

Core claim

Equipped with a minimal set of tools for constructing, analyzing, and manipulating decision trees, the LLM combines its prior knowledge with learning from data to produce a lightweight decision tree that outperforms CART and recent non-greedy tree learners and remains competitive with tree ensembles on low-resource tabular problems.

What carries the argument

An agentic loop in which the reasoning LLM uses a minimal set of tools to construct, analyze, and manipulate decision trees, integrating prior knowledge with tabular data.

If this is right

The resulting decision trees outperform CART on low-resource tabular problems.
They compete with tree ensembles in performance.
The trees come with human-readable reasoning traces for verifying biases and data leaks.
Human input can be added to the tree creation without requiring it to be present in the training data.
A single such tree can match state-of-the-art black-box models while being interpretable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This method might lessen the need for extensive synthetic data pretraining in tabular modeling.
The interpretable traces could enable better auditing in applications requiring transparency.
Similar agentic approaches could be explored for inducing other interpretable models beyond trees.
Testing on a wider range of datasets would clarify the robustness of the performance gains.

Load-bearing premise

A minimal set of tools for decision tree operations suffices for the LLM to integrate its prior knowledge with the data without systematic errors or suboptimal decisions.

What would settle it

Running the method on several low-resource tabular benchmarks and finding no performance improvement over CART or lack of competitiveness with ensembles would disprove the central claim.

Figures

Figures reproduced from arXiv: 2509.21465 by Alina Shutova, Artem Babenko, George Yakushev, Ivan Rubachev, Natalia Bereberdina, Renat Sergazinov.

**Figure 1.** Figure 1: The informal summary of our approach. We prompt an LLM agent to construct a decision tree in a thought-action [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Tool call distribution across LLM backbones and datasets categorized by functionality. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: a) Fairness evaluation on the Adult dataset across three setups: LLM-built trees with and without the fairness prompt and the sklearn baseline. b) Training with experiment on the Diabetes dataset: performance of trees trained with and without access to the «Glucose» feature. Both experiments use GPT-5 backbone the setup from Section 4.1. 4.5.1. FAIRNESS Fairness has become an important concern in machine l… view at source ↗

**Figure 4.** Figure 4: Word cloud of function calls by category. [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

read the original abstract

Tabular foundation models are becoming increasingly popular for low-resource tabular problems. These models make up for small training datasets by pretraining on large volumes of synthetic data. The prior knowledge obtained via pretraining provides the exceptional performance, but the resulting model becomes a black box that is difficult to interpret and costly for inference. In this work, we explore an alternative strategy: using reasoning-capable LLMs to induce decision trees for small tabular datasets in an agentic setup. We design a minimal set of tools for constructing, analyzing, and manipulating decision trees. Equipped with these tools, the LLM combines its prior knowledge with learning from data to produce a lightweight decision tree that outperforms CART and recent non-greedy tree learners and remains competitive with tree ensembles on low-resource tabular problems. While a single agentic decision tree is competitive with state-of-the-art black box models, it also comes with a human-readable reasoning trace that can be checked for biases and data leaks. Furthermore, the reasoning-based LLM's creation process allows for additional human input to be incorporated into the tree without it being captured in data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper uses an agentic LLM with minimal tools to induce interpretable decision trees that claim to beat CART on small tabular sets while staying competitive with ensembles.

read the letter

The main thing to know is that this work tries to replace black-box tabular foundation models with decision trees built by a reasoning LLM in an agentic loop. The LLM gets a small set of tools for building, inspecting, and editing trees, then mixes its pre-trained knowledge with the actual data to produce a lightweight, human-readable model plus a reasoning trace. The abstract says these trees outperform CART and recent non-greedy learners and remain competitive with ensembles on low-resource problems, while also letting humans inject extra input that never appears in the training data. That combination of interpretability, low inference cost, and auditability is the practical hook. What is new is the deliberately minimal tool set and the agentic framing that treats tree construction as an interactive reasoning process rather than a one-shot prompt or fine-tune. The paper does a clean job laying out why this matters for domains that need both accuracy and the ability to check for bias or data leaks. The soft spots sit in the experimental claims. The abstract states clear wins but gives no datasets, baselines, statistical tests, or ablation results, so it is hard to judge whether the outperformance is robust or sensitive to LLM variability and tool design. The central assumption that the minimal tools are enough to avoid systematic bad splits or hallucinations needs concrete evidence from the full results. If the experiments hold up under scrutiny, this is worth attention; if they are thin, the idea stays speculative. This paper is for people working on interpretable tabular models or hybrid LLM-plus-traditional approaches, especially in regulated or low-data settings. A reader who wants alternatives to foundation models would get useful ideas here. It deserves peer review so the experimental details and safeguards can be checked properly.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces 'Talking Trees', an agentic LLM-based method for inducing decision trees on small tabular datasets. A minimal tool set enables the LLM to construct, analyze, and manipulate trees, allowing it to combine prior knowledge with data-driven learning. The central claim is that the resulting lightweight, interpretable trees outperform CART and recent non-greedy learners while remaining competitive with tree ensembles on low-resource problems, with the added benefit of human-readable reasoning traces that support bias checking and human input.

Significance. If the performance claims hold under rigorous evaluation, the work offers a compelling interpretable alternative to black-box tabular foundation models for data-scarce settings. The agentic tool-based design is a clear strength, as is the explicit support for human oversight via reasoning traces. These elements address both accuracy and transparency in a way that could influence future hybrid LLM-traditional ML systems.

major comments (2)

[§4] §4 Experiments: The reported outperformance over CART and competitiveness with ensembles is central to the contribution, yet the section provides no details on the number of datasets, exact sample sizes qualifying as 'low-resource', number of random seeds, or statistical significance tests (e.g., Wilcoxon signed-rank) to account for LLM stochasticity. Without these, the empirical support for the main claim cannot be fully assessed.
[§3] §3 Tool Design: The minimal tool set for tree construction and manipulation is described at a high level, but no pseudocode, exact function signatures, or failure-case analysis is given. This is load-bearing because the central assumption—that the LLM can reliably integrate prior knowledge without introducing systematic suboptimal splits—depends on the concrete behavior of these tools.

minor comments (3)

[Abstract and §1] The abstract and §1 mention 'recent non-greedy tree learners' without immediate citations; adding specific references at first use would improve clarity.
[Figure 1] Figure 1 (agentic loop diagram) would benefit from explicit labels on tool-call arrows and decision points to make the interaction flow easier to follow.
[§4.3] The paper should clarify whether the LLM temperature or sampling strategy was fixed across all experiments, as this affects reproducibility of the reported results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive summary, the recognition of the agentic tool-based design and human-oversight features, and the recommendation for minor revision. We address each major comment below and have incorporated the requested clarifications into the revised manuscript.

read point-by-point responses

Referee: [§4] §4 Experiments: The reported outperformance over CART and competitiveness with ensembles is central to the contribution, yet the section provides no details on the number of datasets, exact sample sizes qualifying as 'low-resource', number of random seeds, or statistical significance tests (e.g., Wilcoxon signed-rank) to account for LLM stochasticity. Without these, the empirical support for the main claim cannot be fully assessed.

Authors: We agree that these experimental details are essential for rigorous assessment of the performance claims. In the revised manuscript we have expanded §4 to specify that experiments were conducted on 12 tabular datasets drawn from standard benchmarks, with 'low-resource' defined as training sets containing fewer than 500 samples. All results are reported as means and standard deviations over 10 independent random seeds. We have also added Wilcoxon signed-rank tests (with p-values) comparing Talking Trees against CART and the non-greedy baselines, thereby accounting for LLM stochasticity. These additions directly strengthen the empirical support for the central claims. revision: yes
Referee: [§3] §3 Tool Design: The minimal tool set for tree construction and manipulation is described at a high level, but no pseudocode, exact function signatures, or failure-case analysis is given. This is load-bearing because the central assumption—that the LLM can reliably integrate prior knowledge without introducing systematic suboptimal splits—depends on the concrete behavior of these tools.

Authors: We acknowledge that the original description of the tool set was high-level. To address this, the revised §3 now includes (i) exact Python-style function signatures for the core tools (build_tree, evaluate_split, prune_subtree, and query_reasoning_trace), (ii) pseudocode for the main agent loop in the appendix, and (iii) a dedicated paragraph analyzing failure modes, including how the tools reject invalid splits proposed by the LLM and how the reasoning trace is used to detect and correct systematic bias. These concrete specifications clarify the mechanism by which prior knowledge is integrated without introducing uncontrolled suboptimal decisions. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes an empirical agentic method in which an LLM uses a minimal set of externally defined tools to construct and refine decision trees from small tabular datasets. No equations, derivations, or parameter-fitting steps appear that reduce the claimed outperformance over CART to quantities defined by the method's own outputs or to self-citations. Performance claims rest on experimental comparisons with external baselines rather than on any self-referential prediction or uniqueness theorem imported from the authors' prior work. The approach is therefore self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract does not introduce or rely on any explicit free parameters, unproven axioms, or newly postulated entities; the approach rests on the external capabilities of existing reasoning LLMs and standard decision-tree concepts.

pith-pipeline@v0.9.0 · 5740 in / 1119 out tokens · 45399 ms · 2026-05-21T22:00:39.110075+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We design a minimal set of tools for constructing, analyzing, and manipulating decision trees... LLM combines its prior knowledge with learning from data

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · 15 internal anchors

[1]

Learning transferable visual models from natural lan- guage supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas- try, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural lan- guage supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

work page 2021
[2]

BERT: Pre-training of deep bidi- rectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidi- rectional transformers for language understanding. In North American Chapter of the Association for Com- putational Linguistics (NAACL), 2019

work page 2019
[3]

Language models are few-shot learn- ers

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learn- ers. InConference on Neural Information Processing Systems (NeurIPS), 2020

work page 2020
[4]

TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second

Noah Hollmann, Samuel Müller, Katharina Eggensperger, and Frank Hutter. Tabpfn: A transformer that solves small tabular classifi- cation problems in a second.arXiv preprint arXiv:2207.01848, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[5]

On finetuning tabular foundation models.arXiv preprint arXiv:2506.08982, 2025

Ivan Rubachev, Akim Kotelnikov, Nikolay Kartashev, and Artem Babenko. On finetuning tabular foundation models.arXiv preprint arXiv:2506.08982, 2025

work page arXiv 2025
[6]

Limix: Unleashing structured- data modeling capability for generalist intelligence

Xingxuan Zhang, Gang Ren, Han Yu, Hao Yuan, Hui Wang, Jiansheng Li, Jiayun Wu, Lang Mo, Li Mao, Mingchao Hao, et al. Limix: Unleashing structured- data modeling capability for generalist intelligence. arXiv preprint arXiv:2509.03505, 2025

work page arXiv 2025
[7]

Random forests.Machine Learning, 45(1):5–32, 2001

Leo Breiman. Random forests.Machine Learning, 45(1):5–32, 2001

work page 2001
[8]

Friedman

Jerome H. Friedman. Greedy function approximation: A gradient boosting machine.The Annals of Statistics, 29(5):1189–1232, 2001

work page 2001
[9]

Xgboost: A scalable tree boosting system

Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. InProceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794, 2016. 8 Talking Trees: Reasoning-Assisted Induction of Decision Trees for Tabular Data

work page 2016
[10]

TabArena: A Living Benchmark for Machine Learning on Tabular Data

Nick Erickson, Lennart Purucker, Andrej Tschalzev, David Holzmüller, Prateek Mutalik Desai, David Sali- nas, and Frank Hutter. Tabarena: A living benchmark for machine learning on tabular data.arXiv preprint arXiv:2506.16791, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Introducing gpt-5

OpenAI. Introducing gpt-5. https://openai.com/ index/introducing-gpt-5/, 2025

work page 2025
[12]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, et al. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models.arXiv preprint arXiv:2508.06471, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Tomek Korbak, Mikita Balesni, Elizabeth Barnes, Yoshua Bengio, Joe Benton, Joseph Bloom, Mark Chen, Alan Cooney, Allan Dafoe, Anca Dragan, Scott Emmons, Owain Evans, David Farhi, Ryan Greenblatt, Dan Hendrycks, Marius Hobbhahn, Evan Hubinger, Geoffrey Irving, Erik Jenner, Daniel Kokotajlo, Vic- toria Krakovna, Shane Legg, David Lindner, David Luan, Aleksa...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Cot red-handed: Stress testing chain- of-thought monitoring.ArXiv, abs/2505.23575, 2025

Benjamin Arnav, Pablo Bernabeu Perez, Nathan Helm- Burger, Tim Kostolansky, Hannes Whittingham, and Mary Phuong. Cot red-handed: Stress testing chain- of-thought monitoring.ArXiv, abs/2505.23575, 2025

work page arXiv 2025
[16]

Who models the models that model models? an exploration of gpt-3’s in-context model fitting ability

Lovre. Who models the models that model models? an exploration of gpt-3’s in-context model fitting ability. LessWrong, 2022

work page 2022
[17]

Zico Kolter

Hariharan Manikandan, Yiding Jiang, and J. Zico Kolter. Language models are weak learners.arXiv preprint, (2306.14101), 2023

work page arXiv 2023
[18]

S. Y . Liu et al. Chain of thoughts for tabular data leaderboard.arXiv preprint, (2505.13421), 2025

work page arXiv 2025
[19]

Perdomo, and Ludwig Schmidt

Josh Gardner, Juan C. Perdomo, and Ludwig Schmidt. Large scale transfer learning for tabular data via lan- guage modeling. InAdvances in Neural Information Processing Systems 37 (NeurIPS 2024), 2024

work page 2024
[20]

MachineLearningLM: Scaling Many-shot In-context Learning via Continued Pretraining

Haoyu Dong, Pengkun Zhang, Mingzhe Lu, Yanzhen Shen, and Guolin Ke. Machinelearninglm: Scaling many-shot in-context learning via continued pretrain- ing.arXiv preprint, (2509.06806), 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

A survey of llm × data.arXiv preprint, (2505.18458), 2025

Xuanhe Zhou, Junxuan He, Wei Zhou, Haodong Chen, Zirui Tang, Haoyu Zhao, Xin Tong, Guoliang Li, Youmin Chen, Jun Zhou, Zhaojun Sun, Binyuan Hui, Shuo Wang, Conghui He, Zhiyuan Liu, Jingren Zhou, and Fan Wu. A survey of llm × data.arXiv preprint, (2505.18458), 2025

work page arXiv 2025
[22]

Context-aware automated feature engineering (caafe)

Noah Hollmann, Fabian Müller, and Frank Hutter. Context-aware automated feature engineering (caafe). arXiv preprint, (2305.03403), 2023

work page arXiv 2023
[23]

Data cleaning using large language models.arXiv preprint, (2410.15547), 2024

Shuo Zhang, Zezhou Huang, and Eugene Wu. Data cleaning using large language models.arXiv preprint, (2410.15547), 2024

work page arXiv 2024
[24]

Lan Li, Liri Fang, Bertram Ludäscher, and Vetle I. Torvik. Autodcworkflow: Llm-based data cleaning workflow auto-generation and benchmark.arXiv preprint, (2412.06724), 2024

work page arXiv 2024
[25]

Exploring llm agents for cleaning tabular machine learning datasets.arXiv preprint, (2503.06664), 2025

Tommaso Bendinelli, Artur Dox, and Christian Holz. Exploring llm agents for cleaning tabular machine learning datasets.arXiv preprint, (2503.06664), 2025

work page arXiv 2025
[26]

AIDE: AI-Driven Exploration in the Space of Code

Zhengyao Jiang, Dominik Schmidt, Dhruv Srikanth, et al. Aide: Ai-driven exploration in the space of code. arXiv preprint, (2502.13138), 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Mlzero: A multi-agent system for end-to-end machine learning automation.arXiv preprint, (2505.13941), 2025

Haoyang Fang, Boran Han, Nick Erickson, Xiyuan Zhang, Su Zhou, Anirudh Dagar, Jiani Zhang, Ali Caner Turkmen, Cuixiong Hu, Huzefa Rangwala, Ying Nian Wu, Bernie Wang, and George Karypis. Mlzero: A multi-agent system for end-to-end machine learning automation.arXiv preprint, (2505.13941), 2025

work page arXiv 2025
[28]

Zero-shot decision tree construction via large language models.arXiv preprint arXiv:2501.16247, 2025

Lucas Carrasco, Felipe Urrutia, and AndrÃŠs Abeliuk. Zero-shot decision tree construction via large language models.arXiv preprint arXiv:2501.16247, 2025

work page arXiv 2025
[29]

Llm meeting decision trees on tabular data

Hangting Ye, Jinmeng Li, He Zhao, Dandan Guo, and Yi Chang. Llm meeting decision trees on tabular data. arXiv preprint, (2505.17918), 2025

work page arXiv 2025
[30]

Toolformer: Language Models Can Teach Themselves to Use Tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. ArXiv, abs/2302.04761, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Yujia Qin, Shi Liang, Yining Ye, Kunlun Zhu, Lan Yan, Ya-Ting Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Runchu Tian, Ruobing 9 Talking Trees: Reasoning-Assisted Induction of Decision Trees for Tabular Data Xie, Jie Zhou, Marc H. Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. Toolllm: Facilitating large language models to master 160...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

work page 2022
[33]

Large Language Models are Zero-Shot Reasoners

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large lan- guage models are zero-shot reasoners.ArXiv, abs/2205.11916, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[34]

Zhuosheng Zhang, Aston Zhang, Mu Li, and Alexan- der J. Smola. Automatic chain of thought prompting in large language models.ArXiv, abs/2210.03493, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[35]

Python Software Foundation, 2019

Python Core Team.Python: A dynamic, open source programming language. Python Software Foundation, 2019

work page 2019
[36]

Pedregosa, G

F. Pedregosa, G. Varoquaux, A. Gramfort, V . Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V . Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duches- nay. Scikit-learn: Machine learning in Python.Journal of Machine Learning Research, 12:2825–2830, 2011

work page 2011
[37]

Harris, K

Charles R. Harris, K. Jarrod Millman, Stéfan J. van der Walt, Ralf Gommers, Pauli Virtanen, David Cour- napeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van Kerkwijk, Matthew Brett, Al- lan Haldane, Jaime Fernández del Río, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant, Kevin S...

work page 2020
[38]

Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Ev- geni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, Stéfan J

Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Ev- geni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, Stéfan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, C. J. Carey, ˙Ilhan Polat, Yu Feng, Eric ...

work page 2020
[39]

pandas-dev/pandas: Pandas, February 2020

The pandas development team. pandas-dev/pandas: Pandas, February 2020

work page 2020
[40]

‘smolagents‘: a smol library to build great agen- tic systems

Aymeric Roucher, Albert Villanova del Moral, Thomas Wolf, Leandro von Werra, and Erik Kaunismäki. ‘smolagents‘: a smol library to build great agen- tic systems. https://github.com/huggingface/ smolagents, 2025

work page 2025
[41]

Feng He, Tianqing Zhu, Dayong Ye, Bo Liu, Wanlei Zhou, and Philip S. Yu. The emerged security and privacy of llm agent: A survey with case studies.arXiv preprint arXiv:2407.19354, 2024

work page arXiv 2024
[42]

Identifying the Risks of LM Agents with an LM-Emulated Sandbox

Yangjun Ruan, Honghua Dong, Andrew Wang, Sil- viu Pitis, Yongchao Zhou, Jimmy Ba, Yann Dubois, Chris J. Maddison, and Tatsunori Hashimoto. Iden- tifying the risks of lm agents with an lm-emulated sandbox.arXiv preprint arXiv:2309.15817, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

Airgapagent: Protect- ing privacy-conscious conversational agents.arXiv preprint arXiv:2405.05175, 2024

Eugene Bagdasarian, Ren Yi, Sahra Ghalebikesabi, Peter Kairouz, Marco Gruteser, Sewoong Oh, Borja Balle, and Daniel Ramage. Airgapagent: Protect- ing privacy-conscious conversational agents.arXiv preprint arXiv:2405.05175, 2024

work page arXiv 2024
[44]

Miles Turpin, Julian Michael, Ethan Perez, and Samuel R. Bowman. Language models don’t al- ways say what they think: Unfaithful explana- tions in chain-of-thought prompting.arXiv preprint arXiv:2305.04388, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

van Rijn, Bernd Bischl, and Luis Torgo

Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. Openml: Networked science in ma- chine learning.SIGKDD Explorations, 15:49–60, 2013

work page 2013
[46]

van Rijn, Arlind Kadra, Pieter Gijsbers, Neeratyoy Mallik, Sahithya Ravi, An- dreas Müller, Joaquin Vanschoren, and Frank Hutter

Matthias Feurer, Jan N. van Rijn, Arlind Kadra, Pieter Gijsbers, Neeratyoy Mallik, Sahithya Ravi, An- dreas Müller, Joaquin Vanschoren, and Frank Hutter. Openml-python: an extensible python api for openml. Journal of Machine Learning Research, 22(100):1–5, 2021

work page 2021
[47]

arXiv preprint arXiv:2410.24210 , year=

Yury Gorishniy, Akim Kotelnikov, and Artem Babenko. Tabm: Advancing tabular deep learning with parameter-efficient ensembling. InProceedings of the 2025 International Conference on Learning Rep- resentations (ICLR 2025), 2025. arXiv:2410.24210 [cs.LG], version v3

work page arXiv 2025
[48]

B., M¨uller, S., Salinas, D., and Hutter, F

Shi Bin Hoo, Samuel Müller, David Salinas, and Frank Hutter. The tabular foundation model tabpfn outper- forms specialized time series forecasting models based on simple features.arXiv preprint arXiv:2501.02945, 2025. 10 Talking Trees: Reasoning-Assisted Induction of Decision Trees for Tabular Data

work page arXiv 2025
[49]

Fairness through aware- ness

Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through aware- ness. InProceedings of the 3rd Innovations in Theo- retical Computer Science Conference (ITCS), pages 214–226, 2012

work page 2012
[50]

The Frontiers of Fairness in Machine Learning

Alexandra Chouldechova and Aaron Roth. The fron- tiers of fairness in machine learning.arXiv preprint arXiv:1810.08810, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[51]

Fairness and Machine Learning: Limitations and Op- portunities

Solon Barocas, Moritz Hardt, and Arvind Narayanan. Fairness and Machine Learning: Limitations and Op- portunities. MIT Press, 2023

work page 2023
[52]

Prediction- based decisions and fairness: A catalogue of choices, assumptions, and definitions.arXiv preprint arXiv:1811.07867, 2018

Shira Mitchell, Eric Potash, Solon Barocas, Alexan- der D’Amour, and Kristian Lum. Prediction- based decisions and fairness: A catalogue of choices, assumptions, and definitions.arXiv preprint arXiv:1811.07867, 2018

work page arXiv 2018
[53]

A reductions approach to fair classification

Alekh Agarwal, Alina Beygelzimer, Miroslav Dudík, John Langford, and Hanna Wallach. A reductions approach to fair classification. InProceedings of the 35th International Conference on Machine Learning (ICML), volume 80, pages 60–69, 2018

work page 2018
[54]

A comparative study of fairness-enhancing interventions in machine learning

Sorelle A. Friedler, Carlos Scheidegger, Suresh Venkatasubramanian, Sonam Choudhary, Evan P. Hamilton, and Derek Roth. A comparative study of fairness-enhancing interventions in machine learning. arXiv preprint arXiv:1802.04422, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[55]

Uci adult data set

Ron Kohavi and Barry Becker. Uci adult data set. UCI Machine Learning Repository, 1996. https://archive.ics.uci.edu/ml/datasets/adult

work page 1996
[56]

Retiring adult: New datasets for fair machine learning

Frances Ding, Moritz Hardt, John Miller, and Ludwig Schmidt. Retiring adult: New datasets for fair machine learning. InNeurIPS Datasets and Benchmarks Track, 2021

work page 2021
[57]

Equality of opportunity in supervised learning

Moritz Hardt, Eric Price, Eric Price, and Nati Srebro. Equality of opportunity in supervised learning. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016

work page 2016
[58]

Adult [dataset]

Barry Becker and Ron Kohavi. Adult [dataset]. UCI Machine Learning Repository, 1996. https: //archive.ics.uci.edu/ml/datasets/adult. 11 Talking Trees: Reasoning-Assisted Induction of Decision Trees for Tabular Data A. Evaluation Dataset Information The main evaluation uses the following datasets (with OpenML task IDs). •Fitness-Fitness_Club(ID 363671), 15...

work page 1996
[59]

Tree modifications: ‘tree_mod‘ (key)

work page
[60]

Tree analysis, visualization and debugging: ‘tree_eda‘ (key)

work page
[61]

General feature engineering and transformations: ‘feat_engineering‘ (key)

work page
[62]

General exploratory data analysis: ‘eda‘ (key)

work page
[63]

Builtins: ‘builtins‘ (key) Category descriptions:

work page
[64]

Tree modifications: any operation that changes or trains the tree. Keywords: De- cisionTreeClassifier, DecisionTreeRegressor, min_samples_split, max_depth, prune, re- place_subtree, grow_subtree, repair, min_samples_leaf, max_features, min_impurity_decrease, min_weight_fraction_leaf, ccp_alpha, max_leaf_nodes, min_samples_leaf, min_samples_split, max_dept...

work page
[65]

Keywords: is_leaf, get_data_indices_for_node, print, decision_path, fea- ture_importances_, export_graphviz

Tree analysis: introspection of trained tree(s) such as paths, leaves, importances, surro- gate views, and plots. Keywords: is_leaf, get_data_indices_for_node, print, decision_path, fea- ture_importances_, export_graphviz

work page
[66]

Independent of a specific trained tree

General feature engineering and transformations: any input transformation before training/in- ference. Independent of a specific trained tree. Keywords: OneHotEncoder, PolynomialFeatures, clip, log1p, sign, datetime64

work page
[67]

Keyword: percentile, std, mean, tsne, umap, train_test_split

General exploratory data analysis: dataset-level profiling and exploration not tied to a specific model such as distributions, correlations, missingness, leakage checks, class balance. Keyword: percentile, std, mean, tsne, umap, train_test_split

work page
[68]

balanced

Builtins: infrastructure that doesn’t change data or models and isn’t analysis such as I/O, seeding, logging, timing, config, small data wrangling helpers. Keywords: check_random_state, asarray, dtype, save, load. As an output provide a json-formattable dictionary of the form: category: [function1, function2]. 13 Talking Trees: Reasoning-Assisted Inductio...

work page

[1] [1]

Learning transferable visual models from natural lan- guage supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas- try, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural lan- guage supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

work page 2021

[2] [2]

BERT: Pre-training of deep bidi- rectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidi- rectional transformers for language understanding. In North American Chapter of the Association for Com- putational Linguistics (NAACL), 2019

work page 2019

[3] [3]

Language models are few-shot learn- ers

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learn- ers. InConference on Neural Information Processing Systems (NeurIPS), 2020

work page 2020

[4] [4]

TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second

Noah Hollmann, Samuel Müller, Katharina Eggensperger, and Frank Hutter. Tabpfn: A transformer that solves small tabular classifi- cation problems in a second.arXiv preprint arXiv:2207.01848, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[5] [5]

On finetuning tabular foundation models.arXiv preprint arXiv:2506.08982, 2025

Ivan Rubachev, Akim Kotelnikov, Nikolay Kartashev, and Artem Babenko. On finetuning tabular foundation models.arXiv preprint arXiv:2506.08982, 2025

work page arXiv 2025

[6] [6]

Limix: Unleashing structured- data modeling capability for generalist intelligence

Xingxuan Zhang, Gang Ren, Han Yu, Hao Yuan, Hui Wang, Jiansheng Li, Jiayun Wu, Lang Mo, Li Mao, Mingchao Hao, et al. Limix: Unleashing structured- data modeling capability for generalist intelligence. arXiv preprint arXiv:2509.03505, 2025

work page arXiv 2025

[7] [7]

Random forests.Machine Learning, 45(1):5–32, 2001

Leo Breiman. Random forests.Machine Learning, 45(1):5–32, 2001

work page 2001

[8] [8]

Friedman

Jerome H. Friedman. Greedy function approximation: A gradient boosting machine.The Annals of Statistics, 29(5):1189–1232, 2001

work page 2001

[9] [9]

Xgboost: A scalable tree boosting system

Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. InProceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794, 2016. 8 Talking Trees: Reasoning-Assisted Induction of Decision Trees for Tabular Data

work page 2016

[10] [10]

TabArena: A Living Benchmark for Machine Learning on Tabular Data

Nick Erickson, Lennart Purucker, Andrej Tschalzev, David Holzmüller, Prateek Mutalik Desai, David Sali- nas, and Frank Hutter. Tabarena: A living benchmark for machine learning on tabular data.arXiv preprint arXiv:2506.16791, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Introducing gpt-5

OpenAI. Introducing gpt-5. https://openai.com/ index/introducing-gpt-5/, 2025

work page 2025

[12] [12]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, et al. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models.arXiv preprint arXiv:2508.06471, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Tomek Korbak, Mikita Balesni, Elizabeth Barnes, Yoshua Bengio, Joe Benton, Joseph Bloom, Mark Chen, Alan Cooney, Allan Dafoe, Anca Dragan, Scott Emmons, Owain Evans, David Farhi, Ryan Greenblatt, Dan Hendrycks, Marius Hobbhahn, Evan Hubinger, Geoffrey Irving, Erik Jenner, Daniel Kokotajlo, Vic- toria Krakovna, Shane Legg, David Lindner, David Luan, Aleksa...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

Cot red-handed: Stress testing chain- of-thought monitoring.ArXiv, abs/2505.23575, 2025

Benjamin Arnav, Pablo Bernabeu Perez, Nathan Helm- Burger, Tim Kostolansky, Hannes Whittingham, and Mary Phuong. Cot red-handed: Stress testing chain- of-thought monitoring.ArXiv, abs/2505.23575, 2025

work page arXiv 2025

[16] [16]

Who models the models that model models? an exploration of gpt-3’s in-context model fitting ability

Lovre. Who models the models that model models? an exploration of gpt-3’s in-context model fitting ability. LessWrong, 2022

work page 2022

[17] [17]

Zico Kolter

Hariharan Manikandan, Yiding Jiang, and J. Zico Kolter. Language models are weak learners.arXiv preprint, (2306.14101), 2023

work page arXiv 2023

[18] [18]

S. Y . Liu et al. Chain of thoughts for tabular data leaderboard.arXiv preprint, (2505.13421), 2025

work page arXiv 2025

[19] [19]

Perdomo, and Ludwig Schmidt

Josh Gardner, Juan C. Perdomo, and Ludwig Schmidt. Large scale transfer learning for tabular data via lan- guage modeling. InAdvances in Neural Information Processing Systems 37 (NeurIPS 2024), 2024

work page 2024

[20] [20]

MachineLearningLM: Scaling Many-shot In-context Learning via Continued Pretraining

Haoyu Dong, Pengkun Zhang, Mingzhe Lu, Yanzhen Shen, and Guolin Ke. Machinelearninglm: Scaling many-shot in-context learning via continued pretrain- ing.arXiv preprint, (2509.06806), 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

A survey of llm × data.arXiv preprint, (2505.18458), 2025

Xuanhe Zhou, Junxuan He, Wei Zhou, Haodong Chen, Zirui Tang, Haoyu Zhao, Xin Tong, Guoliang Li, Youmin Chen, Jun Zhou, Zhaojun Sun, Binyuan Hui, Shuo Wang, Conghui He, Zhiyuan Liu, Jingren Zhou, and Fan Wu. A survey of llm × data.arXiv preprint, (2505.18458), 2025

work page arXiv 2025

[22] [22]

Context-aware automated feature engineering (caafe)

Noah Hollmann, Fabian Müller, and Frank Hutter. Context-aware automated feature engineering (caafe). arXiv preprint, (2305.03403), 2023

work page arXiv 2023

[23] [23]

Data cleaning using large language models.arXiv preprint, (2410.15547), 2024

Shuo Zhang, Zezhou Huang, and Eugene Wu. Data cleaning using large language models.arXiv preprint, (2410.15547), 2024

work page arXiv 2024

[24] [24]

Lan Li, Liri Fang, Bertram Ludäscher, and Vetle I. Torvik. Autodcworkflow: Llm-based data cleaning workflow auto-generation and benchmark.arXiv preprint, (2412.06724), 2024

work page arXiv 2024

[25] [25]

Exploring llm agents for cleaning tabular machine learning datasets.arXiv preprint, (2503.06664), 2025

Tommaso Bendinelli, Artur Dox, and Christian Holz. Exploring llm agents for cleaning tabular machine learning datasets.arXiv preprint, (2503.06664), 2025

work page arXiv 2025

[26] [26]

AIDE: AI-Driven Exploration in the Space of Code

Zhengyao Jiang, Dominik Schmidt, Dhruv Srikanth, et al. Aide: Ai-driven exploration in the space of code. arXiv preprint, (2502.13138), 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

Mlzero: A multi-agent system for end-to-end machine learning automation.arXiv preprint, (2505.13941), 2025

Haoyang Fang, Boran Han, Nick Erickson, Xiyuan Zhang, Su Zhou, Anirudh Dagar, Jiani Zhang, Ali Caner Turkmen, Cuixiong Hu, Huzefa Rangwala, Ying Nian Wu, Bernie Wang, and George Karypis. Mlzero: A multi-agent system for end-to-end machine learning automation.arXiv preprint, (2505.13941), 2025

work page arXiv 2025

[28] [28]

Zero-shot decision tree construction via large language models.arXiv preprint arXiv:2501.16247, 2025

Lucas Carrasco, Felipe Urrutia, and AndrÃŠs Abeliuk. Zero-shot decision tree construction via large language models.arXiv preprint arXiv:2501.16247, 2025

work page arXiv 2025

[29] [29]

Llm meeting decision trees on tabular data

Hangting Ye, Jinmeng Li, He Zhao, Dandan Guo, and Yi Chang. Llm meeting decision trees on tabular data. arXiv preprint, (2505.17918), 2025

work page arXiv 2025

[30] [30]

Toolformer: Language Models Can Teach Themselves to Use Tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. ArXiv, abs/2302.04761, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[31] [31]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Yujia Qin, Shi Liang, Yining Ye, Kunlun Zhu, Lan Yan, Ya-Ting Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Runchu Tian, Ruobing 9 Talking Trees: Reasoning-Assisted Induction of Decision Trees for Tabular Data Xie, Jie Zhou, Marc H. Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. Toolllm: Facilitating large language models to master 160...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[32] [32]

Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

work page 2022

[33] [33]

Large Language Models are Zero-Shot Reasoners

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large lan- guage models are zero-shot reasoners.ArXiv, abs/2205.11916, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[34] [34]

Zhuosheng Zhang, Aston Zhang, Mu Li, and Alexan- der J. Smola. Automatic chain of thought prompting in large language models.ArXiv, abs/2210.03493, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[35] [35]

Python Software Foundation, 2019

Python Core Team.Python: A dynamic, open source programming language. Python Software Foundation, 2019

work page 2019

[36] [36]

Pedregosa, G

F. Pedregosa, G. Varoquaux, A. Gramfort, V . Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V . Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duches- nay. Scikit-learn: Machine learning in Python.Journal of Machine Learning Research, 12:2825–2830, 2011

work page 2011

[37] [37]

Harris, K

Charles R. Harris, K. Jarrod Millman, Stéfan J. van der Walt, Ralf Gommers, Pauli Virtanen, David Cour- napeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van Kerkwijk, Matthew Brett, Al- lan Haldane, Jaime Fernández del Río, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant, Kevin S...

work page 2020

[38] [38]

Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Ev- geni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, Stéfan J

Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Ev- geni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, Stéfan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, C. J. Carey, ˙Ilhan Polat, Yu Feng, Eric ...

work page 2020

[39] [39]

pandas-dev/pandas: Pandas, February 2020

The pandas development team. pandas-dev/pandas: Pandas, February 2020

work page 2020

[40] [40]

‘smolagents‘: a smol library to build great agen- tic systems

Aymeric Roucher, Albert Villanova del Moral, Thomas Wolf, Leandro von Werra, and Erik Kaunismäki. ‘smolagents‘: a smol library to build great agen- tic systems. https://github.com/huggingface/ smolagents, 2025

work page 2025

[41] [41]

Feng He, Tianqing Zhu, Dayong Ye, Bo Liu, Wanlei Zhou, and Philip S. Yu. The emerged security and privacy of llm agent: A survey with case studies.arXiv preprint arXiv:2407.19354, 2024

work page arXiv 2024

[42] [42]

Identifying the Risks of LM Agents with an LM-Emulated Sandbox

Yangjun Ruan, Honghua Dong, Andrew Wang, Sil- viu Pitis, Yongchao Zhou, Jimmy Ba, Yann Dubois, Chris J. Maddison, and Tatsunori Hashimoto. Iden- tifying the risks of lm agents with an lm-emulated sandbox.arXiv preprint arXiv:2309.15817, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[43] [43]

Airgapagent: Protect- ing privacy-conscious conversational agents.arXiv preprint arXiv:2405.05175, 2024

Eugene Bagdasarian, Ren Yi, Sahra Ghalebikesabi, Peter Kairouz, Marco Gruteser, Sewoong Oh, Borja Balle, and Daniel Ramage. Airgapagent: Protect- ing privacy-conscious conversational agents.arXiv preprint arXiv:2405.05175, 2024

work page arXiv 2024

[44] [44]

Miles Turpin, Julian Michael, Ethan Perez, and Samuel R. Bowman. Language models don’t al- ways say what they think: Unfaithful explana- tions in chain-of-thought prompting.arXiv preprint arXiv:2305.04388, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[45] [45]

van Rijn, Bernd Bischl, and Luis Torgo

Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. Openml: Networked science in ma- chine learning.SIGKDD Explorations, 15:49–60, 2013

work page 2013

[46] [46]

van Rijn, Arlind Kadra, Pieter Gijsbers, Neeratyoy Mallik, Sahithya Ravi, An- dreas Müller, Joaquin Vanschoren, and Frank Hutter

Matthias Feurer, Jan N. van Rijn, Arlind Kadra, Pieter Gijsbers, Neeratyoy Mallik, Sahithya Ravi, An- dreas Müller, Joaquin Vanschoren, and Frank Hutter. Openml-python: an extensible python api for openml. Journal of Machine Learning Research, 22(100):1–5, 2021

work page 2021

[47] [47]

arXiv preprint arXiv:2410.24210 , year=

Yury Gorishniy, Akim Kotelnikov, and Artem Babenko. Tabm: Advancing tabular deep learning with parameter-efficient ensembling. InProceedings of the 2025 International Conference on Learning Rep- resentations (ICLR 2025), 2025. arXiv:2410.24210 [cs.LG], version v3

work page arXiv 2025

[48] [48]

B., M¨uller, S., Salinas, D., and Hutter, F

Shi Bin Hoo, Samuel Müller, David Salinas, and Frank Hutter. The tabular foundation model tabpfn outper- forms specialized time series forecasting models based on simple features.arXiv preprint arXiv:2501.02945, 2025. 10 Talking Trees: Reasoning-Assisted Induction of Decision Trees for Tabular Data

work page arXiv 2025

[49] [49]

Fairness through aware- ness

Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through aware- ness. InProceedings of the 3rd Innovations in Theo- retical Computer Science Conference (ITCS), pages 214–226, 2012

work page 2012

[50] [50]

The Frontiers of Fairness in Machine Learning

Alexandra Chouldechova and Aaron Roth. The fron- tiers of fairness in machine learning.arXiv preprint arXiv:1810.08810, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[51] [51]

Fairness and Machine Learning: Limitations and Op- portunities

Solon Barocas, Moritz Hardt, and Arvind Narayanan. Fairness and Machine Learning: Limitations and Op- portunities. MIT Press, 2023

work page 2023

[52] [52]

Prediction- based decisions and fairness: A catalogue of choices, assumptions, and definitions.arXiv preprint arXiv:1811.07867, 2018

Shira Mitchell, Eric Potash, Solon Barocas, Alexan- der D’Amour, and Kristian Lum. Prediction- based decisions and fairness: A catalogue of choices, assumptions, and definitions.arXiv preprint arXiv:1811.07867, 2018

work page arXiv 2018

[53] [53]

A reductions approach to fair classification

Alekh Agarwal, Alina Beygelzimer, Miroslav Dudík, John Langford, and Hanna Wallach. A reductions approach to fair classification. InProceedings of the 35th International Conference on Machine Learning (ICML), volume 80, pages 60–69, 2018

work page 2018

[54] [54]

A comparative study of fairness-enhancing interventions in machine learning

Sorelle A. Friedler, Carlos Scheidegger, Suresh Venkatasubramanian, Sonam Choudhary, Evan P. Hamilton, and Derek Roth. A comparative study of fairness-enhancing interventions in machine learning. arXiv preprint arXiv:1802.04422, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[55] [55]

Uci adult data set

Ron Kohavi and Barry Becker. Uci adult data set. UCI Machine Learning Repository, 1996. https://archive.ics.uci.edu/ml/datasets/adult

work page 1996

[56] [56]

Retiring adult: New datasets for fair machine learning

Frances Ding, Moritz Hardt, John Miller, and Ludwig Schmidt. Retiring adult: New datasets for fair machine learning. InNeurIPS Datasets and Benchmarks Track, 2021

work page 2021

[57] [57]

Equality of opportunity in supervised learning

Moritz Hardt, Eric Price, Eric Price, and Nati Srebro. Equality of opportunity in supervised learning. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016

work page 2016

[58] [58]

Adult [dataset]

Barry Becker and Ron Kohavi. Adult [dataset]. UCI Machine Learning Repository, 1996. https: //archive.ics.uci.edu/ml/datasets/adult. 11 Talking Trees: Reasoning-Assisted Induction of Decision Trees for Tabular Data A. Evaluation Dataset Information The main evaluation uses the following datasets (with OpenML task IDs). •Fitness-Fitness_Club(ID 363671), 15...

work page 1996

[59] [59]

Tree modifications: ‘tree_mod‘ (key)

work page

[60] [60]

Tree analysis, visualization and debugging: ‘tree_eda‘ (key)

work page

[61] [61]

General feature engineering and transformations: ‘feat_engineering‘ (key)

work page

[62] [62]

General exploratory data analysis: ‘eda‘ (key)

work page

[63] [63]

Builtins: ‘builtins‘ (key) Category descriptions:

work page

[64] [64]

Tree modifications: any operation that changes or trains the tree. Keywords: De- cisionTreeClassifier, DecisionTreeRegressor, min_samples_split, max_depth, prune, re- place_subtree, grow_subtree, repair, min_samples_leaf, max_features, min_impurity_decrease, min_weight_fraction_leaf, ccp_alpha, max_leaf_nodes, min_samples_leaf, min_samples_split, max_dept...

work page

[65] [65]

Keywords: is_leaf, get_data_indices_for_node, print, decision_path, fea- ture_importances_, export_graphviz

Tree analysis: introspection of trained tree(s) such as paths, leaves, importances, surro- gate views, and plots. Keywords: is_leaf, get_data_indices_for_node, print, decision_path, fea- ture_importances_, export_graphviz

work page

[66] [66]

Independent of a specific trained tree

General feature engineering and transformations: any input transformation before training/in- ference. Independent of a specific trained tree. Keywords: OneHotEncoder, PolynomialFeatures, clip, log1p, sign, datetime64

work page

[67] [67]

Keyword: percentile, std, mean, tsne, umap, train_test_split

General exploratory data analysis: dataset-level profiling and exploration not tied to a specific model such as distributions, correlations, missingness, leakage checks, class balance. Keyword: percentile, std, mean, tsne, umap, train_test_split

work page

[68] [68]

balanced

Builtins: infrastructure that doesn’t change data or models and isn’t analysis such as I/O, seeding, logging, timing, config, small data wrangling helpers. Keywords: check_random_state, asarray, dtype, save, load. As an output provide a json-formattable dictionary of the form: category: [function1, function2]. 13 Talking Trees: Reasoning-Assisted Inductio...

work page