Multistage Defer Trees for Hybrid Interpretability: If at First You Can't Succeed, Tree Again

Cynthia Rudin; Hayden McTavish; Margo Seltzer; Zakk Heile

arxiv: 2606.30995 · v1 · pith:XXLZHJMAnew · submitted 2026-06-30 · 💻 cs.LG · stat.ML

Multistage Defer Trees for Hybrid Interpretability: If at First You Can't Succeed, Tree Again

Zakk Heile , Hayden McTavish , Margo Seltzer , Cynthia Rudin This is my paper

Pith reviewed 2026-07-01 01:06 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords multistage defer treeshybrid interpretabilitydecision treesaccuracy-interpretability tradeoffsparse decision treesmodel deferralblack box models

0 comments

The pith

A sequence of sparse decision trees can match the accuracy of full tree ensembles by classifying most samples with one or two trees and deferring the rest to a black box.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents multistage defer trees as a way to resolve the accuracy-interpretability tradeoff when single decision trees fall short. Each sparse tree in the sequence predicts for the bulk of the data and passes only a small share onward, ultimately reaching a complex model if needed. This keeps the majority of outputs transparent while preserving ensemble-level performance. A reader would care because it lets practitioners apply interpretable models more widely without accepting lower accuracy on hard cases. The work shows that even when black boxes are required, they need not handle every instance.

Core claim

Multistage Defer Trees consist of a sequence of sparse decision trees where each tree makes predictions for most samples and defers a small proportion to the next tree in the sequence or ultimately to a black box. The authors demonstrate that this model class can be trained to match the performance of complex tree-based ensembles while routing most samples through only one or a small number of sparse decision trees, expanding the accuracy-interpretability frontier in settings where single-tree methods remain insufficient.

What carries the argument

Multistage Defer Trees: a sequence of sparse decision trees that each predict for most samples and defer a small share to later stages or a black box.

If this is right

Most samples receive predictions from one or two sparse trees, preserving interpretability for the bulk of the data.
Overall accuracy reaches levels comparable to complex tree ensembles without full opacity.
The method applies in domains where single trees are insufficient but full black-box use is undesirable.
Training techniques exist that maintain model simplicity while achieving the required deferral behavior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This structure could support compliance needs in regulated settings by exposing tree-based explanations for the majority of cases.
The deferral points might identify data regions that inherently require more complex modeling, informing future data collection.
Similar staged deferral could be tested with other interpretable base learners beyond decision trees.

Load-bearing premise

Sequences of sparse trees can be trained so that deferral decisions keep overall accuracy high without needing many stages or deferring large fractions of the data on typical datasets.

What would settle it

On standard benchmark datasets, training multistage defer trees either drops accuracy below that of tree ensembles or requires more than a few stages with high deferral rates to reach comparable performance.

Figures

Figures reproduced from arXiv: 2606.30995 by Cynthia Rudin, Hayden McTavish, Margo Seltzer, Zakk Heile.

**Figure 2.** Figure 2: Test accuracy (mean ± standard deviation) as a function of deferral rate across datasets. To answer question (1), [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: CDF of deferral thresholds required to have accuracy within [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 14.** Figure 14: In short, backfitting gives clear improvements on some datasets, such as Tictactoe, where [PITH_FULL_IMAGE:figures/full_fig_p041_14.png] view at source ↗

**Figure 5.** Figure 5: Comparison between MDT+XGB hyperparameter selection using test deferral rate and [PITH_FULL_IMAGE:figures/full_fig_p051_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison between selecting hyperparameters for MDTs using a budget on the total [PITH_FULL_IMAGE:figures/full_fig_p055_6.png] view at source ↗

**Figure 7.** Figure 7: Legend for the accuracy-deferral trade-off plots on the following pages. [PITH_FULL_IMAGE:figures/full_fig_p055_7.png] view at source ↗

**Figure 8.** Figure 8: Deferral-accuracy trade-offs for all datasets. [PITH_FULL_IMAGE:figures/full_fig_p056_8.png] view at source ↗

**Figure 9.** Figure 9: Deferral-accuracy trade-offs for all datasets, continued. [PITH_FULL_IMAGE:figures/full_fig_p057_9.png] view at source ↗

**Figure 10.** Figure 10: Results of our MDT algorithm compared to just using our DeferTree algorithm. Here, [PITH_FULL_IMAGE:figures/full_fig_p059_10.png] view at source ↗

**Figure 11.** Figure 11: Results of our MDT algorithm compared to just using our DeferTree algorithm (continued). [PITH_FULL_IMAGE:figures/full_fig_p060_11.png] view at source ↗

**Figure 12.** Figure 12: Ablation on the impacts of rescaling τ for later stages of the MDT. The boolean rescale_tau is set to 0 for the αdefer = 0 method [PITH_FULL_IMAGE:figures/full_fig_p061_12.png] view at source ↗

**Figure 13.** Figure 13: Ablation on the distance-based reweighting via tuning [PITH_FULL_IMAGE:figures/full_fig_p062_13.png] view at source ↗

**Figure 14.** Figure 14: Comparison of MDT+XGB and MDT+XGB with backfitting across selected datasets. [PITH_FULL_IMAGE:figures/full_fig_p064_14.png] view at source ↗

**Figure 15.** Figure 15: Comparison of MDT+XGB and MDT+EBM across selected datasets. MDT+EBM uses an [PITH_FULL_IMAGE:figures/full_fig_p065_15.png] view at source ↗

**Figure 16.** Figure 16: Compressed MDT stages and corresponding compressed single-tree representation for [PITH_FULL_IMAGE:figures/full_fig_p067_16.png] view at source ↗

**Figure 17.** Figure 17: Compressed MDT stages and corresponding compressed single-tree representation for [PITH_FULL_IMAGE:figures/full_fig_p068_17.png] view at source ↗

**Figure 18.** Figure 18: Phishing Rule List Representation. 69 [PITH_FULL_IMAGE:figures/full_fig_p069_18.png] view at source ↗

read the original abstract

Recent work has shown that well-optimized individual decision trees can match complex black box models in some settings, primarily in noisy domains. For the remaining settings, however, complex ensembled compositions of trees often achieve higher accuracy at the cost of interpretability, leaving practitioners with difficult modeling decisions along an accuracy-interpretability tradeoff. Ideally, we would like to classify as much of the data as possible with one or a small number of trees, achieving interpretability for most samples while maintaining state-of-the-art accuracy. We introduce Multistage Defer Trees: a sequence of sparse decision trees that each make predictions for most samples, while deferring a small proportion to the next tree in the sequence or, ultimately, to a black box. We demonstrate that we can train this model class to match the performance of complex tree-based ensembles while routing most samples through only one or a small number of sparse decision trees. We discuss a range of techniques for training these models while maintaining simplicity. Our method expands the accuracy--interpretability frontier in settings where single-tree methods remain insufficient, demonstrating that even when complex models are necessary, they need not be fully opaque.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Multistage defer trees is a new staged architecture worth looking at, but the claim that it routes most data through one or two sparse trees while matching ensemble accuracy rests on unshown training results.

read the letter

The paper introduces multistage defer trees: a sequence of sparse decision trees where each predicts for most samples and defers a small share to the next tree or ultimately a black box. This is distinct from single-tree or standard ensemble work. The goal is to expand the accuracy-interpretability frontier in cases where one tree is not enough.

The architecture itself is the clear contribution. It formalizes a way to chain sparse trees with learned deferral and discusses training techniques that aim to preserve simplicity. If the deferral decisions actually keep the bulk of data in the first one or two trees without accuracy drop, the hybrid benefit follows.

The soft spot is empirical. The abstract states that the model class can be trained to match complex tree ensembles with most samples routed through few trees, but the visible details on datasets, metrics, deferral fractions, stage counts, and how the training avoids either accuracy loss or high deferral are thin. The load-bearing assumption—that deferral can be made selective enough without inflating stages or sending too much downstream—remains the part that needs concrete numbers to evaluate. The stress-test note identifies this correctly.

This is for people working on interpretable models who already know single trees and ensembles. A reader in that area could get value from the model class definition and training discussion even if the experiments need tightening. It deserves peer review because the idea is new and the problem it targets is practical, though the referee will likely ask for fuller results on the routing behavior.

Referee Report

2 major / 0 minor

Summary. The paper introduces Multistage Defer Trees as a sequence of sparse decision trees, each classifying most samples while deferring a small fraction to the next stage or ultimately a black-box model. It claims that techniques exist to train this class such that it matches the accuracy of complex tree ensembles while routing the bulk of data through only one or a small number of the sparse trees, thereby expanding the accuracy-interpretability frontier.

Significance. If the claimed training procedure and routing behavior were empirically validated, the approach would meaningfully extend hybrid interpretability methods beyond single sparse trees by allowing most predictions to remain interpretable even when ensembles are needed for accuracy.

major comments (2)

[Abstract] Abstract: The assertion that 'we can train this model class to match the performance of complex tree-based ensembles while routing most samples through only one or a small number of sparse decision trees' is presented with no experimental details, datasets, metrics, training procedures, or results, leaving the central empirical claim without any visible support.
[Abstract] Abstract: The load-bearing assumption that deferral decisions can be trained to preserve overall accuracy without requiring many stages or large deferral fractions on typical datasets receives no analysis, ablation, or evidence, directly undermining the hybrid-interpretability benefit.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their comments. We respond point-by-point to the major comments below and indicate where revisions are appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: The assertion that 'we can train this model class to match the performance of complex tree-based ensembles while routing most samples through only one or a small number of sparse decision trees' is presented with no experimental details, datasets, metrics, training procedures, or results, leaving the central empirical claim without any visible support.

Authors: The abstract is a concise summary; the full manuscript contains dedicated experimental sections reporting results across multiple datasets, accuracy metrics, deferral fractions, and the training procedures used (including optimization of tree sparsity and deferral thresholds). We will revise the abstract to include a brief reference to the empirical validation and key datasets to make the support more visible at the summary level. revision: yes
Referee: [Abstract] Abstract: The load-bearing assumption that deferral decisions can be trained to preserve overall accuracy without requiring many stages or large deferral fractions on typical datasets receives no analysis, ablation, or evidence, directly undermining the hybrid-interpretability benefit.

Authors: The manuscript describes a range of training techniques intended to keep deferral fractions small and stage counts low while preserving accuracy. We acknowledge that dedicated ablations quantifying the sensitivity to stage count and deferral rate would provide stronger direct evidence for the assumption. We will add such analysis (or expanded discussion of existing results) in revision. revision: yes

Circularity Check

0 steps flagged

No circularity in multistage defer trees derivation or claims

full rationale

The paper introduces a new model class (sequences of sparse decision trees with deferral to later stages or a black box) and states an empirical claim that it can be trained to match ensemble accuracy while routing most samples through one or few trees. No equations, fitted parameters renamed as predictions, or self-citations appear in the provided text. The load-bearing step is a training procedure whose success is presented as a demonstration rather than a definitional identity or reduction to prior self-cited results. The derivation chain is therefore self-contained and does not reduce any claimed result to its inputs by construction.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that sparse trees remain effective on deferred subsets and on several training hyperparameters whose values are not specified in the abstract.

free parameters (2)

number of stages
The length of the deferral sequence is a design choice that must be selected per dataset.
deferral threshold per stage
The fraction or confidence level at which samples are deferred is a tunable parameter.

axioms (1)

domain assumption Sparse decision trees can achieve competitive accuracy on appropriately chosen subsets of data.
Invoked implicitly when claiming that deferral preserves overall performance.

pith-pipeline@v0.9.1-grok · 5748 in / 1152 out tokens · 39041 ms · 2026-07-01T01:06:04.547472+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

78 extracted references · 18 canonical work pages · 1 internal anchor

[1]

UCI Machine Learning Repository, 2017

In-Vehicle Coupon Recommendation. UCI Machine Learning Repository, 2017. DOI: https://doi.org/10.24432/C5GS4P

work page doi:10.24432/c5gs4p 2017
[3]

Learning optimal decision trees using caching branch-and-bound search

Gaël Aglin, Siegfried Nijssen, and Pierre Schaus. Learning optimal decision trees using caching branch-and-bound search. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 3146–3153, 2020

2020
[4]

Tic-Tac-Toe Endgame

David Aha. Tic-Tac-Toe Endgame. UCI Machine Learning Repository, 1991. DOI: https://doi.org/10.24432/C5688J

work page doi:10.24432/c5688j 1991
[5]

Near-optimal decision trees in a SPLIT second

Varun Babbar, Hayden McTavish, Cynthia Rudin, and Margo Seltzer. Near-optimal decision trees in a SPLIT second. In International Conference on Machine Learning, 2025

2025
[6]

Michelle Bao, Angela Zhou, A. S. Zottola, Brian Brubach, Sarah Desmarais, Seth A. Horowitz, Kristian Lum, and Suresh Venkatasubramanian. It’s compaslicated: The messy relationship between rai datasets and algorithmic fairness benchmarks. In Proceedings of the Thirty-Fifth Conference on Neural Information Processing Systems (NeurIPS), 2021. Datasets and Be...

2021
[7]

Barry Becker and Ronny Kohavi. Adult. UCI Machine Learning Repository, 1996. DOI: https://doi.org/10.24432/C5XW20

work page doi:10.24432/c5xw20 1996
[8]

Optimal classification trees

Dimitris Bertsimas and Jack Dunn. Optimal classification trees. Machine Learning, 106: 1039–1082, 2017

2017
[9]

Harnessing the power of choices in decision tree learning

Guy Blanc, Jane Lange, Chirag Pabbaraju, Colin Sullivan, Li-Yang Tan, and Mo Tiwari. Harnessing the power of choices in decision tree learning. Advances in Neural Information Processing Systems, 36, 2024

2024
[10]

R. Bock. MAGIC Gamma Telescope. UCI Machine Learning Repository, 2004. DOI: https://doi.org/10.24432/C52C8B

work page doi:10.24432/c52c8b 2004
[11]

Using noise to infer aspects of simplicity without learning

Zachery Boner, Harry Chen, Lesia Semenova, Ronald Parr, and Cynthia Rudin. Using noise to infer aspects of simplicity without learning. In Advances In Neural Information Processing Systems, 2024

2024
[12]

Random forests

Leo Breiman. Random forests. Machine Learning, 45:5–32, 2001

2001
[13]

Classification and regression trees

Leo Breiman, Jerome H Friedman, Richard A Olshen, and Charles J Stone. Classification and regression trees. wadsworth & brooks, 1984

1984
[14]

Optimal classification trees for continuous feature data using dynamic programming with branch-and-bound

C˘at˘alin E Brit,a, Jacobus GM van der Linden, and Emir Demirovi´c. Optimal classification trees for continuous feature data using dynamic programming with branch-and-bound. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 11131–11139, 2025. 10

2025
[15]

Xgboost: A Scalable Tree Boosting System

Tianqi Chen and Carlos Guestrin. Xgboost: A Scalable Tree Boosting System. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 785–794, 2016

2016
[16]

On optimum recognition error and reject tradeoff

C Chow. On optimum recognition error and reject tradeoff. IEEE Transactions on information theory, 16(1):41–46, 2003

2003
[17]

Cortez and A

P. Cortez and A. M. Gonçalves Silva. Using data mining to predict secondary school student performance. 2008. URLhttps://api.semanticscholar.org/CorpusID:16621299

2008
[18]

Student Performance

Paulo Cortez. Student Performance. UCI Machine Learning Repository, 2008. DOI: https://doi.org/10.24432/C5TG7T

work page doi:10.24432/c5tg7t 2008
[19]

Murtree: Optimal decision trees via dynamic programming and search

Emir Demirovi´c, Anna Lukina, Emmanuel Hebrard, Jeffrey Chan, James Bailey, Christopher Leckie, Kotagiri Ramamohanarao, and Peter J Stuckey. Murtree: Optimal decision trees via dynamic programming and search. Journal of Machine Learning Research, 23(26):1–47, 2022

2022
[20]

Blossom: an anytime algorithm for computing optimal decision trees

Emir Demirovi´c, Emmanuel Hebrard, and Louis Jean. Blossom: an anytime algorithm for computing optimal decision trees. In International Conference on Machine Learning, pages 7533–7562. PMLR, 2023

2023
[21]

Compressing tree ensembles through level-wise optimization and pruning

Laurens Devos, Timo Martens, Deniz Can Oruc, Wannes Meert, Hendrik Blockeel, and Jesse Davis. Compressing tree ensembles through level-wise optimization and pruning. In Forty-second International Conference on Machine Learning, 2025

2025
[22]

On the foundations of noise-free selective classification.Journal of Machine Learning Research, 11(5), 2010

Ran El-Yaniv et al. On the foundations of noise-free selective classification.Journal of Machine Learning Research, 11(5), 2010

2010
[23]

TabArena: A Living Benchmark for Machine Learning on Tabular Data

Nick Erickson, Lennart Purucker, Andrej Tschalzev, David Holzmüller, Prateek Mutalik Desai, David Salinas, and Frank Hutter. Tabarena: A living benchmark for machine learning on tabular data, 2025. URLhttps://arxiv.org/abs/2506.16791

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Bike Sharing

Hadi Fanaee-T. Bike Sharing. UCI Machine Learning Repository, 2013. DOI: https://doi.org/10.24432/C5W894

work page doi:10.24432/c5w894 2013
[25]

Event labeling combining ensemble detectors and back- ground knowledge

Hadi Fanaee-T and João Gama. Event labeling combining ensemble detectors and back- ground knowledge. Progress in Artificial Intelligence, 2:113 – 127, 2013. URL https: //api.semanticscholar.org/CorpusID:256282956

2013
[26]

Learning hybrid interpretable models: Theory, taxonomy, and methods

Julien Ferry, Gabriel Laberge, and Ulrich Aïvodji. Learning hybrid interpretable models: Theory, taxonomy, and methods. arXiv preprint arXiv:2303.04437, 2023

work page arXiv 2023
[27]

Home equity line of credit (heloc) dataset, 2018

FICO. Home equity line of credit (heloc) dataset, 2018. URL https://community.fico. com/s/explainable-machine-learning-challenge . FICO Explainable Machine Learn- ing Challenge

2018
[28]

Using and combining predictors that specialize

Yoav Freund, Robert E Schapire, Yoram Singer, and Manfred K Warmuth. Using and combining predictors that specialize. In Proceedings of the twenty-ninth annual ACM symposium on Theory of computing, pages 334–343, 1997

1997
[29]

Partially interpretable models with guarantees on coverage and accuracy

Nave Frost, Zachary Lipton, Yishay Mansour, and Michal Moshkovitz. Partially interpretable models with guarantees on coverage and accuracy. In International conference on algorithmic learning theory, pages 590–613. PMLR, 2024

2024
[30]

Analysis of the automl challenge series 2015-

Isabelle Guyon, Lisheng Sun-Hosoya, Marc Boullé, Hugo Jair Escalante, Sergio Escalera, Zhengying Liu, Damir Jajetic, Bisakha Ray, Mehreen Saeed, Michéle Sebag, Alexander Statnikov, WeiWei Tu, and Evelyne Viegas. Analysis of the automl challenge series 2015-

2015
[31]

URL https: //www.automl.org/wp-content/uploads/2018/09/chapter10-challenge.pdf

In AutoML, Springer series on Challenges in Machine Learning, 2019. URL https: //www.automl.org/wp-content/uploads/2018/09/chapter10-challenge.pdf

2019
[32]

From rashomon theory to PRAXIS: Efficient decision tree rashomon sets

Zakk Heile, Hayden McTavish, Varun Babbar, Margo Seltzer, and Cynthia Rudin. From rashomon theory to PRAXIS: Efficient decision tree rashomon sets. In Forty-third International Conference on Machine Learning, 2026. URL https://openreview.net/forum?id= Sgwd0l1u2V. 11

2026
[33]

Spambase

Mark Hopkins, Erik Reeber, George Forman, and Jaap Suermondt. Spambase. UCI Machine Learning Repository, 1999. DOI: https://doi.org/10.24432/C53G6X

work page doi:10.24432/c53g6x 1999
[34]

Optimal sparse decision trees

Xiyang Hu, Cynthia Rudin, and Margo Seltzer. Optimal sparse decision trees. In Advances in Neural Information Processing Systems, volume 32, pages 7265–7273, 2019

2019
[35]

Time constrained dl8.5 using limited discrepancy search

Harold Kiossou, Pierre Schaus, Siegfried Nijssen, and Vinasetan Ratheil Houndji. Time constrained dl8.5 using limited discrepancy search. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 443–459. Springer, 2022

2022
[36]

Efficient lookahead decision trees

Harold Kiossou, Pierre Schaus, Siegfried Nijssen, and Gaël Aglin. Efficient lookahead decision trees. In International Symposium on Intelligent Data Analysis, pages 133–144. Springer, 2024

2024
[37]

A generic complete anytime beam search for optimal decision tree

Harold Silvere Kiossou, Siegfried Nijssen, and Pierre Schaus. A generic complete anytime beam search for optimal decision tree. arXiv preprint arXiv:2508.06064, 2025

work page arXiv 2025
[38]

Interpret when possible: A tree-based hybrid framework for interpretable classification

Yifan Li, Shuhan Qi, Lei Cui, Chao Xing, Lei Zhang, and Xuan Wang. Interpret when possible: A tree-based hybrid framework for interpretable classification. Big Data Mining and Analytics, 9(1):263–283, 2026. doi: 10.26599/BDMA.2025.9020055. URL https: //www.sciopen.com/article/10.26599/BDMA.2025.9020055

work page doi:10.26599/bdma.2025.9020055 2026
[39]

Generalized and scalable optimal sparse decision trees

Jimmy Lin, Chudi Zhong, Diane Hu, Cynthia Rudin, and Margo Seltzer. Generalized and scalable optimal sparse decision trees. In International Conference on Machine Learning, pages 6150–6160. PMLR, 2020

2020
[40]

Accurate intelligible models with pairwise interactions

Yin Lou, Rich Caruana, Johannes Gehrke, and Giles Hooker. Accurate intelligible models with pairwise interactions. Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, 2013. URL https://api.semanticscholar.org/ CorpusID:11246170

2013
[41]

Predict responsibly: improving fairness and accuracy by learning to defer

David Madras, Toni Pitassi, and Richard Zemel. Predict responsibly: improving fairness and accuracy by learning to defer. Advances in neural information processing systems, 31, 2018

2018
[42]

Malani, Jeffrey Kullgren, and Erica Solway

Preeti N. Malani, Jeffrey Kullgren, and Erica Solway. National poll on healthy aging (npha), united states, april 2017, 2019. URL https://doi.org/10.3886/ICPSR37305.v1. Distrib- utor

work page doi:10.3886/icpsr37305.v1 2017
[43]

Marcoulides

George A. Marcoulides. Discovering Knowledge in Data: An Introduction to Data Mining. Wiley, 2005. Churn dataset

2005
[44]

NATICUSdroid (Android Permissions)

Akshay Mathur. NATICUSdroid (Android Permissions). UCI Machine Learning Repository,
[45]

DOI: https://doi.org/10.24432/C5FS64

work page doi:10.24432/c5fs64
[46]

Akshay Mathur, Mounika Podila, Keyur Kulkarni, Quamar Niyaz, and Ahmad Y . Javaid. Poster: Naticusdroid: A malware detection framework for android using native and custom. 2021. URL https://api.semanticscholar.org/CorpusID:232063483

2021
[47]

Quant-BnB: A scalable branch-and-bound method for optimal decision trees with continuous features

Rahul Mazumder, Xiang Meng, and Haoyue Wang. Quant-BnB: A scalable branch-and-bound method for optimal decision trees with continuous features. In International Conference on Machine Learning, volume 162, pages 15255–15277. PMLR, 17–23 Jul 2022

2022
[48]

Fast sparse decision tree optimization via reference ensembles

Hayden McTavish, Chudi Zhong, Reto Achermann, Ilias Karimalis, Jacques Chen, Cynthia Rudin, and Margo Seltzer. Fast sparse decision tree optimization via reference ensembles. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 9604–9613, 2022

2022
[49]

Phishing Websites

Rami Mohammad and Lee McCluskey. Phishing Websites. UCI Machine Learning Repository,
[50]

DOI: https://doi.org/10.24432/C51W2X

work page doi:10.24432/c51w2x
[51]

Mohammad, Fadi A

Rami Mustafa A. Mohammad, Fadi A. Thabtah, and Lee Mccluskey. An assessment of features related to phishing websites using an automated technique. 2012 International Conference for Internet Technology and Secured Transactions, pages 492–497, 2012. URL https://api. semanticscholar.org/CorpusID:5716727. 12

2012
[52]

Consistent estimators for learning to defer to an expert

Hussein Mozannar and David Sontag. Consistent estimators for learning to defer to an expert. In International conference on machine learning, pages 7076–7087. PMLR, 2020

2020
[53]

Jasmine dataset, 2018

OpenML. Jasmine dataset, 2018. URL https://www.openml.org/d/41143. OpenML Dataset ID 41143; dataset from the ChaLearn Automatic Machine Learning (AutoML) Chal- lenge

2018
[54]

Madeline dataset, 2018

OpenML. Madeline dataset, 2018. URL https://www.openml.org/d/41144. OpenML Dataset ID 41144; dataset from the ChaLearn Automatic Machine Learning (AutoML) Chal- lenge

2018
[55]

Diamonds dataset, 2019

OpenML. Diamonds dataset, 2019. URL https://www.openml.org/d/42225. OpenML Dataset ID 42225; dataset containing prices and attributes of nearly 54,000 diamonds

2019
[56]

Abalone dataset, 2022

OpenML. Abalone dataset, 2022. URL https://www.openml.org/d/44956. OpenML Dataset ID 44956; dataset for predicting abalone age from physical measurements

2022
[57]

California dataset, 2022

OpenML. California dataset, 2022. URL https://www.openml.org/d/44090. OpenML Dataset ID 44090; dataset used in the tabular data benchmark and derived from the California Housing dataset

2022
[58]

Pol dataset, 2022

OpenML. Pol dataset, 2022. URL https://www.openml.org/d/44082. OpenML Dataset ID 44082; dataset used in the tabular data benchmark and derived from a binarized version of the original regression dataset

2022
[59]

Rl dataset, 2022

OpenML. Rl dataset, 2022. URL https://www.openml.org/d/43949. OpenML Dataset ID 43949; dataset used in the tabular data benchmark and derived from the ChaLearn AutoML Challenge

2022
[60]

Wine dataset, 2025

OpenML. Wine dataset, 2025. URL https://www.openml.org/d/47041. OpenML Dataset ID 47041

2025
[61]

Interpretable companions for black-box models

Danqing Pan, Tong Wang, and Satoshi Hara. Interpretable companions for black-box models. In International conference on artificial intelligence and statistics, pages 2444–2454. PMLR, 2020

2020
[62]

Ross Quinlan

J. Ross Quinlan. Induction of decision trees. Machine learning, 1(1):81–106, 1986

1986
[63]

Approximating xgboost with an interpretable decision tree

Omer Sagi and Lior Rokach. Approximating xgboost with an interpretable decision tree. Information Sciences, 572:522–542, 2021

2021
[64]

Sakar and Yomi Kastro

C. Sakar and Yomi Kastro. Online Shoppers Purchasing Intention Dataset. UCI Machine Learning Repository, 2018. DOI: https://doi.org/10.24432/C5F88Q

work page doi:10.24432/c5f88q 2018
[65]

Real-time prediction of online shoppers’ purchasing intention using multilayer perceptron and lstm recurrent neural networks

Cemal Okan Sakar, Suleyman Olcay Polat, Mete Katircioglu, and Yomi Kastro. Real-time prediction of online shoppers’ purchasing intention using multilayer perceptron and lstm recurrent neural networks. Neural Computing and Applications, 31:6893 – 6908, 2018. URL https://api.semanticscholar.org/CorpusID:13682776

2018
[66]

Santos-Pereira and Ana M

Carla M. Santos-Pereira and Ana M. Pires. On optimal reject rules and roc curves. Pattern Recognition Letters, 26(7):943–952, 2005. ISSN 0167-8655. doi: https://doi.org/10.1016/ j.patrec.2004.09.042. URL https://www.sciencedirect.com/science/article/pii/ S0167865504002892

2005
[67]

Schapire and Yoram Singer

Robert E. Schapire and Yoram Singer. Improved boosting algorithms using confidence-rated predictions. In Proceedings of the Eleventh Annual Conference on Computational Learning Theory, COLT’ 98, page 80–91, New York, NY , USA, 1998. Association for Computing Machinery. ISBN 1581130570. doi: 10.1145/279943.279960. URL https://doi.org/10. 1145/279943.279960

work page doi:10.1145/279943.279960 1998
[68]

On the existence of simpler machine learning models

Lesia Semenova, Cynthia Rudin, and Ronald Parr. On the existence of simpler machine learning models. In 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 1827–1858, 2022. 13

2022
[69]

A path to simpler models starts with noise

Lesia Semenova, Harry Chen, Ronald Parr, and Cynthia Rudin. A path to simpler models starts with noise. Advances in Neural Information Processing Systems, 36, 2023

2023
[70]

Fast interpretable greedy-tree sums

Yan Shuo Tan, Chandan Singh, Keyan Nasseri, Abhineet Agarwal, James Duncan, Omer Ronen, Matthew Epland, Aaron Kornblith, and Bin Yu. Fast interpretable greedy-tree sums. Proceedings of the National Academy of Sciences, 122(7):e2310151122, 2025. doi: 10.1073/ pnas.2310151122

2025
[71]

The monk”s problems-a performance comparison of different learning algo- rithms, cmu-cs-91-197, sch

Sebastian Thrun. The monk”s problems-a performance comparison of different learning algo- rithms, cmu-cs-91-197, sch. 1991. URL https://api.semanticscholar.org/CorpusID: 59699060

1991
[72]

Learning optimal classification trees using a binary lin- ear program formulation

Sicco Verwer and Yingqian Zhang. Learning optimal classification trees using a binary lin- ear program formulation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 1625–1632, 2019

2019
[73]

Born-again tree ensembles

Thibaut Vidal and Maximilian Schiffer. Born-again tree ensembles. In International conference on machine learning, pages 9743–9753. PMLR, 2020

2020
[74]

Gaining no or low-cost transparency with interpretable partial substitute

Tong Wang. Gaining no or low-cost transparency with interpretable partial substitute. In International Conference on Machine Learning, 2019

2019
[75]

Hybrid predictive models: when an interpretable model collaborates with a black-box model

Tong Wang and Qihang Lin. Hybrid predictive models: when an interpretable model collaborates with a black-box model. J. Mach. Learn. Res., 22(1), January 2021. ISSN 1532-4435

2021
[76]

J. Wnek. MONK’s Problems. UCI Machine Learning Repository, 1993. DOI: https://doi.org/10.24432/C5R30R

work page doi:10.24432/c5r30r 1993
[77]

Default of Credit Card Clients

I-Cheng Yeh. Default of Credit Card Clients. UCI Machine Learning Repository, 2009. DOI: https://doi.org/10.24432/C55S3H

work page doi:10.24432/c55s3h 2009
[78]

The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients

I-Cheng Yeh and Che hui Lien. The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Syst. Appl., 36:2473–2480,
[79]

14 Appendix Contents A Proofs

URLhttps://api.semanticscholar.org/CorpusID:15696161. 14 Appendix Contents A Proofs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Proposition 1: Top-down improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ....

2056

[1] [1]

UCI Machine Learning Repository, 2017

In-Vehicle Coupon Recommendation. UCI Machine Learning Repository, 2017. DOI: https://doi.org/10.24432/C5GS4P

work page doi:10.24432/c5gs4p 2017

[2] [3]

Learning optimal decision trees using caching branch-and-bound search

Gaël Aglin, Siegfried Nijssen, and Pierre Schaus. Learning optimal decision trees using caching branch-and-bound search. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 3146–3153, 2020

2020

[3] [4]

Tic-Tac-Toe Endgame

David Aha. Tic-Tac-Toe Endgame. UCI Machine Learning Repository, 1991. DOI: https://doi.org/10.24432/C5688J

work page doi:10.24432/c5688j 1991

[4] [5]

Near-optimal decision trees in a SPLIT second

Varun Babbar, Hayden McTavish, Cynthia Rudin, and Margo Seltzer. Near-optimal decision trees in a SPLIT second. In International Conference on Machine Learning, 2025

2025

[5] [6]

Michelle Bao, Angela Zhou, A. S. Zottola, Brian Brubach, Sarah Desmarais, Seth A. Horowitz, Kristian Lum, and Suresh Venkatasubramanian. It’s compaslicated: The messy relationship between rai datasets and algorithmic fairness benchmarks. In Proceedings of the Thirty-Fifth Conference on Neural Information Processing Systems (NeurIPS), 2021. Datasets and Be...

2021

[6] [7]

Barry Becker and Ronny Kohavi. Adult. UCI Machine Learning Repository, 1996. DOI: https://doi.org/10.24432/C5XW20

work page doi:10.24432/c5xw20 1996

[7] [8]

Optimal classification trees

Dimitris Bertsimas and Jack Dunn. Optimal classification trees. Machine Learning, 106: 1039–1082, 2017

2017

[8] [9]

Harnessing the power of choices in decision tree learning

Guy Blanc, Jane Lange, Chirag Pabbaraju, Colin Sullivan, Li-Yang Tan, and Mo Tiwari. Harnessing the power of choices in decision tree learning. Advances in Neural Information Processing Systems, 36, 2024

2024

[9] [10]

R. Bock. MAGIC Gamma Telescope. UCI Machine Learning Repository, 2004. DOI: https://doi.org/10.24432/C52C8B

work page doi:10.24432/c52c8b 2004

[10] [11]

Using noise to infer aspects of simplicity without learning

Zachery Boner, Harry Chen, Lesia Semenova, Ronald Parr, and Cynthia Rudin. Using noise to infer aspects of simplicity without learning. In Advances In Neural Information Processing Systems, 2024

2024

[11] [12]

Random forests

Leo Breiman. Random forests. Machine Learning, 45:5–32, 2001

2001

[12] [13]

Classification and regression trees

Leo Breiman, Jerome H Friedman, Richard A Olshen, and Charles J Stone. Classification and regression trees. wadsworth & brooks, 1984

1984

[13] [14]

Optimal classification trees for continuous feature data using dynamic programming with branch-and-bound

C˘at˘alin E Brit,a, Jacobus GM van der Linden, and Emir Demirovi´c. Optimal classification trees for continuous feature data using dynamic programming with branch-and-bound. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 11131–11139, 2025. 10

2025

[14] [15]

Xgboost: A Scalable Tree Boosting System

Tianqi Chen and Carlos Guestrin. Xgboost: A Scalable Tree Boosting System. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 785–794, 2016

2016

[15] [16]

On optimum recognition error and reject tradeoff

C Chow. On optimum recognition error and reject tradeoff. IEEE Transactions on information theory, 16(1):41–46, 2003

2003

[16] [17]

Cortez and A

P. Cortez and A. M. Gonçalves Silva. Using data mining to predict secondary school student performance. 2008. URLhttps://api.semanticscholar.org/CorpusID:16621299

2008

[17] [18]

Student Performance

Paulo Cortez. Student Performance. UCI Machine Learning Repository, 2008. DOI: https://doi.org/10.24432/C5TG7T

work page doi:10.24432/c5tg7t 2008

[18] [19]

Murtree: Optimal decision trees via dynamic programming and search

Emir Demirovi´c, Anna Lukina, Emmanuel Hebrard, Jeffrey Chan, James Bailey, Christopher Leckie, Kotagiri Ramamohanarao, and Peter J Stuckey. Murtree: Optimal decision trees via dynamic programming and search. Journal of Machine Learning Research, 23(26):1–47, 2022

2022

[19] [20]

Blossom: an anytime algorithm for computing optimal decision trees

Emir Demirovi´c, Emmanuel Hebrard, and Louis Jean. Blossom: an anytime algorithm for computing optimal decision trees. In International Conference on Machine Learning, pages 7533–7562. PMLR, 2023

2023

[20] [21]

Compressing tree ensembles through level-wise optimization and pruning

Laurens Devos, Timo Martens, Deniz Can Oruc, Wannes Meert, Hendrik Blockeel, and Jesse Davis. Compressing tree ensembles through level-wise optimization and pruning. In Forty-second International Conference on Machine Learning, 2025

2025

[21] [22]

On the foundations of noise-free selective classification.Journal of Machine Learning Research, 11(5), 2010

Ran El-Yaniv et al. On the foundations of noise-free selective classification.Journal of Machine Learning Research, 11(5), 2010

2010

[22] [23]

TabArena: A Living Benchmark for Machine Learning on Tabular Data

Nick Erickson, Lennart Purucker, Andrej Tschalzev, David Holzmüller, Prateek Mutalik Desai, David Salinas, and Frank Hutter. Tabarena: A living benchmark for machine learning on tabular data, 2025. URLhttps://arxiv.org/abs/2506.16791

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [24]

Bike Sharing

Hadi Fanaee-T. Bike Sharing. UCI Machine Learning Repository, 2013. DOI: https://doi.org/10.24432/C5W894

work page doi:10.24432/c5w894 2013

[24] [25]

Event labeling combining ensemble detectors and back- ground knowledge

Hadi Fanaee-T and João Gama. Event labeling combining ensemble detectors and back- ground knowledge. Progress in Artificial Intelligence, 2:113 – 127, 2013. URL https: //api.semanticscholar.org/CorpusID:256282956

2013

[25] [26]

Learning hybrid interpretable models: Theory, taxonomy, and methods

Julien Ferry, Gabriel Laberge, and Ulrich Aïvodji. Learning hybrid interpretable models: Theory, taxonomy, and methods. arXiv preprint arXiv:2303.04437, 2023

work page arXiv 2023

[26] [27]

Home equity line of credit (heloc) dataset, 2018

FICO. Home equity line of credit (heloc) dataset, 2018. URL https://community.fico. com/s/explainable-machine-learning-challenge . FICO Explainable Machine Learn- ing Challenge

2018

[27] [28]

Using and combining predictors that specialize

Yoav Freund, Robert E Schapire, Yoram Singer, and Manfred K Warmuth. Using and combining predictors that specialize. In Proceedings of the twenty-ninth annual ACM symposium on Theory of computing, pages 334–343, 1997

1997

[28] [29]

Partially interpretable models with guarantees on coverage and accuracy

Nave Frost, Zachary Lipton, Yishay Mansour, and Michal Moshkovitz. Partially interpretable models with guarantees on coverage and accuracy. In International conference on algorithmic learning theory, pages 590–613. PMLR, 2024

2024

[29] [30]

Analysis of the automl challenge series 2015-

Isabelle Guyon, Lisheng Sun-Hosoya, Marc Boullé, Hugo Jair Escalante, Sergio Escalera, Zhengying Liu, Damir Jajetic, Bisakha Ray, Mehreen Saeed, Michéle Sebag, Alexander Statnikov, WeiWei Tu, and Evelyne Viegas. Analysis of the automl challenge series 2015-

2015

[30] [31]

URL https: //www.automl.org/wp-content/uploads/2018/09/chapter10-challenge.pdf

In AutoML, Springer series on Challenges in Machine Learning, 2019. URL https: //www.automl.org/wp-content/uploads/2018/09/chapter10-challenge.pdf

2019

[31] [32]

From rashomon theory to PRAXIS: Efficient decision tree rashomon sets

Zakk Heile, Hayden McTavish, Varun Babbar, Margo Seltzer, and Cynthia Rudin. From rashomon theory to PRAXIS: Efficient decision tree rashomon sets. In Forty-third International Conference on Machine Learning, 2026. URL https://openreview.net/forum?id= Sgwd0l1u2V. 11

2026

[32] [33]

Spambase

Mark Hopkins, Erik Reeber, George Forman, and Jaap Suermondt. Spambase. UCI Machine Learning Repository, 1999. DOI: https://doi.org/10.24432/C53G6X

work page doi:10.24432/c53g6x 1999

[33] [34]

Optimal sparse decision trees

Xiyang Hu, Cynthia Rudin, and Margo Seltzer. Optimal sparse decision trees. In Advances in Neural Information Processing Systems, volume 32, pages 7265–7273, 2019

2019

[34] [35]

Time constrained dl8.5 using limited discrepancy search

Harold Kiossou, Pierre Schaus, Siegfried Nijssen, and Vinasetan Ratheil Houndji. Time constrained dl8.5 using limited discrepancy search. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 443–459. Springer, 2022

2022

[35] [36]

Efficient lookahead decision trees

Harold Kiossou, Pierre Schaus, Siegfried Nijssen, and Gaël Aglin. Efficient lookahead decision trees. In International Symposium on Intelligent Data Analysis, pages 133–144. Springer, 2024

2024

[36] [37]

A generic complete anytime beam search for optimal decision tree

Harold Silvere Kiossou, Siegfried Nijssen, and Pierre Schaus. A generic complete anytime beam search for optimal decision tree. arXiv preprint arXiv:2508.06064, 2025

work page arXiv 2025

[37] [38]

Interpret when possible: A tree-based hybrid framework for interpretable classification

Yifan Li, Shuhan Qi, Lei Cui, Chao Xing, Lei Zhang, and Xuan Wang. Interpret when possible: A tree-based hybrid framework for interpretable classification. Big Data Mining and Analytics, 9(1):263–283, 2026. doi: 10.26599/BDMA.2025.9020055. URL https: //www.sciopen.com/article/10.26599/BDMA.2025.9020055

work page doi:10.26599/bdma.2025.9020055 2026

[38] [39]

Generalized and scalable optimal sparse decision trees

Jimmy Lin, Chudi Zhong, Diane Hu, Cynthia Rudin, and Margo Seltzer. Generalized and scalable optimal sparse decision trees. In International Conference on Machine Learning, pages 6150–6160. PMLR, 2020

2020

[39] [40]

Accurate intelligible models with pairwise interactions

Yin Lou, Rich Caruana, Johannes Gehrke, and Giles Hooker. Accurate intelligible models with pairwise interactions. Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, 2013. URL https://api.semanticscholar.org/ CorpusID:11246170

2013

[40] [41]

Predict responsibly: improving fairness and accuracy by learning to defer

David Madras, Toni Pitassi, and Richard Zemel. Predict responsibly: improving fairness and accuracy by learning to defer. Advances in neural information processing systems, 31, 2018

2018

[41] [42]

Malani, Jeffrey Kullgren, and Erica Solway

Preeti N. Malani, Jeffrey Kullgren, and Erica Solway. National poll on healthy aging (npha), united states, april 2017, 2019. URL https://doi.org/10.3886/ICPSR37305.v1. Distrib- utor

work page doi:10.3886/icpsr37305.v1 2017

[42] [43]

Marcoulides

George A. Marcoulides. Discovering Knowledge in Data: An Introduction to Data Mining. Wiley, 2005. Churn dataset

2005

[43] [44]

NATICUSdroid (Android Permissions)

Akshay Mathur. NATICUSdroid (Android Permissions). UCI Machine Learning Repository,

[44] [45]

DOI: https://doi.org/10.24432/C5FS64

work page doi:10.24432/c5fs64

[45] [46]

Akshay Mathur, Mounika Podila, Keyur Kulkarni, Quamar Niyaz, and Ahmad Y . Javaid. Poster: Naticusdroid: A malware detection framework for android using native and custom. 2021. URL https://api.semanticscholar.org/CorpusID:232063483

2021

[46] [47]

Quant-BnB: A scalable branch-and-bound method for optimal decision trees with continuous features

Rahul Mazumder, Xiang Meng, and Haoyue Wang. Quant-BnB: A scalable branch-and-bound method for optimal decision trees with continuous features. In International Conference on Machine Learning, volume 162, pages 15255–15277. PMLR, 17–23 Jul 2022

2022

[47] [48]

Fast sparse decision tree optimization via reference ensembles

Hayden McTavish, Chudi Zhong, Reto Achermann, Ilias Karimalis, Jacques Chen, Cynthia Rudin, and Margo Seltzer. Fast sparse decision tree optimization via reference ensembles. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 9604–9613, 2022

2022

[48] [49]

Phishing Websites

Rami Mohammad and Lee McCluskey. Phishing Websites. UCI Machine Learning Repository,

[49] [50]

DOI: https://doi.org/10.24432/C51W2X

work page doi:10.24432/c51w2x

[50] [51]

Mohammad, Fadi A

Rami Mustafa A. Mohammad, Fadi A. Thabtah, and Lee Mccluskey. An assessment of features related to phishing websites using an automated technique. 2012 International Conference for Internet Technology and Secured Transactions, pages 492–497, 2012. URL https://api. semanticscholar.org/CorpusID:5716727. 12

2012

[51] [52]

Consistent estimators for learning to defer to an expert

Hussein Mozannar and David Sontag. Consistent estimators for learning to defer to an expert. In International conference on machine learning, pages 7076–7087. PMLR, 2020

2020

[52] [53]

Jasmine dataset, 2018

OpenML. Jasmine dataset, 2018. URL https://www.openml.org/d/41143. OpenML Dataset ID 41143; dataset from the ChaLearn Automatic Machine Learning (AutoML) Chal- lenge

2018

[53] [54]

Madeline dataset, 2018

OpenML. Madeline dataset, 2018. URL https://www.openml.org/d/41144. OpenML Dataset ID 41144; dataset from the ChaLearn Automatic Machine Learning (AutoML) Chal- lenge

2018

[54] [55]

Diamonds dataset, 2019

OpenML. Diamonds dataset, 2019. URL https://www.openml.org/d/42225. OpenML Dataset ID 42225; dataset containing prices and attributes of nearly 54,000 diamonds

2019

[55] [56]

Abalone dataset, 2022

OpenML. Abalone dataset, 2022. URL https://www.openml.org/d/44956. OpenML Dataset ID 44956; dataset for predicting abalone age from physical measurements

2022

[56] [57]

California dataset, 2022

OpenML. California dataset, 2022. URL https://www.openml.org/d/44090. OpenML Dataset ID 44090; dataset used in the tabular data benchmark and derived from the California Housing dataset

2022

[57] [58]

Pol dataset, 2022

OpenML. Pol dataset, 2022. URL https://www.openml.org/d/44082. OpenML Dataset ID 44082; dataset used in the tabular data benchmark and derived from a binarized version of the original regression dataset

2022

[58] [59]

Rl dataset, 2022

OpenML. Rl dataset, 2022. URL https://www.openml.org/d/43949. OpenML Dataset ID 43949; dataset used in the tabular data benchmark and derived from the ChaLearn AutoML Challenge

2022

[59] [60]

Wine dataset, 2025

OpenML. Wine dataset, 2025. URL https://www.openml.org/d/47041. OpenML Dataset ID 47041

2025

[60] [61]

Interpretable companions for black-box models

Danqing Pan, Tong Wang, and Satoshi Hara. Interpretable companions for black-box models. In International conference on artificial intelligence and statistics, pages 2444–2454. PMLR, 2020

2020

[61] [62]

Ross Quinlan

J. Ross Quinlan. Induction of decision trees. Machine learning, 1(1):81–106, 1986

1986

[62] [63]

Approximating xgboost with an interpretable decision tree

Omer Sagi and Lior Rokach. Approximating xgboost with an interpretable decision tree. Information Sciences, 572:522–542, 2021

2021

[63] [64]

Sakar and Yomi Kastro

C. Sakar and Yomi Kastro. Online Shoppers Purchasing Intention Dataset. UCI Machine Learning Repository, 2018. DOI: https://doi.org/10.24432/C5F88Q

work page doi:10.24432/c5f88q 2018

[64] [65]

Real-time prediction of online shoppers’ purchasing intention using multilayer perceptron and lstm recurrent neural networks

Cemal Okan Sakar, Suleyman Olcay Polat, Mete Katircioglu, and Yomi Kastro. Real-time prediction of online shoppers’ purchasing intention using multilayer perceptron and lstm recurrent neural networks. Neural Computing and Applications, 31:6893 – 6908, 2018. URL https://api.semanticscholar.org/CorpusID:13682776

2018

[65] [66]

Santos-Pereira and Ana M

Carla M. Santos-Pereira and Ana M. Pires. On optimal reject rules and roc curves. Pattern Recognition Letters, 26(7):943–952, 2005. ISSN 0167-8655. doi: https://doi.org/10.1016/ j.patrec.2004.09.042. URL https://www.sciencedirect.com/science/article/pii/ S0167865504002892

2005

[66] [67]

Schapire and Yoram Singer

Robert E. Schapire and Yoram Singer. Improved boosting algorithms using confidence-rated predictions. In Proceedings of the Eleventh Annual Conference on Computational Learning Theory, COLT’ 98, page 80–91, New York, NY , USA, 1998. Association for Computing Machinery. ISBN 1581130570. doi: 10.1145/279943.279960. URL https://doi.org/10. 1145/279943.279960

work page doi:10.1145/279943.279960 1998

[67] [68]

On the existence of simpler machine learning models

Lesia Semenova, Cynthia Rudin, and Ronald Parr. On the existence of simpler machine learning models. In 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 1827–1858, 2022. 13

2022

[68] [69]

A path to simpler models starts with noise

Lesia Semenova, Harry Chen, Ronald Parr, and Cynthia Rudin. A path to simpler models starts with noise. Advances in Neural Information Processing Systems, 36, 2023

2023

[69] [70]

Fast interpretable greedy-tree sums

Yan Shuo Tan, Chandan Singh, Keyan Nasseri, Abhineet Agarwal, James Duncan, Omer Ronen, Matthew Epland, Aaron Kornblith, and Bin Yu. Fast interpretable greedy-tree sums. Proceedings of the National Academy of Sciences, 122(7):e2310151122, 2025. doi: 10.1073/ pnas.2310151122

2025

[70] [71]

The monk”s problems-a performance comparison of different learning algo- rithms, cmu-cs-91-197, sch

Sebastian Thrun. The monk”s problems-a performance comparison of different learning algo- rithms, cmu-cs-91-197, sch. 1991. URL https://api.semanticscholar.org/CorpusID: 59699060

1991

[71] [72]

Learning optimal classification trees using a binary lin- ear program formulation

Sicco Verwer and Yingqian Zhang. Learning optimal classification trees using a binary lin- ear program formulation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 1625–1632, 2019

2019

[72] [73]

Born-again tree ensembles

Thibaut Vidal and Maximilian Schiffer. Born-again tree ensembles. In International conference on machine learning, pages 9743–9753. PMLR, 2020

2020

[73] [74]

Gaining no or low-cost transparency with interpretable partial substitute

Tong Wang. Gaining no or low-cost transparency with interpretable partial substitute. In International Conference on Machine Learning, 2019

2019

[74] [75]

Hybrid predictive models: when an interpretable model collaborates with a black-box model

Tong Wang and Qihang Lin. Hybrid predictive models: when an interpretable model collaborates with a black-box model. J. Mach. Learn. Res., 22(1), January 2021. ISSN 1532-4435

2021

[75] [76]

J. Wnek. MONK’s Problems. UCI Machine Learning Repository, 1993. DOI: https://doi.org/10.24432/C5R30R

work page doi:10.24432/c5r30r 1993

[76] [77]

Default of Credit Card Clients

I-Cheng Yeh. Default of Credit Card Clients. UCI Machine Learning Repository, 2009. DOI: https://doi.org/10.24432/C55S3H

work page doi:10.24432/c55s3h 2009

[77] [78]

The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients

I-Cheng Yeh and Che hui Lien. The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Syst. Appl., 36:2473–2480,

[78] [79]

14 Appendix Contents A Proofs

URLhttps://api.semanticscholar.org/CorpusID:15696161. 14 Appendix Contents A Proofs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Proposition 1: Top-down improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ....

2056