Differential Subgroup Discovery: Characterizing Where Two Populations Differ, and Why
Pith reviewed 2026-05-07 05:39 UTC · model grok-4.3
The pith
A gradient-based method discovers subgroups where two populations differ most in a target outcome, under conditions that support causal interpretations of those differences.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes a general optimization objective for finding differential subgroups and proves conditions under which subgroups recovered by this objective admit a causal interpretation of population differences. It further presents DiffSub, a gradient-based procedure that searches for such interpretable subgroups directly in tabular data, and demonstrates its use across synthetic benchmarks, medical case studies, model-error analyses, and treatment-effect settings.
What carries the argument
The differential subgroup: a subset drawn from both populations that shares similar feature values but exhibits exceptional differences in the target outcome; the optimization objective together with the derived causal conditions lets this object identify the covariate combinations structurally responsible for population-level gaps.
If this is right
- In clinical data the method can surface the exact patient profiles where outcome gaps between groups are largest.
- For model diagnostics it flags the feature combinations where prediction errors diverge most between populations.
- In treatment-effect studies it isolates subgroups that drive heterogeneous effects between treated and control groups.
- The recovered subgroups are human-readable, so practitioners can inspect the covariate patterns that explain the gaps.
Where Pith is reading between the lines
- If the causal conditions are satisfied in practice, the subgroups could directly inform targeted interventions aimed at reducing the identified gaps.
- The same optimization framework might be adapted to identify regions of disagreement between two predictive models rather than two human populations.
- When applied to fairness audits, the approach could highlight the precise feature combinations where outcome disparities are most concentrated.
Load-bearing premise
Gradient-based optimization will consistently recover subgroups that meet the stated conditions for causal interpretation in real tabular data, without post-hoc selection or high sensitivity to hyperparameter choices that would break the claimed interpretability.
What would settle it
Construct synthetic tabular data containing one known differential subgroup that satisfies the causal conditions by design; if DiffSub either misses this subgroup or returns subgroups whose differences fail the causal conditions, the central claim is falsified.
Figures
read the original abstract
We study the problem of understanding where two populations differ within a feature space, which we formalize in the concept of a differential subgroup: a subset of individuals from both populations who, despite sharing similar characteristics, exhibit exceptional differences in a target outcome. Differential subgroups reveal the regions of the feature space where population-level gaps are most pronounced and can help practitioners identify the covariate combinations that are structurally responsible for these differences, e.g.~in clinical analysis, model diagnostics, or treatment-effect studies. We introduce a general optimization objective for discovering differential subgroups and establish conditions under which the resulting subgroups admit a causal interpretation of population differences. We propose DiffSub, a gradient-based approach that discovers interpretable differential subgroups in tabular data. Across synthetic benchmarks, medical case studies, model-error analyses, and treatment-effect settings, DiffSub identifies informative subgroups that reveal where population differences arise and why.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper formalizes differential subgroups as subsets of individuals from two populations that share similar features but exhibit exceptional differences in a target outcome. It introduces a general optimization objective for discovering such subgroups, establishes conditions under which the subgroups admit a causal interpretation of population differences, and proposes DiffSub, a gradient-based method to recover interpretable subgroups from tabular data. The approach is evaluated across synthetic benchmarks, medical case studies, model-error analyses, and treatment-effect settings.
Significance. If the optimization objective is well-posed and the causal conditions are both correctly derived and reliably satisfied by the recovered subgroups, the work could provide a useful tool for explaining population-level differences in domains such as clinical analysis, model diagnostics, and heterogeneous treatment effects. The gradient-based formulation may offer computational advantages for tabular data compared to exhaustive search methods in subgroup discovery.
major comments (3)
- [§3] §3: The conditions for causal interpretation are stated as separately established, but it is unclear whether they are preserved under the non-convex gradient-based optimization used by DiffSub. In mixed discrete/continuous feature spaces, local optima or discretization artifacts could produce subgroups that violate the identifiability requirements even when the theoretical statement holds.
- [§4] §4 (DiffSub algorithm and experiments): The manuscript does not report sensitivity analyses to hyperparameters, initialization, or regularization choices. Without such controls, it is difficult to confirm that the recovered subgroups consistently satisfy the causal conditions rather than depending on post-hoc filtering or favorable hyperparameter settings.
- [Table 1] Table 1 and synthetic benchmark results: The reported performance metrics lack error bars across multiple random seeds and do not include direct comparisons against existing subgroup discovery baselines (e.g., causal rule mining or treatment-effect heterogeneity methods) that would demonstrate whether DiffSub improves upon them in recovering causally valid subgroups.
minor comments (2)
- [§2] Notation for the optimization objective (likely Eq. (1) or (2)) should be introduced with an explicit statement of all variables before the causal conditions are derived.
- [§5] The medical case studies and treatment-effect experiments would benefit from a clearer description of data exclusion rules and preprocessing steps to allow reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which have helped us improve the manuscript. We address each major comment below, indicating the revisions made.
read point-by-point responses
-
Referee: [§3] §3: The conditions for causal interpretation are stated as separately established, but it is unclear whether they are preserved under the non-convex gradient-based optimization used by DiffSub. In mixed discrete/continuous feature spaces, local optima or discretization artifacts could produce subgroups that violate the identifiability requirements even when the theoretical statement holds.
Authors: We appreciate this observation. The causal conditions in §3 are established for the subgroups themselves, independent of how they are discovered. However, we recognize that the non-convex optimization in DiffSub may lead to local optima that do not fully satisfy these conditions in practice, particularly in mixed feature spaces. In the revised manuscript, we have added a discussion in §3.3 clarifying the assumptions under which the gradient-based method preserves the conditions (e.g., sufficient regularization to avoid discretization artifacts). We also include additional experiments in §4 that empirically verify the causal identifiability metrics for the recovered subgroups across different initializations, showing that violations are rare under our recommended settings. This is a partial revision as a complete theoretical guarantee for global optimality remains an open direction. revision: partial
-
Referee: §4 (DiffSub algorithm and experiments): The manuscript does not report sensitivity analyses to hyperparameters, initialization, or regularization choices. Without such controls, it is difficult to confirm that the recovered subgroups consistently satisfy the causal conditions rather than depending on post-hoc filtering or favorable hyperparameter settings.
Authors: Thank you for highlighting this gap. The original manuscript focused on demonstrating the method's effectiveness on the reported benchmarks but did not include systematic sensitivity analyses. We have now added a new subsection §4.4 on robustness and sensitivity, where we vary hyperparameters including the trade-off parameter λ, the number of gradient steps, and random initializations over 20 seeds. We report that the discovered subgroups and their causal validity scores remain stable, with low variance in performance metrics. This confirms that the results do not rely on specific hyperparameter choices or post-hoc filtering. revision: yes
-
Referee: Table 1 and synthetic benchmark results: The reported performance metrics lack error bars across multiple random seeds and do not include direct comparisons against existing subgroup discovery baselines (e.g., causal rule mining or treatment-effect heterogeneity methods) that would demonstrate whether DiffSub improves upon them in recovering causally valid subgroups.
Authors: We agree that error bars and baseline comparisons are important for a rigorous evaluation. In the revised version, we have updated all experimental results, including Table 1, to report means and standard deviations over 10 independent random seeds. Additionally, we have incorporated comparisons against two relevant baselines: a causal rule mining approach (e.g., adapted from causal subgroup discovery literature) and a heterogeneous treatment effect method (such as causal forests). The results show that DiffSub outperforms these baselines in terms of both subgroup interpretability and adherence to causal conditions on the synthetic data, while maintaining competitive performance on real-world case studies. These additions are detailed in the updated §4 and Table 1. revision: yes
Circularity Check
No circularity: general objective and independent causal conditions
full rationale
The paper introduces a general optimization objective for differential subgroups and separately establishes conditions under which subgroups admit causal interpretation of population differences. DiffSub is presented as a gradient-based solver for this objective on tabular data. No self-definitional reduction (objective not defined via its own outputs), no fitted inputs renamed as predictions, and no load-bearing self-citations that collapse the central claims. The derivation chain remains self-contained against external benchmarks, consistent with the provided abstract and reader's assessment of score 2.0.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Conditions exist under which discovered differential subgroups admit a causal interpretation of population differences.
- domain assumption Gradient-based optimization can efficiently discover interpretable differential subgroups in tabular data.
Reference graph
Works this paper leans on
-
[1]
Susan Athey and Guido Imbens. 2016. Recursive partitioning for heterogeneous causal effects.Proceedings of the National Academy of Sciences113, 27 (2016), 7353–7360
work page 2016
-
[2]
Martin Atzmueller. 2015. Subgroup discovery.Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery5, 1 (2015), 35–49
work page 2015
-
[3]
Nicolás M Ballarini, Gerd K Rosenkranz, Thomas Jaki, Franz König, and Martin Posch. 2018. Subgroup identification in clinical trials via the predicted individual treatment effect.PloS one13, 10 (2018), e0205971
work page 2018
-
[4]
Stephen D Bay and Michael J Pazzani. 2001. Detecting group differences: Mining contrast sets.Data mining and knowledge discovery5, 3 (2001), 213–246
work page 2001
-
[5]
Barry Becker and Ronny Kohavi. 1996. Adult. UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C5XW20
-
[6]
Tahir Belice and Ismail Demir. 2020. The gender differences as a risk factor in diabetic patients with COVID-19.Iranian journal of microbiology12, 6 (2020), 625
work page 2020
-
[7]
Mario Boley, Bryan R Goldsmith, Luca M Ghiringhelli, and Jilles Vreeken. 2017. Identifying consistent statements about numerical data with dispersion-corrected subgroup discovery.Data Mining and Knowledge Discovery31, 5 (2017), 1391– 1418
work page 2017
-
[8]
Toon Calders and Sicco Verwer. 2010. Three naive bayes approaches for discrimination-free classification.Data mining and knowledge discovery21, 2 (2010), 277–292
work page 2010
-
[9]
Huigang Chen, Totte Harinen, Jeong-Yoon Lee, Mike Yung, and Zhenyu Zhao
- [10]
-
[11]
Yoichi Chikahara, Makoto Yamada, and Hisashi Kashima. 2022. Feature selec- tion for discovering distributional treatment effect modifiers. InUncertainty in Artificial Intelligence. PMLR, 400–410
work page 2022
-
[12]
Yeounoh Chung, Tim Kraska, Neoklis Polyzotis, Ki Hyun Tae, and Steven Euijong Whang. 2019. Slice finder: Automated data slicing for model validation. In2019 IEEE 35th International Conference on Data Engineering (ICDE). IEEE, 1550–1553
work page 2019
-
[13]
David I Cook, Val J Gebski, and Anthony C Keech. 2004. Subgroup analysis in clinical trials.Medical Journal of Australia180, 6 (2004), 289
work page 2004
-
[14]
Robert Detrano, Andras Janosi, Walter Steinbrunn, Matthias Pfisterer, Johann- Jakob Schmid, Sarbjit Sandhu, Kern H Guppy, Stella Lee, and Victor Froelicher
-
[15]
International application of a new probability algorithm for the diagnosis of coronary artery disease.The American journal of cardiology64, 5 (1989), 304–310
work page 1989
-
[16]
Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. 2012. Fairness through awareness. InProceedings of the 3rd innovations in theoretical computer science conference. 214–226
work page 2012
-
[17]
Jared C Foster, Jeremy MG Taylor, and Stephen J Ruberg. 2011. Subgroup identi- fication from randomized clinical trial data.Statistics in medicine30, 24 (2011), 2867–2880
work page 2011
-
[18]
Ursula Hébert-Johnson, Michael Kim, Omer Reingold, and Guy Rothblum. 2018. Multicalibration: Calibration for the (computationally-identifiable) masses. In International Conference on Machine Learning. PMLR, 1939–1948
work page 2018
-
[19]
Jennifer L Hill. 2011. Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics20, 1 (2011), 217–240
work page 2011
-
[20]
Janis Kalofolias, Mario Boley, and Jilles Vreeken. 2017. Efficiently discovering lo- cally exceptional yet globally representative subgroups. In2017 IEEE International Conference on Data Mining (ICDM). IEEE, 197–206
work page 2017
-
[21]
Toshihiro Kamishima, Shotaro Akaho, and Jun Sakuma. 2011. Fairness-aware learning through regularization approach. In2011 IEEE 11th international confer- ence on data mining workshops. IEEE, 643–650
work page 2011
-
[22]
Michael Kearns, Seth Neel, Aaron Roth, and Zhiwei Steven Wu. 2018. Prevent- ing fairness gerrymandering: Auditing and learning for subgroup fairness. In International conference on machine learning. PMLR, 2564–2572
work page 2018
-
[23]
Michael P Kim, Amirata Ghorbani, and James Zou. 2019. Multiaccuracy: Black-box post-processing for fairness in classification. InProceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society. 247–254
work page 2019
-
[24]
Diederik P Kingma and Jimmy Lei Ba. 2015. Adam: A method for stochastic gradient descent. InICLR: international conference on learning representations. 1–15
work page 2015
-
[25]
Sören R Künzel, Jasjeet S Sekhon, Peter J Bickel, and Bin Yu. 2019. Metalearners for estimating heterogeneous treatment effects using machine learning.Proceedings of the national academy of sciences116, 10 (2019), 4156–4165
work page 2019
-
[26]
Ben Lambert, Isaac J Stopard, Amir Momeni-Boroujeni, Rachelle Mendoza, and Alejandro Zuretti. 2022. Using patient biomarker time series to determine mor- tality risk in hospitalised COVID-19 patients: A comparative analysis across two New York hospitals.Plos one17, 8 (2022), e0272442
work page 2022
-
[27]
Chim C Lang, Sandeep Gupta, Paul Kalra, Bernard Keavney, Ian Menown, Chris Morley, and Sandosh Padmanabhan. 2010. Elevated heart rate and cardiovas- cular outcomes in patients with coronary artery disease: clinical evidence and pathophysiological mechanisms.Atherosclerosis212, 1 (2010), 1–8
work page 2010
-
[28]
Florian Lemmerich, Martin Atzmueller, and Frank Puppe. 2016. Fast exhaustive subgroup discovery with numerical target concepts.Data Mining and Knowledge Discovery30, 3 (2016), 711–762
work page 2016
-
[29]
Florian Lemmerich and Martin Becker. 2018. pysubgroup: Easy-to-use subgroup discovery in python. InJoint European conference on machine learning and knowl- edge discovery in databases. Springer, 658–662
work page 2018
-
[30]
Ilya Lipkovich, Alex Dmitrienko, Jonathan Denne, and Gregory Enas. 2011. Sub- group identification based on differential effect search–a recursive partitioning method for establishing response to treatment in patient subpopulations.Statistics in medicine30, 21 (2011), 2601–2621
work page 2011
-
[31]
Christos Louizos, Uri Shalit, Joris M Mooij, David Sontag, Richard Zemel, and Max Welling. 2017. Causal effect inference with deep latent-variable models. Advances in neural information processing systems30 (2017)
work page 2017
-
[32]
Natalia L Martinez, Martin A Bertran, Afroditi Papadaki, Miguel Rodrigues, and Guillermo Sapiro. 2021. Blind pareto fairness and subgroup robustness. In International Conference on Machine Learning. PMLR, 7492–7501
work page 2021
-
[33]
Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. 2021. A survey on bias and fairness in machine learning.ACM com- puting surveys (CSUR)54, 6 (2021), 1–35
work page 2021
-
[34]
Miruna Oprescu, Vasilis Syrgkanis, Keith Battocchi, Maggie Hei, and Greg Lewis
-
[35]
In33rd Conference on Neural Information Processing Systems, Vol
EconML: A machine learning library for estimating heterogeneous treat- ment effects. In33rd Conference on Neural Information Processing Systems, Vol. 6. Curran Associates, Inc
-
[36]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems32 (2019)
work page 2019
- [37]
-
[38]
Judea Pearl. 2022. Direct and indirect effects. InProbabilistic and causal inference: the works of Judea Pearl. 373–392
work page 2022
-
[39]
Kumarajarshi Ray. 2020. Life Expectancy (WHO). https://www.kaggle.com/ datasets/kumarajarshi/life-expectancy-who. https://www.kaggle.com/datasets/ kumarajarshi/life-expectancy-who
work page 2020
-
[40]
Tamara Rushovich, Marion Boulicault, Jarvis T Chen, Ann Caroline Danielsen, Amelia Tarrant, Sarah S Richardson, and Heather Shattuck-Heidorn. 2021. Sex disparities in COVID-19 mortality vary across US racial groups.Journal of General Internal Medicine36, 6 (2021), 1696–1701
work page 2021
-
[41]
Svetlana Sagadeeva and Matthias Boehm. 2021. Sliceline: Fast, linear-algebra- based slice finding for ml model debugging. InProceedings of the 2021 international conference on management of data. 2290–2299
work page 2021
-
[42]
Heidi Seibold, Achim Zeileis, and Torsten Hothorn. 2016. Model-based recursive partitioning for subgroup analyses.The international journal of biostatistics12, 1 (2016), 45–63
work page 2016
-
[43]
Uri Shalit, Fredrik D Johansson, and David Sontag. 2017. Estimating individual treatment effect: generalization bounds and algorithms. InInternational conference on machine learning. PMLR, 3076–3085
work page 2017
-
[44]
Changjian Shui, Gezheng Xu, Qi Chen, Jiaqi Li, Charles X Ling, Tal Arbel, Boyu Wang, and Christian Gagné. 2022. On learning fairness and accuracy on multiple subgroups.Advances in Neural Information Processing Systems35 (2022), 34121– 34135
work page 2022
-
[45]
George R Terrell and David W Scott. 1992. Variable kernel density estimation. The Annals of Statistics(1992), 1236–1265
work page 1992
-
[46]
Stefan Wager and Susan Athey. 2018. Estimation and inference of heterogeneous treatment effects using random forests.J. Amer. Statist. Assoc.113, 523 (2018), 1228–1242
work page 2018
-
[47]
Peter WF Wilson, Ralph B D’Agostino, Daniel Levy, Albert M Belanger, Halit Silbershatz, and William B Kannel. 1998. Prediction of coronary heart disease using risk factor categories.Circulation97, 18 (1998), 1837–1847
work page 1998
-
[48]
Sascha Xu, Nils Philipp Walter, Janis Kalofolias, and Jilles Vreeken. 2024. Learning Exceptional Subgroups by End-to-End Maximizing KL-Divergence. InInterna- tional Conference on Machine Learning. PMLR, 55267–55285
work page 2024
-
[49]
Lu Zhang, Yongkai Wu, and Xintao Wu. 2016. Situation Testing-Based Discrimi- nation Discovery: A Causal Inference Approach.. InIJCAI, Vol. 16. 2718–2724
work page 2016
-
[50]
Lu Zhang, Yongkai Wu, and Xintao Wu. 2018. Causal modeling-based discrimina- tion discovery and removal: Criteria, bounds, and algorithms.IEEE Transactions on Knowledge and Data Engineering31, 11 (2018), 2035–2050. Differential Subgroup Discovery Conference’17, July 2017, Washington, DC, USA A Proofs In this section, we provide a thorough investigation of...
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.