Counterfactual Modeling with Fine-Tuned LLMs for Health Intervention Design and Sensor Data Augmentation
Pith reviewed 2026-05-16 12:22 UTC · model grok-4.3
The pith
Fine-tuned LLMs generate plausible counterfactuals for health sensor data that also restore classifier performance under label scarcity.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Fine-tuned LLMs produce counterfactual explanations with high plausibility and validity for sensor-based health models; when these explanations are inserted as augmented samples, they restore an average 20 percent F1 performance in label-scarce regimes while remaining more clinically actionable than optimization-based counterfactual methods.
What carries the argument
Fine-tuned LLM that maps clinical sensor features to minimal, semantically coherent feature edits capable of flipping a downstream classifier prediction.
If this is right
- Health models can output both a risk score and the smallest realistic change a patient can make to lower that score.
- Scarce sensor datasets can be balanced without collecting new patient records by adding LLM-generated minority-class examples.
- Counterfactual generation becomes model-agnostic, allowing the same LLM pipeline to serve different underlying classifiers.
- Intervention design moves from generic advice to individualized, data-driven adjustments derived directly from the trained model.
Where Pith is reading between the lines
- Wearable devices could run lightweight fine-tuned LLMs locally to give users immediate counterfactual feedback on their sensor streams.
- The same generation process might support simulation of hypothetical patient trajectories for public-health policy testing.
- If clinical validity holds, regulatory pathways could treat LLM counterfactuals as approved decision-support outputs rather than black-box suggestions.
Load-bearing premise
That the feature edits suggested by the LLM will remain clinically safe and free of hidden artifacts when patients actually attempt the recommended changes.
What would settle it
A controlled study that measures whether patients who follow the exact feature adjustments proposed by the LLM show the predicted health improvement without new adverse effects.
Figures
read the original abstract
Counterfactual explanations (CFEs) provide human-centric interpretability by identifying the minimal, actionable changes required to alter a machine learning model's prediction. Therefore, CFs can be used as (i) interventions for abnormality prevention and (ii) augmented data for training robust models. We conduct a comprehensive evaluation of CF generation using large language models (LLMs), including GPT-4 (zero-shot and few-shot) and two open-source models-BioMistral-7B and LLaMA-3.1-8B, in both pretrained and fine-tuned configurations. Using the multimodal AI-READI clinical dataset, we assess CFs across three dimensions: intervention quality, feature diversity, and augmentation effectiveness. Fine-tuned LLMs, particularly LLaMA-3.1-8B, produce CFs with high plausibility (up to 99%), strong validity (up to 0.99), and realistic, behaviorally modifiable feature adjustments. When used for data augmentation under controlled label-scarcity settings, LLM-generated CFs substantially restore classifier performance, yielding an average 20% F1 recovery across three scarcity scenarios. Compared with optimization-based baselines such as DiCE, CFNOW, and NICE, LLMs offer a flexible, model-agnostic approach that generates more clinically actionable and semantically coherent counterfactuals. Overall, this work demonstrates the promise of LLM-driven counterfactuals for both interpretable intervention design and data-efficient model training in sensor-based digital health. Impact: SenseCF fine-tunes an LLM to generate valid, representative counterfactual explanations and supplement minority class in an imbalanced dataset for improving model training and boosting model robustness and predictive performance
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript evaluates counterfactual explanation generation using LLMs (GPT-4 zero/few-shot, BioMistral-7B, LLaMA-3.1-8B) in pretrained and fine-tuned settings on the multimodal AI-READI clinical sensor dataset. It claims that fine-tuned LLaMA-3.1-8B produces CFs with up to 99% plausibility and 0.99 validity featuring realistic, behaviorally modifiable adjustments; when used for augmentation under label scarcity, these yield an average 20% F1 recovery across three scenarios and outperform optimization baselines (DiCE, CFNOW, NICE). The work positions the approach as model-agnostic and suitable for both interpretable health interventions and data-efficient training.
Significance. If the reported metrics prove robust under independent clinical validation and the augmentation gains generalize, the work would offer a practical advance in digital health by enabling scalable, semantically coherent counterfactuals that support both intervention design and robustness in imbalanced sensor-data settings. The explicit baseline comparisons and focus on augmentation effectiveness strengthen its applied relevance.
major comments (2)
- [Evaluation and Results] The headline validity (0.99) and plausibility (99%) scores for fine-tuned LLaMA-3.1-8B are presented without any description of blinded clinician adjudication or external medical review; the abstract and evaluation rely on automated/proxy checks whose independence from fine-tuning artifacts is unspecified. This is load-bearing because continuous sensor features (glucose, activity) can produce numerically coherent but clinically incoherent CFs that would fail real intervention use.
- [Augmentation Experiments] The 20% average F1 recovery is demonstrated only under synthetic label-scarcity on the same AI-READI cohort; no cross-cohort or external validation is reported, leaving open whether the augmentation benefit transfers to new patient populations or deployment settings.
minor comments (2)
- Baseline implementations (DiCE, CFNOW, NICE) are referenced but lack explicit hyperparameter settings, distance metrics, or feature constraints used in the comparison.
- Notation for the three scarcity scenarios and the exact F1 recovery calculation (e.g., relative to which baseline classifier) should be clarified with a table or equation.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We address each major comment below with clarifications on our evaluation approach and planned revisions to improve transparency.
read point-by-point responses
-
Referee: [Evaluation and Results] The headline validity (0.99) and plausibility (99%) scores for fine-tuned LLaMA-3.1-8B are presented without any description of blinded clinician adjudication or external medical review; the abstract and evaluation rely on automated/proxy checks whose independence from fine-tuning artifacts is unspecified. This is load-bearing because continuous sensor features (glucose, activity) can produce numerically coherent but clinically incoherent CFs that would fail real intervention use.
Authors: We agree that automated metrics alone are insufficient for full clinical claims. Validity is defined in Section 4.2 as the rate at which the counterfactual flips the downstream classifier prediction, while plausibility uses a combination of feature-range constraints and embedding-based semantic similarity to the original distribution. These are independent of the fine-tuning objective in the sense that they are computed post-generation using held-out validation data and fixed physiological bounds, but we acknowledge they are proxies. We will revise the abstract, Section 4, and add an explicit limitations paragraph stating the absence of blinded clinician review and the risk of numerically coherent but clinically implausible outputs. This is a partial revision: we will clarify scope and add the limitation statement without new experiments. revision: partial
-
Referee: [Augmentation Experiments] The 20% average F1 recovery is demonstrated only under synthetic label-scarcity on the same AI-READI cohort; no cross-cohort or external validation is reported, leaving open whether the augmentation benefit transfers to new patient populations or deployment settings.
Authors: The augmentation experiments in Section 5.3 are performed on the AI-READI cohort under controlled synthetic scarcity to isolate the effect of counterfactual augmentation. We do not claim generalization to new cohorts in the manuscript. We will expand the discussion section to explicitly note this limitation and identify cross-cohort validation as important future work. The reported 20% F1 recovery remains accurate for the evaluated setting. This constitutes a partial revision focused on improved transparency rather than new empirical results. revision: partial
Circularity Check
No significant circularity; empirical results rest on external baselines and held-out metrics
full rationale
The paper evaluates fine-tuned LLMs (LLaMA-3.1-8B etc.) on the AI-READI dataset by generating counterfactuals, measuring plausibility/validity via standard metrics, and testing augmentation under label scarcity against independent baselines (DiCE, CFNOW, NICE). Fine-tuning uses the dataset but performance is quantified on held-out evaluations and external comparators rather than reducing to fitted parameters by construction. No self-definitional equations, renamed predictions, or load-bearing self-citations appear in the derivation; the chain is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Fine-tuned LLMs produce more plausible and valid counterfactuals than optimization baselines on health sensor data
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Fine-tuned LLMs... produce CFs with high plausibility (up to 99%), strong validity (up to 0.99)... average 20% F1 recovery
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Jcost not mentioned; no recognition cost, golden-ratio ladder, or 8-tick periodicity
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Counterfactual explana- tions without opening the black box: Automated decisions and the gdpr,
S. Wachter, B. D. Mittelstadt, and C. Russell, “Counterfactual explana- tions without opening the black box: Automated decisions and the gdpr,” Cybersecurity, 2017
work page 2017
-
[2]
A survey of methods for explaining black box models,
R. Guidotti, A. Monreale, F. Turini, D. Pedreschi, and F. Giannotti, “A survey of methods for explaining black box models,”ACM Computing Surveys (CSUR), vol. 51, pp. 1 – 42, 2018
work page 2018
-
[3]
Algorithmic recourse: from counterfactual explanations to interventions,
A.-H. Karimi, B. Scholkopf, and I. Valera, “Algorithmic recourse: from counterfactual explanations to interventions,”Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 2020
work page 2021
-
[4]
Model-agnostic counterfactual explanations for consequential decisions,
A.-H. Karimi, G. Barthe, B. Balle, and I. Valera, “Model-agnostic counterfactual explanations for consequential decisions,”ArXiv, vol. abs/1905.11190, 2019
-
[5]
Explaining machine learning classifiers through diverse counterfactual explanations,
R. K. Mothilal, A. Sharma, and C. Tan, “Explaining machine learning classifiers through diverse counterfactual explanations,” inProceedings of the 2020 Conference on Fairness, Accountability, and Transparency, 2020, pp. 607–617
work page 2020
-
[6]
R. M. B. de Oliveira, K. S ¨orensen, and D. Martens, “A model-agnostic and data-independent tabu search algorithm to generate counterfactuals for tabular, image, and text data,”European Journal of Operational Research, 2023
work page 2023
-
[7]
Nice: an algorithm for nearest instance counterfactual explanations,
D. Brughmans and D. Martens, “Nice: an algorithm for nearest instance counterfactual explanations,”Data Mining and Knowledge Discovery, pp. 1–39, 2021
work page 2021
-
[8]
Language Models are Few-Shot Learners
B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwalet al., “Language models are few-shot learners,”arXiv preprint arXiv:2005.14165, vol. 1, p. 3, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[9]
SenseCF: LLM-prompted counterfactuals for intervention and sensor data augmentation,
S. B. Soumma, A. Arefeen, S. M. Carpenter, M. Hingle, and H. Ghasemzadeh, “SenseCF: LLM-prompted counterfactuals for intervention and sensor data augmentation,” inIEEE-EMBS International Conference on Body Sensor Networks 2025, 2025. [Online]. Available: https://openreview.net/forum?id=8qqMeF9EmT
work page 2025
-
[10]
A. Bhattacharjee, R. Moraffah, J. Garland, and H. Liu, “ Zero- shot LLM-guided Counterfactual Generation: A Case Study on NLP Model Evaluation ,” in2024 IEEE International Conference on Big Data (BigData). Los Alamitos, CA, USA: IEEE Computer Society, Dec. 2024, pp. 1243–1248. [Online]. Available: https: //doi.ieeecomputersociety.org/10.1109/BigData62323....
-
[11]
Prompting large language models for counterfactual generation: An empirical study,
Y . Li, M. Xu, X. Miao, S. Zhou, and T. Qian, “Prompting large language models for counterfactual generation: An empirical study,” inProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), N. Calzolari, M.-Y . Kan, V . Hoste, A. Lenci, S. Sakti, and N. Xue, Eds. Torino, Ita...
work page 2024
-
[12]
CounterBench: Evaluating and Improving Counterfactual Reasoning in Large Language Models
Y . Chen, V . K. Singh, J. Ma, and R. Tang, “Counterbench: A benchmark for counterfactuals reasoning in large language models,”ArXiv, vol. abs/2502.11008, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
Efficient search for diverse coherent explanations,
C. Russell, “Efficient search for diverse coherent explanations,”Proceed- ings of the Conference on Fairness, Accountability, and Transparency, 2019
work page 2019
-
[14]
Ai-readi: rethinking ai data collection, preparation and sharing in diabetes research and beyond,
P. M. http://orcid. org/0000-0001-6343-2140 Drolet Caroline 4 http://orcid. org/0000-0003-2287-4190 Lucero Abigail 8 Matthies Dawn 7 http://orcid. org/0009 0003-4909-6058 Pittock Hanna 3 Watkins Kate 3 York Brittany 1 and N. P. S. W. X. 11, “Ai-readi: rethinking ai data collection, preparation and sharing in diabetes research and beyond,” Nature metabolis...
work page 2024
-
[15]
Ai-readi: rethinking ai data collection, preparation and sharing in diabetes research and beyond,
S. L. Baxter, V . R. de Sa, K. S. Ferryman, P. Jain, C. S. Lee, J. Li- Pook-Than, T. Y . A. Liu, J. P. Owen, B. Patel, Q. Yu, L. M. Zangwill, A. Bahmani, C. G. Chute, J. C. Edberg, S. Hurst, H. Ishikawa, A. Y . Lee, G. McGwin, S. K. McWeeney, C. Nebeker, C. Owsley, S. J. Singer, R. Adib, M. Adibuzzaman, A. Alavi, C. Ashley, A. Baer, E. Benton, M. Blazes, ...
work page 2024
-
[16]
What do the stress level numbers mean?
Garmin, “What do the stress level numbers mean?” https: //support.garmin.com/en-US/?faq=WT9BmhjacO4ZpxbCc0EKn9, accessed: 2025-12-02
work page 2025
-
[17]
Ro- bust counterfactual explanations for neural networks with probabilistic guarantees,
F. Hamman, E. Noorani, S. Mishra, D. Magazzeni, and S. Dutta, “Ro- bust counterfactual explanations for neural networks with probabilistic guarantees,” inInternational Conference on Machine Learning, 2023
work page 2023
-
[18]
Counternet: End-to-end training of prediction aware counterfactual explanations,
H. Guo, T. H. Nguyen, and A. Yadav, “Counternet: End-to-end training of prediction aware counterfactual explanations,”Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2021
work page 2021
-
[19]
Counterfactual explanations and how to find them: literature review and benchmarking,
R. Guidotti, “Counterfactual explanations and how to find them: literature review and benchmarking,”Data Mining and Knowledge Discovery, vol. 38, pp. 2770 – 2824, 2022
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.