pith. sign in

arxiv: 2601.14590 · v2 · submitted 2026-01-21 · 💻 cs.LG

Counterfactual Modeling with Fine-Tuned LLMs for Health Intervention Design and Sensor Data Augmentation

Pith reviewed 2026-05-16 12:22 UTC · model grok-4.3

classification 💻 cs.LG
keywords counterfactual explanationslarge language modelsdata augmentationhealth interventionssensor datamachine learning interpretabilitylabel scarcity
0
0 comments X

The pith

Fine-tuned LLMs generate plausible counterfactuals for health sensor data that also restore classifier performance under label scarcity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates large language models for producing counterfactual explanations from multimodal clinical sensor data. Fine-tuned models such as LLaMA-3.1-8B identify minimal, behaviorally modifiable feature changes that flip model predictions while preserving clinical plausibility and validity. These outputs serve two practical roles: they outline concrete interventions to prevent health abnormalities and they supply synthetic samples that augment scarce labeled training data. In controlled tests on the AI-READI dataset, the generated counterfactuals reach up to 99 percent plausibility and recover an average 20 percent F1 score across three scarcity settings, outperforming optimization baselines like DiCE, CFNOW, and NICE. The approach therefore supplies a single model-agnostic mechanism for both interpretable intervention design and data-efficient model improvement in digital health.

Core claim

Fine-tuned LLMs produce counterfactual explanations with high plausibility and validity for sensor-based health models; when these explanations are inserted as augmented samples, they restore an average 20 percent F1 performance in label-scarce regimes while remaining more clinically actionable than optimization-based counterfactual methods.

What carries the argument

Fine-tuned LLM that maps clinical sensor features to minimal, semantically coherent feature edits capable of flipping a downstream classifier prediction.

If this is right

  • Health models can output both a risk score and the smallest realistic change a patient can make to lower that score.
  • Scarce sensor datasets can be balanced without collecting new patient records by adding LLM-generated minority-class examples.
  • Counterfactual generation becomes model-agnostic, allowing the same LLM pipeline to serve different underlying classifiers.
  • Intervention design moves from generic advice to individualized, data-driven adjustments derived directly from the trained model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Wearable devices could run lightweight fine-tuned LLMs locally to give users immediate counterfactual feedback on their sensor streams.
  • The same generation process might support simulation of hypothetical patient trajectories for public-health policy testing.
  • If clinical validity holds, regulatory pathways could treat LLM counterfactuals as approved decision-support outputs rather than black-box suggestions.

Load-bearing premise

That the feature edits suggested by the LLM will remain clinically safe and free of hidden artifacts when patients actually attempt the recommended changes.

What would settle it

A controlled study that measures whether patients who follow the exact feature adjustments proposed by the LLM show the predicted health improvement without new adverse effects.

Figures

Figures reproduced from arXiv: 2601.14590 by Asiful Arefeen, Hassan Ghasemzadeh, Melanie Hingle, Shovito Barua Soumma, Stephanie M. Carpenter.

Figure 2
Figure 2. Figure 2: SenseCF pipeline: LLM-generated counterfactuals are used both for augmenting imbalanced training data (left) and for model [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Counterfactual generation using LLMs from sensor-derived [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Prompt template for counterfactual generation. [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Overview of the feature extraction pipeline. Patients are first [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Feature diversity in the generated CFs for AI-Readi data. Avg: [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Impact of LLM-Generated Counterfactual Augmentation Under Class-Specific Label Scarcity [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Latent Space Distribution of Factual and Counterfactual [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
read the original abstract

Counterfactual explanations (CFEs) provide human-centric interpretability by identifying the minimal, actionable changes required to alter a machine learning model's prediction. Therefore, CFs can be used as (i) interventions for abnormality prevention and (ii) augmented data for training robust models. We conduct a comprehensive evaluation of CF generation using large language models (LLMs), including GPT-4 (zero-shot and few-shot) and two open-source models-BioMistral-7B and LLaMA-3.1-8B, in both pretrained and fine-tuned configurations. Using the multimodal AI-READI clinical dataset, we assess CFs across three dimensions: intervention quality, feature diversity, and augmentation effectiveness. Fine-tuned LLMs, particularly LLaMA-3.1-8B, produce CFs with high plausibility (up to 99%), strong validity (up to 0.99), and realistic, behaviorally modifiable feature adjustments. When used for data augmentation under controlled label-scarcity settings, LLM-generated CFs substantially restore classifier performance, yielding an average 20% F1 recovery across three scarcity scenarios. Compared with optimization-based baselines such as DiCE, CFNOW, and NICE, LLMs offer a flexible, model-agnostic approach that generates more clinically actionable and semantically coherent counterfactuals. Overall, this work demonstrates the promise of LLM-driven counterfactuals for both interpretable intervention design and data-efficient model training in sensor-based digital health. Impact: SenseCF fine-tunes an LLM to generate valid, representative counterfactual explanations and supplement minority class in an imbalanced dataset for improving model training and boosting model robustness and predictive performance

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript evaluates counterfactual explanation generation using LLMs (GPT-4 zero/few-shot, BioMistral-7B, LLaMA-3.1-8B) in pretrained and fine-tuned settings on the multimodal AI-READI clinical sensor dataset. It claims that fine-tuned LLaMA-3.1-8B produces CFs with up to 99% plausibility and 0.99 validity featuring realistic, behaviorally modifiable adjustments; when used for augmentation under label scarcity, these yield an average 20% F1 recovery across three scenarios and outperform optimization baselines (DiCE, CFNOW, NICE). The work positions the approach as model-agnostic and suitable for both interpretable health interventions and data-efficient training.

Significance. If the reported metrics prove robust under independent clinical validation and the augmentation gains generalize, the work would offer a practical advance in digital health by enabling scalable, semantically coherent counterfactuals that support both intervention design and robustness in imbalanced sensor-data settings. The explicit baseline comparisons and focus on augmentation effectiveness strengthen its applied relevance.

major comments (2)
  1. [Evaluation and Results] The headline validity (0.99) and plausibility (99%) scores for fine-tuned LLaMA-3.1-8B are presented without any description of blinded clinician adjudication or external medical review; the abstract and evaluation rely on automated/proxy checks whose independence from fine-tuning artifacts is unspecified. This is load-bearing because continuous sensor features (glucose, activity) can produce numerically coherent but clinically incoherent CFs that would fail real intervention use.
  2. [Augmentation Experiments] The 20% average F1 recovery is demonstrated only under synthetic label-scarcity on the same AI-READI cohort; no cross-cohort or external validation is reported, leaving open whether the augmentation benefit transfers to new patient populations or deployment settings.
minor comments (2)
  1. Baseline implementations (DiCE, CFNOW, NICE) are referenced but lack explicit hyperparameter settings, distance metrics, or feature constraints used in the comparison.
  2. Notation for the three scarcity scenarios and the exact F1 recovery calculation (e.g., relative to which baseline classifier) should be clarified with a table or equation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below with clarifications on our evaluation approach and planned revisions to improve transparency.

read point-by-point responses
  1. Referee: [Evaluation and Results] The headline validity (0.99) and plausibility (99%) scores for fine-tuned LLaMA-3.1-8B are presented without any description of blinded clinician adjudication or external medical review; the abstract and evaluation rely on automated/proxy checks whose independence from fine-tuning artifacts is unspecified. This is load-bearing because continuous sensor features (glucose, activity) can produce numerically coherent but clinically incoherent CFs that would fail real intervention use.

    Authors: We agree that automated metrics alone are insufficient for full clinical claims. Validity is defined in Section 4.2 as the rate at which the counterfactual flips the downstream classifier prediction, while plausibility uses a combination of feature-range constraints and embedding-based semantic similarity to the original distribution. These are independent of the fine-tuning objective in the sense that they are computed post-generation using held-out validation data and fixed physiological bounds, but we acknowledge they are proxies. We will revise the abstract, Section 4, and add an explicit limitations paragraph stating the absence of blinded clinician review and the risk of numerically coherent but clinically implausible outputs. This is a partial revision: we will clarify scope and add the limitation statement without new experiments. revision: partial

  2. Referee: [Augmentation Experiments] The 20% average F1 recovery is demonstrated only under synthetic label-scarcity on the same AI-READI cohort; no cross-cohort or external validation is reported, leaving open whether the augmentation benefit transfers to new patient populations or deployment settings.

    Authors: The augmentation experiments in Section 5.3 are performed on the AI-READI cohort under controlled synthetic scarcity to isolate the effect of counterfactual augmentation. We do not claim generalization to new cohorts in the manuscript. We will expand the discussion section to explicitly note this limitation and identify cross-cohort validation as important future work. The reported 20% F1 recovery remains accurate for the evaluated setting. This constitutes a partial revision focused on improved transparency rather than new empirical results. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical results rest on external baselines and held-out metrics

full rationale

The paper evaluates fine-tuned LLMs (LLaMA-3.1-8B etc.) on the AI-READI dataset by generating counterfactuals, measuring plausibility/validity via standard metrics, and testing augmentation under label scarcity against independent baselines (DiCE, CFNOW, NICE). Fine-tuning uses the dataset but performance is quantified on held-out evaluations and external comparators rather than reducing to fitted parameters by construction. No self-definitional equations, renamed predictions, or load-bearing self-citations appear in the derivation; the chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on standard ML assumptions about counterfactual validity and the benefits of synthetic augmentation; no new entities or ad-hoc parameters are introduced beyond typical fine-tuning.

axioms (1)
  • domain assumption Fine-tuned LLMs produce more plausible and valid counterfactuals than optimization baselines on health sensor data
    Invoked in the evaluation of intervention quality and augmentation effectiveness

pith-pipeline@v0.9.0 · 5626 in / 1171 out tokens · 43552 ms · 2026-05-16T12:22:15.768412+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 2 internal anchors

  1. [1]

    Counterfactual explana- tions without opening the black box: Automated decisions and the gdpr,

    S. Wachter, B. D. Mittelstadt, and C. Russell, “Counterfactual explana- tions without opening the black box: Automated decisions and the gdpr,” Cybersecurity, 2017

  2. [2]

    A survey of methods for explaining black box models,

    R. Guidotti, A. Monreale, F. Turini, D. Pedreschi, and F. Giannotti, “A survey of methods for explaining black box models,”ACM Computing Surveys (CSUR), vol. 51, pp. 1 – 42, 2018

  3. [3]

    Algorithmic recourse: from counterfactual explanations to interventions,

    A.-H. Karimi, B. Scholkopf, and I. Valera, “Algorithmic recourse: from counterfactual explanations to interventions,”Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 2020

  4. [4]

    Model-agnostic counterfactual explanations for consequential decisions,

    A.-H. Karimi, G. Barthe, B. Balle, and I. Valera, “Model-agnostic counterfactual explanations for consequential decisions,”ArXiv, vol. abs/1905.11190, 2019

  5. [5]

    Explaining machine learning classifiers through diverse counterfactual explanations,

    R. K. Mothilal, A. Sharma, and C. Tan, “Explaining machine learning classifiers through diverse counterfactual explanations,” inProceedings of the 2020 Conference on Fairness, Accountability, and Transparency, 2020, pp. 607–617

  6. [6]

    A model-agnostic and data-independent tabu search algorithm to generate counterfactuals for tabular, image, and text data,

    R. M. B. de Oliveira, K. S ¨orensen, and D. Martens, “A model-agnostic and data-independent tabu search algorithm to generate counterfactuals for tabular, image, and text data,”European Journal of Operational Research, 2023

  7. [7]

    Nice: an algorithm for nearest instance counterfactual explanations,

    D. Brughmans and D. Martens, “Nice: an algorithm for nearest instance counterfactual explanations,”Data Mining and Knowledge Discovery, pp. 1–39, 2021

  8. [8]

    Language Models are Few-Shot Learners

    B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwalet al., “Language models are few-shot learners,”arXiv preprint arXiv:2005.14165, vol. 1, p. 3, 2020

  9. [9]

    SenseCF: LLM-prompted counterfactuals for intervention and sensor data augmentation,

    S. B. Soumma, A. Arefeen, S. M. Carpenter, M. Hingle, and H. Ghasemzadeh, “SenseCF: LLM-prompted counterfactuals for intervention and sensor data augmentation,” inIEEE-EMBS International Conference on Body Sensor Networks 2025, 2025. [Online]. Available: https://openreview.net/forum?id=8qqMeF9EmT

  10. [10]

    Warren, Lu Cheng, Haidar M

    A. Bhattacharjee, R. Moraffah, J. Garland, and H. Liu, “ Zero- shot LLM-guided Counterfactual Generation: A Case Study on NLP Model Evaluation ,” in2024 IEEE International Conference on Big Data (BigData). Los Alamitos, CA, USA: IEEE Computer Society, Dec. 2024, pp. 1243–1248. [Online]. Available: https: //doi.ieeecomputersociety.org/10.1109/BigData62323....

  11. [11]

    Prompting large language models for counterfactual generation: An empirical study,

    Y . Li, M. Xu, X. Miao, S. Zhou, and T. Qian, “Prompting large language models for counterfactual generation: An empirical study,” inProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), N. Calzolari, M.-Y . Kan, V . Hoste, A. Lenci, S. Sakti, and N. Xue, Eds. Torino, Ita...

  12. [12]

    CounterBench: Evaluating and Improving Counterfactual Reasoning in Large Language Models

    Y . Chen, V . K. Singh, J. Ma, and R. Tang, “Counterbench: A benchmark for counterfactuals reasoning in large language models,”ArXiv, vol. abs/2502.11008, 2025

  13. [13]

    Efficient search for diverse coherent explanations,

    C. Russell, “Efficient search for diverse coherent explanations,”Proceed- ings of the Conference on Fairness, Accountability, and Transparency, 2019

  14. [14]

    Ai-readi: rethinking ai data collection, preparation and sharing in diabetes research and beyond,

    P. M. http://orcid. org/0000-0001-6343-2140 Drolet Caroline 4 http://orcid. org/0000-0003-2287-4190 Lucero Abigail 8 Matthies Dawn 7 http://orcid. org/0009 0003-4909-6058 Pittock Hanna 3 Watkins Kate 3 York Brittany 1 and N. P. S. W. X. 11, “Ai-readi: rethinking ai data collection, preparation and sharing in diabetes research and beyond,” Nature metabolis...

  15. [15]

    Ai-readi: rethinking ai data collection, preparation and sharing in diabetes research and beyond,

    S. L. Baxter, V . R. de Sa, K. S. Ferryman, P. Jain, C. S. Lee, J. Li- Pook-Than, T. Y . A. Liu, J. P. Owen, B. Patel, Q. Yu, L. M. Zangwill, A. Bahmani, C. G. Chute, J. C. Edberg, S. Hurst, H. Ishikawa, A. Y . Lee, G. McGwin, S. K. McWeeney, C. Nebeker, C. Owsley, S. J. Singer, R. Adib, M. Adibuzzaman, A. Alavi, C. Ashley, A. Baer, E. Benton, M. Blazes, ...

  16. [16]

    What do the stress level numbers mean?

    Garmin, “What do the stress level numbers mean?” https: //support.garmin.com/en-US/?faq=WT9BmhjacO4ZpxbCc0EKn9, accessed: 2025-12-02

  17. [17]

    Ro- bust counterfactual explanations for neural networks with probabilistic guarantees,

    F. Hamman, E. Noorani, S. Mishra, D. Magazzeni, and S. Dutta, “Ro- bust counterfactual explanations for neural networks with probabilistic guarantees,” inInternational Conference on Machine Learning, 2023

  18. [18]

    Counternet: End-to-end training of prediction aware counterfactual explanations,

    H. Guo, T. H. Nguyen, and A. Yadav, “Counternet: End-to-end training of prediction aware counterfactual explanations,”Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2021

  19. [19]

    Counterfactual explanations and how to find them: literature review and benchmarking,

    R. Guidotti, “Counterfactual explanations and how to find them: literature review and benchmarking,”Data Mining and Knowledge Discovery, vol. 38, pp. 2770 – 2824, 2022