pith. machine review for the scientific record. sign in

arxiv: 2605.14113 · v1 · submitted 2026-05-13 · 💻 cs.CV · cs.AI· cs.LG· cs.MA

Recognition: 2 theorem links

· Lean Theorem

ProtoMedAgent: Multimodal Clinical Interpretability via Privacy-Aware Agentic Workflows

Authors on Pith no claims yet

Pith reviewed 2026-05-15 05:05 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LGcs.MA
keywords ProtoMedAgentclinical interpretabilityprototype networksneuro-symbolic bottleneckRAG hallucinationsprivacy preservationmultimodal medical reportingagentic workflows
0
0 comments X

The pith

ProtoMedAgent constrains LLM clinical reports to a neuro-symbolic bottleneck to reach 91.2% faithfulness while cutting membership inference risks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ProtoMedAgent to generate semantic medical documentation from prototype networks without the post-hoc rationalizations that plague standard retrieval-augmented generation. It treats report creation as an iterative zero-gradient optimization over a frozen backbone, distilling features into discrete memory and enforcing every sentence through set-theoretic rules and a Scribe-Critic loop. On a 4,160-patient cohort this produces 91.2 percent Comparison Set Faithfulness, more than double the 46.2 percent of unconstrained RAG, while an ℓ-diversity privacy gate lowers artifact-level membership inference by 9.8 percent. A sympathetic reader cares because the approach directly tackles the gap between accurate visual predictions and trustworthy, private clinical text that physicians can actually use.

Core claim

ProtoMedAgent formalizes multimodal clinical reporting as zero-gradient test-time optimization over a strict neuro-symbolic bottleneck on a frozen prototype backbone, distilling latent features into discrete semantic memory and constraining online generation by exact set-theoretic differentials together with a reflective Scribe-Critic loop, thereby mathematically precluding unsupported claims and achieving 91.2 percent Comparison Set Faithfulness on a 4,160-patient cohort while a binding ℓ-diversity phase transition reduces artifact-level membership inference risks by an absolute 9.8 percent.

What carries the argument

The neuro-symbolic bottleneck enforced by iterative zero-gradient test-time optimization, set-theoretic differentials, and the Scribe-Critic loop, augmented by a semantic privacy gate that applies k-anonymity and ℓ-diversity.

If this is right

  • Clinical reports become reliably grounded in prototype predictions without sycophantic rationalizations that misalign with visual evidence.
  • Privacy protection is achieved through controlled disclosure that still permits diagnostically useful detail.
  • The framework applies directly to existing frozen prototype models without retraining or gradient updates.
  • Faithfulness scores improve dramatically over unconstrained LLM generation on the same clinical cohort.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same constrained-optimization pattern could be tested in other high-stakes domains where outputs must stay strictly derivable from structured evidence.
  • The observed ℓ-diversity phase transition points to a general mechanism for trading off privacy and utility that might apply beyond medical imaging.
  • Real-world validation would need to measure whether the 91.2 percent faithfulness holds across patient populations with different demographic distributions.

Load-bearing premise

The Scribe-Critic loop and neuro-symbolic constraints can mathematically preclude unsupported narrative claims in all cases, and the semantic privacy gate bounds disclosure without compromising report utility.

What would settle it

A generated report containing at least one narrative claim that cannot be derived from the exact set-theoretic differentials of the prototype features, or an experiment showing that patient identity can still be inferred at rates higher than the reported 9.8 percent reduction despite the ℓ-diversity controls.

Figures

Figures reproduced from arXiv: 2605.14113 by Alvaro Lopez Pellicer, Eduardo Soares, Jemma Kerns, Marwan Bukhari, Plamen Angelov, Yi Li.

Figure 1
Figure 1. Figure 1: Overview of an Agentic Clinical Report Generation framework A multimodal patient case, comprising a lumbar DEXA scan and a patient record, is first processed by a frozen prototype backbone that retrieves raw visual exemplars and tabular statistics. Reconciling these outputs through unconstrained prototype seman￾tics via a generic LLM yields an ungrounded, hallucination-prone clinical report. In contrast, b… view at source ↗
Figure 2
Figure 2. Figure 2: Detailed architecture of ProtoMedAgent. (a) [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Privacy-utility frontier of the exportable prototype ev￾idence surface. Sweeping semantic k-anonymity (k∈ {3, 5, 7, 9}) and ℓ-diversity (ℓ∈ {1, 2, 3}). (a) Under ℓ= 1 (orange), increasing k provides a smooth trade-off between utility and exposure. Im￾posing ℓ = 2 (green) triggers a phase transition, collapsing all k values to a single bounded regime, indicating diversity binds be￾fore k-anonymity. (b) The … view at source ↗
Figure 4
Figure 4. Figure 4: Final ProtoMedAgent qualitative outputs. Each row pairs a final ProtoMedAgent report at left with the displayed same-class ProtoCard reference at right. The header line shows the query summary and compact ProtoMedX output; p1 is the ProtoCard shown beside the report, while p2/p3 summarize the remaining retrieved neighborhood. Here N = NORMAL, Op = OSTEOPENIA, and Os = OSTEOPOROSIS. The three examples show … view at source ↗
read the original abstract

While interpretable prototype networks offer compelling case-based reasoning for clinical diagnostics, their raw continuous outputs lack the semantic structure required for medical documentation. Bridging this gap via standard Retrieval-Augmented Generation (RAG) routinely triggers ``retrieval sycophancy,'' where Large Language Models (LLMs) hallucinate post-hoc rationalizations to align with visual predictions. We introduce ProtoMedAgent, a framework that formalizes multimodal clinical reporting as an iterative, zero-gradient test-time optimization problem over a strict neuro-symbolic bottleneck. Operating on a frozen prototype backbone, we distill latent visual and tabular features into a discrete semantic memory. Online generation is strictly constrained by exact set-theoretic differentials and a reflective Scribe-Critic loop, mathematically precluding unsupported narrative claims. To safely bound data disclosure, we introduce a semantic privacy gate governed by $k$-anonymity and $\ell$-diversity. Evaluated on a 4,160-patient clinical cohort, ProtoMedAgent achieves 91.2\% Comparison Set Faithfulness where it fundamentally outperforms standard RAG (46.2\%). ProtoMedAgent additionally leverages a binding $\ell$-diversity phase transition to systematically reduce artifact-level membership inference risks by an absolute 9.8\%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces ProtoMedAgent, a framework that formalizes multimodal clinical reporting as an iterative zero-gradient test-time optimization over a neuro-symbolic bottleneck on a frozen prototype backbone. Latent visual and tabular features are distilled into discrete semantic memory, with generation strictly constrained by set-theoretic differentials and a reflective Scribe-Critic loop that is claimed to mathematically preclude unsupported narrative claims. A semantic privacy gate based on k-anonymity and ℓ-diversity is added to bound disclosure. On a 4,160-patient cohort, the system reports 91.2% Comparison Set Faithfulness (vs. 46.2% for standard RAG) and an absolute 9.8% reduction in artifact-level membership inference risk via a binding ℓ-diversity phase transition.

Significance. If the central mathematical guarantee holds and the faithfulness/privacy metrics are independently validated, the work would offer a concrete advance in combining prototype-based case reasoning with controlled LLM generation for clinical documentation. The reported performance delta and privacy reduction would be practically relevant for reducing hallucination while preserving interpretability and meeting regulatory constraints on data disclosure.

major comments (3)
  1. [Abstract] Abstract: the claim that the Scribe-Critic loop together with exact set-theoretic differentials 'mathematically preclud[es] unsupported narrative claims' is load-bearing for the 91.2% vs. 46.2% faithfulness result, yet no theorem, invariant, soundness/completeness argument, or exhaustive case analysis is supplied showing that every generated token is confined to the discrete semantic memory for arbitrary prototype feature combinations.
  2. [Evaluation] Evaluation section: the Comparison Set Faithfulness metric is defined with reference to the system's own outputs and comparisons; this circularity must be addressed by an independent ground-truth annotation protocol or human evaluation protocol before the headline delta can be accepted as evidence of superiority over RAG.
  3. [Privacy Mechanism] Privacy Mechanism: the 9.8% absolute reduction in artifact-level membership inference risk is attributed to the binding ℓ-diversity phase transition, but no explicit measurement protocol, baseline comparison, or statistical test is described that would confirm the reduction is not an artifact of the metric's internal definition.
minor comments (2)
  1. [Method] The free parameters k (k-anonymity) and ℓ (ℓ-diversity) are listed as free but their concrete values and sensitivity analysis are not reported; add these to the experimental section.
  2. [Method] The iterative zero-gradient test-time optimization and Scribe-Critic loop would benefit from a pseudocode listing or explicit algorithmic description to support reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that the Scribe-Critic loop together with exact set-theoretic differentials 'mathematically preclud[es] unsupported narrative claims' is load-bearing for the 91.2% vs. 46.2% faithfulness result, yet no theorem, invariant, soundness/completeness argument, or exhaustive case analysis is supplied showing that every generated token is confined to the discrete semantic memory for arbitrary prototype feature combinations.

    Authors: We agree that the abstract claim requires formal support. In the revised manuscript we will add a dedicated subsection in the Methods section providing a soundness argument: we prove that the exact set-theoretic differentials restrict the token vocabulary to the discrete semantic memory, and that the Scribe-Critic loop enforces an invariant that no token outside this memory can be emitted. The proof will be accompanied by an exhaustive case analysis covering representative prototype feature combinations. revision: yes

  2. Referee: [Evaluation] Evaluation section: the Comparison Set Faithfulness metric is defined with reference to the system's own outputs and comparisons; this circularity must be addressed by an independent ground-truth annotation protocol or human evaluation protocol before the headline delta can be accepted as evidence of superiority over RAG.

    Authors: The referee correctly notes the risk of circularity. We will revise the Evaluation section to include an independent human evaluation protocol: two board-certified clinicians will annotate faithfulness on a stratified random sample of 200 generated reports against the original clinical notes (ground truth). We will report inter-annotator agreement (Cohen's kappa) and the resulting faithfulness scores to corroborate the automated 91.2% figure. revision: yes

  3. Referee: [Privacy Mechanism] Privacy Mechanism: the 9.8% absolute reduction in artifact-level membership inference risk is attributed to the binding ℓ-diversity phase transition, but no explicit measurement protocol, baseline comparison, or statistical test is described that would confirm the reduction is not an artifact of the metric's internal definition.

    Authors: We acknowledge that the current description of the membership-inference evaluation lacks sufficient detail. In the revision we will expand the Privacy Analysis subsection to specify the full protocol: a shadow-model membership inference attack with 5-fold cross-validation, standard RAG as the explicit baseline, and a paired t-test (p < 0.01) to establish significance of the 9.8% reduction. Pseudocode for the attack and evaluation pipeline will be added. revision: partial

Circularity Check

3 steps flagged

Faithfulness metric and ℓ-diversity phase transition reduce to self-defined constructs; preclusion claim lacks independent theorem

specific steps
  1. self definitional [Abstract]
    "Online generation is strictly constrained by exact set-theoretic differentials and a reflective Scribe-Critic loop, mathematically precluding unsupported narrative claims."

    The preclusion is presented as a direct mathematical consequence of the constraints and loop that the framework itself defines and enforces; no separate theorem, invariant, or exhaustive verification is supplied to show soundness beyond the definition of the neuro-symbolic bottleneck.

  2. self definitional [Abstract]
    "ProtoMedAgent additionally leverages a binding ℓ-diversity phase transition to systematically reduce artifact-level membership inference risks by an absolute 9.8%."

    The 'binding ℓ-diversity phase transition' is introduced by the paper as part of its semantic privacy gate; the specific 9.8% risk reduction is then attributed directly to this transition, making the reported gain a consequence of how the phase transition is defined and applied within the same system.

  3. fitted input called prediction [Abstract]
    "ProtoMedAgent achieves 91.2% Comparison Set Faithfulness where it fundamentally outperforms standard RAG (46.2%)."

    Comparison Set Faithfulness is measured against the system's own distilled prototype features and discrete semantic memory; the large delta versus RAG is therefore produced by construction of the evaluation metric and the neuro-symbolic constraints rather than an independent external benchmark.

full rationale

The derivation chain centers on the neuro-symbolic bottleneck and Scribe-Critic loop being asserted to 'mathematically preclude' unsupported claims, with performance (91.2% vs 46.2%) and privacy reduction (9.8%) tied to internally introduced mechanisms like the binding ℓ-diversity phase transition and Comparison Set Faithfulness. These reduce to the paper's own definitions and constraints without external theorem or independent validation shown in the abstract. This produces partial circularity (score 6) where load-bearing guarantees are by construction of the introduced components rather than derived from prior independent results.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 2 invented entities

The approach relies on several new invented components and domain assumptions about the effectiveness of the constraints.

free parameters (2)
  • k in k-anonymity
    Parameter for the semantic privacy gate, specific value not provided in abstract.
  • l in l-diversity
    Diversity parameter for privacy, value not specified.
axioms (2)
  • domain assumption The prototype backbone remains frozen and provides stable latent features for distillation.
    The framework operates on a frozen prototype backbone as stated.
  • ad hoc to paper Set-theoretic differentials can enforce strict constraints on generated narratives.
    Central to the claim of precluding unsupported claims.
invented entities (2)
  • Scribe-Critic loop no independent evidence
    purpose: To reflectively ensure generated reports are supported by the semantic memory.
    Introduced as part of the online generation process.
  • semantic privacy gate no independent evidence
    purpose: To enforce k-anonymity and l-diversity for bounding data disclosure.
    New component for privacy awareness.

pith-pipeline@v0.9.0 · 5544 in / 1645 out tokens · 71663 ms · 2026-05-15T05:05:37.166680+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    formalizes multimodal clinical reporting as an iterative, zero-gradient test-time optimization problem over a strict neuro-symbolic bottleneck... exact set-theoretic differentials... Scribe-Critic loop, mathematically precluding unsupported narrative claims... semantic privacy gate governed by k-anonymity and ℓ-diversity

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 4 internal anchors

  1. [1]

    Towards explainable deep neural networks (xdnn).Neural Networks, 130:185–194,

    Plamen Angelov and Eduardo Soares. Towards explainable deep neural networks (xdnn).Neural Networks, 130:185–194,

  2. [2]

    Trace: Tem- poral radiology with anatomical change explanation for grounded x-ray report generation, 2026

    OFM Riaz Rahman Aranya and Kevin Desai. Trace: Tem- poral radiology with anatomical change explanation for grounded x-ray report generation, 2026. 1

  3. [3]

    Large language models in radiology reporting-a sys- tematic review of performance, limitations, and clinical im- plications.Intelligence-Based Medicine, page 100287, 2025

    Yaara Artsi, Eyal Klang, Jeremy D Collins, Benjamin S Glicksberg, Girish N Nadkarni, Panagiotis Korfiatis, and Vera Sorin. Large language models in radiology reporting-a sys- tematic review of performance, limitations, and clinical im- plications.Intelligence-Based Medicine, page 100287, 2025. 2

  4. [4]

    This looks like that: Deep learning for interpretable image recognition

    Chaofan Chen, Oscar Li, Chaofan Tao, Alina Jade Barnett, Jonathan Su, and Cynthia Rudin. This looks like that: Deep learning for interpretable image recognition. InAdvances in Neural Information Processing Systems (NeurIPS), 2019. 2

  5. [5]

    Meditron-70b: Scaling medical pretraining for large language models, 2023

    Zeming Chen, Alejandro Hern ´andez Cano, Angelika Ro- manou, Antoine Bonnet, Kyle Matoba, Francesco Salvi, Matteo Pagliardini, Simin Fan, Andreas K ¨opf, Amirkeivan Mohtashami, Alexandre Sallinen, Alireza Sakhaeirad, Vini- tra Swamy, Igor Krawczuk, Deniz Bayazit, Axel Marmet, Syrielle Montariol, Mary-Anne Hartley, Martin Jaggi, and Antoine Bosselut. Medit...

  6. [6]

    arXiv preprint arXiv:2502.03333 (2025)

    Nicolas Deperrois, Hidetoshi Matsuo, Samuel Ruip ´erez- Campillo, Moritz Vandenhirtz, Sonia Laguna, Alain Ryser, Koji Fujimoto, Mizuho Nishio, Thomas M Sutter, Ju- lia E V ogt, et al. Radvlm: a multitask conversational vision-language model for radiology.arXiv preprint arXiv:2502.03333, 2025. 2

  7. [7]

    Ruan, Yaxing Cai, Ruihang Lai, Ziyi Xu, Yilong Zhao, and Tianqi Chen

    Yixin Dong, Charlie F. Ruan, Yaxing Cai, Ruihang Lai, Ziyi Xu, Yilong Zhao, and Tianqi Chen. Xgrammar: Flexible and efficient structured generation engine for large language models. arXiv:2411.15100, 2024. 2, 5

  8. [8]

    45 cfr §164.514: Other requirements relating to uses and disclosures of pro- tected health information (de-identification), 2025

    Electronic Code of Federal Regulations. 45 cfr §164.514: Other requirements relating to uses and disclosures of pro- tected health information (de-identification), 2025. Accessed: 2026-02-18. 3

  9. [9]

    Regulation (eu) 2016/679 (general data protection regulation),

    European Parliament and Council of the European Union. Regulation (eu) 2016/679 (general data protection regulation),

  10. [10]

    Regulation (eu) 2023/2854 of the european parliament and of the council of 13 december 2023 on harmonised rules on fair access to and use of data (data act)

    European Parliament and Council of the European Union. Regulation (eu) 2023/2854 of the european parliament and of the council of 13 december 2023 on harmonised rules on fair access to and use of data (data act). Official Journal of the European Union, OJ L 2023/2854, 22 December 2023, 2023. Accessed: 2026-03-22. 1, 3

  11. [11]

    Regulation (eu) 2024/1689 of the european parliament and of the council of 13 june 2024 laying down harmonised rules on artificial intelligence (artificial intelligence act)

    European Parliament and Council of the European Union. Regulation (eu) 2024/1689 of the european parliament and of the council of 13 june 2024 laying down harmonised rules on artificial intelligence (artificial intelligence act). Official Journal of the European Union, OJ L 2024/1689, 12 July 2024, 2024. Accessed: 2026-03-22. 1, 3

  12. [12]

    JSONSchemaBench: A rigorous bench- mark of structured outputs for language models.arXiv preprint arXiv:2501.10868, 2025

    Saibo Geng, Hudson Cooper, Michał Moskal, Samuel Jenkins, Julian Berman, Nathan Ranchin, Robert West, Eric Horvitz, and Harsha Nori. Generating structured outputs from lan- guage models: Benchmark and studies. arXiv:2501.10868,

  13. [13]

    Emre Gursoy, Asim Inan, M

    M. Emre Gursoy, Asim Inan, M. Emin Nergiz, and Yucel Saygin. Differentially private nearest neighbor classification. Data Mining and Knowledge Discovery, 31(5):1544–1575,

  14. [14]

    LoRA: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. 5, 6

  15. [15]

    Are attribute inference attacks just imputation? InProceedings of The ACM Con- ference on Computer and Communications Security (CCS),

    Bargav Jayaraman and David Evans. Are attribute inference attacks just imputation? InProceedings of The ACM Con- ference on Computer and Communications Security (CCS),

  16. [16]

    Survey of hallucination in natural language generation.ACM computing surveys, 55(12):1–38, 2023

    Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation.ACM computing surveys, 55(12):1–38, 2023. 2

  17. [17]

    Timexl: Explainable multi-modal time series prediction with llm-in- the-loop.arXiv preprint arXiv:2503.01013, 2025

    Yushan Jiang, Wenchao Yu, Geon Lee, Dongjin Song, Kijung Shin, Wei Cheng, Yanchi Liu, and Haifeng Chen. Timexl: Explainable multi-modal time series prediction with llm-in- the-loop.arXiv preprint arXiv:2503.01013, 2025. 2

  18. [18]

    From weak cues to real iden- tities: Evaluating inference-driven de-anonymization in llm agents.arXiv preprint arXiv:2603.18382, 2026

    Myeongseob Ko, Jihyun Jeong, Sumiran Singh Thakur, Gyuhak Kim, and Ruoxi Jia. From weak cues to real iden- tities: Evaluating inference-driven de-anonymization in llm agents.arXiv preprint arXiv:2603.18382, 2026. 5

  19. [19]

    Biomistral: A collection of open-source pretrained large lan- guage models for medical domains, 2024

    Yanis Labrak, Adrien Bazoge, Emmanuel Morin, Pierre- Antoine Gourraud, Mickael Rouvier, and Richard Dufour. Biomistral: A collection of open-source pretrained large lan- guage models for medical domains, 2024. 5

  20. [20]

    Retrieval-augmented generation for knowledge- intensive NLP tasks

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K¨uttler, Mike Lewis, Wen tau Yih, Tim Rockt¨aschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge- intensive NLP tasks. InAdvances in Neural Information Processing Systems (NeurIPS), 2020. 2, 6

  21. [21]

    Llava-med: Training a large language-and- vision assistant for biomedicine in one day.arXiv preprint arXiv:2306.00890, 2023

    Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language-and- vision assistant for biomedicine in one day.arXiv preprint arXiv:2306.00890, 2023. 5

  22. [22]

    t-closeness: Privacy beyond k-anonymity and l-diversity

    Ninghui Li, Tiancheng Li, and Suresh Venkatasubramanian. t-closeness: Privacy beyond k-anonymity and l-diversity. In Proceedings of the 23rd International Conference on Data Engineering (ICDE), pages 106–115, 2007. 3

  23. [23]

    l-diversity: Privacy beyond k-anonymity.ACM Transactions on Knowl- edge Discovery from Data, 1(1):3:1–3:52, 2007

    Ashwin Machanavajjhala, Johannes Gehrke, Daniel Kifer, and Muthuramakrishnan Venkitasubramaniam. l-diversity: Privacy beyond k-anonymity.ACM Transactions on Knowl- edge Discovery from Data, 1(1):3:1–3:52, 2007. 3

  24. [24]

    Sycophancy in large language models: Causes and mitigations, 2024

    Lars Malmqvist. Sycophancy in large language models: Causes and mitigations, 2024. 2

  25. [25]

    Multimodal large language models in medical imaging: current state and future directions.Korean Journal of Radiology, 26(10):900, 2025

    Yoojin Nam, Dong Yeong Kim, Sunggu Kyung, Jinyoung Seo, Jeong Min Song, Jimin Kwon, Jihyun Kim, Wooyoung Jo, Hyungbin Park, Jimin Sung, et al. Multimodal large language models in medical imaging: current state and future directions.Korean Journal of Radiology, 26(10):900, 2025. 2

  26. [26]

    Neural pro- totype trees for interpretable fine-grained image recognition

    Meike Nauta, Ron van Bree, and Christin Seifert. Neural pro- totype trees for interpretable fine-grained image recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021. 2

  27. [27]

    Pip-net: Patch-based intuitive prototypes for interpretable image clas- sification

    Meike Nauta, Ron van Bree, and Christin Seifert. Pip-net: Patch-based intuitive prototypes for interpretable image clas- sification. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 2

  28. [28]

    Ragtruth: A hallucination corpus for developing trustworthy retrieval-augmented language models

    Cheng Niu, Yuanhao Wu, Juno Zhu, Siliang Xu, Kashun Shum, Randy Zhong, Juntong Song, and Tong Zhang. Ragtruth: A hallucination corpus for developing trustworthy retrieval-augmented language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10862–10878,

  29. [29]

    Scalable private learning with pate, 2018

    Nicolas Papernot, Shuang Song, Ilya Mironov, Ananth Raghu- nathan, Kunal Talwar, and ´Ulfar Erlingsson. Scalable private learning with pate, 2018. 3

  30. [30]

    Se- curing (vision-based) autonomous systems: taxonomy, chal- lenges, and defense mechanisms against adversarial threats

    Alvaro Lopez Pellicer, Plamen Angelov, and Neeraj Suri. Se- curing (vision-based) autonomous systems: taxonomy, chal- lenges, and defense mechanisms against adversarial threats. Artificial Intelligence Review, 58(12):373, 2025. 3

  31. [31]

    Alvaro Lopez Pellicer, Andre Mariucci, Plamen Angelov, Marwan Bukhari, and Jemma G. Kerns. Protomedx: Towards explainable multi-modal prototype learning for bone health classification. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, pages 7357–7366, 2025. 1, 2, 3, 5

  32. [32]

    Code generation with AlphaCodium : From prompt engineering to flow engineering

    Tal Ridnik, Dedy Kredo, and Itamar Friedman. Code gener- ation with alphacodium: From prompt engineering to flow engineering.arXiv preprint arXiv:2401.08500, 2024. 2

  33. [33]

    Jsonformer: Generate structured json from language models

    Isaac Rogers. Jsonformer: Generate structured json from language models. GitHub repository, 2023. 2

  34. [34]

    Chaos with keywords: Exposing large language models sycophancy to misleading keywords and evaluating defense strategies

    Aswin Rrv, Nemika Tyagi, Md Nayem Uddin, Neeraj Varsh- ney, and Chitta Baral. Chaos with keywords: Exposing large language models sycophancy to misleading keywords and evaluating defense strategies. InFindings of the Associa- tion for Computational Linguistics: ACL 2024, pages 12717– 12733. Association for Computational Linguistics, 2024. 2

  35. [35]

    Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead.Nature Machine Intelligence, 1:206–215, 2019

    Cynthia Rudin. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead.Nature Machine Intelligence, 1:206–215, 2019. 2

  36. [36]

    Towards Understanding Sycophancy in Language Models

    Mrinank Sharma, Meg Tong, Tomasz Korbak, David Du- venaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R. John- ston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, and Ethan Perez. Towards understanding sycophancy in language models. arXiv:2...

  37. [37]

    Reflexion: Language Agents with Verbal Reinforcement Learning

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.arXiv preprint arXiv:2303.11366, 2023. 2

  38. [38]

    Membership inference attacks against machine learning models

    Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. Membership inference attacks against machine learning models. In2017 IEEE Symposium on Security and Privacy (SP), pages 3–18. IEEE, 2017. 3, 5

  39. [39]

    Structuredrag: Json response formatting with large language models

    Connor Shorten, Charles Pierse, Thomas Benjamin Smith, Erika Cardenas, Akanksha Sharma, John Trengrove, and Bob van Luijt. Structuredrag: Json response formatting with large language models. arXiv:2408.11061, 2024. 2

  40. [40]

    Large language models encode clinical knowledge.Nature, 620(7972):172– 180, 2023

    Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Large language models encode clinical knowledge.Nature, 620(7972):172– 180, 2023. 2

  41. [41]

    k-anonymity: A model for protecting privacy.International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10(5):557–570, 2002

    Latanya Sweeney. k-anonymity: A model for protecting privacy.International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10(5):557–570, 2002. 3

  42. [42]

    Correctness is not faithfulness in retrieval augmented generation attributions

    Jonas Wallat, Maria Heuss, Maarten de Rijke, and Avishek Anand. Correctness is not faithfulness in retrieval augmented generation attributions. InProceedings of the 2025 Interna- tional ACM SIGIR Conference on Innovative Concepts and Theories in Information Retrieval (ICTIR ’25), 2025. 2

  43. [43]

    Efficient Guided Generation for Large Language Models

    Brandon T. Willard and R ´emi Louf. Efficient guided genera- tion for large language models. arXiv:2307.09702, 2023. 2, 5

  44. [44]

    Towards generalist foundation model for radiology by leveraging web-scale 2d&3d medical data.arXiv preprint arXiv:2308.02463, 2023

    Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Towards generalist foundation model for radiology by leveraging web-scale 2d&3d medical data.arXiv preprint arXiv:2308.02463, 2023. 5

  45. [45]

    ReAct: Synergizing Reasoning and Acting in Language Models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022. 2