arxiv: 2605.14113 · v1 · submitted 2026-05-13 · 💻 cs.CV · cs.AI· cs.LG· cs.MA

Recognition: 2 theorem links

· Lean Theorem

ProtoMedAgent: Multimodal Clinical Interpretability via Privacy-Aware Agentic Workflows

Alvaro Lopez Pellicer , Plamen Angelov , Marwan Bukhari , Yi Li , Eduardo Soares , Jemma Kerns

Authors on Pith no claims yet

Pith reviewed 2026-05-15 05:05 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LGcs.MA

keywords ProtoMedAgentclinical interpretabilityprototype networksneuro-symbolic bottleneckRAG hallucinationsprivacy preservationmultimodal medical reportingagentic workflows

0 comments

The pith

ProtoMedAgent constrains LLM clinical reports to a neuro-symbolic bottleneck to reach 91.2% faithfulness while cutting membership inference risks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ProtoMedAgent to generate semantic medical documentation from prototype networks without the post-hoc rationalizations that plague standard retrieval-augmented generation. It treats report creation as an iterative zero-gradient optimization over a frozen backbone, distilling features into discrete memory and enforcing every sentence through set-theoretic rules and a Scribe-Critic loop. On a 4,160-patient cohort this produces 91.2 percent Comparison Set Faithfulness, more than double the 46.2 percent of unconstrained RAG, while an ℓ-diversity privacy gate lowers artifact-level membership inference by 9.8 percent. A sympathetic reader cares because the approach directly tackles the gap between accurate visual predictions and trustworthy, private clinical text that physicians can actually use.

Core claim

ProtoMedAgent formalizes multimodal clinical reporting as zero-gradient test-time optimization over a strict neuro-symbolic bottleneck on a frozen prototype backbone, distilling latent features into discrete semantic memory and constraining online generation by exact set-theoretic differentials together with a reflective Scribe-Critic loop, thereby mathematically precluding unsupported claims and achieving 91.2 percent Comparison Set Faithfulness on a 4,160-patient cohort while a binding ℓ-diversity phase transition reduces artifact-level membership inference risks by an absolute 9.8 percent.

What carries the argument

The neuro-symbolic bottleneck enforced by iterative zero-gradient test-time optimization, set-theoretic differentials, and the Scribe-Critic loop, augmented by a semantic privacy gate that applies k-anonymity and ℓ-diversity.

If this is right

Clinical reports become reliably grounded in prototype predictions without sycophantic rationalizations that misalign with visual evidence.
Privacy protection is achieved through controlled disclosure that still permits diagnostically useful detail.
The framework applies directly to existing frozen prototype models without retraining or gradient updates.
Faithfulness scores improve dramatically over unconstrained LLM generation on the same clinical cohort.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same constrained-optimization pattern could be tested in other high-stakes domains where outputs must stay strictly derivable from structured evidence.
The observed ℓ-diversity phase transition points to a general mechanism for trading off privacy and utility that might apply beyond medical imaging.
Real-world validation would need to measure whether the 91.2 percent faithfulness holds across patient populations with different demographic distributions.

Load-bearing premise

The Scribe-Critic loop and neuro-symbolic constraints can mathematically preclude unsupported narrative claims in all cases, and the semantic privacy gate bounds disclosure without compromising report utility.

What would settle it

A generated report containing at least one narrative claim that cannot be derived from the exact set-theoretic differentials of the prototype features, or an experiment showing that patient identity can still be inferred at rates higher than the reported 9.8 percent reduction despite the ℓ-diversity controls.

Figures

Figures reproduced from arXiv: 2605.14113 by Alvaro Lopez Pellicer, Eduardo Soares, Jemma Kerns, Marwan Bukhari, Plamen Angelov, Yi Li.

**Figure 1.** Figure 1: Overview of an Agentic Clinical Report Generation framework A multimodal patient case, comprising a lumbar DEXA scan and a patient record, is first processed by a frozen prototype backbone that retrieves raw visual exemplars and tabular statistics. Reconciling these outputs through unconstrained prototype semantics via a generic LLM yields an ungrounded, hallucination-prone clinical report. In contrast, b… view at source ↗

**Figure 2.** Figure 2: Detailed architecture of ProtoMedAgent. (a) [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Privacy-utility frontier of the exportable prototype evidence surface. Sweeping semantic k-anonymity (k∈ {3, 5, 7, 9}) and ℓ-diversity (ℓ∈ {1, 2, 3}). (a) Under ℓ= 1 (orange), increasing k provides a smooth trade-off between utility and exposure. Imposing ℓ = 2 (green) triggers a phase transition, collapsing all k values to a single bounded regime, indicating diversity binds before k-anonymity. (b) The … view at source ↗

**Figure 4.** Figure 4: Final ProtoMedAgent qualitative outputs. Each row pairs a final ProtoMedAgent report at left with the displayed same-class ProtoCard reference at right. The header line shows the query summary and compact ProtoMedX output; p1 is the ProtoCard shown beside the report, while p2/p3 summarize the remaining retrieved neighborhood. Here N = NORMAL, Op = OSTEOPENIA, and Os = OSTEOPOROSIS. The three examples show … view at source ↗

read the original abstract

While interpretable prototype networks offer compelling case-based reasoning for clinical diagnostics, their raw continuous outputs lack the semantic structure required for medical documentation. Bridging this gap via standard Retrieval-Augmented Generation (RAG) routinely triggers ``retrieval sycophancy,'' where Large Language Models (LLMs) hallucinate post-hoc rationalizations to align with visual predictions. We introduce ProtoMedAgent, a framework that formalizes multimodal clinical reporting as an iterative, zero-gradient test-time optimization problem over a strict neuro-symbolic bottleneck. Operating on a frozen prototype backbone, we distill latent visual and tabular features into a discrete semantic memory. Online generation is strictly constrained by exact set-theoretic differentials and a reflective Scribe-Critic loop, mathematically precluding unsupported narrative claims. To safely bound data disclosure, we introduce a semantic privacy gate governed by $k$-anonymity and $\ell$-diversity. Evaluated on a 4,160-patient clinical cohort, ProtoMedAgent achieves 91.2\% Comparison Set Faithfulness where it fundamentally outperforms standard RAG (46.2\%). ProtoMedAgent additionally leverages a binding $\ell$-diversity phase transition to systematically reduce artifact-level membership inference risks by an absolute 9.8\%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ProtoMedAgent adds set constraints and a Scribe-Critic loop to prototype features for more faithful clinical reports, but the mathematical guarantee looks underspecified.

read the letter

The core of this paper is a framework that takes frozen prototype network outputs, distills them into discrete semantic memory, and then runs generation through exact set-theoretic differentials plus a reflective Scribe-Critic loop. On a 4,160-patient cohort it reports 91.2% Comparison Set Faithfulness against 46.2% for plain RAG, plus a 9.8% absolute drop in artifact-level membership inference risk from the l-diversity phase transition. That combination of prototype grounding, agentic iteration, and privacy controls is the actual new piece; most prior RAG work does not enforce the same neuro-symbolic bottleneck at test time on a frozen backbone.

Referee Report

3 major / 2 minor

Summary. The paper introduces ProtoMedAgent, a framework that formalizes multimodal clinical reporting as an iterative zero-gradient test-time optimization over a neuro-symbolic bottleneck on a frozen prototype backbone. Latent visual and tabular features are distilled into discrete semantic memory, with generation strictly constrained by set-theoretic differentials and a reflective Scribe-Critic loop that is claimed to mathematically preclude unsupported narrative claims. A semantic privacy gate based on k-anonymity and ℓ-diversity is added to bound disclosure. On a 4,160-patient cohort, the system reports 91.2% Comparison Set Faithfulness (vs. 46.2% for standard RAG) and an absolute 9.8% reduction in artifact-level membership inference risk via a binding ℓ-diversity phase transition.

Significance. If the central mathematical guarantee holds and the faithfulness/privacy metrics are independently validated, the work would offer a concrete advance in combining prototype-based case reasoning with controlled LLM generation for clinical documentation. The reported performance delta and privacy reduction would be practically relevant for reducing hallucination while preserving interpretability and meeting regulatory constraints on data disclosure.

major comments (3)

[Abstract] Abstract: the claim that the Scribe-Critic loop together with exact set-theoretic differentials 'mathematically preclud[es] unsupported narrative claims' is load-bearing for the 91.2% vs. 46.2% faithfulness result, yet no theorem, invariant, soundness/completeness argument, or exhaustive case analysis is supplied showing that every generated token is confined to the discrete semantic memory for arbitrary prototype feature combinations.
[Evaluation] Evaluation section: the Comparison Set Faithfulness metric is defined with reference to the system's own outputs and comparisons; this circularity must be addressed by an independent ground-truth annotation protocol or human evaluation protocol before the headline delta can be accepted as evidence of superiority over RAG.
[Privacy Mechanism] Privacy Mechanism: the 9.8% absolute reduction in artifact-level membership inference risk is attributed to the binding ℓ-diversity phase transition, but no explicit measurement protocol, baseline comparison, or statistical test is described that would confirm the reduction is not an artifact of the metric's internal definition.

minor comments (2)

[Method] The free parameters k (k-anonymity) and ℓ (ℓ-diversity) are listed as free but their concrete values and sensitivity analysis are not reported; add these to the experimental section.
[Method] The iterative zero-gradient test-time optimization and Scribe-Critic loop would benefit from a pseudocode listing or explicit algorithmic description to support reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that the Scribe-Critic loop together with exact set-theoretic differentials 'mathematically preclud[es] unsupported narrative claims' is load-bearing for the 91.2% vs. 46.2% faithfulness result, yet no theorem, invariant, soundness/completeness argument, or exhaustive case analysis is supplied showing that every generated token is confined to the discrete semantic memory for arbitrary prototype feature combinations.

Authors: We agree that the abstract claim requires formal support. In the revised manuscript we will add a dedicated subsection in the Methods section providing a soundness argument: we prove that the exact set-theoretic differentials restrict the token vocabulary to the discrete semantic memory, and that the Scribe-Critic loop enforces an invariant that no token outside this memory can be emitted. The proof will be accompanied by an exhaustive case analysis covering representative prototype feature combinations. revision: yes
Referee: [Evaluation] Evaluation section: the Comparison Set Faithfulness metric is defined with reference to the system's own outputs and comparisons; this circularity must be addressed by an independent ground-truth annotation protocol or human evaluation protocol before the headline delta can be accepted as evidence of superiority over RAG.

Authors: The referee correctly notes the risk of circularity. We will revise the Evaluation section to include an independent human evaluation protocol: two board-certified clinicians will annotate faithfulness on a stratified random sample of 200 generated reports against the original clinical notes (ground truth). We will report inter-annotator agreement (Cohen's kappa) and the resulting faithfulness scores to corroborate the automated 91.2% figure. revision: yes
Referee: [Privacy Mechanism] Privacy Mechanism: the 9.8% absolute reduction in artifact-level membership inference risk is attributed to the binding ℓ-diversity phase transition, but no explicit measurement protocol, baseline comparison, or statistical test is described that would confirm the reduction is not an artifact of the metric's internal definition.

Authors: We acknowledge that the current description of the membership-inference evaluation lacks sufficient detail. In the revision we will expand the Privacy Analysis subsection to specify the full protocol: a shadow-model membership inference attack with 5-fold cross-validation, standard RAG as the explicit baseline, and a paired t-test (p < 0.01) to establish significance of the 9.8% reduction. Pseudocode for the attack and evaluation pipeline will be added. revision: partial

Circularity Check

3 steps flagged

Faithfulness metric and ℓ-diversity phase transition reduce to self-defined constructs; preclusion claim lacks independent theorem

specific steps

self definitional [Abstract]
"Online generation is strictly constrained by exact set-theoretic differentials and a reflective Scribe-Critic loop, mathematically precluding unsupported narrative claims."

The preclusion is presented as a direct mathematical consequence of the constraints and loop that the framework itself defines and enforces; no separate theorem, invariant, or exhaustive verification is supplied to show soundness beyond the definition of the neuro-symbolic bottleneck.
self definitional [Abstract]
"ProtoMedAgent additionally leverages a binding ℓ-diversity phase transition to systematically reduce artifact-level membership inference risks by an absolute 9.8%."

The 'binding ℓ-diversity phase transition' is introduced by the paper as part of its semantic privacy gate; the specific 9.8% risk reduction is then attributed directly to this transition, making the reported gain a consequence of how the phase transition is defined and applied within the same system.
fitted input called prediction [Abstract]
"ProtoMedAgent achieves 91.2% Comparison Set Faithfulness where it fundamentally outperforms standard RAG (46.2%)."

Comparison Set Faithfulness is measured against the system's own distilled prototype features and discrete semantic memory; the large delta versus RAG is therefore produced by construction of the evaluation metric and the neuro-symbolic constraints rather than an independent external benchmark.

full rationale

The derivation chain centers on the neuro-symbolic bottleneck and Scribe-Critic loop being asserted to 'mathematically preclude' unsupported claims, with performance (91.2% vs 46.2%) and privacy reduction (9.8%) tied to internally introduced mechanisms like the binding ℓ-diversity phase transition and Comparison Set Faithfulness. These reduce to the paper's own definitions and constraints without external theorem or independent validation shown in the abstract. This produces partial circularity (score 6) where load-bearing guarantees are by construction of the introduced components rather than derived from prior independent results.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 2 invented entities

The approach relies on several new invented components and domain assumptions about the effectiveness of the constraints.

free parameters (2)

k in k-anonymity
Parameter for the semantic privacy gate, specific value not provided in abstract.
l in l-diversity
Diversity parameter for privacy, value not specified.

axioms (2)

domain assumption The prototype backbone remains frozen and provides stable latent features for distillation.
The framework operates on a frozen prototype backbone as stated.
ad hoc to paper Set-theoretic differentials can enforce strict constraints on generated narratives.
Central to the claim of precluding unsupported claims.

invented entities (2)

Scribe-Critic loop no independent evidence
purpose: To reflectively ensure generated reports are supported by the semantic memory.
Introduced as part of the online generation process.
semantic privacy gate no independent evidence
purpose: To enforce k-anonymity and l-diversity for bounding data disclosure.
New component for privacy awareness.

pith-pipeline@v0.9.0 · 5544 in / 1645 out tokens · 71663 ms · 2026-05-15T05:05:37.166680+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

formalizes multimodal clinical reporting as an iterative, zero-gradient test-time optimization problem over a strict neuro-symbolic bottleneck... exact set-theoretic differentials... Scribe-Critic loop, mathematically precluding unsupported narrative claims... semantic privacy gate governed by k-anonymity and ℓ-diversity

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 4 internal anchors

[1]

Towards explainable deep neural networks (xdnn).Neural Networks, 130:185–194,

Plamen Angelov and Eduardo Soares. Towards explainable deep neural networks (xdnn).Neural Networks, 130:185–194,

work page
[2]

Trace: Tem- poral radiology with anatomical change explanation for grounded x-ray report generation, 2026

OFM Riaz Rahman Aranya and Kevin Desai. Trace: Tem- poral radiology with anatomical change explanation for grounded x-ray report generation, 2026. 1

work page 2026
[3]

Large language models in radiology reporting-a sys- tematic review of performance, limitations, and clinical im- plications.Intelligence-Based Medicine, page 100287, 2025

Yaara Artsi, Eyal Klang, Jeremy D Collins, Benjamin S Glicksberg, Girish N Nadkarni, Panagiotis Korfiatis, and Vera Sorin. Large language models in radiology reporting-a sys- tematic review of performance, limitations, and clinical im- plications.Intelligence-Based Medicine, page 100287, 2025. 2

work page 2025
[4]

This looks like that: Deep learning for interpretable image recognition

Chaofan Chen, Oscar Li, Chaofan Tao, Alina Jade Barnett, Jonathan Su, and Cynthia Rudin. This looks like that: Deep learning for interpretable image recognition. InAdvances in Neural Information Processing Systems (NeurIPS), 2019. 2

work page 2019
[5]

Meditron-70b: Scaling medical pretraining for large language models, 2023

Zeming Chen, Alejandro Hern ´andez Cano, Angelika Ro- manou, Antoine Bonnet, Kyle Matoba, Francesco Salvi, Matteo Pagliardini, Simin Fan, Andreas K ¨opf, Amirkeivan Mohtashami, Alexandre Sallinen, Alireza Sakhaeirad, Vini- tra Swamy, Igor Krawczuk, Deniz Bayazit, Axel Marmet, Syrielle Montariol, Mary-Anne Hartley, Martin Jaggi, and Antoine Bosselut. Medit...

work page 2023
[6]

arXiv preprint arXiv:2502.03333 (2025)

Nicolas Deperrois, Hidetoshi Matsuo, Samuel Ruip ´erez- Campillo, Moritz Vandenhirtz, Sonia Laguna, Alain Ryser, Koji Fujimoto, Mizuho Nishio, Thomas M Sutter, Ju- lia E V ogt, et al. Radvlm: a multitask conversational vision-language model for radiology.arXiv preprint arXiv:2502.03333, 2025. 2

work page arXiv 2025
[7]

Ruan, Yaxing Cai, Ruihang Lai, Ziyi Xu, Yilong Zhao, and Tianqi Chen

Yixin Dong, Charlie F. Ruan, Yaxing Cai, Ruihang Lai, Ziyi Xu, Yilong Zhao, and Tianqi Chen. Xgrammar: Flexible and efficient structured generation engine for large language models. arXiv:2411.15100, 2024. 2, 5

work page arXiv 2024
[8]

45 cfr §164.514: Other requirements relating to uses and disclosures of pro- tected health information (de-identification), 2025

Electronic Code of Federal Regulations. 45 cfr §164.514: Other requirements relating to uses and disclosures of pro- tected health information (de-identification), 2025. Accessed: 2026-02-18. 3

work page 2025
[9]

Regulation (eu) 2016/679 (general data protection regulation),

European Parliament and Council of the European Union. Regulation (eu) 2016/679 (general data protection regulation),

work page 2016
[10]

Regulation (eu) 2023/2854 of the european parliament and of the council of 13 december 2023 on harmonised rules on fair access to and use of data (data act)

European Parliament and Council of the European Union. Regulation (eu) 2023/2854 of the european parliament and of the council of 13 december 2023 on harmonised rules on fair access to and use of data (data act). Official Journal of the European Union, OJ L 2023/2854, 22 December 2023, 2023. Accessed: 2026-03-22. 1, 3

work page 2023
[11]

Regulation (eu) 2024/1689 of the european parliament and of the council of 13 june 2024 laying down harmonised rules on artificial intelligence (artificial intelligence act)

European Parliament and Council of the European Union. Regulation (eu) 2024/1689 of the european parliament and of the council of 13 june 2024 laying down harmonised rules on artificial intelligence (artificial intelligence act). Official Journal of the European Union, OJ L 2024/1689, 12 July 2024, 2024. Accessed: 2026-03-22. 1, 3

work page 2024
[12]

JSONSchemaBench: A rigorous bench- mark of structured outputs for language models.arXiv preprint arXiv:2501.10868, 2025

Saibo Geng, Hudson Cooper, Michał Moskal, Samuel Jenkins, Julian Berman, Nathan Ranchin, Robert West, Eric Horvitz, and Harsha Nori. Generating structured outputs from lan- guage models: Benchmark and studies. arXiv:2501.10868,

work page arXiv
[13]

Emre Gursoy, Asim Inan, M

M. Emre Gursoy, Asim Inan, M. Emin Nergiz, and Yucel Saygin. Differentially private nearest neighbor classification. Data Mining and Knowledge Discovery, 31(5):1544–1575,

work page
[14]

LoRA: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. 5, 6

work page 2022
[15]

Are attribute inference attacks just imputation? InProceedings of The ACM Con- ference on Computer and Communications Security (CCS),

Bargav Jayaraman and David Evans. Are attribute inference attacks just imputation? InProceedings of The ACM Con- ference on Computer and Communications Security (CCS),

work page
[16]

Survey of hallucination in natural language generation.ACM computing surveys, 55(12):1–38, 2023

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation.ACM computing surveys, 55(12):1–38, 2023. 2

work page 2023
[17]

Timexl: Explainable multi-modal time series prediction with llm-in- the-loop.arXiv preprint arXiv:2503.01013, 2025

Yushan Jiang, Wenchao Yu, Geon Lee, Dongjin Song, Kijung Shin, Wei Cheng, Yanchi Liu, and Haifeng Chen. Timexl: Explainable multi-modal time series prediction with llm-in- the-loop.arXiv preprint arXiv:2503.01013, 2025. 2

work page arXiv 2025
[18]

From weak cues to real iden- tities: Evaluating inference-driven de-anonymization in llm agents.arXiv preprint arXiv:2603.18382, 2026

Myeongseob Ko, Jihyun Jeong, Sumiran Singh Thakur, Gyuhak Kim, and Ruoxi Jia. From weak cues to real iden- tities: Evaluating inference-driven de-anonymization in llm agents.arXiv preprint arXiv:2603.18382, 2026. 5

work page arXiv 2026
[19]

Biomistral: A collection of open-source pretrained large lan- guage models for medical domains, 2024

Yanis Labrak, Adrien Bazoge, Emmanuel Morin, Pierre- Antoine Gourraud, Mickael Rouvier, and Richard Dufour. Biomistral: A collection of open-source pretrained large lan- guage models for medical domains, 2024. 5

work page 2024
[20]

Retrieval-augmented generation for knowledge- intensive NLP tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K¨uttler, Mike Lewis, Wen tau Yih, Tim Rockt¨aschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge- intensive NLP tasks. InAdvances in Neural Information Processing Systems (NeurIPS), 2020. 2, 6

work page 2020
[21]

Llava-med: Training a large language-and- vision assistant for biomedicine in one day.arXiv preprint arXiv:2306.00890, 2023

Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language-and- vision assistant for biomedicine in one day.arXiv preprint arXiv:2306.00890, 2023. 5

work page arXiv 2023
[22]

t-closeness: Privacy beyond k-anonymity and l-diversity

Ninghui Li, Tiancheng Li, and Suresh Venkatasubramanian. t-closeness: Privacy beyond k-anonymity and l-diversity. In Proceedings of the 23rd International Conference on Data Engineering (ICDE), pages 106–115, 2007. 3

work page 2007
[23]

l-diversity: Privacy beyond k-anonymity.ACM Transactions on Knowl- edge Discovery from Data, 1(1):3:1–3:52, 2007

Ashwin Machanavajjhala, Johannes Gehrke, Daniel Kifer, and Muthuramakrishnan Venkitasubramaniam. l-diversity: Privacy beyond k-anonymity.ACM Transactions on Knowl- edge Discovery from Data, 1(1):3:1–3:52, 2007. 3

work page 2007
[24]

Sycophancy in large language models: Causes and mitigations, 2024

Lars Malmqvist. Sycophancy in large language models: Causes and mitigations, 2024. 2

work page 2024
[25]

Multimodal large language models in medical imaging: current state and future directions.Korean Journal of Radiology, 26(10):900, 2025

Yoojin Nam, Dong Yeong Kim, Sunggu Kyung, Jinyoung Seo, Jeong Min Song, Jimin Kwon, Jihyun Kim, Wooyoung Jo, Hyungbin Park, Jimin Sung, et al. Multimodal large language models in medical imaging: current state and future directions.Korean Journal of Radiology, 26(10):900, 2025. 2

work page 2025
[26]

Neural pro- totype trees for interpretable fine-grained image recognition

Meike Nauta, Ron van Bree, and Christin Seifert. Neural pro- totype trees for interpretable fine-grained image recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021. 2

work page 2021
[27]

Pip-net: Patch-based intuitive prototypes for interpretable image clas- sification

Meike Nauta, Ron van Bree, and Christin Seifert. Pip-net: Patch-based intuitive prototypes for interpretable image clas- sification. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 2

work page 2023
[28]

Ragtruth: A hallucination corpus for developing trustworthy retrieval-augmented language models

Cheng Niu, Yuanhao Wu, Juno Zhu, Siliang Xu, Kashun Shum, Randy Zhong, Juntong Song, and Tong Zhang. Ragtruth: A hallucination corpus for developing trustworthy retrieval-augmented language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10862–10878,

work page
[29]

Scalable private learning with pate, 2018

Nicolas Papernot, Shuang Song, Ilya Mironov, Ananth Raghu- nathan, Kunal Talwar, and ´Ulfar Erlingsson. Scalable private learning with pate, 2018. 3

work page 2018
[30]

Se- curing (vision-based) autonomous systems: taxonomy, chal- lenges, and defense mechanisms against adversarial threats

Alvaro Lopez Pellicer, Plamen Angelov, and Neeraj Suri. Se- curing (vision-based) autonomous systems: taxonomy, chal- lenges, and defense mechanisms against adversarial threats. Artificial Intelligence Review, 58(12):373, 2025. 3

work page 2025
[31]

Alvaro Lopez Pellicer, Andre Mariucci, Plamen Angelov, Marwan Bukhari, and Jemma G. Kerns. Protomedx: Towards explainable multi-modal prototype learning for bone health classification. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, pages 7357–7366, 2025. 1, 2, 3, 5

work page 2025
[32]

Code generation with AlphaCodium : From prompt engineering to flow engineering

Tal Ridnik, Dedy Kredo, and Itamar Friedman. Code gener- ation with alphacodium: From prompt engineering to flow engineering.arXiv preprint arXiv:2401.08500, 2024. 2

work page arXiv 2024
[33]

Jsonformer: Generate structured json from language models

Isaac Rogers. Jsonformer: Generate structured json from language models. GitHub repository, 2023. 2

work page 2023
[34]

Chaos with keywords: Exposing large language models sycophancy to misleading keywords and evaluating defense strategies

Aswin Rrv, Nemika Tyagi, Md Nayem Uddin, Neeraj Varsh- ney, and Chitta Baral. Chaos with keywords: Exposing large language models sycophancy to misleading keywords and evaluating defense strategies. InFindings of the Associa- tion for Computational Linguistics: ACL 2024, pages 12717– 12733. Association for Computational Linguistics, 2024. 2

work page 2024
[35]

Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead.Nature Machine Intelligence, 1:206–215, 2019

Cynthia Rudin. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead.Nature Machine Intelligence, 1:206–215, 2019. 2

work page 2019
[36]

Towards Understanding Sycophancy in Language Models

Mrinank Sharma, Meg Tong, Tomasz Korbak, David Du- venaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R. John- ston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, and Ethan Perez. Towards understanding sycophancy in language models. arXiv:2...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

Reflexion: Language Agents with Verbal Reinforcement Learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.arXiv preprint arXiv:2303.11366, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

Membership inference attacks against machine learning models

Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. Membership inference attacks against machine learning models. In2017 IEEE Symposium on Security and Privacy (SP), pages 3–18. IEEE, 2017. 3, 5

work page 2017
[39]

Structuredrag: Json response formatting with large language models

Connor Shorten, Charles Pierse, Thomas Benjamin Smith, Erika Cardenas, Akanksha Sharma, John Trengrove, and Bob van Luijt. Structuredrag: Json response formatting with large language models. arXiv:2408.11061, 2024. 2

work page arXiv 2024
[40]

Large language models encode clinical knowledge.Nature, 620(7972):172– 180, 2023

Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Large language models encode clinical knowledge.Nature, 620(7972):172– 180, 2023. 2

work page 2023
[41]

k-anonymity: A model for protecting privacy.International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10(5):557–570, 2002

Latanya Sweeney. k-anonymity: A model for protecting privacy.International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10(5):557–570, 2002. 3

work page 2002
[42]

Correctness is not faithfulness in retrieval augmented generation attributions

Jonas Wallat, Maria Heuss, Maarten de Rijke, and Avishek Anand. Correctness is not faithfulness in retrieval augmented generation attributions. InProceedings of the 2025 Interna- tional ACM SIGIR Conference on Innovative Concepts and Theories in Information Retrieval (ICTIR ’25), 2025. 2

work page 2025
[43]

Efficient Guided Generation for Large Language Models

Brandon T. Willard and R ´emi Louf. Efficient guided genera- tion for large language models. arXiv:2307.09702, 2023. 2, 5

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

Towards generalist foundation model for radiology by leveraging web-scale 2d&3d medical data.arXiv preprint arXiv:2308.02463, 2023

Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Towards generalist foundation model for radiology by leveraging web-scale 2d&3d medical data.arXiv preprint arXiv:2308.02463, 2023. 5

work page arXiv 2023
[45]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022