pith. machine review for the scientific record. sign in

arxiv: 2605.02266 · v1 · submitted 2026-05-04 · 💻 cs.CL · cs.AI

Recognition: 3 theorem links

· Lean Theorem

Reliability-Oriented Multilingual Orthopedic Diagnosis: A Domain-Adaptive Modeling and a Conceptual Validation Framework

Authors on Pith no claims yet

Pith reviewed 2026-05-08 19:26 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords multilingual clinical diagnosisdomain adaptationorthopedic notesLLM reliabilitycross-lingual performanceadapter modulesvalidation frameworkclinical decision support
0
0 comments X

The pith

Domain-adaptive specialization with language-specific adapters substantially improves reliability and calibration in multilingual orthopedic diagnosis compared to zero-shot large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether large language models can reliably perform structured diagnostic classification from clinical notes in English, Hindi, and Punjabi. It finds that while these models handle language fluently, they suffer from unstable calibration and lower reliability in high-stakes medical tasks. A domain-adaptive approach using IndicBERT-HPA with orthopedic-specific adapter heads per language achieves consistent accuracy across six diagnostic categories and better confidence behavior. This matters because clinical decision support in multilingual, low-resource settings demands predictable performance to avoid errors. The authors also sketch a validation framework with evidence checks and human oversight to ensure safety.

Core claim

The authors claim that domain-adaptive specialization substantially improves cross-lingual discrimination and confidence behavior for orthopedic diagnosis. IndicBERT-HPA, equipped with language-specific orthopedic adapter heads, delivers strong performance across six categories with more predictable deployment characteristics than either task-only adaptation or zero-shot LLMs. LLMs show fluency but unstable calibration under structured multilingual conditions, particularly in low-resource languages. These results are based on comparisons with task-aligned multilingual encoders and a DistilBERT baseline.

What carries the argument

IndicBERT-HPA, a domain-adaptive architecture that adds language-specific orthopedic adapter heads to a multilingual transformer encoder for specialized processing of orthopedic clinical text.

If this is right

  • Specialized domain adaptation leads to improved discrimination in cross-lingual settings for clinical tasks.
  • Models exhibit more stable confidence estimates suitable for deployment.
  • Zero-shot LLMs are less suitable for structured high-risk diagnostic classification.
  • The proposed validation framework can enforce language-sensitive checks and conservative gating.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar adapter-based specialization might benefit diagnosis in other medical domains beyond orthopedics.
  • Testing on larger, more diverse real-world datasets could reveal if the advantages hold in varied clinical environments.
  • Integration into actual hospital systems would require addressing data privacy and integration challenges not covered here.

Load-bearing premise

The evaluation dataset and six diagnostic categories represent typical real clinical notes without biases that would artificially favor the domain-adapted model over zero-shot approaches.

What would settle it

Demonstrating on a new, independent collection of multilingual clinical notes that zero-shot LLMs achieve comparable or superior calibration and accuracy would challenge the finding that domain adaptation is necessary for reliability.

Figures

Figures reproduced from arXiv: 2605.02266 by Danish Ali, Farrukh Zaidi, Li Xiaojian, Sundas Iqbal.

Figure 1
Figure 1. Figure 1: Overview of the proposed multilingual orthopedic decision-support framework. The pipeline integrates task-aligned transformer view at source ↗
Figure 2
Figure 2. Figure 2: End-to-end fine-tuning pipeline for DistilBERT. A pre-trained DistilBERT encoder is adapted to the orthopedic disease view at source ↗
Figure 3
Figure 3. Figure 3: Proposed IndicBERT-HPA architecture. A shared IndicBERT encoder generates multilingual contextual representations, which view at source ↗
Figure 4
Figure 4. Figure 4: Performance comparison of task-aligned transformer models across English (EN), Hindi (HI) and Punjabi (PA). Each panel view at source ↗
Figure 5
Figure 5. Figure 5: Average performance across all languages (small multiples). Each panel reports one metric over models: IndicBERT-HPA achieves view at source ↗
Figure 6
Figure 6. Figure 6: Zero-shot LLM performance across languages (EN/HI/PA). Each panel plots Accuracy, Precision, Recall and F1 for each LLM. view at source ↗
Figure 7
Figure 7. Figure 7: Comprehensive visualization of controlled and real-world multilingual orthopedic diagnosis performance. The figure summarizes view at source ↗
Figure 8
Figure 8. Figure 8: Deterministic agent validation framework for reliability-oriented multilingual clinical decision support. view at source ↗
read the original abstract

Large Language Models (LLMs) are increasingly proposed for clinical decision support including multilingual diagnosis in low-resource settings. However, their reliability, calibration and safety characteristics remain insufficiently understood for structured, high-risk tasks. We present a system-level analysis of multilingual orthopedic diagnosis from free-text clinical notes in English, Hindi and Punjabi. We evaluate three modeling regimes: (i) task-aligned multilingual transformer encoders, (ii) a task-fine-tuned baseline (DistilBERT), and (iii) a domain-adaptive architecture tailored to orthopedic text (IndicBERT-HPA). These models are compared with zero-shot, instruction-tuned LLMs to assess suitability for structured diagnostic classification. Results indicate that while LLMs exhibit strong linguistic fluency, they show unstable calibration and reduced reliability under structured multilingual conditions, particularly in low-resource languages. These findings are specific to zero-shot evaluation and do not imply limitations of fine-tuned models. Domain-adaptive specialization substantially improves cross-lingual discrimination and confidence behavior. IndicBERT-HPA, with language-specific orthopedic adapter heads achieves consistently strong performance across six diagnostic categories and more predictable deployment characteristics than task-only adaptation. Building on these observations, we outline a conceptual deterministic agent-based validation framework for future implementation, formalizing evidence checks, language-sensitive validation and conservative human-in-the-loop gating. Reliable multilingual clinical decision support requires specialized architecture, explicit reliability analysis, and structured validation for safety-critical systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript conducts a system-level analysis of multilingual orthopedic diagnosis from free-text clinical notes in English, Hindi, and Punjabi. It evaluates three modeling regimes—task-aligned multilingual transformer encoders, a task-fine-tuned DistilBERT baseline, and the domain-adaptive IndicBERT-HPA architecture with language-specific orthopedic adapter heads—against zero-shot instruction-tuned LLMs. The central empirical claim is that domain-adaptive specialization yields substantially better cross-lingual discrimination and confidence calibration than task-only adaptation or zero-shot LLMs, with IndicBERT-HPA delivering consistent performance across six diagnostic categories and more predictable deployment behavior. The paper additionally sketches a conceptual deterministic agent-based validation framework incorporating evidence checks, language-sensitive validation, and human-in-the-loop gating.

Significance. If the empirical comparisons are supported by adequately documented data and statistical analysis, the work would usefully illustrate the limitations of zero-shot LLMs on structured, high-stakes multilingual clinical tasks and the value of domain-specific adapter specialization for reliability. The outlined validation framework offers a constructive direction for safety-critical deployment. The absence of quantitative details in the abstract, however, prevents assessment of whether these contributions are currently substantiated.

major comments (2)
  1. [Abstract] Abstract: The abstract asserts that IndicBERT-HPA 'achieves consistently strong performance across six diagnostic categories' and 'more predictable deployment characteristics' yet reports no dataset sizes, class distributions, exact metrics (accuracy, F1, calibration error), statistical significance tests, or error analysis. Without these, the comparative claims cannot be evaluated and the reliability conclusions remain unverifiable.
  2. [Results / Evaluation] Evaluation setup (implied in results discussion): The central claim that domain-adaptive specialization improves cross-lingual discrimination rests on the unverified assumption that the evaluation notes in Hindi and Punjabi are representative real clinical data without curation or translation biases that could favor IndicBERT-HPA. No information is supplied on data sourcing, annotation process, translation method, or class balance, which is load-bearing for the headline conclusion.
minor comments (2)
  1. [Abstract] The abstract states that findings 'are specific to zero-shot evaluation and do not imply limitations of fine-tuned models,' but this caveat is not carried through clearly into the comparative discussion or framework section.
  2. [Introduction / Methods] Notation for the six diagnostic categories and the precise definition of 'language-specific orthopedic adapter heads' should be introduced earlier and used consistently.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. The comments highlight important areas for improving clarity and transparency, particularly regarding the abstract and data documentation. We address each major comment below and commit to revisions that strengthen the verifiability of our empirical claims without altering the core contributions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The abstract asserts that IndicBERT-HPA 'achieves consistently strong performance across six diagnostic categories' and 'more predictable deployment characteristics' yet reports no dataset sizes, class distributions, exact metrics (accuracy, F1, calibration error), statistical significance tests, or error analysis. Without these, the comparative claims cannot be evaluated and the reliability conclusions remain unverifiable.

    Authors: We agree that the abstract would be strengthened by including key quantitative details to support the claims made. In the revised version, we will expand the abstract to report summary statistics such as the total number of clinical notes per language, overall accuracy and macro-F1 for the IndicBERT-HPA model across the six categories, and a brief reference to improved calibration error relative to zero-shot LLMs. We will also note that full tables with per-category metrics, class distributions, and statistical significance results (e.g., paired tests) appear in the results section. This change will allow readers to assess the claims directly from the abstract while respecting length limits. revision: yes

  2. Referee: [Results / Evaluation] Evaluation setup (implied in results discussion): The central claim that domain-adaptive specialization improves cross-lingual discrimination rests on the unverified assumption that the evaluation notes in Hindi and Punjabi are representative real clinical data without curation or translation biases that could favor IndicBERT-HPA. No information is supplied on data sourcing, annotation process, translation method, or class balance, which is load-bearing for the headline conclusion.

    Authors: We acknowledge that greater transparency on data provenance is essential for substantiating cross-lingual claims in a clinical context. The manuscript currently provides only a high-level overview of the evaluation notes. In the revision, we will add a dedicated subsection detailing: (1) data sourcing (publicly available de-identified clinical note collections supplemented by expert-generated examples where needed), (2) annotation process (conducted by board-certified orthopedic specialists with inter-annotator agreement metrics), (3) translation method (professional medical translators followed by back-translation checks for fidelity), and (4) class balance statistics with explicit discussion of any curation steps and potential biases. We will also include an analysis of how these factors were controlled to ensure the comparison remains fair. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparisons rest on external data and benchmarks

full rationale

The paper presents an empirical evaluation of three modeling regimes (task-aligned encoders, DistilBERT baseline, IndicBERT-HPA) against zero-shot LLMs on multilingual orthopedic notes across six diagnostic categories. No equations, derivations, or first-principles claims appear in the provided text; performance and calibration results are reported as direct measurements rather than predictions derived from fitted parameters or self-referential definitions. The conceptual validation framework is outlined as future work without load-bearing self-citations or ansatz smuggling. The derivation chain is therefore self-contained against external benchmarks and does not reduce to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard machine-learning assumptions that the test notes are representative, the six diagnostic categories are exhaustive and unbiased, and that calibration metrics reflect real deployment risk; none of these are derived or validated within the abstract.

axioms (2)
  • domain assumption The six diagnostic categories are clinically meaningful and balanced across languages
    Invoked when claiming consistent performance across categories
  • domain assumption Zero-shot LLM outputs can be directly compared to fine-tuned encoder outputs on the same task
    Used when contrasting LLM fluency with encoder reliability

pith-pipeline@v0.9.0 · 5559 in / 1282 out tokens · 35996 ms · 2026-05-08T19:26:04.046762+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 4 canonical work pages

  1. [1]

    Moustafa Abdelwanis, Hamdan Khalaf Alarafati, Maram Muhanad Saleh Tammam, and Mecit Can Emre Simsekler. 2024. Exploring the risks of automation bias in healthcare artificial intelligence applications: A Bowtie analysis.Journal of Safety Science and Resilience5, 4 (2024), 460–469

  2. [2]

    Abirami, N

    S. Abirami, N. Krishnammal, R. Suganya, and R. T. Suganya. 2026. NLP Powered Orthopaedics Expert System. InProceedings of the 2026 International Conference on Intelligent and Innovative Technologies in Computing, Electrical and Electronics (IITCEE). IEEE, 1–5

  3. [3]

    Hugo Araujo, Mohammad Reza Mousavi, and Mahsa Varshosaz. 2023. Testing, validation, and verification of robotic and autonomous systems: a systematic review.ACM Transactions on Software Engineering and Methodology32, 2 (2023), 1–61

  4. [4]

    Hayden P Baker, Sarthak Aggarwal, Senthooran Kalidoss, Matthew Hess, Rex Haydon, and Jason A Strelzow. 2025. Diagnostic accuracy of ChatGPT-4 in orthopedic oncology: a comparative study with residents.The Knee55 (2025), 153–160

  5. [5]

    Prashant Baranwal, Ankit Pundir, Sandeep Singh, and Gaurav Saxena. 2025. Embedding-Driven Clustering for Unerring Content Categorization in Low-Resource Hindi Language.ACM Transactions on Asian and Low-Resource Language Information Processing24, 12 (2025), 1–33

  6. [6]

    Agnese Bonfigli, Luca Bacco, Mario Merone, and Felice Dell’Orletta. 2024. From pre-training to fine-tuning: An in-depth analysis of Large Language Models in the biomedical domain.Artificial Intelligence in Medicine157 (2024), 103003

  7. [7]

    Qiang Chen, Yu Hu, Xi Peng, Qian Xie, Qiang Jin, Aaron Gilson, and Hua Xu. 2025. Benchmarking Large Language Models for Biomedical Natural Language Processing Applications and Recommendations.Nature Communications16, 1 (2025), 3280

  8. [8]

    Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng. 2025. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

  9. [9]

    Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, et al. 2020. Unsupervised Cross-lingual Representation Learning at Scale. InACL

  10. [10]

    Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised cross-lingual representation learning at scale. InProceedings of the 58th annual meeting of the association for computational linguistics. 8440–8451

  11. [11]

    Warren Del-Pinto, George Demetriou, Meghna Jani, Rikesh Patel, Leanne Gray, Alex Bulcock, Niels Peek, Andrew S Kanter, William G Dixon, and Goran Nenadic. 2025. Exploring the consistency, quality and challenges in manual and automated coding of free-text diagnoses from hospital outpatient letters.Plos one20, 8 (2025), e0328108

  12. [12]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. InNAACL-HLT

  13. [13]

    Shangheng Du, Jiabao Zhao, Jinxin Shi, Zhentao Xie, Xin Jiang, Yanhong Bai, and Liang He. 2026. A survey on the optimization of large language model-based agents.Comput. Surveys58, 9 (2026), 1–37

  14. [14]

    Niles, Ken Pathak, and Steven Sloan

    Md Meftahul Ferdaus, Mahdi Abdelguerfi, Elias Loup, Kendall N. Niles, Ken Pathak, and Steven Sloan. 2026. Towards trustworthy AI: a review of ethical and robust large language models.Comput. Surveys58, 7 (2026), 1–43. Manuscript submitted to ACM 30 Ali et al

  15. [15]

    Gaber, M

    F. Gaber, M. Shaik, F. Allega, A. J. Bilecz, F. Busch, K. Goon, and A. Akalin. 2025. Evaluating large language model workflows in clinical decision support for triage and referral and diagnosis.npj Digital Medicine8, 1 (2025), 263

  16. [16]

    Shang Gao, Mohammed Alawad, M Todd Young, John Gounley, Noah Schaefferkoetter, Hong Jun Yoon, Xiao-Cheng Wu, Eric B Durbin, Jennifer Doherty, Antoinette Stroup, et al. 2021. Limitations of transformers on clinical text classification.IEEE journal of biomedical and health informatics 25, 9 (2021), 3596–3607

  17. [17]

    Edgar Garcia-Lopez, Jamieson O’Marr, Rachel Gottlieb, Katherine Rebecca Miclau, and Nirav Pandya. 2025. Language Barriers in the Delivery of Musculoskeletal Care and Future Directions.Current Reviews in Musculoskeletal Medicine(2025), 1–9

  18. [18]

    Pengcheng He, Jianfeng Gao, Weizhu Chen, and Jason Wang. 2021. mDeBERTa: Efficient Multilingual Pre-trained Model for Low-Resource Languages. InFindings of EMNLP

  19. [19]

    Koki Horiguchi, Tomoyuki Kajiwara, Takashi Ninomiya, Shoko Wakamiya, and Eiji Aramaki. 2025. MultiMSD: A corpus for multilingual medical text simplification from online medical references. InFindings of the Association for Computational Linguistics: ACL 2025. 9248–9258

  20. [20]

    Dipika Jain. 2025. Multilingual and Cross-Linguistic Challenges in NLP. InTransformative Natural Language Processing: Bridging Ambiguity in Healthcare, Legal, and Financial Applications. Springer, 157–177

  21. [21]

    Shaoxiong Ji, Xiaobo Li, Wei Sun, Hang Dong, Ara Taalas, Yijia Zhang, Honghan Wu, Esa Pitkänen, and Pekka Marttinen. 2024. A Unified Review of Deep Learning for Automated Medical Coding.Comput. Surveys56, 12 (2024), 1–41

  22. [22]

    Kyungjin Kim, Jinju Kim, Haeji Jung, David R Mortensen, and Jongmo Seo. 2025. Domain-Specific Multilingual Strategies for Medical NLP: A Cross-Lingual Analysis of Orthographic and Phonemic Representations. In2025 47th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). IEEE, 1–6

  23. [23]

    Young-Tak Kim, Hyunji Kim, Manisha Bahl, Michael H Lev, Ramon Gilberto González, Michael S Gee, and Synho Do. 2026. Defining operational safety in clinical artificial intelligence systems.npj Digital Medicine(2026)

  24. [24]

    Dani Kiyasseh, Tingting Zhu, and David Clifton. 2020. The promise of clinical decision support systems targetting low-resource settings.IEEE Reviews in Biomedical Engineering15 (2020), 354–371

  25. [25]

    Bo Li. 2024. A Study of DistilBERT-Based Answer Extraction Machine Reading Comprehension Algorithm. InProceedings of the 2024 3rd International Conference on Cyber Security, Artificial Intelligence and Digital Economy. 261–268

  26. [26]

    Chen Ling, Xujiang Zhao, Jiaying Lu, Chengyuan Deng, Can Zheng, Junxiang Wang, Tanmoy Chowdhury, Yun Li, Hejie Cui, Xuchao Zhang, et al

  27. [27]

    Surveys58, 3 (2025), 1–39

    Domain specialization as the key to make large language models disruptive: A comprehensive survey.Comput. Surveys58, 3 (2025), 1–39

  28. [28]

    Xinyang Liu, Ji Wang, Xiaoyang Yuan, Jing Sun, Guangyi Dong, Peng Di, and Dong Wang. 2023. Prompting Frameworks for Large Language Models: A Survey.Comput. Surveys(2023)

  29. [29]

    Yang Liu, Jie Yu, Haoran Sun, Lei Shi, Guang Deng, Yong Chen, and Yang Liu. 2024. Efficient Detection of Toxic Prompts in Large Language Models. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering (ASE ’24). Association for Computing Machinery, 455–467

  30. [30]

    Durga Satish Matta and Saruladha Krishnamurthy. 2026. Enhancing sentiment analysis performance: a multilingual approach with advanced text processing and hybrid deep learning techniques with improved dung beetle optimization algorithm.Knowledge and Information Systems68, 1 (2026), 66

  31. [31]

    Mark A Musen, Blackford Middleton, and Robert A Greenes. 2021. Clinical decision-support systems. InBiomedical informatics: computer applications in health care and biomedicine. Springer, 795–840

  32. [32]

    Muhammad Kashif Nazir, CM Nadeem Faisal, Muhammad Asif Habib, and Haseeb Ahmad. 2025. Leveraging multilingual transformer for multiclass sentiment analysis in code-mixed data of low-resource languages.IEEE Access(2025)

  33. [33]

    Lin Ning, Luyang Liu, Jiaxing Wu, Neo Wu, Devora Berlowitz, Sushant Prakash, Bradley Green, Shawn O’Banion, and Jun Xie. 2025. User-llm: Efficient llm contextualization with user embeddings. InCompanion Proceedings of the ACM on Web Conference 2025. 1219–1223

  34. [34]

    Riccardo Nogaroli. 2025. Ethical and Legal Aspects of Artificial Intelligence (AI) in Medical Service Contracts. InMedical Liability and Artificial Intelligence. Springer

  35. [35]

    Nikolaos Pangakis and Sebastian Wolken. 2024. Knowledge Distillation in Automated Annotation: Supervised Text Classification with LLM- Generated Training Labels. InProceedings of the Sixth Workshop on Natural Language Processing and Computational Social Science (NLP+CSS 2024), Dallas Card, Anjalie Field, Dirk Hovy, and Katherine Keith (Eds.). Association ...

  36. [36]

    Andrea Posada, Daniel Rueckert, Felix Meissen, and Philip Müller. 2024. Evaluation of Language Models in the Medical Context Under Resource- Constrained Settings.arXiv preprint arXiv:2406.16611(2024)

  37. [37]

    Yu Qiao, Phuong-Nam Tran, Ji Su Yoon, Loc X Nguyen, Eui-Nam Huh, Dusit Niyato, and Choong Seon Hong. 2025. Deepseek-inspired exploration of rl-based llms and synergy with wireless networks: A survey.Comput. Surveys58, 7 (2025), 1–37

  38. [38]

    Pengcheng Qiu, Chaoyi Wu, Xiaoman Zhang, Weixiong Lin, Haicheng Wang, Ya Zhang, Yanfeng Wang, and Weidi Xie. 2024. Towards building multilingual language model for medicine.Nature Communications15, 1 (2024), 8384

  39. [39]

    Lisa Raithel, Johann Frei, Philippe Thomas, Roland Roller, Pierre Zweigenbaum, Sebastian Möller, and Frank Kramer. 2025. Cross-& multi-lingual medication detection: a transformer-based analysis.BMC Medical Informatics and Decision Making25, 1 (2025), 359

  40. [40]

    Rohit Singh Raja. 2025. A multi-level NLP framework for medical concept mapping in healthcare AI systems. In2025 IEEE 4th International Conference on AI in Cybersecurity (ICAIC). IEEE, 1–3. Manuscript submitted to ACM Reliability-Oriented Multilingual Orthopedic Diagnosis: A Domain-Adaptive Modeling and a Conceptual Validation Framework 31

  41. [41]

    Tizabi, Michael Baumgartner, Maximilian Eisenmann, et al

    Annika Reinke, Mohammad D. Tizabi, Michael Baumgartner, Maximilian Eisenmann, et al. 2024. Understanding metric-related pitfalls in image analysis validation.Nature(2024)

  42. [42]

    Jenifer Shanmugasundaram and Ratnavel Rajalakshmi. 2024. Multilingual Claim Span Identification With DaBERTa. InInternational Conference on Speech and Language Technologies for Low-resource Languages. Springer, 512–522

  43. [43]

    Telmo Silva Filho, Hao Song, Miquel Perello-Nieto, Raul Santos-Rodriguez, Meelis Kull, and Peter Flach. 2023. Classifier calibration: a survey on how to assess and improve predicted class probabilities.Machine Learning112, 9 (2023), 3211–3260

  44. [44]

    Adir Solomon, Maxim Glebov, and Teddy Lazebnik. 2025. Explainable Surgical Procedures Recommender System Leveraging Large Language Models.ACM Transactions on Recommender Systems(2025)

  45. [45]

    Sauhard Soni and S Lalitha. 2025. Effective Multilingual and Mixed-lingual DSR System for Healthcare Application in Indian Languages.Procedia Computer Science258 (2025), 1219–1231

  46. [46]

    Hiren Thakkar and A Manimaran. 2023. Comprehensive examination of instruction-based language models: A comparative analysis of mistral-7b and llama-2-7b. In2023 International Conference on Emerging Research in Computational Science (ICERCS). IEEE, 1–6

  47. [47]

    Jeffrey Thompson, Jinxiang Hu, Dinesh Pal Mudaranthakam, David Streeter, Lisa Neums, Michele Park, Devin C Koestler, Byron Gajewski, Roy Jensen, and Matthew S Mayo. 2019. Relevant word order vectorization for improved natural language processing in electronic health records. Scientific reports9, 1 (2019), 9253

  48. [48]

    Fabián Villena, Felipe Bravo-Marquez, and Jocelyn Dunstan. 2025. NLP modeling recommendations for restricted data availability in clinical settings. BMC Medical Informatics and Decision Making25, 1 (2025), 116

  49. [49]

    Xintong Wu, Yu Huang, and Qing He. 2025. A large language model improves clinicians’ diagnostic performance in complex critical illness cases. Critical Care29, 1 (2025), 230

  50. [50]

    Rui Yang, Zhe Zhang, Wenqing Zheng, and Jeffrey Xu Yu. 2023. Fast Continuous Subgraph Matching over Streaming Graphs via Backtracking Reduction.Proceedings of the ACM on Management of Data1, 1 (2023). doi:10.1145/3588695

  51. [51]

    Junjie Zhang, Ruobing Xie, Yupeng Hou, Xin Zhao, Leyu Lin, and Ji-Rong Wen. 2025. Recommendation as instruction following: A large language model empowered recommendation approach.ACM Transactions on Information Systems43, 5 (2025), 1–37

  52. [52]

    Qian Zhang, Chen Zhou, Gregory Go, Bo Zeng, Haotian Shi, Zhen Xu, and Yu Jiang. 2024. Imperceptible Content Poisoning in LLM-Powered Applications. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering (ASE ’24). Association for Computing Machinery, New York, NY, USA, 242–254. doi:10.1145/3691620.3695001

  53. [53]

    R. Zhou, H. Huang, G. Zhang, H. Zhou, and J. Bian. 2025. Crash-based safety testing of autonomous vehicles: Insights from generating safety-critical scenarios based on in-depth crash data.IEEE Transactions on Intelligent Transportation Systems(2025)

  54. [54]

    Ke Zou, Yang Bai, Bo Liu, Yidi Chen, Zhihao Chen, Yang Zhou, Xuedong Yuan, Meng Wang, Xiaojing Shen, Xiaochun Cao, et al. 2025. Uncertainty- aware medical diagnostic phrase identification and grounding.IEEE Transactions on Pattern Analysis and Machine Intelligence(2025). Manuscript submitted to ACM