pith. machine review for the scientific record. sign in

arxiv: 2605.12313 · v1 · submitted 2026-05-12 · 💻 cs.CL · cs.IR

Recognition: no theorem link

Overview of the MedHopQA track at BioCreative IX: track description, participation and evaluation of systems for multi-hop medical question answering

Authors on Pith no claims yet

Pith reviewed 2026-05-13 04:40 UTC · model grok-4.3

classification 💻 cs.CL cs.IR
keywords multi-hop question answeringbiomedical QAshared taskretrieval-augmented generationMedHopQABioCreativelarge language modelsrare diseases
0
0 comments X

The pith

Retrieval-augmented systems reached 89 percent conceptual accuracy on 1,000 two-hop biomedical questions that require facts from separate Wikipedia pages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents the MedHopQA shared task, which created a benchmark of 1,000 questions each designed to need integration of information from two distinct Wikipedia pages on diseases, genes, and chemicals. It reports results from 48 submissions by 13 teams and shows a clear performance gap between standard large language models and systems that add retrieval steps. The top entry scored 89.3 percent F1 on the MedCPT conceptual metric and 87.3 percent exact match, compared with 67.4 percent and 60.2 percent for the zero-shot baseline. Concept-level scoring proved useful when correct answers appeared in different surface forms. The dataset is released publicly to support further progress on multi-hop reasoning in medicine.

Core claim

The MedHopQA track supplied 1,000 questions constructed to demand two-hop reasoning across distinct Wikipedia pages and showed that retrieval-augmented generation strategies enabled participating systems to reach substantially higher scores than zero-shot baselines under both exact-match and MedCPT conceptual evaluation.

What carries the argument

The MedHopQA dataset of 1,000 two-hop QA pairs, evaluated with exact match plus MedCPT conceptual similarity scores.

If this is right

  • Retrieval-augmented generation is required for strong results on questions that span multiple biomedical sources.
  • Concept-level evaluation better recognizes valid answers that differ in wording from the reference.
  • The large gap between baselines and top systems points to a specific weakness in current models' ability to combine information across documents.
  • Public availability of the dataset allows direct comparison of new multi-hop methods against the reported benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same retrieval approach could be tested on clinical notes or PubMed abstracts to check whether Wikipedia-based two-hop reasoning transfers to other text types.
  • If retrieval helps most on rare diseases, it may be especially valuable wherever model pre-training data is sparse.
  • Extending the task to three-hop or longer chains would reveal whether the observed retrieval benefit scales with reasoning depth.

Load-bearing premise

The questions cannot be answered correctly from a single Wikipedia page or from the model's existing knowledge alone.

What would settle it

A zero-shot large language model achieving MedCPT scores close to the top retrieval-augmented entries on the same 1,000 questions would indicate the items do not genuinely require cross-page integration.

read the original abstract

Multi-hop question answering (QA) remains a significant challenge in the biomedical domain, requiring systems to integrate information across multiple sources to answer complex questions. To address this problem, the BioCreative IX MedHopQA shared task was designed to benchmark in multi-hop reasoning for large language models (LLMs). We developed a novel dataset of 1,000 challenging QA pairs spanning diseases, genes, and chemicals, with particular emphasis on rare diseases. Each question was constructed to require two-hop reasoning through the integration of information from two distinct Wikipedia pages. The challenge attracted 48 submissions from 13 teams. Systems were evaluated using both surface string comparison and conceptual accuracy (MedCPT score). The results showed a substantial performance gap between baseline LLMs and enhanced systems. The top-ranked submission achieved an 89.30% F1 score on the MedCPT metric and an 87.30% exact match (EM) score, compared with 67.40% and 60.20%, respectively, for the zero-shot baseline. A central finding of the challenge was that retrieval-augmented generation (RAG) and related retrieval-based strategies were critical for strong performance. In addition, concept-level evaluation improved answer assessment when correct responses differed in surface form. The MedHopQA dataset is publicly available to support continued progress in this important area. Challenge materials: https://www.ncbi.nlm.nih.gov/research/bionlp/medhopqa and benchmark https://www.codabench.org/competitions/7609/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper provides an overview of the MedHopQA shared task at BioCreative IX. It describes the construction of a novel dataset of 1,000 QA pairs on diseases, genes, and chemicals (emphasizing rare diseases), where each question is designed to require two-hop reasoning across two distinct Wikipedia pages. The track received 48 submissions from 13 teams. Systems were evaluated with exact match (EM) and a conceptual MedCPT F1 score. Results show the top submission at 89.30% MedCPT F1 and 87.30% EM versus 67.40% and 60.20% for the zero-shot baseline. The paper concludes that retrieval-augmented generation (RAG) and retrieval-based strategies were critical for strong performance, releases the dataset publicly, and notes that concept-level evaluation helps when surface forms differ.

Significance. If the dataset's multi-hop property holds, the work supplies a public benchmark and evaluation platform for biomedical multi-hop QA, with concrete participation numbers and performance deltas that can guide future LLM development in the domain. The emphasis on rare diseases and the MedCPT metric add domain-specific value beyond standard string matching.

major comments (1)
  1. [Dataset construction / track description] Dataset description (abstract and track description sections): The manuscript asserts that the 1,000 questions were constructed to require two-hop reasoning via integration across two distinct Wikipedia pages, yet provides no verification steps such as single-page retrieval tests, human annotation for answer locality, or ablations demonstrating that zero-shot/one-hop baselines fail specifically on these items. This is load-bearing for the central claim that the observed performance gap (89.30% MedCPT F1 / 87.30% EM vs. 67.40%/60.20% baseline) and the conclusion that RAG strategies are critical reflect multi-hop integration rather than improved single-fact retrieval or metric handling.
minor comments (1)
  1. [Abstract] Abstract: 'benchmark in multi-hop reasoning' appears to be a minor phrasing issue and should read 'benchmarking multi-hop reasoning'.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive review and for highlighting the importance of verifying the multi-hop character of the MedHopQA dataset. We agree that this verification is central to interpreting the performance results and have revised the manuscript to include additional details and supporting analyses on dataset construction.

read point-by-point responses
  1. Referee: Dataset description (abstract and track description sections): The manuscript asserts that the 1,000 questions were constructed to require two-hop reasoning via integration across two distinct Wikipedia pages, yet provides no verification steps such as single-page retrieval tests, human annotation for answer locality, or ablations demonstrating that zero-shot/one-hop baselines fail specifically on these items. This is load-bearing for the central claim that the observed performance gap (89.30% MedCPT F1 / 87.30% EM vs. 67.40%/60.20% baseline) and the conclusion that RAG strategies are critical reflect multi-hop integration rather than improved single-fact retrieval or metric handling.

    Authors: We acknowledge that the original manuscript did not present explicit verification experiments for the two-hop requirement. The questions were designed by domain experts following a protocol that required each item to depend on facts from two distinct Wikipedia pages, with the second fact only accessible after retrieving the first. In the revised version we have added: (1) a detailed description of the annotation guidelines and quality-control steps used to enforce two-hop dependency; (2) results of a human annotation study in which annotators were restricted to a single page and could not locate the answer for the large majority of questions; and (3) an ablation comparing a one-hop retrieval baseline against the submitted two-hop RAG systems, confirming a substantial performance drop when multi-hop integration is removed. These additions directly address the concern that the observed gap (and the value of RAG) might reflect single-fact retrieval or metric effects alone. We have also clarified that the MedCPT conceptual metric was applied uniformly and still shows the same ordering, further supporting that the gains arise from reasoning across hops. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical reporting of shared-task results

full rationale

The paper is an overview of a BioCreative shared task that reports participant submissions, baseline scores, and dataset construction details without any mathematical derivations, parameter fittings, or predictive equations. Central claims rest on externally evaluated F1/EM/MedCPT metrics from 48 submissions rather than any self-referential reduction or ansatz smuggled via citation. No load-bearing step equates a claimed result to its own inputs by construction, and the analysis is self-contained against the reported benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical shared-task overview paper rather than a theoretical derivation, so it introduces no free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5677 in / 1179 out tokens · 71899 ms · 2026-05-13T04:40:45.993233+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 1 internal anchor

  1. [1]

    (2016) Community challenges in biomedical text mining over 10 years: success, failure and the future

    Huang, C.C., Lu, Z. (2016) Community challenges in biomedical text mining over 10 years: success, failure and the future. Brief Bioinform, 17, 132-144

  2. [2]

    (2017) Information Retrieval and Text Mining Technologies for Chemistry

    Krallinger, M., Rabal, O., Lourenco, A., et al. (2017) Information Retrieval and Text Mining Technologies for Chemistry. Chem Rev, 117, 7673-7761

  3. [3]

    (2016) BioCreative V CDR task corpus: a resource for chemical disease relation extraction

    Li, J., Sun, Y., Johnson, R.J., et al. (2016) BioCreative V CDR task corpus: a resource for chemical disease relation extraction. Database (Oxford), 2016

  4. [4]

    (2005) Overview of BioCreAtIvE: critical assessment of information extraction for biology

    Hirschman, L., Yeh, A., Blaschke, C., et al. (2005) Overview of BioCreAtIvE: critical assessment of information extraction for biology. BMC Bioinformatics, 6 Suppl 1, S1

  5. [5]

    (2008) Evaluation of text-mining systems for biology: overview of the Second BioCreative community challenge

    Krallinger, M., Morgan, A., Smith, L., et al. (2008) Evaluation of text-mining systems for biology: overview of the Second BioCreative community challenge. Genome Biol, 9 Suppl 2, S1

  6. [6]

    (2011) Overview of the BioCreative III Workshop

    Arighi, C.N., Lu, Z., Krallinger, M., et al. (2011) Overview of the BioCreative III Workshop. BMC Bioinformatics, 12 Suppl 8, S1

  7. [7]

    (2011) The Protein-Protein Interaction tasks of BioCreative III: classification/ranking of articles and linking bio-ontology concepts to full text

    Krallinger, M., Vazquez, M., Leitner, F., et al. (2011) The Protein-Protein Interaction tasks of BioCreative III: classification/ranking of articles and linking bio-ontology concepts to full text. BMC Bioinformatics, 12 Suppl 8, S3

  8. [8]

    (2008) Overview of BioCreative II gene normalization

    Morgan, A.A., Lu, Z., Wang, X., et al. (2008) Overview of BioCreative II gene normalization. Genome Biol, 9 Suppl 2, S3

  9. [9]

    (2019) Overview of the BioCreative VI Precision Medicine Track: mining protein interactions and mutations for precision medicine

    Islamaj Dogan, R., Kim, S., Chatr-Aryamontri, A., et al. (2019) Overview of the BioCreative VI Precision Medicine Track: mining protein interactions and mutations for precision medicine. Database (Oxford), 2019

  10. [10]

    (2023) Chemical identification and indexing in full-text articles: an overview of the NLM-Chem track at BioCreative VII

    Leaman, R., Islamaj, R., Adams, V., et al. (2023) Chemical identification and indexing in full-text articles: an overview of the NLM-Chem track at BioCreative VII. Database (Oxford), 2023

  11. [11]

    (2019) The extraction of complex relationships and their conversion to biological expression language (BEL) overview of the BioCreative VI (2017) BEL track

    Madan, S., Szostak, J., Komandur Elayavilli, R., et al. (2019) The extraction of complex relationships and their conversion to biological expression language (BEL) overview of the BioCreative VI (2017) BEL track. Database (Oxford), 2019. 21

  12. [12]

    (2023) Overview of DrugProt task at BioCreative VII: data and methods for large-scale text mining and knowledge graph generation of heterogenous chemical-protein relations

    Miranda-Escalada, A., Mehryary, F., Luoma, J., et al. (2023) Overview of DrugProt task at BioCreative VII: data and methods for large-scale text mining and knowledge graph generation of heterogenous chemical-protein relations. Database (Oxford), 2023

  13. [13]

    (2024) The overview of the BioRED (Biomedical Relation Extraction Dataset) track at BioCreative VIII

    Islamaj, R., Lai, P.T., Wei, C.H., et al. (2024) The overview of the BioRED (Biomedical Relation Extraction Dataset) track at BioCreative VIII. Database (Oxford), 2024

  14. [15]

    (2026) Synthesizing scientific literature with retrieval- augmented language models

    Asai, A., He, J., Shao, R., et al. (2026) Synthesizing scientific literature with retrieval- augmented language models. Nature, 650, 857-863

  15. [16]

    (2025) Toward expert-level medical question answering with large language models

    Singhal, K., Tu, T., Gottweis, J., et al. (2025) Toward expert-level medical question answering with large language models. Nat Med, 31, 943-950

  16. [17]

    (2025) Overview of the MedHopQA track at BioCreative IX: track description, participation and evaluation of systems for multi- hop medical question answering

    Islamaj, R., Chan, J., Leaman, R., et al. (2025) Overview of the MedHopQA track at BioCreative IX: track description, participation and evaluation of systems for multi- hop medical question answering. Proceedings of the BioCreative IX Challenge and Workshop (BC9): Large Language Models for Clinical and Biomedical NLP at the International Joint Conference ...

  17. [18]

    (2022) Biomedical Question Answering: A Survey of Approaches and Challenges

    Jin, Q., Yuan, Z., Xiong, G., et al. (2022) Biomedical Question Answering: A Survey of Approaches and Challenges. ACM Comput. Surv., 55, Article 35

  18. [19]

    (2015) An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition

    Tsatsaronis, G., Balikas, G., Malakasiotis, P., et al. (2015) An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition. BMC Bioinformatics, 16, 138

  19. [20]

    (2019) PubMedQA: A Dataset for Biomedical Research Question Answering

    Jin, Q., Dhingra, B., Liu, Z., et al. (2019) PubMedQA: A Dataset for Biomedical Research Question Answering. Association for Computational Linguistics, Hong Kong, China, pp. 2567-2577

  20. [21]

    (2021) What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams

    Jin, D., Pan, E., Oufattole, N., et al. (2021) What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams. Applied Sciences, 11, 6421

  21. [22]

    (2022) MedMCQA: A Large-scale Multi- Subject Multi-Choice Dataset for Medical domain Question Answering

    Pal, A., Umapathi, L.K., Sankarasubbu, M. (2022) MedMCQA: A Large-scale Multi- Subject Multi-Choice Dataset for Medical domain Question Answering. In Gerardo, F., George, H.C., Tom, P., et al. (eds.), Proceedings of the Conference on Health, Inference, and Learning. PMLR, Proceedings of Machine Learning Research, Vol. 174, pp. 248--260

  22. [23]

    (2023) Large language models encode clinical knowledge

    Singhal, K., Azizi, S., Tu, T., et al. (2023) Large language models encode clinical knowledge. Nature, 620, 172-180

  23. [24]

    (2018) Constructing Datasets for Multi-hop Reading Comprehension Across Documents

    Welbl, J., Stenetorp, P., Riedel, S. (2018) Constructing Datasets for Multi-hop Reading Comprehension Across Documents. Transactions of the Association for Computational Linguistics, 6, 287-302

  24. [25]

    (2018) emrqa: A large corpus for question answering on electronic medical records

    Pampari, A., Raghavan, P., Liang, J., et al. (2018) emrqa: A large corpus for question answering on electronic medical records. Proceedings of the 2018 conference on empirical methods in natural language processing, pp. 2357-2368. 22

  25. [26]

    (2021) Overview of the MEDIQA 2021 Shared Task on Summarization in the Medical Domain

    Ben Abacha, A., Mrabet, Y., Zhang, Y., et al. (2021) Overview of the MEDIQA 2021 Shared Task on Summarization in the Medical Domain. Association for Computational Linguistics, Online, pp. 74-85

  26. [27]

    (2016) Overview of the TREC 2016 clinical decision support track

    Kirk, R., Dina, D., Voorhees, E., et al. (2016) Overview of the TREC 2016 clinical decision support track. Proceedings of the 15th text retrieval conference

  27. [28]

    Islamaj, R., Lima López, S., Xu, D., et al. (2025) Proceedings of the BioCreative IX Challenge and Workshop (BC9): Large Language Models for Clinical and Biomedical NLP at the International Joint Conference on Artificial Intelligence (IJCAI). In Islamaj, R. and Lima López, S. (eds.), BioCreative IX Challenge and Workshop (BC9): Large Language Models for C...

  28. [29]

    (2008) Biomedical ontologies in action: role in knowledge management, data integration and decision support

    Bodenreider, O. (2008) Biomedical ontologies in action: role in knowledge management, data integration and decision support. Yearb Med Inform, 67-79

  29. [30]

    (2023) MedCPT: Contrastive Pre-trained Transformers with large-scale PubMed search logs for zero-shot biomedical information retrieval

    Jin, Q., Kim, W., Chen, Q., et al. (2023) MedCPT: Contrastive Pre-trained Transformers with large-scale PubMed search logs for zero-shot biomedical information retrieval. Bioinformatics, 39

  30. [31]

    (2025) DMIS Lab at MedHopQA-2025: Ensemble Multi-Retrieval Methodologies with Reasoning Language Model Decision

    Jung, J., Hwang, H., Yein Park, M.S., Jaehoon Yoon, Hyeon Hwang, Sanghoon Lee, Jiwoong Sohn and Jaewoo Kang. (2025) DMIS Lab at MedHopQA-2025: Ensemble Multi-Retrieval Methodologies with Reasoning Language Model Decision. Proceedings of the BioCreative IX Challenge and Workshop (BC9): Large Language Models for Clinical and Biomedical NLP at the Internatio...

  31. [32]

    (2025) UETQuintet at BioCreative IX – MedHopQA: Enhancing Biomedical QA with Selective Multi-hop Reasoning and Contextual Retrieval

    Nguyen, Q.-A., Vu, T.-M.-T., Bich-Dat Nguyen, D.-Q.-M.T.a.H.-Q.L. (2025) UETQuintet at BioCreative IX – MedHopQA: Enhancing Biomedical QA with Selective Multi-hop Reasoning and Contextual Retrieval. Proceedings of the BioCreative IX Challenge and Workshop (BC9): Large Language Models for Clinical and Biomedical NLP at the International Joint Conference on...

  32. [33]

    Pakawat Phasook, R.P., Jiramet Kinchagawat, Amrest Chinkamol, Tossaporn Saengja, Jitkapat Sawatphol and Piyalitt Ittichaiwong NHSRAG: Addressing Multi- Hop Medical QA with Named-entity Heuristic Search Retrieval-Augmented Generation. Proceedings of the BioCreative IX Challenge and Workshop (BC9): Large Language Models for Clinical and Biomedical NLP at th...

  33. [34]

    Alliheedi, A.B.a.M. (2025) Evaluating Advanced Prompting on Gemini Flash for Multi- Hop Biomedical QA., Proceedings of the BioCreative IX Challenge and Workshop (BC9): Large Language Models for Clinical and Biomedical NLP at the International Joint Conference on Artificial Intelligence (IJCAI)

  34. [35]

    Harikrishnan Gurushankar Saisudha, G.C.a.S.B. (2025) Agentic and Non-Agentic Multi-Hop Systems for Medical Question Answering., Proceedings of the BioCreative IX Challenge and Workshop (BC9): Large Language Models for Clinical and Biomedical NLP at the International Joint Conference on Artificial Intelligence (IJCAI). 23

  35. [36]

    Taktashov, N.Y.B., Olga A

    Rustam R. Taktashov, N.Y.B., Olga A. Tarasova and Alexander V. Dmitriev (2025) Wikipedia-based hybrid-search RAG with prompt decomposition for MedHopQA. Proceedings of the BioCreative IX Challenge and Workshop (BC9): Large Language Models for Clinical and Biomedical NLP at the International Joint Conference on Artificial Intelligence (IJCAI)

  36. [37]

    Sofia I. R. Conceição, P.R.C.L.a.F.M.C. (2025) lasigeBioTM at MedHop track : Can a Lean RAG-Enhanced Model Compete with MedGemma. Proceedings of the BioCreative IX Challenge and Workshop (BC9): Large Language Models for Clinical and Biomedical NLP at the International Joint Conference on Artificial Intelligence (IJCAI)

  37. [38]

    Reem Abdel-Salam, M.A.a.M.A.A. (2025) CaresAI at BioCreative IX Track 1 - LLM for Biomedical QA., Proceedings of the BioCreative IX Challenge and Workshop (BC9): Large Language Models for Clinical and Biomedical NLP at the International Joint Conference on Artificial Intelligence (IJCAI)

  38. [39]

    Yuelyu Ji, H.Z., Shiven Verma, Hui Ji, Chun Li, Yushui Han and Yanshan Wang (2025) DeepRAG: Integrating Hierarchical Reasoning and Process Supervision for Biomedical Multi-Hop QA., Proceedings of the BioCreative IX Challenge and Workshop (BC9): Large Language Models for Clinical and Biomedical NLP at the International Joint Conference on Artificial Intell...

  39. [40]

    (2023) Query2doc: Query Expansion with Large Language Models

    Wang, L., Yang, N., Wei, F. (2023) Query2doc: Query Expansion with Large Language Models. Association for Computational Linguistics, Singapore, pp. 9414-9423

  40. [41]

    (2024) Improving medical reasoning through retrieval and self-reflection with retrieval-augmented large language models

    Jeong, M., Sohn, J., Sung, M., et al. (2024) Improving medical reasoning through retrieval and self-reflection with retrieval-augmented large language models. Bioinformatics, 40, i119-i129

  41. [42]

    (2025) Rationale-Guided Retrieval Augmented Generation for Medical Question Answering

    Sohn, J., Park, Y., Yoon, C., et al. (2025) Rationale-Guided Retrieval Augmented Generation for Medical Question Answering. Association for Computational Linguistics, Albuquerque, New Mexico, pp. 12739-12753

  42. [43]

    (2023) Efficient Memory Management for Large Language Model Serving with PagedAttention

    Kwon, W., Li, Z., Zhuang, S., et al. (2023) Efficient Memory Management for Large Language Model Serving with PagedAttention. Proceedings of the 29th Symposium on Operating Systems Principles. Association for Computing Machinery, Koblenz, Germany, pp. 611–626

  43. [44]

    (2025) `smolagents`: a smol library to build great agentic systems

    Roucher, A., del Moral, A.V., Wolf, T., et al. (2025) `smolagents`: a smol library to build great agentic systems

  44. [45]

    (2022) React: Synergizing reasoning and acting in language models

    Yao, S., Zhao, J., Yu, D., et al. (2022) React: Synergizing reasoning and acting in language models. The eleventh international conference on learning representations

  45. [46]

    (2024) Bm25s: Orders of magnitude faster lexical search via eager sparse scoring

    Lù, X.H. (2024) Bm25s: Orders of magnitude faster lexical search via eager sparse scoring. arXiv preprint arXiv:2407.03618

  46. [47]

    (2009) Reciprocal rank fusion outperforms condorcet and individual rank learning methods

    Cormack, G.V., Clarke, C.L.A., Buettcher, S. (2009) Reciprocal rank fusion outperforms condorcet and individual rank learning methods. Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval. Association for Computing Machinery, Boston, MA, USA, pp. 758–759. 24

  47. [48]

    Med42-v2: A suite of clinical llms.arXiv preprint arXiv:2408.06142.2024

    Christophe, C., Kanithi, P.K., Raha, T., et al. (2024) Med42-v2: A suite of clinical llms. arXiv preprint arXiv:2408.06142

  48. [49]

    6g non-terrestrial networks enabled low-altitude economy: Opportunities and challenges.arXiv preprint arXiv:2311.09047, 2023

    Jiang, Y., Li, X., Zhu, G., et al. (2023) 6G non-terrestrial networks enabled low- altitude economy: Opportunities and challenges. arXiv preprint arXiv:2311.09047

  49. [50]

    (2022) Mondo: Unifying diseases for the world, by the world

    Vasilevsky, N.A., Matentzoglu, N.A., Toro, S., et al. (2022) Mondo: Unifying diseases for the world, by the world. medRxiv, 2022.2004.2013.22273750

  50. [51]

    MedGemma Technical Report

    Sellergren, A., Kazemzadeh, S., Jaroensri, T., et al. (2025) Medgemma technical report. arXiv preprint arXiv:2507.05201