arxiv: 2605.12313 · v1 · submitted 2026-05-12 · 💻 cs.CL · cs.IR

Recognition: no theorem link

Overview of the MedHopQA track at BioCreative IX: track description, participation and evaluation of systems for multi-hop medical question answering

Rezarta Islamaj , Joey Chan , Robert Leaman , Jongmyung Jung , Hyeongsoon Hwang , Quoc-An Nguyen , Hoang-Quynh Le , Harikrishnan Gurushankar Saisudha

show 8 more authors

Ganesh Chandrasekar Rustam R. Taktashov Nadezhda Yu. Bizyukova Sofia I. R. Concei\c{c}\~ao Paulo R. C. Lopes Reem Abdel Salam Mary Adewunmi Zhiyong Lu

Authors on Pith no claims yet

Pith reviewed 2026-05-13 04:40 UTC · model grok-4.3

classification 💻 cs.CL cs.IR

keywords multi-hop question answeringbiomedical QAshared taskretrieval-augmented generationMedHopQABioCreativelarge language modelsrare diseases

0 comments

The pith

Retrieval-augmented systems reached 89 percent conceptual accuracy on 1,000 two-hop biomedical questions that require facts from separate Wikipedia pages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents the MedHopQA shared task, which created a benchmark of 1,000 questions each designed to need integration of information from two distinct Wikipedia pages on diseases, genes, and chemicals. It reports results from 48 submissions by 13 teams and shows a clear performance gap between standard large language models and systems that add retrieval steps. The top entry scored 89.3 percent F1 on the MedCPT conceptual metric and 87.3 percent exact match, compared with 67.4 percent and 60.2 percent for the zero-shot baseline. Concept-level scoring proved useful when correct answers appeared in different surface forms. The dataset is released publicly to support further progress on multi-hop reasoning in medicine.

Core claim

The MedHopQA track supplied 1,000 questions constructed to demand two-hop reasoning across distinct Wikipedia pages and showed that retrieval-augmented generation strategies enabled participating systems to reach substantially higher scores than zero-shot baselines under both exact-match and MedCPT conceptual evaluation.

What carries the argument

The MedHopQA dataset of 1,000 two-hop QA pairs, evaluated with exact match plus MedCPT conceptual similarity scores.

If this is right

Retrieval-augmented generation is required for strong results on questions that span multiple biomedical sources.
Concept-level evaluation better recognizes valid answers that differ in wording from the reference.
The large gap between baselines and top systems points to a specific weakness in current models' ability to combine information across documents.
Public availability of the dataset allows direct comparison of new multi-hop methods against the reported benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same retrieval approach could be tested on clinical notes or PubMed abstracts to check whether Wikipedia-based two-hop reasoning transfers to other text types.
If retrieval helps most on rare diseases, it may be especially valuable wherever model pre-training data is sparse.
Extending the task to three-hop or longer chains would reveal whether the observed retrieval benefit scales with reasoning depth.

Load-bearing premise

The questions cannot be answered correctly from a single Wikipedia page or from the model's existing knowledge alone.

What would settle it

A zero-shot large language model achieving MedCPT scores close to the top retrieval-augmented entries on the same 1,000 questions would indicate the items do not genuinely require cross-page integration.

read the original abstract

Multi-hop question answering (QA) remains a significant challenge in the biomedical domain, requiring systems to integrate information across multiple sources to answer complex questions. To address this problem, the BioCreative IX MedHopQA shared task was designed to benchmark in multi-hop reasoning for large language models (LLMs). We developed a novel dataset of 1,000 challenging QA pairs spanning diseases, genes, and chemicals, with particular emphasis on rare diseases. Each question was constructed to require two-hop reasoning through the integration of information from two distinct Wikipedia pages. The challenge attracted 48 submissions from 13 teams. Systems were evaluated using both surface string comparison and conceptual accuracy (MedCPT score). The results showed a substantial performance gap between baseline LLMs and enhanced systems. The top-ranked submission achieved an 89.30% F1 score on the MedCPT metric and an 87.30% exact match (EM) score, compared with 67.40% and 60.20%, respectively, for the zero-shot baseline. A central finding of the challenge was that retrieval-augmented generation (RAG) and related retrieval-based strategies were critical for strong performance. In addition, concept-level evaluation improved answer assessment when correct responses differed in surface form. The MedHopQA dataset is publicly available to support continued progress in this important area. Challenge materials: https://www.ncbi.nlm.nih.gov/research/bionlp/medhopqa and benchmark https://www.codabench.org/competitions/7609/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This track report releases a 1,000-question two-hop biomedical QA dataset on rare diseases and shows RAG systems outperforming baselines, but provides no checks that the questions actually require cross-page reasoning.

read the letter

Colleague, the one thing to know about this MedHopQA track paper is that it releases a new dataset of 1,000 two-hop questions on biomedical topics like rare diseases, genes, and chemicals, built from pairs of Wikipedia pages, and reports that the best systems using retrieval-augmented generation reached 89% F1 on their conceptual MedCPT metric and 87% exact match, well above the zero-shot baseline. The paper does a good job describing the shared task organization, the 48 submissions from 13 teams, and the evaluation approach that combines surface matching with concept-level scoring to handle varied answer phrasing. Making the dataset and benchmark available publicly is a concrete contribution that others can use right away. The focus on rare diseases makes sense for testing integration of scattered knowledge. Where it falls short is on confirming that the questions actually require two-hop reasoning. The description says each was constructed to need info from two distinct pages, but there are no reported tests to rule out single-page answers or model priors, such as checking performance with one-page retrieval or seeing if annotators agree the answer isn't local to one source. This leaves the conclusion about RAG being critical for multi-hop open to the possibility that it's mostly about better fact lookup instead. Readers working on biomedical question answering or shared task benchmarks will find this useful as a starting point for testing LLMs on complex medical queries. It is the kind of paper that belongs in a reading group if the group is surveying available datasets in the area. It deserves peer review because the dataset is new and the participation data is solid. I would recommend sending it for review, with the main request being to strengthen the validation section on why these questions isolate multi-hop capability.

Referee Report

1 major / 1 minor

Summary. The paper provides an overview of the MedHopQA shared task at BioCreative IX. It describes the construction of a novel dataset of 1,000 QA pairs on diseases, genes, and chemicals (emphasizing rare diseases), where each question is designed to require two-hop reasoning across two distinct Wikipedia pages. The track received 48 submissions from 13 teams. Systems were evaluated with exact match (EM) and a conceptual MedCPT F1 score. Results show the top submission at 89.30% MedCPT F1 and 87.30% EM versus 67.40% and 60.20% for the zero-shot baseline. The paper concludes that retrieval-augmented generation (RAG) and retrieval-based strategies were critical for strong performance, releases the dataset publicly, and notes that concept-level evaluation helps when surface forms differ.

Significance. If the dataset's multi-hop property holds, the work supplies a public benchmark and evaluation platform for biomedical multi-hop QA, with concrete participation numbers and performance deltas that can guide future LLM development in the domain. The emphasis on rare diseases and the MedCPT metric add domain-specific value beyond standard string matching.

major comments (1)

[Dataset construction / track description] Dataset description (abstract and track description sections): The manuscript asserts that the 1,000 questions were constructed to require two-hop reasoning via integration across two distinct Wikipedia pages, yet provides no verification steps such as single-page retrieval tests, human annotation for answer locality, or ablations demonstrating that zero-shot/one-hop baselines fail specifically on these items. This is load-bearing for the central claim that the observed performance gap (89.30% MedCPT F1 / 87.30% EM vs. 67.40%/60.20% baseline) and the conclusion that RAG strategies are critical reflect multi-hop integration rather than improved single-fact retrieval or metric handling.

minor comments (1)

[Abstract] Abstract: 'benchmark in multi-hop reasoning' appears to be a minor phrasing issue and should read 'benchmarking multi-hop reasoning'.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive review and for highlighting the importance of verifying the multi-hop character of the MedHopQA dataset. We agree that this verification is central to interpreting the performance results and have revised the manuscript to include additional details and supporting analyses on dataset construction.

read point-by-point responses

Referee: Dataset description (abstract and track description sections): The manuscript asserts that the 1,000 questions were constructed to require two-hop reasoning via integration across two distinct Wikipedia pages, yet provides no verification steps such as single-page retrieval tests, human annotation for answer locality, or ablations demonstrating that zero-shot/one-hop baselines fail specifically on these items. This is load-bearing for the central claim that the observed performance gap (89.30% MedCPT F1 / 87.30% EM vs. 67.40%/60.20% baseline) and the conclusion that RAG strategies are critical reflect multi-hop integration rather than improved single-fact retrieval or metric handling.

Authors: We acknowledge that the original manuscript did not present explicit verification experiments for the two-hop requirement. The questions were designed by domain experts following a protocol that required each item to depend on facts from two distinct Wikipedia pages, with the second fact only accessible after retrieving the first. In the revised version we have added: (1) a detailed description of the annotation guidelines and quality-control steps used to enforce two-hop dependency; (2) results of a human annotation study in which annotators were restricted to a single page and could not locate the answer for the large majority of questions; and (3) an ablation comparing a one-hop retrieval baseline against the submitted two-hop RAG systems, confirming a substantial performance drop when multi-hop integration is removed. These additions directly address the concern that the observed gap (and the value of RAG) might reflect single-fact retrieval or metric effects alone. We have also clarified that the MedCPT conceptual metric was applied uniformly and still shows the same ordering, further supporting that the gains arise from reasoning across hops. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical reporting of shared-task results

full rationale

The paper is an overview of a BioCreative shared task that reports participant submissions, baseline scores, and dataset construction details without any mathematical derivations, parameter fittings, or predictive equations. Central claims rest on externally evaluated F1/EM/MedCPT metrics from 48 submissions rather than any self-referential reduction or ansatz smuggled via citation. No load-bearing step equates a claimed result to its own inputs by construction, and the analysis is self-contained against the reported benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical shared-task overview paper rather than a theoretical derivation, so it introduces no free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5677 in / 1179 out tokens · 71899 ms · 2026-05-13T04:40:45.993233+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 1 internal anchor

[1]

(2016) Community challenges in biomedical text mining over 10 years: success, failure and the future

Huang, C.C., Lu, Z. (2016) Community challenges in biomedical text mining over 10 years: success, failure and the future. Brief Bioinform, 17, 132-144

work page 2016
[2]

(2017) Information Retrieval and Text Mining Technologies for Chemistry

Krallinger, M., Rabal, O., Lourenco, A., et al. (2017) Information Retrieval and Text Mining Technologies for Chemistry. Chem Rev, 117, 7673-7761

work page 2017
[3]

(2016) BioCreative V CDR task corpus: a resource for chemical disease relation extraction

Li, J., Sun, Y., Johnson, R.J., et al. (2016) BioCreative V CDR task corpus: a resource for chemical disease relation extraction. Database (Oxford), 2016

work page 2016
[4]

(2005) Overview of BioCreAtIvE: critical assessment of information extraction for biology

Hirschman, L., Yeh, A., Blaschke, C., et al. (2005) Overview of BioCreAtIvE: critical assessment of information extraction for biology. BMC Bioinformatics, 6 Suppl 1, S1

work page 2005
[5]

(2008) Evaluation of text-mining systems for biology: overview of the Second BioCreative community challenge

Krallinger, M., Morgan, A., Smith, L., et al. (2008) Evaluation of text-mining systems for biology: overview of the Second BioCreative community challenge. Genome Biol, 9 Suppl 2, S1

work page 2008
[6]

(2011) Overview of the BioCreative III Workshop

Arighi, C.N., Lu, Z., Krallinger, M., et al. (2011) Overview of the BioCreative III Workshop. BMC Bioinformatics, 12 Suppl 8, S1

work page 2011
[7]

(2011) The Protein-Protein Interaction tasks of BioCreative III: classification/ranking of articles and linking bio-ontology concepts to full text

Krallinger, M., Vazquez, M., Leitner, F., et al. (2011) The Protein-Protein Interaction tasks of BioCreative III: classification/ranking of articles and linking bio-ontology concepts to full text. BMC Bioinformatics, 12 Suppl 8, S3

work page 2011
[8]

(2008) Overview of BioCreative II gene normalization

Morgan, A.A., Lu, Z., Wang, X., et al. (2008) Overview of BioCreative II gene normalization. Genome Biol, 9 Suppl 2, S3

work page 2008
[9]

(2019) Overview of the BioCreative VI Precision Medicine Track: mining protein interactions and mutations for precision medicine

Islamaj Dogan, R., Kim, S., Chatr-Aryamontri, A., et al. (2019) Overview of the BioCreative VI Precision Medicine Track: mining protein interactions and mutations for precision medicine. Database (Oxford), 2019

work page 2019
[10]

(2023) Chemical identification and indexing in full-text articles: an overview of the NLM-Chem track at BioCreative VII

Leaman, R., Islamaj, R., Adams, V., et al. (2023) Chemical identification and indexing in full-text articles: an overview of the NLM-Chem track at BioCreative VII. Database (Oxford), 2023

work page 2023
[11]

(2019) The extraction of complex relationships and their conversion to biological expression language (BEL) overview of the BioCreative VI (2017) BEL track

Madan, S., Szostak, J., Komandur Elayavilli, R., et al. (2019) The extraction of complex relationships and their conversion to biological expression language (BEL) overview of the BioCreative VI (2017) BEL track. Database (Oxford), 2019. 21

work page 2019
[12]

(2023) Overview of DrugProt task at BioCreative VII: data and methods for large-scale text mining and knowledge graph generation of heterogenous chemical-protein relations

Miranda-Escalada, A., Mehryary, F., Luoma, J., et al. (2023) Overview of DrugProt task at BioCreative VII: data and methods for large-scale text mining and knowledge graph generation of heterogenous chemical-protein relations. Database (Oxford), 2023

work page 2023
[13]

(2024) The overview of the BioRED (Biomedical Relation Extraction Dataset) track at BioCreative VIII

Islamaj, R., Lai, P.T., Wei, C.H., et al. (2024) The overview of the BioRED (Biomedical Relation Extraction Dataset) track at BioCreative VIII. Database (Oxford), 2024

work page 2024
[15]

(2026) Synthesizing scientific literature with retrieval- augmented language models

Asai, A., He, J., Shao, R., et al. (2026) Synthesizing scientific literature with retrieval- augmented language models. Nature, 650, 857-863

work page 2026
[16]

(2025) Toward expert-level medical question answering with large language models

Singhal, K., Tu, T., Gottweis, J., et al. (2025) Toward expert-level medical question answering with large language models. Nat Med, 31, 943-950

work page 2025
[17]

(2025) Overview of the MedHopQA track at BioCreative IX: track description, participation and evaluation of systems for multi- hop medical question answering

Islamaj, R., Chan, J., Leaman, R., et al. (2025) Overview of the MedHopQA track at BioCreative IX: track description, participation and evaluation of systems for multi- hop medical question answering. Proceedings of the BioCreative IX Challenge and Workshop (BC9): Large Language Models for Clinical and Biomedical NLP at the International Joint Conference ...

work page 2025
[18]

(2022) Biomedical Question Answering: A Survey of Approaches and Challenges

Jin, Q., Yuan, Z., Xiong, G., et al. (2022) Biomedical Question Answering: A Survey of Approaches and Challenges. ACM Comput. Surv., 55, Article 35

work page 2022
[19]

(2015) An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition

Tsatsaronis, G., Balikas, G., Malakasiotis, P., et al. (2015) An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition. BMC Bioinformatics, 16, 138

work page 2015
[20]

(2019) PubMedQA: A Dataset for Biomedical Research Question Answering

Jin, Q., Dhingra, B., Liu, Z., et al. (2019) PubMedQA: A Dataset for Biomedical Research Question Answering. Association for Computational Linguistics, Hong Kong, China, pp. 2567-2577

work page 2019
[21]

(2021) What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams

Jin, D., Pan, E., Oufattole, N., et al. (2021) What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams. Applied Sciences, 11, 6421

work page 2021
[22]

(2022) MedMCQA: A Large-scale Multi- Subject Multi-Choice Dataset for Medical domain Question Answering

Pal, A., Umapathi, L.K., Sankarasubbu, M. (2022) MedMCQA: A Large-scale Multi- Subject Multi-Choice Dataset for Medical domain Question Answering. In Gerardo, F., George, H.C., Tom, P., et al. (eds.), Proceedings of the Conference on Health, Inference, and Learning. PMLR, Proceedings of Machine Learning Research, Vol. 174, pp. 248--260

work page 2022
[23]

(2023) Large language models encode clinical knowledge

Singhal, K., Azizi, S., Tu, T., et al. (2023) Large language models encode clinical knowledge. Nature, 620, 172-180

work page 2023
[24]

(2018) Constructing Datasets for Multi-hop Reading Comprehension Across Documents

Welbl, J., Stenetorp, P., Riedel, S. (2018) Constructing Datasets for Multi-hop Reading Comprehension Across Documents. Transactions of the Association for Computational Linguistics, 6, 287-302

work page 2018
[25]

(2018) emrqa: A large corpus for question answering on electronic medical records

Pampari, A., Raghavan, P., Liang, J., et al. (2018) emrqa: A large corpus for question answering on electronic medical records. Proceedings of the 2018 conference on empirical methods in natural language processing, pp. 2357-2368. 22

work page 2018
[26]

(2021) Overview of the MEDIQA 2021 Shared Task on Summarization in the Medical Domain

Ben Abacha, A., Mrabet, Y., Zhang, Y., et al. (2021) Overview of the MEDIQA 2021 Shared Task on Summarization in the Medical Domain. Association for Computational Linguistics, Online, pp. 74-85

work page 2021
[27]

(2016) Overview of the TREC 2016 clinical decision support track

Kirk, R., Dina, D., Voorhees, E., et al. (2016) Overview of the TREC 2016 clinical decision support track. Proceedings of the 15th text retrieval conference

work page 2016
[28]

Islamaj, R., Lima López, S., Xu, D., et al. (2025) Proceedings of the BioCreative IX Challenge and Workshop (BC9): Large Language Models for Clinical and Biomedical NLP at the International Joint Conference on Artificial Intelligence (IJCAI). In Islamaj, R. and Lima López, S. (eds.), BioCreative IX Challenge and Workshop (BC9): Large Language Models for C...

work page 2025
[29]

(2008) Biomedical ontologies in action: role in knowledge management, data integration and decision support

Bodenreider, O. (2008) Biomedical ontologies in action: role in knowledge management, data integration and decision support. Yearb Med Inform, 67-79

work page 2008
[30]

(2023) MedCPT: Contrastive Pre-trained Transformers with large-scale PubMed search logs for zero-shot biomedical information retrieval

Jin, Q., Kim, W., Chen, Q., et al. (2023) MedCPT: Contrastive Pre-trained Transformers with large-scale PubMed search logs for zero-shot biomedical information retrieval. Bioinformatics, 39

work page 2023
[31]

(2025) DMIS Lab at MedHopQA-2025: Ensemble Multi-Retrieval Methodologies with Reasoning Language Model Decision

Jung, J., Hwang, H., Yein Park, M.S., Jaehoon Yoon, Hyeon Hwang, Sanghoon Lee, Jiwoong Sohn and Jaewoo Kang. (2025) DMIS Lab at MedHopQA-2025: Ensemble Multi-Retrieval Methodologies with Reasoning Language Model Decision. Proceedings of the BioCreative IX Challenge and Workshop (BC9): Large Language Models for Clinical and Biomedical NLP at the Internatio...

work page 2025
[32]

(2025) UETQuintet at BioCreative IX – MedHopQA: Enhancing Biomedical QA with Selective Multi-hop Reasoning and Contextual Retrieval

Nguyen, Q.-A., Vu, T.-M.-T., Bich-Dat Nguyen, D.-Q.-M.T.a.H.-Q.L. (2025) UETQuintet at BioCreative IX – MedHopQA: Enhancing Biomedical QA with Selective Multi-hop Reasoning and Contextual Retrieval. Proceedings of the BioCreative IX Challenge and Workshop (BC9): Large Language Models for Clinical and Biomedical NLP at the International Joint Conference on...

work page 2025
[33]

Pakawat Phasook, R.P., Jiramet Kinchagawat, Amrest Chinkamol, Tossaporn Saengja, Jitkapat Sawatphol and Piyalitt Ittichaiwong NHSRAG: Addressing Multi- Hop Medical QA with Named-entity Heuristic Search Retrieval-Augmented Generation. Proceedings of the BioCreative IX Challenge and Workshop (BC9): Large Language Models for Clinical and Biomedical NLP at th...

work page
[34]

Alliheedi, A.B.a.M. (2025) Evaluating Advanced Prompting on Gemini Flash for Multi- Hop Biomedical QA., Proceedings of the BioCreative IX Challenge and Workshop (BC9): Large Language Models for Clinical and Biomedical NLP at the International Joint Conference on Artificial Intelligence (IJCAI)

work page 2025
[35]

Harikrishnan Gurushankar Saisudha, G.C.a.S.B. (2025) Agentic and Non-Agentic Multi-Hop Systems for Medical Question Answering., Proceedings of the BioCreative IX Challenge and Workshop (BC9): Large Language Models for Clinical and Biomedical NLP at the International Joint Conference on Artificial Intelligence (IJCAI). 23

work page 2025
[36]

Taktashov, N.Y.B., Olga A

Rustam R. Taktashov, N.Y.B., Olga A. Tarasova and Alexander V. Dmitriev (2025) Wikipedia-based hybrid-search RAG with prompt decomposition for MedHopQA. Proceedings of the BioCreative IX Challenge and Workshop (BC9): Large Language Models for Clinical and Biomedical NLP at the International Joint Conference on Artificial Intelligence (IJCAI)

work page 2025
[37]

Sofia I. R. Conceição, P.R.C.L.a.F.M.C. (2025) lasigeBioTM at MedHop track : Can a Lean RAG-Enhanced Model Compete with MedGemma. Proceedings of the BioCreative IX Challenge and Workshop (BC9): Large Language Models for Clinical and Biomedical NLP at the International Joint Conference on Artificial Intelligence (IJCAI)

work page 2025
[38]

Reem Abdel-Salam, M.A.a.M.A.A. (2025) CaresAI at BioCreative IX Track 1 - LLM for Biomedical QA., Proceedings of the BioCreative IX Challenge and Workshop (BC9): Large Language Models for Clinical and Biomedical NLP at the International Joint Conference on Artificial Intelligence (IJCAI)

work page 2025
[39]

Yuelyu Ji, H.Z., Shiven Verma, Hui Ji, Chun Li, Yushui Han and Yanshan Wang (2025) DeepRAG: Integrating Hierarchical Reasoning and Process Supervision for Biomedical Multi-Hop QA., Proceedings of the BioCreative IX Challenge and Workshop (BC9): Large Language Models for Clinical and Biomedical NLP at the International Joint Conference on Artificial Intell...

work page 2025
[40]

(2023) Query2doc: Query Expansion with Large Language Models

Wang, L., Yang, N., Wei, F. (2023) Query2doc: Query Expansion with Large Language Models. Association for Computational Linguistics, Singapore, pp. 9414-9423

work page 2023
[41]

(2024) Improving medical reasoning through retrieval and self-reflection with retrieval-augmented large language models

Jeong, M., Sohn, J., Sung, M., et al. (2024) Improving medical reasoning through retrieval and self-reflection with retrieval-augmented large language models. Bioinformatics, 40, i119-i129

work page 2024
[42]

(2025) Rationale-Guided Retrieval Augmented Generation for Medical Question Answering

Sohn, J., Park, Y., Yoon, C., et al. (2025) Rationale-Guided Retrieval Augmented Generation for Medical Question Answering. Association for Computational Linguistics, Albuquerque, New Mexico, pp. 12739-12753

work page 2025
[43]

(2023) Efficient Memory Management for Large Language Model Serving with PagedAttention

Kwon, W., Li, Z., Zhuang, S., et al. (2023) Efficient Memory Management for Large Language Model Serving with PagedAttention. Proceedings of the 29th Symposium on Operating Systems Principles. Association for Computing Machinery, Koblenz, Germany, pp. 611–626

work page 2023
[44]

(2025) `smolagents`: a smol library to build great agentic systems

Roucher, A., del Moral, A.V., Wolf, T., et al. (2025) `smolagents`: a smol library to build great agentic systems

work page 2025
[45]

(2022) React: Synergizing reasoning and acting in language models

Yao, S., Zhao, J., Yu, D., et al. (2022) React: Synergizing reasoning and acting in language models. The eleventh international conference on learning representations

work page 2022
[46]

(2024) Bm25s: Orders of magnitude faster lexical search via eager sparse scoring

Lù, X.H. (2024) Bm25s: Orders of magnitude faster lexical search via eager sparse scoring. arXiv preprint arXiv:2407.03618

work page arXiv 2024
[47]

(2009) Reciprocal rank fusion outperforms condorcet and individual rank learning methods

Cormack, G.V., Clarke, C.L.A., Buettcher, S. (2009) Reciprocal rank fusion outperforms condorcet and individual rank learning methods. Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval. Association for Computing Machinery, Boston, MA, USA, pp. 758–759. 24

work page 2009
[48]

Med42-v2: A suite of clinical llms.arXiv preprint arXiv:2408.06142.2024

Christophe, C., Kanithi, P.K., Raha, T., et al. (2024) Med42-v2: A suite of clinical llms. arXiv preprint arXiv:2408.06142

work page arXiv 2024
[49]

6g non-terrestrial networks enabled low-altitude economy: Opportunities and challenges.arXiv preprint arXiv:2311.09047, 2023

Jiang, Y., Li, X., Zhu, G., et al. (2023) 6G non-terrestrial networks enabled low- altitude economy: Opportunities and challenges. arXiv preprint arXiv:2311.09047

work page arXiv 2023
[50]

(2022) Mondo: Unifying diseases for the world, by the world

Vasilevsky, N.A., Matentzoglu, N.A., Toro, S., et al. (2022) Mondo: Unifying diseases for the world, by the world. medRxiv, 2022.2004.2013.22273750

work page arXiv 2022
[51]

MedGemma Technical Report

Sellergren, A., Kazemzadeh, S., Jaroensri, T., et al. (2025) Medgemma technical report. arXiv preprint arXiv:2507.05201

work page internal anchor Pith review Pith/arXiv arXiv 2025