Query pipeline optimization for cancer patient question answering systems

Brian E. Chapman; Maolin He; Mike Conway; Rena Gao

arxiv: 2412.14751 · v2 · submitted 2024-12-19 · 💻 cs.CL

Query pipeline optimization for cancer patient question answering systems

Maolin He , Rena Gao , Mike Conway , Brian E. Chapman This is my paper

Pith reviewed 2026-05-23 06:28 UTC · model grok-4.3

classification 💻 cs.CL

keywords retrieval-augmented generationRAG query pipelinecancer patient question answeringbiomedical databasesdocument retrievalpassage retrievalsemantic segmentation

0 comments

The pith

A three-aspect optimization of the RAG query pipeline improves accuracy on cancer patient questions by 5.24 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to establish that query pipelines for retrieval-augmented generation in cancer patient question-answering systems need separate, domain-specific tuning of document retrieval, passage retrieval, and semantic representation. It introduces Hybrid Semantic Real-time Document Retrieval for documents, optimal dense retriever-reranker pairings for passages, and Semantic Enhanced Overlap Segmentation for context, all drawing on PubMed and PubMed Central. A sympathetic reader would care because these changes produce measurable gains in answer accuracy for an LLM on cancer-related queries. The work shows the optimizations outperform both chain-of-thought prompting and a basic RAG baseline on a custom dataset.

Core claim

The central claim is that the three proposed optimizations—comparative analysis of NCBI resources with Hybrid Semantic Real-time Document Retrieval, identification of best dense retriever and reranker pairs, and Semantic Enhanced Overlap Segmentation—raise the answer accuracy of Claude-3-haiku by 5.24 percent over chain-of-thought prompting and roughly 3 percent over a naive RAG setup when tested on a custom dataset of cancer-related inquiries.

What carries the argument

The three-aspect optimization approach for the RAG query pipeline, consisting of document retrieval via HSRDR, passage retrieval via retriever-reranker pairings, and semantic representation via SEOS.

If this is right

Domain-specific tuning of each RAG pipeline stage is required to achieve the reported gains in CPQA systems.
Public biomedical databases such as PubMed become effective grounding sources once paired with the described retrieval methods.
The overall framework supports construction of more accurate CPQA systems than either prompting alone or untuned RAG.
The same three-aspect structure can be reused as a template for other biomedical RAG applications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the optimizations hold on other medical topics, they could reduce the need for larger models in specialized QA tasks.
Testing the pipeline on questions drawn directly from clinical records rather than a custom set would clarify real-world transfer.
The accuracy delta might compound if the optimized retrieval is combined with model fine-tuning on biomedical text.

Load-bearing premise

The custom dataset is representative of real cancer patient questions and the measured accuracy gains are produced by the three optimizations rather than by how the dataset was built or evaluated.

What would settle it

Running the same optimized pipeline and baselines on an independently gathered collection of actual cancer patient questions and observing no accuracy improvement would falsify the claim.

Figures

Figures reproduced from arXiv: 2412.14751 by Brian E. Chapman, Maolin He, Mike Conway, Rena Gao.

**Figure 1.** Figure 1: Description of filtered cancer QA datasets used in this study. six widely used medical QA datasets to create cancer-related evaluation datasets ( [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: The HSRDR employs dual retrieval strategies, then downloads and filters candidate documents. After document Retrieval, next steps and comparative analyses are conducted C. Two-Stage Passage Retrieval While MedCPT excels in document retrieval tasks, we need embedding models (dense retrievers) that excel in generating sentence-level representations to handle shorter, more specific text spans for matching qu… view at source ↗

**Figure 3.** Figure 3: Distribution comparison between Initial Document Pool and Top-5 Retrieved Evidence when HSRDR’s Retrieval Source involving PubMed Abstract, PMC Reviews and PMC Others PubMed Abstracts Dominance and Decline in Evidence: PubMed abstracts comprise a substantial portion of the top-5 evidence, likely due to their wider coverage than PMC (23.9M citations with valid abstract vs. 8M free full-text articles), sugg… view at source ↗

**Figure 4.** Figure 4: Performance of Embedding Models with rerankers Domain-Specific Feature is Crucial: Pubmedbertmatryoshka , despite its smaller size and absence from the MTEB leaderboard, achieved the second-best performance when paired with the MedCPT-reranker. This suggests that the size of the embedding model is not the only determinant of effectiveness and that domain-specific fine-tuning or training can significantly … view at source ↗

read the original abstract

Retrieval-augmented generation (RAG) mitigates hallucination in Large Language Models (LLMs) by using query pipelines to retrieve relevant external information and grounding responses in retrieved knowledge. However, query pipeline optimization for cancer patient question-answering (CPQA) systems requires separately optimizing multiple components with domain-specific considerations. We propose a novel three-aspect optimization approach for the RAG query pipeline in CPQA systems, utilizing public biomedical databases like PubMed and PubMed Central. Our optimization includes: (1) document retrieval, utilizing a comparative analysis of NCBI resources and introducing Hybrid Semantic Real-time Document Retrieval (HSRDR); (2) passage retrieval, identifying optimal pairings of dense retrievers and rerankers; and (3) semantic representation, introducing Semantic Enhanced Overlap Segmentation (SEOS) for improved contextual understanding. On a custom-developed dataset tailored for cancer-related inquiries, our optimized RAG approach improved the answer accuracy of Claude-3-haiku by 5.24% over chain-of-thought prompting and about 3% over a naive RAG setup. This study highlights the importance of domain-specific query optimization in realizing the full potential of RAG and provides a robust framework for building more accurate and reliable CPQA systems, advancing the development of RAG-based biomedical systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The reported gains on Claude-3-haiku rest on an unreleased custom dataset with no ablations, dataset stats, or metric details, so the three optimizations cannot be credited.

read the letter

The main takeaway is that this paper shows small accuracy improvements from RAG tweaks on cancer questions, but the custom dataset isn't shared and there's no breakdown of which parts helped, so the results are hard to trust or build on. They adapt existing retrieval methods for biomedical sources like PubMed and introduce two named procedures: HSRDR for document retrieval and SEOS for segmenting passages to keep context. The three-aspect view—document level, passage level, and semantic overlap—makes sense for medical QA where precision matters. They test on Claude-3-haiku and report better performance than basic chain-of-thought or plain RAG. That practical angle is the strength. People building patient-facing systems could pick up the specific pairings of retrievers and rerankers or the segmentation idea. The weak part is the evidence. The abstract mentions percentage gains but skips dataset size, how the questions were made, what 'answer accuracy' means exactly, and any statistical checks. No ablations are described, so it's possible the gains come from how the test set was put together rather than the new methods. The dataset being custom and not released adds to the problem. This work is aimed at applied researchers in biomedical NLP who are optimizing RAG pipelines for specific domains. A reader looking for new theory or strong benchmarks won't find it here. Given the missing controls and reproducibility issues, it doesn't seem ready for a full peer review process. I'd recommend against sending it to referees in its current form.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes a three-aspect optimization framework for RAG query pipelines in cancer patient question-answering systems. It introduces Hybrid Semantic Real-time Document Retrieval (HSRDR) using NCBI resources, identifies optimal dense retriever-reranker pairings for passage retrieval, and presents Semantic Enhanced Overlap Segmentation (SEOS) for semantic representation. On a custom-developed dataset of cancer-related inquiries, the optimized pipeline is reported to improve answer accuracy of Claude-3-haiku by 5.24% relative to chain-of-thought prompting and approximately 3% relative to a naive RAG baseline.

Significance. If the reported gains prove robust and causally attributable to the three proposed components, the work would supply a practical, domain-specific template for RAG optimization in biomedical QA. The emphasis on public biomedical corpora (PubMed, PubMed Central) and the explicit separation of document-level, passage-level, and segmentation-level choices are potentially useful for practitioners. At present, however, the absence of dataset statistics, metric definitions, and component ablations prevents any such assessment.

major comments (3)

[Abstract] Abstract: the headline claim of a 5.24 % accuracy lift is stated without any accompanying information on dataset size, question provenance, ground-truth construction, inter-annotator agreement, or the precise definition of “answer accuracy” (exact match, LLM-as-judge, human rating, etc.). These omissions make it impossible to determine whether the observed deltas are driven by the three optimizations or by dataset-construction or evaluation artifacts.
[Results] Results / Evaluation section: the three proposed components (HSRDR, retriever-reranker pairings, SEOS) are never ablated against one another on a fixed test set. Consequently it is impossible to isolate which component, if any, accounts for the reported improvement over the naive RAG baseline.
[Methods] Methods: no statistical tests, confidence intervals, or controls for prompt leakage or dataset leakage are described, leaving the 3 % and 5.24 % deltas without evidence of statistical reliability or causal attribution.

minor comments (2)

[Abstract] The manuscript should supply a clear, reproducible definition of the accuracy metric and release (or at minimum describe in detail) the custom dataset so that the empirical claims can be verified.
[Methods] Notation for the three optimization stages (HSRDR, SEOS) is introduced without an accompanying diagram or pseudocode that would clarify their integration into a single query pipeline.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to improve transparency, add missing analyses, and strengthen the evaluation section.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claim of a 5.24 % accuracy lift is stated without any accompanying information on dataset size, question provenance, ground-truth construction, inter-annotator agreement, or the precise definition of “answer accuracy” (exact match, LLM-as-judge, human rating, etc.). These omissions make it impossible to determine whether the observed deltas are driven by the three optimizations or by dataset-construction or evaluation artifacts.

Authors: We agree that the abstract would benefit from a concise summary of the evaluation setup. Detailed information on the custom dataset (size, provenance from cancer patient inquiries, expert-constructed ground truth, and answer accuracy defined via LLM-as-judge with human verification) appears in the Methods and Results sections. We will revise the abstract to include a brief overview of these elements, dataset statistics, and the accuracy metric. Inter-annotator agreement is not applicable, as ground truth was produced by domain experts using a single-annotator protocol for consistency in this specialized biomedical domain. revision: yes
Referee: [Results] Results / Evaluation section: the three proposed components (HSRDR, retriever-reranker pairings, SEOS) are never ablated against one another on a fixed test set. Consequently it is impossible to isolate which component, if any, accounts for the reported improvement over the naive RAG baseline.

Authors: The referee is correct that component-wise ablations on a fixed test set are absent. We will add these ablations in a revised Results section, reporting performance when enabling/disabling each aspect (HSRDR, optimal retriever-reranker pairs, and SEOS) independently while holding the test set constant. This will clarify the contribution of each optimization to the observed gains over the naive RAG baseline. revision: yes
Referee: [Methods] Methods: no statistical tests, confidence intervals, or controls for prompt leakage or dataset leakage are described, leaving the 3 % and 5.24 % deltas without evidence of statistical reliability or causal attribution.

Authors: We acknowledge the omission of statistical validation and leakage controls. In the revision we will add bootstrap-derived 95% confidence intervals around the accuracy deltas and report results of paired statistical tests (e.g., McNemar’s test) to establish reliability. We will also describe the leakage safeguards already used, including temporally disjoint test questions and explicit checks that test queries do not overlap with the retrieval corpus or model training data. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical performance reporting with no derivations or self-referential reductions

full rationale

The paper proposes three RAG pipeline components (HSRDR, retriever-reranker pairings, SEOS) and reports measured accuracy gains on a custom dataset. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. All claims are direct empirical outcomes rather than tautological reductions of inputs; the work is self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review based on abstract only; no explicit free parameters, axioms, or invented entities are described beyond standard assumptions of RAG pipelines and the validity of the custom dataset.

pith-pipeline@v0.9.0 · 5761 in / 1100 out tokens · 44366 ms · 2026-05-23T06:28:37.583346+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Learning Where to Embed: Noise-Aware Positional Embedding for Query Retrieval in Small-Object Detection
cs.CV 2026-04 unverdicted novelty 7.0

HELP uses heatmap-guided positional embeddings and a gradient mask to suppress background noise in queries, enabling efficient small-object detection with fewer decoder layers and parameters.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

Scientific literature: Information overload,

E. Landhuis, “Scientific literature: Information overload,” Nature, vol. 535, no. 7612, pp. 457–458, 2016

work page 2016
[2]

Benchmarking retrieval- augmented generation for medicine,

G. Xiong, Q. Jin, Z. Lu, and A. Zhang, “Benchmarking retrieval- augmented generation for medicine,” in Findings of the Association for Computational Linguistics: ACL 2024 , L.-W. Ku, A. Martins, and V . Srikumar, Eds. Bangkok, Thailand: Association for Computational Linguistics, Aug. 2024, pp. 6233–6251. [Online]. Available: https: //aclanthology.org/2024...

work page 2024
[3]

Survey of hallucination in natural language generation,

Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y . Xu, E. Ishii, Y . J. Bang, A. Madotto, and P. Fung, “Survey of hallucination in natural language generation,” ACM Computing Surveys , vol. 55, no. 12, pp. 1–38, 2023

work page 2023
[4]

Towards mitigating llm hallucination via self reflection,

Z. Ji, T. Yu, Y . Xu, N. Lee, E. Ishii, and P. Fung, “Towards mitigating llm hallucination via self reflection,” in Findings of the Association for Computational Linguistics: EMNLP 2023 , 2023, pp. 1827–1843

work page 2023
[5]

Med-halt: Medical domain hallucination test for large language models,

L. K. Umapathi, A. Pal, and M. Sankarasubbu, “Med-halt: Medical domain hallucination test for large language models,” arXiv preprint arXiv:2307.15343, 2023

work page arXiv 2023
[6]

Retrieval- augmented generation for knowledge-intensive nlp tasks,

P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K ¨uttler, M. Lewis, W.-t. Yih, T. Rockt ¨aschel et al. , “Retrieval- augmented generation for knowledge-intensive nlp tasks,” Advances in Neural Information Processing Systems , vol. 33, pp. 9459–9474, 2020

work page 2020
[7]

Knowledge-augmented reasoning distillation for small language models in knowledge-intensive tasks,

M. Kang, S. Lee, J. Baek, K. Kawaguchi, and S. J. Hwang, “Knowledge-augmented reasoning distillation for small language models in knowledge-intensive tasks,” Advances in Neural Information Process- ing Systems, vol. 36, 2024

work page 2024
[8]

Large language mod- els should be used as scientific reasoning engines, not knowledge databases,

D. Truhn, J. S. Reis-Filho, and J. N. Kather, “Large language mod- els should be used as scientific reasoning engines, not knowledge databases,” Nature medicine, vol. 29, no. 12, pp. 2983–2984, 2023

work page 2023
[9]

Enabling large language models to generate text with citations,

T. Gao, H. Yen, J. Yu, and D. Chen, “Enabling large language models to generate text with citations,” arXiv preprint arXiv:2305.14627, 2023

work page arXiv 2023
[10]

Superposition prompting: Improving and accelerating retrieval-augmented generation,

T. Merth, Q. Fu, M. Rastegari, and M. Najibi, “Superposition prompting: Improving and accelerating retrieval-augmented generation,” in Forty- first International Conference on Machine Learning

work page
[11]

Raptor: Recursive abstractive processing for tree-organized retrieval,

P. Sarthi, S. Abdullah, A. Tuli, S. Khanna, A. Goldie, and C. D. Manning, “Raptor: Recursive abstractive processing for tree-organized retrieval,” in The Twelfth International Conference on Learning Repre- sentations, 2023

work page 2023
[12]

Gar-meets-rag paradigm for zero-shot information re- trieval,

D. Arora, A. Kini, S. R. Chowdhury, N. Natarajan, G. Sinha, and A. Sharma, “Gar-meets-rag paradigm for zero-shot information re- trieval,” arXiv preprint arXiv:2310.20158 , 2023

work page arXiv 2023
[13]

Multihop-rag: Benchmarking retrieval-augmented generation for multi-hop queries,

Y . Tang and Y . Yang, “Multihop-rag: Benchmarking retrieval-augmented generation for multi-hop queries,” arXiv e-prints, pp. arXiv–2401, 2024

work page 2024
[14]

Making retrieval- augmented language models robust to irrelevant context,

O. Yoran, T. Wolfson, O. Ram, and J. Berant, “Making retrieval- augmented language models robust to irrelevant context,” in The Twelfth International Conference on Learning Representations , 2023

work page 2023
[15]

Entrez direct: E-utilities on the unix command line,

J. Kans, “Entrez direct: E-utilities on the unix command line,” in Entrez programming utilities help [Internet] . National Center for Biotechnology Information (US), 2024. 8 IEEE TRANSACTIONS AND JOURNALS TEMPLATE

work page 2024
[16]

Pubmed and beyond: biomedical literature search in the age of artificial intelligence,

Q. Jin, R. Leaman, and Z. Lu, “Pubmed and beyond: biomedical literature search in the age of artificial intelligence,” Ebiomedicine, vol. 100, 2024

work page 2024
[17]

Pmc-llama: toward building open-source language models for medicine,

C. Wu, W. Lin, X. Zhang, Y . Zhang, W. Xie, and Y . Wang, “Pmc-llama: toward building open-source language models for medicine,” Journal of the American Medical Informatics Association , p. ocae045, 2024

work page 2024
[18]

Bioreader: a retrieval-enhanced text-to-text transformer for biomedical literature,

G. Frisoni, M. Mizutani, G. Moro, and L. Valgimigli, “Bioreader: a retrieval-enhanced text-to-text transformer for biomedical literature,” in Proceedings of the 2022 conference on empirical methods in natural language processing, 2022, pp. 5770–5793

work page 2022
[19]

Improving health question answering with reliable and time-aware evidence retrieval,

J. Vladika and F. Matthes, “Improving health question answering with reliable and time-aware evidence retrieval,” in Findings of the Associa- tion for Computational Linguistics: NAACL 2024, 2024, pp. 4752–4763

work page 2024
[20]

Medcpt: Contrastive pre-trained transformers with large-scale pubmed search logs for zero-shot biomedical information retrieval,

Q. Jin, W. Kim, Q. Chen, D. C. Comeau, L. Yeganova, W. J. Wilbur, and Z. Lu, “Medcpt: Contrastive pre-trained transformers with large-scale pubmed search logs for zero-shot biomedical information retrieval,” Bioinformatics, vol. 39, no. 11, p. btad651, 2023

work page 2023
[21]

Semantic models for the first-stage retrieval: A comprehensive review,

J. Guo, Y . Cai, Y . Fan, F. Sun, R. Zhang, and X. Cheng, “Semantic models for the first-stage retrieval: A comprehensive review,” ACM Transactions on Information Systems (TOIS) , vol. 40, no. 4, pp. 1–42, 2022

work page 2022
[22]

The probabilistic relevance frame- work: Bm25 and beyond,

S. Robertson, H. Zaragoza et al. , “The probabilistic relevance frame- work: Bm25 and beyond,” Foundations and Trends® in Information Retrieval, vol. 3, no. 4, pp. 333–389, 2009

work page 2009
[23]

Towards robust ranker for text retrieval,

Y . Zhou, T. Shen, X. Geng, C. Tao, C. Xu, G. Long, B. Jiao, and D. Jiang, “Towards robust ranker for text retrieval,” in Findings of the Association for Computational Linguistics: ACL 2023 , 2023, pp. 5387– 5401

work page 2023
[24]

Multi-stage document ranking with bert,

R. Nogueira, W. Yang, K. Cho, and J. Lin, “Multi-stage document ranking with bert,” arXiv preprint arXiv:1910.14424 , 2019

work page arXiv 1910
[25]

Retrieval-augmented generation for large language models: A survey

“Retrieval-augmented generation for large language models: A survey.”

work page
[26]

Llamaindex,

J. Liu, “Llamaindex,” Acceso el, vol. 6, 2022

work page 2022
[27]

Text tiling: Segmenting text into multi-paragraph subtopic passages,

M. A. Hearst, “Text tiling: Segmenting text into multi-paragraph subtopic passages,” Computational linguistics, vol. 23, no. 1, pp. 33–64, 1997

work page 1997
[28]

Can generalist foundation models outcompete special-purpose tuning? case study in medicine,

H. Nori, Y . T. Lee, S. Zhang, D. Carignan, R. Edgar, N. Fusi, N. King, J. Larson, Y . Li, W. Liu et al. , “Can generalist foundation models outcompete special-purpose tuning? case study in medicine,” Medicine, vol. 84, no. 88.3, pp. 77–3, 2023

work page 2023
[29]

Large language models encode clinical knowledge,

K. Singhal, S. Azizi, T. Tu, S. S. Mahdavi, J. Wei, H. W. Chung, N. Scales, A. Tanwani, H. Cole-Lewis, S. Pfohl et al., “Large language models encode clinical knowledge,”Nature, vol. 620, no. 7972, pp. 172– 180, 2023

work page 2023
[30]

Can large language models reason about medical questions?

V . Li ´evin, C. E. Hother, A. G. Motzfeldt, and O. Winther, “Can large language models reason about medical questions?” Patterns, vol. 5, no. 3, 2024

work page 2024
[31]

Pubmedqa: A dataset for biomedical research question answering,

Q. Jin, B. Dhingra, Z. Liu, W. Cohen, and X. Lu, “Pubmedqa: A dataset for biomedical research question answering,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 2567–2577

work page 2019
[32]

Measuring massive multitask language understanding,

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt, “Measuring massive multitask language understanding,” in International Conference on Learning Representations , 2020

work page 2020
[33]

What disease does this patient have? a large-scale open domain question answering dataset from medical exams,

D. Jin, E. Pan, N. Oufattole, W. Wei-Hung, H. Fang, and P. Szolovits, “What disease does this patient have? a large-scale open domain question answering dataset from medical exams,” Applied Sciences , vol. 11, no. 14, p. 6421, 2021

work page 2021
[34]

Medmcqa: A large- scale multi-subject multi-choice dataset for medical domain question answering,

A. Pal, L. K. Umapathi, and M. Sankarasubbu, “Medmcqa: A large- scale multi-subject multi-choice dataset for medical domain question answering,” in Conference on health, inference, and learning . PMLR, 2022, pp. 248–260

work page 2022
[35]

Towards Expert-Level Medical Question Answering with Large Language Models

K. Singhal, T. Tu, J. Gottweis, R. Sayres, E. Wulczyn, L. Hou, K. Clark, S. Pfohl, H. Cole-Lewis, D. Neal et al. , “Towards expert- level medical question answering with large language models,” arXiv preprint arXiv:2305.09617, 2023

work page internal anchor Pith review arXiv 2023
[36]

Spell checker for consumer language (cspell),

C. J. Lu, A. R. Aronson, S. E. Shooshan, and D. Demner-Fushman, “Spell checker for consumer language (cspell),”Journal of the American Medical Informatics Association , vol. 26, no. 3, pp. 211–218, 2019

work page 2019
[37]

Bridging the gap between consumers’ med- ication questions and trusted answers,

A. B. Abacha, Y . Mrabet, M. Sharp, T. R. Goodwin, S. E. Shooshan, and D. Demner-Fushman, “Bridging the gap between consumers’ med- ication questions and trusted answers,” in MEDINFO 2019: Health and Wellbeing e-Networks for All . IOS Press, 2019, pp. 25–29

work page 2019
[38]

Introduction to information retrieval,

D. MANNING, “Introduction to information retrieval,” Journal of the American Statistical Association , vol. 15, 2008

work page 2008
[39]

Domain-specific language model pretraining for biomedical natural language processing,

Y . Gu, R. Tinn, H. Cheng, M. Lucas, N. Usuyama, X. Liu, T. Naumann, J. Gao, and H. Poon, “Domain-specific language model pretraining for biomedical natural language processing,” 2020

work page 2020
[40]

Matryoshka representation learning,

A. Kusupati, G. Bhatt, A. Rege, M. Wallingford, A. Sinha, V . Ramanu- jan, W. Howard-Snyder, K. Chen, S. Kakade, P. Jain et al., “Matryoshka representation learning,” Advances in Neural Information Processing Systems, vol. 35, pp. 30 233–30 249, 2022

work page 2022
[41]

Angle-optimized text embeddings,

X. Li and J. Li, “Angle-optimized text embeddings,” arXiv preprint arXiv:2309.12871, 2023

work page arXiv 2023
[42]

C-pack: Packaged resources to advance general chinese embedding,

S. Xiao, Z. Liu, P. Zhang, and N. Muennighoff, “C-pack: Packaged resources to advance general chinese embedding,” 2023

work page 2023
[43]

Sfr-embedding-mistral:enhance text retrieval with transfer learning,

R. Meng, Y . Liu, S. R. Joty, C. Xiong, Y . Zhou, and S. Yavuz, “Sfr-embedding-mistral:enhance text retrieval with transfer learning,” Salesforce AI Research Blog, 2024. [Online]. Available: https: //blog.salesforceairesearch.com/sfr-embedded-mistral/

work page 2024
[44]

Measurement of semantic textual similarity in clinical texts: comparison of transformer- based models,

X. Yang, X. He, H. Zhang, Y . Ma, J. Bian, Y . Wuet al., “Measurement of semantic textual similarity in clinical texts: comparison of transformer- based models,” JMIR medical informatics , vol. 8, no. 11, p. e19735, 2020

work page 2020
[45]

MTEB: Massive Text Embedding Benchmark

N. Muennighoff, N. Tazi, L. Magne, and N. Reimers, “Mteb: Massive text embedding benchmark,” arXiv preprint arXiv:2210.07316 , 2022. [Online]. Available: https://arxiv.org/abs/2210.07316

work page internal anchor Pith review Pith/arXiv arXiv 2022
[46]

Boot and switch: Alter- nating distillation for zero-shot dense retrieval,

F. Jiang, Q. Xu, T. Drummond, and T. Cohn, “Boot and switch: Alter- nating distillation for zero-shot dense retrieval,” in The 2023 Conference on Empirical Methods in Natural Language Processing

work page 2023
[47]

Making large language models a better foundation for dense retrieval,

C. Li, Z. Liu, S. Xiao, and Y . Shao, “Making large language models a better foundation for dense retrieval,” 2023

work page 2023
[48]

Multi-passage bert: A globally normalized bert model for open-domain question answering,

Z. Wang, P. Ng, X. Ma, R. Nallapati, and B. Xiang, “Multi-passage bert: A globally normalized bert model for open-domain question answering,” arXiv preprint arXiv:1908.08167 , 2019

work page arXiv 1908
[49]

Improving social book search using structure semantics, bibliographic descriptions and social metadata,

I. Ullah, S. Khusro, and I. Ahmad, “Improving social book search using structure semantics, bibliographic descriptions and social metadata,” Multimedia Tools and Applications, vol. 80, no. 4, pp. 5131–5172, 2021

work page 2021
[50]

Reciprocal rank fusion outperforms condorcet and individual rank learning methods,

G. V . Cormack, C. L. Clarke, and S. Buettcher, “Reciprocal rank fusion outperforms condorcet and individual rank learning methods,” in Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval , 2009, pp. 758–759

work page 2009
[51]

Information entropy, rough entropy and knowledge granulation in incomplete information systems,

J. Liang, Z. Shi, D. Li, and M. J. Wierman, “Information entropy, rough entropy and knowledge granulation in incomplete information systems,” International Journal of general systems , vol. 35, no. 6, pp. 641–654, 2006

work page 2006
[52]

Chain-of-thought prompting elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhou et al. , “Chain-of-thought prompting elicits reasoning in large language models,” Advances in neural information processing systems , vol. 35, pp. 24 824–24 837, 2022

work page 2022
[53]

Approximate nearest neighbor negative contrastive learning for dense text retrieval,

L. Xiong, C. Xiong, Y . Li, K.-F. Tang, J. Liu, P. Bennett, J. Ahmed, and A. Overwijk, “Approximate nearest neighbor negative contrastive learning for dense text retrieval,” arXiv preprint arXiv:2007.00808 , 2020

work page arXiv 2007
[54]

Parameter-efficient prompt tuning makes generalized and calibrated neural text retrievers,

W. L. Tam, X. Liu, K. Ji, L. Xue, X. Zhang, Y . Dong, J. Liu, M. Hu, and J. Tang, “Parameter-efficient prompt tuning makes generalized and calibrated neural text retrievers,”arXiv preprint arXiv:2207.07087, 2022

work page arXiv 2022
[55]

Self-knowledge guided retrieval augmentation for large language models,

Y . Wang, P. Li, M. Sun, and Y . Liu, “Self-knowledge guided retrieval augmentation for large language models,” in Findings of the Association for Computational Linguistics: EMNLP 2023 , 2023, pp. 10 303–10 315

work page 2023
[56]

Self-rag: Learning to retrieve, generate, and critique through self-reflection,

A. Asai, Z. Wu, Y . Wang, A. Sil, and H. Hajishirzi, “Self-rag: Learning to retrieve, generate, and critique through self-reflection,” in The Twelfth International Conference on Learning Representations , 2023

work page 2023
[57]

Query2doc: Query expansion with large language models,

L. Wang, N. Yang, and F. Wei, “Query2doc: Query expansion with large language models,” arXiv preprint arXiv:2303.07678 , 2023

work page arXiv 2023
[58]

Rag-fusion: a new take on retrieval-augmented gener- ation,

Z. Rackauckas, “Rag-fusion: a new take on retrieval-augmented gener- ation,” arXiv preprint arXiv:2402.03367 , 2024

work page arXiv 2024

[1] [1]

Scientific literature: Information overload,

E. Landhuis, “Scientific literature: Information overload,” Nature, vol. 535, no. 7612, pp. 457–458, 2016

work page 2016

[2] [2]

Benchmarking retrieval- augmented generation for medicine,

G. Xiong, Q. Jin, Z. Lu, and A. Zhang, “Benchmarking retrieval- augmented generation for medicine,” in Findings of the Association for Computational Linguistics: ACL 2024 , L.-W. Ku, A. Martins, and V . Srikumar, Eds. Bangkok, Thailand: Association for Computational Linguistics, Aug. 2024, pp. 6233–6251. [Online]. Available: https: //aclanthology.org/2024...

work page 2024

[3] [3]

Survey of hallucination in natural language generation,

Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y . Xu, E. Ishii, Y . J. Bang, A. Madotto, and P. Fung, “Survey of hallucination in natural language generation,” ACM Computing Surveys , vol. 55, no. 12, pp. 1–38, 2023

work page 2023

[4] [4]

Towards mitigating llm hallucination via self reflection,

Z. Ji, T. Yu, Y . Xu, N. Lee, E. Ishii, and P. Fung, “Towards mitigating llm hallucination via self reflection,” in Findings of the Association for Computational Linguistics: EMNLP 2023 , 2023, pp. 1827–1843

work page 2023

[5] [5]

Med-halt: Medical domain hallucination test for large language models,

L. K. Umapathi, A. Pal, and M. Sankarasubbu, “Med-halt: Medical domain hallucination test for large language models,” arXiv preprint arXiv:2307.15343, 2023

work page arXiv 2023

[6] [6]

Retrieval- augmented generation for knowledge-intensive nlp tasks,

P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K ¨uttler, M. Lewis, W.-t. Yih, T. Rockt ¨aschel et al. , “Retrieval- augmented generation for knowledge-intensive nlp tasks,” Advances in Neural Information Processing Systems , vol. 33, pp. 9459–9474, 2020

work page 2020

[7] [7]

Knowledge-augmented reasoning distillation for small language models in knowledge-intensive tasks,

M. Kang, S. Lee, J. Baek, K. Kawaguchi, and S. J. Hwang, “Knowledge-augmented reasoning distillation for small language models in knowledge-intensive tasks,” Advances in Neural Information Process- ing Systems, vol. 36, 2024

work page 2024

[8] [8]

Large language mod- els should be used as scientific reasoning engines, not knowledge databases,

D. Truhn, J. S. Reis-Filho, and J. N. Kather, “Large language mod- els should be used as scientific reasoning engines, not knowledge databases,” Nature medicine, vol. 29, no. 12, pp. 2983–2984, 2023

work page 2023

[9] [9]

Enabling large language models to generate text with citations,

T. Gao, H. Yen, J. Yu, and D. Chen, “Enabling large language models to generate text with citations,” arXiv preprint arXiv:2305.14627, 2023

work page arXiv 2023

[10] [10]

Superposition prompting: Improving and accelerating retrieval-augmented generation,

T. Merth, Q. Fu, M. Rastegari, and M. Najibi, “Superposition prompting: Improving and accelerating retrieval-augmented generation,” in Forty- first International Conference on Machine Learning

work page

[11] [11]

Raptor: Recursive abstractive processing for tree-organized retrieval,

P. Sarthi, S. Abdullah, A. Tuli, S. Khanna, A. Goldie, and C. D. Manning, “Raptor: Recursive abstractive processing for tree-organized retrieval,” in The Twelfth International Conference on Learning Repre- sentations, 2023

work page 2023

[12] [12]

Gar-meets-rag paradigm for zero-shot information re- trieval,

D. Arora, A. Kini, S. R. Chowdhury, N. Natarajan, G. Sinha, and A. Sharma, “Gar-meets-rag paradigm for zero-shot information re- trieval,” arXiv preprint arXiv:2310.20158 , 2023

work page arXiv 2023

[13] [13]

Multihop-rag: Benchmarking retrieval-augmented generation for multi-hop queries,

Y . Tang and Y . Yang, “Multihop-rag: Benchmarking retrieval-augmented generation for multi-hop queries,” arXiv e-prints, pp. arXiv–2401, 2024

work page 2024

[14] [14]

Making retrieval- augmented language models robust to irrelevant context,

O. Yoran, T. Wolfson, O. Ram, and J. Berant, “Making retrieval- augmented language models robust to irrelevant context,” in The Twelfth International Conference on Learning Representations , 2023

work page 2023

[15] [15]

Entrez direct: E-utilities on the unix command line,

J. Kans, “Entrez direct: E-utilities on the unix command line,” in Entrez programming utilities help [Internet] . National Center for Biotechnology Information (US), 2024. 8 IEEE TRANSACTIONS AND JOURNALS TEMPLATE

work page 2024

[16] [16]

Pubmed and beyond: biomedical literature search in the age of artificial intelligence,

Q. Jin, R. Leaman, and Z. Lu, “Pubmed and beyond: biomedical literature search in the age of artificial intelligence,” Ebiomedicine, vol. 100, 2024

work page 2024

[17] [17]

Pmc-llama: toward building open-source language models for medicine,

C. Wu, W. Lin, X. Zhang, Y . Zhang, W. Xie, and Y . Wang, “Pmc-llama: toward building open-source language models for medicine,” Journal of the American Medical Informatics Association , p. ocae045, 2024

work page 2024

[18] [18]

Bioreader: a retrieval-enhanced text-to-text transformer for biomedical literature,

G. Frisoni, M. Mizutani, G. Moro, and L. Valgimigli, “Bioreader: a retrieval-enhanced text-to-text transformer for biomedical literature,” in Proceedings of the 2022 conference on empirical methods in natural language processing, 2022, pp. 5770–5793

work page 2022

[19] [19]

Improving health question answering with reliable and time-aware evidence retrieval,

J. Vladika and F. Matthes, “Improving health question answering with reliable and time-aware evidence retrieval,” in Findings of the Associa- tion for Computational Linguistics: NAACL 2024, 2024, pp. 4752–4763

work page 2024

[20] [20]

Medcpt: Contrastive pre-trained transformers with large-scale pubmed search logs for zero-shot biomedical information retrieval,

Q. Jin, W. Kim, Q. Chen, D. C. Comeau, L. Yeganova, W. J. Wilbur, and Z. Lu, “Medcpt: Contrastive pre-trained transformers with large-scale pubmed search logs for zero-shot biomedical information retrieval,” Bioinformatics, vol. 39, no. 11, p. btad651, 2023

work page 2023

[21] [21]

Semantic models for the first-stage retrieval: A comprehensive review,

J. Guo, Y . Cai, Y . Fan, F. Sun, R. Zhang, and X. Cheng, “Semantic models for the first-stage retrieval: A comprehensive review,” ACM Transactions on Information Systems (TOIS) , vol. 40, no. 4, pp. 1–42, 2022

work page 2022

[22] [22]

The probabilistic relevance frame- work: Bm25 and beyond,

S. Robertson, H. Zaragoza et al. , “The probabilistic relevance frame- work: Bm25 and beyond,” Foundations and Trends® in Information Retrieval, vol. 3, no. 4, pp. 333–389, 2009

work page 2009

[23] [23]

Towards robust ranker for text retrieval,

Y . Zhou, T. Shen, X. Geng, C. Tao, C. Xu, G. Long, B. Jiao, and D. Jiang, “Towards robust ranker for text retrieval,” in Findings of the Association for Computational Linguistics: ACL 2023 , 2023, pp. 5387– 5401

work page 2023

[24] [24]

Multi-stage document ranking with bert,

R. Nogueira, W. Yang, K. Cho, and J. Lin, “Multi-stage document ranking with bert,” arXiv preprint arXiv:1910.14424 , 2019

work page arXiv 1910

[25] [25]

Retrieval-augmented generation for large language models: A survey

“Retrieval-augmented generation for large language models: A survey.”

work page

[26] [26]

Llamaindex,

J. Liu, “Llamaindex,” Acceso el, vol. 6, 2022

work page 2022

[27] [27]

Text tiling: Segmenting text into multi-paragraph subtopic passages,

M. A. Hearst, “Text tiling: Segmenting text into multi-paragraph subtopic passages,” Computational linguistics, vol. 23, no. 1, pp. 33–64, 1997

work page 1997

[28] [28]

Can generalist foundation models outcompete special-purpose tuning? case study in medicine,

H. Nori, Y . T. Lee, S. Zhang, D. Carignan, R. Edgar, N. Fusi, N. King, J. Larson, Y . Li, W. Liu et al. , “Can generalist foundation models outcompete special-purpose tuning? case study in medicine,” Medicine, vol. 84, no. 88.3, pp. 77–3, 2023

work page 2023

[29] [29]

Large language models encode clinical knowledge,

K. Singhal, S. Azizi, T. Tu, S. S. Mahdavi, J. Wei, H. W. Chung, N. Scales, A. Tanwani, H. Cole-Lewis, S. Pfohl et al., “Large language models encode clinical knowledge,”Nature, vol. 620, no. 7972, pp. 172– 180, 2023

work page 2023

[30] [30]

Can large language models reason about medical questions?

V . Li ´evin, C. E. Hother, A. G. Motzfeldt, and O. Winther, “Can large language models reason about medical questions?” Patterns, vol. 5, no. 3, 2024

work page 2024

[31] [31]

Pubmedqa: A dataset for biomedical research question answering,

Q. Jin, B. Dhingra, Z. Liu, W. Cohen, and X. Lu, “Pubmedqa: A dataset for biomedical research question answering,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 2567–2577

work page 2019

[32] [32]

Measuring massive multitask language understanding,

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt, “Measuring massive multitask language understanding,” in International Conference on Learning Representations , 2020

work page 2020

[33] [33]

What disease does this patient have? a large-scale open domain question answering dataset from medical exams,

D. Jin, E. Pan, N. Oufattole, W. Wei-Hung, H. Fang, and P. Szolovits, “What disease does this patient have? a large-scale open domain question answering dataset from medical exams,” Applied Sciences , vol. 11, no. 14, p. 6421, 2021

work page 2021

[34] [34]

Medmcqa: A large- scale multi-subject multi-choice dataset for medical domain question answering,

A. Pal, L. K. Umapathi, and M. Sankarasubbu, “Medmcqa: A large- scale multi-subject multi-choice dataset for medical domain question answering,” in Conference on health, inference, and learning . PMLR, 2022, pp. 248–260

work page 2022

[35] [35]

Towards Expert-Level Medical Question Answering with Large Language Models

K. Singhal, T. Tu, J. Gottweis, R. Sayres, E. Wulczyn, L. Hou, K. Clark, S. Pfohl, H. Cole-Lewis, D. Neal et al. , “Towards expert- level medical question answering with large language models,” arXiv preprint arXiv:2305.09617, 2023

work page internal anchor Pith review arXiv 2023

[36] [36]

Spell checker for consumer language (cspell),

C. J. Lu, A. R. Aronson, S. E. Shooshan, and D. Demner-Fushman, “Spell checker for consumer language (cspell),”Journal of the American Medical Informatics Association , vol. 26, no. 3, pp. 211–218, 2019

work page 2019

[37] [37]

Bridging the gap between consumers’ med- ication questions and trusted answers,

A. B. Abacha, Y . Mrabet, M. Sharp, T. R. Goodwin, S. E. Shooshan, and D. Demner-Fushman, “Bridging the gap between consumers’ med- ication questions and trusted answers,” in MEDINFO 2019: Health and Wellbeing e-Networks for All . IOS Press, 2019, pp. 25–29

work page 2019

[38] [38]

Introduction to information retrieval,

D. MANNING, “Introduction to information retrieval,” Journal of the American Statistical Association , vol. 15, 2008

work page 2008

[39] [39]

Domain-specific language model pretraining for biomedical natural language processing,

Y . Gu, R. Tinn, H. Cheng, M. Lucas, N. Usuyama, X. Liu, T. Naumann, J. Gao, and H. Poon, “Domain-specific language model pretraining for biomedical natural language processing,” 2020

work page 2020

[40] [40]

Matryoshka representation learning,

A. Kusupati, G. Bhatt, A. Rege, M. Wallingford, A. Sinha, V . Ramanu- jan, W. Howard-Snyder, K. Chen, S. Kakade, P. Jain et al., “Matryoshka representation learning,” Advances in Neural Information Processing Systems, vol. 35, pp. 30 233–30 249, 2022

work page 2022

[41] [41]

Angle-optimized text embeddings,

X. Li and J. Li, “Angle-optimized text embeddings,” arXiv preprint arXiv:2309.12871, 2023

work page arXiv 2023

[42] [42]

C-pack: Packaged resources to advance general chinese embedding,

S. Xiao, Z. Liu, P. Zhang, and N. Muennighoff, “C-pack: Packaged resources to advance general chinese embedding,” 2023

work page 2023

[43] [43]

Sfr-embedding-mistral:enhance text retrieval with transfer learning,

R. Meng, Y . Liu, S. R. Joty, C. Xiong, Y . Zhou, and S. Yavuz, “Sfr-embedding-mistral:enhance text retrieval with transfer learning,” Salesforce AI Research Blog, 2024. [Online]. Available: https: //blog.salesforceairesearch.com/sfr-embedded-mistral/

work page 2024

[44] [44]

Measurement of semantic textual similarity in clinical texts: comparison of transformer- based models,

X. Yang, X. He, H. Zhang, Y . Ma, J. Bian, Y . Wuet al., “Measurement of semantic textual similarity in clinical texts: comparison of transformer- based models,” JMIR medical informatics , vol. 8, no. 11, p. e19735, 2020

work page 2020

[45] [45]

MTEB: Massive Text Embedding Benchmark

N. Muennighoff, N. Tazi, L. Magne, and N. Reimers, “Mteb: Massive text embedding benchmark,” arXiv preprint arXiv:2210.07316 , 2022. [Online]. Available: https://arxiv.org/abs/2210.07316

work page internal anchor Pith review Pith/arXiv arXiv 2022

[46] [46]

Boot and switch: Alter- nating distillation for zero-shot dense retrieval,

F. Jiang, Q. Xu, T. Drummond, and T. Cohn, “Boot and switch: Alter- nating distillation for zero-shot dense retrieval,” in The 2023 Conference on Empirical Methods in Natural Language Processing

work page 2023

[47] [47]

Making large language models a better foundation for dense retrieval,

C. Li, Z. Liu, S. Xiao, and Y . Shao, “Making large language models a better foundation for dense retrieval,” 2023

work page 2023

[48] [48]

Multi-passage bert: A globally normalized bert model for open-domain question answering,

Z. Wang, P. Ng, X. Ma, R. Nallapati, and B. Xiang, “Multi-passage bert: A globally normalized bert model for open-domain question answering,” arXiv preprint arXiv:1908.08167 , 2019

work page arXiv 1908

[49] [49]

Improving social book search using structure semantics, bibliographic descriptions and social metadata,

I. Ullah, S. Khusro, and I. Ahmad, “Improving social book search using structure semantics, bibliographic descriptions and social metadata,” Multimedia Tools and Applications, vol. 80, no. 4, pp. 5131–5172, 2021

work page 2021

[50] [50]

Reciprocal rank fusion outperforms condorcet and individual rank learning methods,

G. V . Cormack, C. L. Clarke, and S. Buettcher, “Reciprocal rank fusion outperforms condorcet and individual rank learning methods,” in Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval , 2009, pp. 758–759

work page 2009

[51] [51]

Information entropy, rough entropy and knowledge granulation in incomplete information systems,

J. Liang, Z. Shi, D. Li, and M. J. Wierman, “Information entropy, rough entropy and knowledge granulation in incomplete information systems,” International Journal of general systems , vol. 35, no. 6, pp. 641–654, 2006

work page 2006

[52] [52]

Chain-of-thought prompting elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhou et al. , “Chain-of-thought prompting elicits reasoning in large language models,” Advances in neural information processing systems , vol. 35, pp. 24 824–24 837, 2022

work page 2022

[53] [53]

Approximate nearest neighbor negative contrastive learning for dense text retrieval,

L. Xiong, C. Xiong, Y . Li, K.-F. Tang, J. Liu, P. Bennett, J. Ahmed, and A. Overwijk, “Approximate nearest neighbor negative contrastive learning for dense text retrieval,” arXiv preprint arXiv:2007.00808 , 2020

work page arXiv 2007

[54] [54]

Parameter-efficient prompt tuning makes generalized and calibrated neural text retrievers,

W. L. Tam, X. Liu, K. Ji, L. Xue, X. Zhang, Y . Dong, J. Liu, M. Hu, and J. Tang, “Parameter-efficient prompt tuning makes generalized and calibrated neural text retrievers,”arXiv preprint arXiv:2207.07087, 2022

work page arXiv 2022

[55] [55]

Self-knowledge guided retrieval augmentation for large language models,

Y . Wang, P. Li, M. Sun, and Y . Liu, “Self-knowledge guided retrieval augmentation for large language models,” in Findings of the Association for Computational Linguistics: EMNLP 2023 , 2023, pp. 10 303–10 315

work page 2023

[56] [56]

Self-rag: Learning to retrieve, generate, and critique through self-reflection,

A. Asai, Z. Wu, Y . Wang, A. Sil, and H. Hajishirzi, “Self-rag: Learning to retrieve, generate, and critique through self-reflection,” in The Twelfth International Conference on Learning Representations , 2023

work page 2023

[57] [57]

Query2doc: Query expansion with large language models,

L. Wang, N. Yang, and F. Wei, “Query2doc: Query expansion with large language models,” arXiv preprint arXiv:2303.07678 , 2023

work page arXiv 2023

[58] [58]

Rag-fusion: a new take on retrieval-augmented gener- ation,

Z. Rackauckas, “Rag-fusion: a new take on retrieval-augmented gener- ation,” arXiv preprint arXiv:2402.03367 , 2024

work page arXiv 2024