AI-assisted Protocol Information Extraction For Improved Accuracy and Efficiency in Clinical Trial Workflows

Fran\c{c}ois Charest; Madison Wright; Ramtin Babaeipour

arxiv: 2602.00052 · v2 · submitted 2026-01-19 · 💻 cs.IR · cs.AI· cs.CL· cs.LG

AI-assisted Protocol Information Extraction For Improved Accuracy and Efficiency in Clinical Trial Workflows

Ramtin Babaeipour , Fran\c{c}ois Charest , Madison Wright This is my paper

Pith reviewed 2026-05-16 12:50 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.CLcs.LG

keywords clinical trial protocolsinformation extractionretrieval-augmented generationlarge language modelsclinical research coordinatorsAI-assisted workflows

0 comments

The pith

Retrieval-augmented generation extracts clinical trial protocol information at 89% accuracy versus 62.6% for standalone LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates an AI system that combines generative large language models with retrieval-augmented generation to automatically extract structured information from complex clinical trial protocols. This RAG approach reaches 89.0% accuracy against expert-supported reference annotations, outperforming standalone LLMs with fine-tuned prompts at 62.6%. In simulated workflows, clinical research coordinators complete extraction tasks 40% faster with AI assistance, rate the work as less cognitively demanding, and strongly prefer the method. A sympathetic reader would care because rising protocol complexity and amendments create heavy burdens for trial teams around knowledge management, documentation quality, and compliance. The authors conclude that expert oversight remains essential while the results point toward scaling protocol intelligence through similar AI methods.

Core claim

The authors demonstrate that their clinical-trial-specific RAG process extracts protocol information with 89.0% accuracy compared to 62.6% for standalone LLMs, while AI-assisted extraction tasks in simulated CRC workflows are completed 40% faster, rated as less cognitively demanding, and preferred by users over manual methods.

What carries the argument

Retrieval-augmented generation (RAG) process tailored to clinical trial protocols, which retrieves relevant protocol sections to ground and improve the accuracy of generative LLM outputs.

If this is right

Protocol content can be structured into standard formats more reliably to improve documentation quality and compliance support.
Clinical research coordinators can handle information extraction tasks more quickly and with lower cognitive load.
Integration of similar AI methodologies into real-world clinical workflows could enable protocol intelligence at scale.
Expert oversight would still be required even after AI assistance is introduced.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same RAG approach could be tested on other dense regulatory or medical documents beyond trial protocols to check for similar accuracy gains.
Connecting the extracted structured data directly to trial management platforms might produce efficiency gains in feasibility assessments that the current simulations do not measure.
Longitudinal use on evolving protocols could help teams track amendments systematically, an application left for future validation.

Load-bearing premise

The simulated CRC workflows and expert reference annotations accurately reflect real-world clinical trial conditions and that the chosen evaluation set is representative of typical protocol complexity and amendment patterns.

What would settle it

A head-to-head comparison of the RAG system's extractions against expert manual annotations on a fresh set of real, ongoing clinical trial protocols drawn from multiple sites and therapeutic areas.

Figures

Figures reproduced from arXiv: 2602.00052 by Fran\c{c}ois Charest, Madison Wright, Ramtin Babaeipour.

**Figure 1.** Figure 1: RAG process for clinical protocol information extraction. The RAG system first processes protocol Portable Document Format (PDF) files subdividing them into meaningful chunks and storing them in a vector database with semantic embeddings. When users query for specific protocol information (e.g., inclusion/exclusion criteria), the system retrieves the most relevant chunks and provides them as context to a… view at source ↗

**Figure 2.** Figure 2: Comparison of task completion times between non-AI and AI-assisted protocol abstraction tasks (AI-assisted: RAG 2, gpt-4o-mini for generation, gpt4o for SoE). Box plot shows median completion time, interquartile range, and outliers (circles) for both conditions. Non-AI tasks required substantially longer completion times compared to AI-assisted tasks. This translates to an average time reduction of 47 min… view at source ↗

**Figure 3.** Figure 3: Distribution and comparison of item-weighted accuracy scores between AI-assisted and unassisted conditions. (A) Histogram showing count distribution of item-weighted scores, with AI-assisted scores (blue) and unassisted scores (gray) overlaid. (B) Box plot displaying median, quartiles, and range of scores for both conditions. (C) Count distribution by rounded score values on 0–5 scale, comparing frequenc… view at source ↗

read the original abstract

Increasing clinical trial protocol complexity, amendments, and challenges around knowledge management create significant burden for trial teams. Structuring protocol content into standard formats has the potential to improve efficiency, support documentation quality, and strengthen compliance. We evaluate an Artificial Intelligence (AI) system using generative LLMs with Retrieval-Augmented Generation (RAG) for automated clinical trial protocol information extraction. We compare the extraction accuracy of our clinical-trial-specific RAG process against that of publicly available (standalone) LLMs. We also assess the operational impact of AI-assistance on simulated extraction Clinical Research Coordinator (CRC) workflows. Our RAG process shows higher extraction accuracy (89.0%) than standalone LLMs with fine-tuned prompts (62.6%) against expert-supported reference annotations. In simulated extraction workflows, AI-assisted tasks are completed 40% faster, are rated as less cognitively demanding and are strongly preferred by users. While expert oversight remains essential, this suggests that AI-assisted extraction can enable protocol intelligence at scale, motivating the integration of similar methodologies into real-world clinical workflows to further validate its impact on feasibility, study start-up, and post-activation monitoring.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RAG lifts extraction accuracy to 89% and cuts workflow time 40% on clinical protocols, but the evaluation set and annotation process are described too thinly to support the broader claims.

read the letter

The main takeaway is that a clinical-trial-tuned RAG setup beats off-the-shelf LLMs with prompts on protocol information extraction (89% vs 62.6%) and shows a 40% speed-up plus lower cognitive load in simulated CRC tasks. That is the concrete result worth noting. The work applies existing RAG methods to a real operational pain point—structuring complex, amendment-prone protocols—and reports user preference data alongside the accuracy numbers. Those metrics are useful for anyone tracking AI deployment in trial start-up. The paper is straightforward about keeping expert oversight in the loop, which keeps the claims grounded. The soft spot is exactly what the stress-test flagged: the test collection size, therapeutic-area mix, protocol length distribution, and amendment frequency are not characterized, and the reference annotations are only called “expert-supported” with no inter-annotator agreement, adjudication rules, or error breakdown. Without those details the 89% figure and the relative gain over baselines are hard to generalize. The simulated workflow is also narrow; it does not yet show impact on actual study timelines or downstream compliance. This is the kind of applied paper that clinical informatics groups and trial operations teams would read for practical ideas. It is not a methods advance, but the empirical comparison is clear enough that a serious referee could usefully press on the evaluation gaps and ask for more representative data. I would send it to review rather than desk-reject; the numbers are specific and the use case matters, even if the current evidence base needs tightening.

Referee Report

3 major / 1 minor

Summary. The paper evaluates an AI system using generative LLMs with retrieval-augmented generation (RAG) for automated extraction of information from clinical trial protocols. It reports that the RAG approach achieves 89.0% extraction accuracy against expert-supported reference annotations, outperforming standalone LLMs with fine-tuned prompts at 62.6%, and that AI-assisted simulated workflows for clinical research coordinators complete tasks 40% faster while being rated as less cognitively demanding and strongly preferred by users.

Significance. If the reported accuracy and workflow gains hold under more rigorous evaluation with characterized datasets, the work could meaningfully reduce protocol management burden in clinical trials and support scalable 'protocol intelligence.' The direct empirical comparison between RAG and baseline LLMs is a strength, as is the inclusion of user preference and cognitive load metrics in the simulated CRC tasks.

major comments (3)

[Evaluation / Results] The manuscript provides no information on the size, therapeutic-area distribution, length, amendment frequency, or selection criteria of the clinical trial protocol collection used for evaluation, which is load-bearing for the generalizability of the 89.0% accuracy claim and the 40% speedup result.
[Methods] The creation of the 'expert-supported reference annotations' is described only at a high level with no reported inter-annotator agreement, adjudication protocol, or error taxonomy, leaving the accuracy comparisons (89.0% vs 62.6%) only partially supported.
[Results] No statistical tests, confidence intervals, or significance testing are mentioned for the accuracy or time-savings differences, which weakens the strength of the central performance claims.

minor comments (1)

[Abstract] The abstract would be strengthened by briefly stating dataset size and any statistical support for the headline numbers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree with the identified gaps in dataset description, annotation details, and statistical reporting, and have revised the manuscript to address each point directly.

read point-by-point responses

Referee: [Evaluation / Results] The manuscript provides no information on the size, therapeutic-area distribution, length, amendment frequency, or selection criteria of the clinical trial protocol collection used for evaluation, which is load-bearing for the generalizability of the 89.0% accuracy claim and the 40% speedup result.

Authors: We agree that these details are necessary to assess generalizability. In the revised manuscript we have added a dedicated Methods subsection describing the protocol collection, including its size, therapeutic-area distribution, average length, amendment frequency, and selection criteria from ClinicalTrials.gov. These additions support the scope of the reported accuracy and workflow results. revision: yes
Referee: [Methods] The creation of the 'expert-supported reference annotations' is described only at a high level with no reported inter-annotator agreement, adjudication protocol, or error taxonomy, leaving the accuracy comparisons (89.0% vs 62.6%) only partially supported.

Authors: We acknowledge the high-level description in the original submission. The revised Methods section now provides a full account of the annotation process, including inter-annotator agreement, the adjudication protocol, and an error taxonomy. These additions strengthen the foundation for the accuracy comparisons. revision: yes
Referee: [Results] No statistical tests, confidence intervals, or significance testing are mentioned for the accuracy or time-savings differences, which weakens the strength of the central performance claims.

Authors: We agree that statistical rigor is required. The revised Results section now includes appropriate statistical tests, 95% confidence intervals, and significance testing for both the accuracy and time-savings differences. This directly bolsters the central performance claims. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical accuracy and workflow measurements rest on direct comparisons

full rationale

The paper reports measured extraction accuracy (89.0% RAG vs 62.6% baseline) and simulated workflow speedups (40% faster) obtained by running the described RAG pipeline on an evaluation collection of protocols and comparing outputs to expert-supported reference annotations. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the load-bearing claims. The central results are produced by standard empirical evaluation against an external reference set rather than by any construction that reduces the reported numbers to the inputs by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, invented entities, or non-standard axioms are stated. The work assumes standard LLM prompting and retrieval capabilities function as described in the clinical domain.

axioms (1)

domain assumption Expert annotations provide reliable ground truth for protocol information extraction
Invoked when accuracy is measured against expert-supported reference annotations

pith-pipeline@v0.9.0 · 5512 in / 1174 out tokens · 29491 ms · 2026-05-16T12:50:11.375606+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 1 internal anchor

[1]

L. M. Friedman, C. D. Furberg, D. L. DeMets, Fundamentals of Clinical Trials, 5th Edition, Springer, 2015

work page 2015
[2]

C. T. Jones, P. M. Jester, M. Fitz-Gerald, Issues in research management: Protocol challenges in the era of complexity, Research Practitioner 14 (3) (2013) 122–127

work page 2013
[3]

Varse, L

F. Varse, L. Janani, Y. Moradi, M. Solaymani-Dodaran, H. R. Baradaran, S. Rimaz, Challenges in the design, conduction, analysis, and reporting of randomized clinical trial studies: A systematic review, Medical Journal of the Islamic Republic of Iran 33 (1) (2019) 37

work page 2019
[4]

K. A. Getz, et al., The impact of protocol amendments on clinical trial performance and cost, Therapeutic Innovation & Regulatory Science 52 (5) (2018) 577–586

work page 2018
[5]

D. Gryaznov, et al., Reporting quality of clinical trial protocols: a re- peated cross-sectional study about the adherence to spirit recommenda- 29 tions in switzerland, canada and germany (aspire-scage), BMJ Open 12 (2022) e053417

work page 2022
[6]

K. A. Getz, et al., New benchmarks on protocol amendment practices, trends and their impact on clinical trial performance, Therapeutic Innova- tion & Regulatory Science 58 (3) (2024) 539–548

work page 2024
[7]

S. Datta, et al., Autocriteria: a generalizable clinical trial eligibility cri- teria extraction system powered by large language models, Journal of the American Medical Informatics Association 31 (2) (2024) 375–385

work page 2024
[8]

M. Kramer, Extraction of schedules of activities tables from clinical trial protocols, https://github.com/markkramerus/publications/blob/ma in/2-Extraction%20of%20SoA%20Tables%20from%20PDFs.pdf (2025)

work page 2025
[9]

Kargren, J

M. Kargren, J. April, G. Clark, J. Mackinnon, A. Nathoo, E. Theron, Un- locking new eﬃciencies: How structured content authoring is streamlining the production of clinical documents for the pharmaceutical industry, Med- ical Writing 32 (3) (2023) 32–37

work page 2023
[10]

Georgieﬀ, Navigating toward a digital clinical trial protocol, Applied Clinical Trials 32 (12) (2023)

T. Georgieﬀ, Navigating toward a digital clinical trial protocol, Applied Clinical Trials 32 (12) (2023)

work page 2023
[11]

Vadakin, R

A. Vadakin, R. D. Kush, Cdisc standards and innovations, Clinical Evalu- ation 40 (Suppl. 31) (2012) 217–228

work page 2012
[12]

A.-W. Chan, I. Boutron, S. Hopewell, D. Moher, K. Schulz, et al., Spirit 2025 statement: updated guideline for protocols of randomised trials , BMJ 389 (2025) e081477. URL https://dx.doi.org/10.1136/bmj-2024-081477

work page doi:10.1136/bmj-2024-081477 2025
[13]

Maleki, S

M. Maleki, S. A. Ghahari, Clinical trials protocol authoring using llms, arXiv, https://arxiv.org/html/2404.05044v2 (2024)

work page arXiv 2024
[14]

Babaeipour, F

R. Babaeipour, F. Charest, M. Wright, Ai-assisted protocol complexity estimation for improved clinical trial workﬂowsIn preparation

work page
[15]

Liu, et al., Clinical trial information extraction with bert, in: IEEE 9th International Conference on Healthcare Informatics (ICHI), 2021, pp

X. Liu, et al., Clinical trial information extraction with bert, in: IEEE 9th International Conference on Healthcare Informatics (ICHI), 2021, pp. 505–506. 30

work page 2021
[16]

Snorkel AI, Augmenting the clinical trial design information extraction, Blog, https://snorkel.ai/blog/augmenting-the-clinical-trial-d esign-information-extraction/ (2022)

work page 2022
[17]

I. C. Wiest, et al., A software pipeline for medical information extraction with large language models, open source and suitable for oncology, npj Precision Oncology 9 (2025) 313

work page 2025
[18]

Hosseini, I

P. Hosseini, I. Castro, I. Ghinassi, M. Purver, Eﬃcient solutions for an intriguing failure of LLMs: Long context window does not mean LLMs can analyze long sequences ﬂawlessly , in: O. Rambow, L. Wanner, M. Apidi- anaki, H. Al-Khalifa, B. D. Eugenio, S. Schockaert (Eds.), Proceedings of the 31st International Conference on Computational Linguistics, Asso...

work page 2025
[19]

N. F. Liu, et al., Lost in the middle: How language models use long contexts, Transactions of the Association for Computational Linguistics 12 (2024) 157–173

work page 2024
[20]

Lewis, et al., Retrieval-augmented generation for knowledge-intensive nlp tasks, in: Advances in Neural Information Processing Systems (NeurIPS), 2020, pp

P. Lewis, et al., Retrieval-augmented generation for knowledge-intensive nlp tasks, in: Advances in Neural Information Processing Systems (NeurIPS), 2020, pp. 9459–9474

work page 2020
[21]

Rajpurkar, et al., Ai in health and medicine, Nature Medicine 28 (1) (2022) 31–38

P. Rajpurkar, et al., Ai in health and medicine, Nature Medicine 28 (1) (2022) 31–38

work page 2022
[22]

Journal of Society for Clinical Data Management, Representing clinical study schedule of activities as fhir resources: Required characteristic at- tributes, https://www.jscdm.org/article/id/266/ (2025)

work page 2025
[23]

Ferrés, H

D. Ferrés, H. Saggion, F. Ronzano, À. Bravo, Pdfdigest: an adaptable layout-aware pdf-to-xml textual content extractor for scientiﬁc articles, in: Proceedings of the Eleventh International Conference on Language Re- sources and Evaluation (LREC-2018), Miyazaki, Japan, 2018, pp. 1896– 1901

work page 2018
[24]

Zhong, E

X. Zhong, E. ShaﬁeiBavani, A. Jimeno Yepes, Image-based table recogni- tion: Data, model, and evaluation , in: Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceed- ings, Part XXI, Springer-Verlag, Berlin, Heidelberg, 2020, pp. 564–580. 31 doi:10.1007/978-3-030-58589-1_34 . URL https://doi.org/10.1007/978-3-030-5...

work page doi:10.1007/978-3-030-58589-1_34 2020
[25]

J. Gu, X. Jiang, Z. Shi, H. Tan, X. Zhai, C. Xu, W. Li, Y. Shen, S. Ma, H. Liu, S. Wang, K. Zhang, Z. Lin, B. Zhang, L. Ni, W. Gao, Y. Wang, J. Guo, A survey on llm-as-a-judge , The Innovation (2026) 101253 doi: https://doi.org/10.1016/j.xinn.2025.101253. URL https://www.sciencedirect.com/science/article/pii/S26666 75825004564

work page doi:10.1016/j.xinn.2025.101253 2026
[26]

Croxford, Y

E. Croxford, Y. Gao, E. First, N. Pellegrino, M. Schnier, J. Caskey, et al., Automating evaluation of ai text generation in healthcare with a large language model (llm)-as-a-judge , medRxiv [Preprint] (2025). doi:10.110 1/2025.04.22.25326219. URL https://pmc.ncbi.nlm.nih.gov/articles/PMC12045442/

work page 2025
[27]

M. Yuan, J. Chen, Z. Xing, G. Mohammadi, A. Quigley, A case study of scalable content annotation using multi-llm consensus and human review , arXiv (2025). arXiv:2503.17620. URL https://arxiv.org/pdf/2503.17620

work page arXiv 2025
[28]

X. Wang, H. Kim, S. Rahman, K. Mitra, Z. Miao, Human-llm collaborative annotation through eﬀective veriﬁcation of llm labels, in: Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems (CHI ’24), Association for Computing Machinery, New York, NY, USA, 2024, pp. 1–21, article 303, 1–21

work page 2024
[29]

ISBN 9798400704314

P. Thomas, S. Spielman, N. Craswell, B. Mitra, Large language models can accurately predict searcher preferences , in: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’24, Association for Computing Machinery, New York, NY, USA, 2024, p. 19301940. doi:10.1145/3626772.3657707. URL https:...

work page doi:10.1145/3626772.3657707 2024
[30]

National Library of Medicine, Clinicaltrials.gov, https://clinicaltria ls.gov/, n.d

work page
[31]

Cohere, Cohere embeddings, july 2025 (2025). 32

work page 2025
[32]

Chase, et al., Langchain: Building applications with llms through com- posability, GitHub, https://github.com/langchain- ai/langchain (2022)

H. Chase, et al., Langchain: Building applications with llms through com- posability, GitHub, https://github.com/langchain- ai/langchain (2022)

work page 2022
[33]

X. Wang, Z. Wang, X. Gao, F. Zhang, Y. Wu, Z. Xu, T. Shi, Z. Wang, S. Li, Q. Qian, R. Yin, C. Lv, X. Zheng, X. Huang, Searching for best practices in retrieval-augmented generation , in: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024, pp. 17753–17788. URL https://aclanthology.org/2024.emnlp-main.981/

work page 2024
[34]

S. Li, L. Stenzel, C. Eickhoﬀ, S. A. Bahrainian, Enhancing retrieval- augmented generation: A study of best practices , in: Proceedings of the 31st International Conference on Computational Linguistics (COLING), 2025, pp. 6682–6698. URL https://aclanthology.org/2025.coling-main.449/

work page 2025
[35]

OpenAI, Gpt-4o, july 2025 (2025)

work page 2025
[36]

Smock, R

B. Smock, R. Pesala, R. Abraham, Pubtables-1m: Towards comprehen- sive table extraction from unstructured documents, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 4634–4642

work page 2022
[37]

L. M. Schulze Buschoﬀ, E. Akata, M. Bethge, et al., Visual cognition in multimodal large language models, Nature Machine Intelligence 7 (2025) 96–106. doi:10.1038/s42256-024-00963-y

work page doi:10.1038/s42256-024-00963-y 2025
[38]

PaperBench: Evaluating AI's Ability to Replicate AI Research

G. Starace, J. Wijk, Y. Tang, S. Pearce, J. Miller, R. Weinstein, et al., PaperBench: Evaluating AI’s ability to replicate AI research , arXiv (2025). arXiv:2504.01848. URL https://arxiv.org/abs/2504.01848

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Aroyo, C

L. Aroyo, C. Welty, Truth is a lie: Crowd truth and the seven myths of human annotation , AI Magazine 36 (1) (2015) 15–24. doi:10.1609/aima g.v36i1.2564. URL https://ojs.aaai.org/aimagazine/index.php/aimagazine/art icle/view/2564

work page doi:10.1609/aima 2015
[40]

F. Yu, N. Seedat, D. Herrmannova, F. Schilder, J. R. Schwarz, Beyond pointwise scores: Decomposed criteria-based evaluation of llm responses , 33 in: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, 2025, pp. 1931–1954. doi:10.186 53/v1/2025.emnlp-industry.136. URL https://aclanthology.org/2025.emnlp-...

work page 2025
[41]

Markey, I

N. Markey, I. El-Mansouri, G. Rensonnet, C. van Langen, C. Meier, From RAGs to riches: Utilizing large language models to write documents for clinical trials, Clinical Trials 22 (5) (2025) 626–631. doi:10.1177/174077 45251320806

work page doi:10.1177/174077 2025
[42]

Waikar, A

S. Waikar, A. G. Bhat, Retrieval augmented generation (RAG) for eval- uating regulatory compliance of drug information and clinical trial pro- tocols, CPT: Pharmacometrics & Systems Pharmacology (2026). doi: 10.1002/psp4.70201

work page doi:10.1002/psp4.70201 2026
[43]

Artsi, V

Y. Artsi, V. Sorin, B. S. Glicksberg, P. Korﬁatis, G. N. Nadkarni, E. Klang, Large language models in real-world clinical workﬂows: a systematic review of applications and implementation, Frontiers in Digital Health 7 (2025) 1659134. doi:10.3389/fdgth.2025.1659134

work page doi:10.3389/fdgth.2025.1659134 2025
[44]

Study Documents

A. Badani, F. Y. de Moraes, P. Vollmuth, et al., AI and innovation in clinical trials, npj Digital Medicine 8 (2025) 683. doi:10.1038/s41746-0 25-02048-5 . 34 A Appendices A.1 Protocol documents selection Starting from identiﬁable protocol documents (using the “Study Documents” ﬁeld), the following ﬁltering logic is applied: Listing 1: Filtering logic for...

work page doi:10.1038/s41746-0 2025

[1] [1]

L. M. Friedman, C. D. Furberg, D. L. DeMets, Fundamentals of Clinical Trials, 5th Edition, Springer, 2015

work page 2015

[2] [2]

C. T. Jones, P. M. Jester, M. Fitz-Gerald, Issues in research management: Protocol challenges in the era of complexity, Research Practitioner 14 (3) (2013) 122–127

work page 2013

[3] [3]

Varse, L

F. Varse, L. Janani, Y. Moradi, M. Solaymani-Dodaran, H. R. Baradaran, S. Rimaz, Challenges in the design, conduction, analysis, and reporting of randomized clinical trial studies: A systematic review, Medical Journal of the Islamic Republic of Iran 33 (1) (2019) 37

work page 2019

[4] [4]

K. A. Getz, et al., The impact of protocol amendments on clinical trial performance and cost, Therapeutic Innovation & Regulatory Science 52 (5) (2018) 577–586

work page 2018

[5] [5]

D. Gryaznov, et al., Reporting quality of clinical trial protocols: a re- peated cross-sectional study about the adherence to spirit recommenda- 29 tions in switzerland, canada and germany (aspire-scage), BMJ Open 12 (2022) e053417

work page 2022

[6] [6]

K. A. Getz, et al., New benchmarks on protocol amendment practices, trends and their impact on clinical trial performance, Therapeutic Innova- tion & Regulatory Science 58 (3) (2024) 539–548

work page 2024

[7] [7]

S. Datta, et al., Autocriteria: a generalizable clinical trial eligibility cri- teria extraction system powered by large language models, Journal of the American Medical Informatics Association 31 (2) (2024) 375–385

work page 2024

[8] [8]

M. Kramer, Extraction of schedules of activities tables from clinical trial protocols, https://github.com/markkramerus/publications/blob/ma in/2-Extraction%20of%20SoA%20Tables%20from%20PDFs.pdf (2025)

work page 2025

[9] [9]

Kargren, J

M. Kargren, J. April, G. Clark, J. Mackinnon, A. Nathoo, E. Theron, Un- locking new eﬃciencies: How structured content authoring is streamlining the production of clinical documents for the pharmaceutical industry, Med- ical Writing 32 (3) (2023) 32–37

work page 2023

[10] [10]

Georgieﬀ, Navigating toward a digital clinical trial protocol, Applied Clinical Trials 32 (12) (2023)

T. Georgieﬀ, Navigating toward a digital clinical trial protocol, Applied Clinical Trials 32 (12) (2023)

work page 2023

[11] [11]

Vadakin, R

A. Vadakin, R. D. Kush, Cdisc standards and innovations, Clinical Evalu- ation 40 (Suppl. 31) (2012) 217–228

work page 2012

[12] [12]

A.-W. Chan, I. Boutron, S. Hopewell, D. Moher, K. Schulz, et al., Spirit 2025 statement: updated guideline for protocols of randomised trials , BMJ 389 (2025) e081477. URL https://dx.doi.org/10.1136/bmj-2024-081477

work page doi:10.1136/bmj-2024-081477 2025

[13] [13]

Maleki, S

M. Maleki, S. A. Ghahari, Clinical trials protocol authoring using llms, arXiv, https://arxiv.org/html/2404.05044v2 (2024)

work page arXiv 2024

[14] [14]

Babaeipour, F

R. Babaeipour, F. Charest, M. Wright, Ai-assisted protocol complexity estimation for improved clinical trial workﬂowsIn preparation

work page

[15] [15]

Liu, et al., Clinical trial information extraction with bert, in: IEEE 9th International Conference on Healthcare Informatics (ICHI), 2021, pp

X. Liu, et al., Clinical trial information extraction with bert, in: IEEE 9th International Conference on Healthcare Informatics (ICHI), 2021, pp. 505–506. 30

work page 2021

[16] [16]

Snorkel AI, Augmenting the clinical trial design information extraction, Blog, https://snorkel.ai/blog/augmenting-the-clinical-trial-d esign-information-extraction/ (2022)

work page 2022

[17] [17]

I. C. Wiest, et al., A software pipeline for medical information extraction with large language models, open source and suitable for oncology, npj Precision Oncology 9 (2025) 313

work page 2025

[18] [18]

Hosseini, I

P. Hosseini, I. Castro, I. Ghinassi, M. Purver, Eﬃcient solutions for an intriguing failure of LLMs: Long context window does not mean LLMs can analyze long sequences ﬂawlessly , in: O. Rambow, L. Wanner, M. Apidi- anaki, H. Al-Khalifa, B. D. Eugenio, S. Schockaert (Eds.), Proceedings of the 31st International Conference on Computational Linguistics, Asso...

work page 2025

[19] [19]

N. F. Liu, et al., Lost in the middle: How language models use long contexts, Transactions of the Association for Computational Linguistics 12 (2024) 157–173

work page 2024

[20] [20]

Lewis, et al., Retrieval-augmented generation for knowledge-intensive nlp tasks, in: Advances in Neural Information Processing Systems (NeurIPS), 2020, pp

P. Lewis, et al., Retrieval-augmented generation for knowledge-intensive nlp tasks, in: Advances in Neural Information Processing Systems (NeurIPS), 2020, pp. 9459–9474

work page 2020

[21] [21]

Rajpurkar, et al., Ai in health and medicine, Nature Medicine 28 (1) (2022) 31–38

P. Rajpurkar, et al., Ai in health and medicine, Nature Medicine 28 (1) (2022) 31–38

work page 2022

[22] [22]

Journal of Society for Clinical Data Management, Representing clinical study schedule of activities as fhir resources: Required characteristic at- tributes, https://www.jscdm.org/article/id/266/ (2025)

work page 2025

[23] [23]

Ferrés, H

D. Ferrés, H. Saggion, F. Ronzano, À. Bravo, Pdfdigest: an adaptable layout-aware pdf-to-xml textual content extractor for scientiﬁc articles, in: Proceedings of the Eleventh International Conference on Language Re- sources and Evaluation (LREC-2018), Miyazaki, Japan, 2018, pp. 1896– 1901

work page 2018

[24] [24]

Zhong, E

X. Zhong, E. ShaﬁeiBavani, A. Jimeno Yepes, Image-based table recogni- tion: Data, model, and evaluation , in: Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceed- ings, Part XXI, Springer-Verlag, Berlin, Heidelberg, 2020, pp. 564–580. 31 doi:10.1007/978-3-030-58589-1_34 . URL https://doi.org/10.1007/978-3-030-5...

work page doi:10.1007/978-3-030-58589-1_34 2020

[25] [25]

J. Gu, X. Jiang, Z. Shi, H. Tan, X. Zhai, C. Xu, W. Li, Y. Shen, S. Ma, H. Liu, S. Wang, K. Zhang, Z. Lin, B. Zhang, L. Ni, W. Gao, Y. Wang, J. Guo, A survey on llm-as-a-judge , The Innovation (2026) 101253 doi: https://doi.org/10.1016/j.xinn.2025.101253. URL https://www.sciencedirect.com/science/article/pii/S26666 75825004564

work page doi:10.1016/j.xinn.2025.101253 2026

[26] [26]

Croxford, Y

E. Croxford, Y. Gao, E. First, N. Pellegrino, M. Schnier, J. Caskey, et al., Automating evaluation of ai text generation in healthcare with a large language model (llm)-as-a-judge , medRxiv [Preprint] (2025). doi:10.110 1/2025.04.22.25326219. URL https://pmc.ncbi.nlm.nih.gov/articles/PMC12045442/

work page 2025

[27] [27]

M. Yuan, J. Chen, Z. Xing, G. Mohammadi, A. Quigley, A case study of scalable content annotation using multi-llm consensus and human review , arXiv (2025). arXiv:2503.17620. URL https://arxiv.org/pdf/2503.17620

work page arXiv 2025

[28] [28]

X. Wang, H. Kim, S. Rahman, K. Mitra, Z. Miao, Human-llm collaborative annotation through eﬀective veriﬁcation of llm labels, in: Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems (CHI ’24), Association for Computing Machinery, New York, NY, USA, 2024, pp. 1–21, article 303, 1–21

work page 2024

[29] [29]

ISBN 9798400704314

P. Thomas, S. Spielman, N. Craswell, B. Mitra, Large language models can accurately predict searcher preferences , in: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’24, Association for Computing Machinery, New York, NY, USA, 2024, p. 19301940. doi:10.1145/3626772.3657707. URL https:...

work page doi:10.1145/3626772.3657707 2024

[30] [30]

National Library of Medicine, Clinicaltrials.gov, https://clinicaltria ls.gov/, n.d

work page

[31] [31]

Cohere, Cohere embeddings, july 2025 (2025). 32

work page 2025

[32] [32]

Chase, et al., Langchain: Building applications with llms through com- posability, GitHub, https://github.com/langchain- ai/langchain (2022)

H. Chase, et al., Langchain: Building applications with llms through com- posability, GitHub, https://github.com/langchain- ai/langchain (2022)

work page 2022

[33] [33]

X. Wang, Z. Wang, X. Gao, F. Zhang, Y. Wu, Z. Xu, T. Shi, Z. Wang, S. Li, Q. Qian, R. Yin, C. Lv, X. Zheng, X. Huang, Searching for best practices in retrieval-augmented generation , in: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024, pp. 17753–17788. URL https://aclanthology.org/2024.emnlp-main.981/

work page 2024

[34] [34]

S. Li, L. Stenzel, C. Eickhoﬀ, S. A. Bahrainian, Enhancing retrieval- augmented generation: A study of best practices , in: Proceedings of the 31st International Conference on Computational Linguistics (COLING), 2025, pp. 6682–6698. URL https://aclanthology.org/2025.coling-main.449/

work page 2025

[35] [35]

OpenAI, Gpt-4o, july 2025 (2025)

work page 2025

[36] [36]

Smock, R

B. Smock, R. Pesala, R. Abraham, Pubtables-1m: Towards comprehen- sive table extraction from unstructured documents, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 4634–4642

work page 2022

[37] [37]

L. M. Schulze Buschoﬀ, E. Akata, M. Bethge, et al., Visual cognition in multimodal large language models, Nature Machine Intelligence 7 (2025) 96–106. doi:10.1038/s42256-024-00963-y

work page doi:10.1038/s42256-024-00963-y 2025

[38] [38]

PaperBench: Evaluating AI's Ability to Replicate AI Research

G. Starace, J. Wijk, Y. Tang, S. Pearce, J. Miller, R. Weinstein, et al., PaperBench: Evaluating AI’s ability to replicate AI research , arXiv (2025). arXiv:2504.01848. URL https://arxiv.org/abs/2504.01848

work page internal anchor Pith review Pith/arXiv arXiv 2025

[39] [39]

Aroyo, C

L. Aroyo, C. Welty, Truth is a lie: Crowd truth and the seven myths of human annotation , AI Magazine 36 (1) (2015) 15–24. doi:10.1609/aima g.v36i1.2564. URL https://ojs.aaai.org/aimagazine/index.php/aimagazine/art icle/view/2564

work page doi:10.1609/aima 2015

[40] [40]

F. Yu, N. Seedat, D. Herrmannova, F. Schilder, J. R. Schwarz, Beyond pointwise scores: Decomposed criteria-based evaluation of llm responses , 33 in: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, 2025, pp. 1931–1954. doi:10.186 53/v1/2025.emnlp-industry.136. URL https://aclanthology.org/2025.emnlp-...

work page 2025

[41] [41]

Markey, I

N. Markey, I. El-Mansouri, G. Rensonnet, C. van Langen, C. Meier, From RAGs to riches: Utilizing large language models to write documents for clinical trials, Clinical Trials 22 (5) (2025) 626–631. doi:10.1177/174077 45251320806

work page doi:10.1177/174077 2025

[42] [42]

Waikar, A

S. Waikar, A. G. Bhat, Retrieval augmented generation (RAG) for eval- uating regulatory compliance of drug information and clinical trial pro- tocols, CPT: Pharmacometrics & Systems Pharmacology (2026). doi: 10.1002/psp4.70201

work page doi:10.1002/psp4.70201 2026

[43] [43]

Artsi, V

Y. Artsi, V. Sorin, B. S. Glicksberg, P. Korﬁatis, G. N. Nadkarni, E. Klang, Large language models in real-world clinical workﬂows: a systematic review of applications and implementation, Frontiers in Digital Health 7 (2025) 1659134. doi:10.3389/fdgth.2025.1659134

work page doi:10.3389/fdgth.2025.1659134 2025

[44] [44]

Study Documents

A. Badani, F. Y. de Moraes, P. Vollmuth, et al., AI and innovation in clinical trials, npj Digital Medicine 8 (2025) 683. doi:10.1038/s41746-0 25-02048-5 . 34 A Appendices A.1 Protocol documents selection Starting from identiﬁable protocol documents (using the “Study Documents” ﬁeld), the following ﬁltering logic is applied: Listing 1: Filtering logic for...

work page doi:10.1038/s41746-0 2025