AI-assisted Protocol Information Extraction For Improved Accuracy and Efficiency in Clinical Trial Workflows
Pith reviewed 2026-05-16 12:50 UTC · model grok-4.3
The pith
Retrieval-augmented generation extracts clinical trial protocol information at 89% accuracy versus 62.6% for standalone LLMs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors demonstrate that their clinical-trial-specific RAG process extracts protocol information with 89.0% accuracy compared to 62.6% for standalone LLMs, while AI-assisted extraction tasks in simulated CRC workflows are completed 40% faster, rated as less cognitively demanding, and preferred by users over manual methods.
What carries the argument
Retrieval-augmented generation (RAG) process tailored to clinical trial protocols, which retrieves relevant protocol sections to ground and improve the accuracy of generative LLM outputs.
If this is right
- Protocol content can be structured into standard formats more reliably to improve documentation quality and compliance support.
- Clinical research coordinators can handle information extraction tasks more quickly and with lower cognitive load.
- Integration of similar AI methodologies into real-world clinical workflows could enable protocol intelligence at scale.
- Expert oversight would still be required even after AI assistance is introduced.
Where Pith is reading between the lines
- The same RAG approach could be tested on other dense regulatory or medical documents beyond trial protocols to check for similar accuracy gains.
- Connecting the extracted structured data directly to trial management platforms might produce efficiency gains in feasibility assessments that the current simulations do not measure.
- Longitudinal use on evolving protocols could help teams track amendments systematically, an application left for future validation.
Load-bearing premise
The simulated CRC workflows and expert reference annotations accurately reflect real-world clinical trial conditions and that the chosen evaluation set is representative of typical protocol complexity and amendment patterns.
What would settle it
A head-to-head comparison of the RAG system's extractions against expert manual annotations on a fresh set of real, ongoing clinical trial protocols drawn from multiple sites and therapeutic areas.
Figures
read the original abstract
Increasing clinical trial protocol complexity, amendments, and challenges around knowledge management create significant burden for trial teams. Structuring protocol content into standard formats has the potential to improve efficiency, support documentation quality, and strengthen compliance. We evaluate an Artificial Intelligence (AI) system using generative LLMs with Retrieval-Augmented Generation (RAG) for automated clinical trial protocol information extraction. We compare the extraction accuracy of our clinical-trial-specific RAG process against that of publicly available (standalone) LLMs. We also assess the operational impact of AI-assistance on simulated extraction Clinical Research Coordinator (CRC) workflows. Our RAG process shows higher extraction accuracy (89.0%) than standalone LLMs with fine-tuned prompts (62.6%) against expert-supported reference annotations. In simulated extraction workflows, AI-assisted tasks are completed 40% faster, are rated as less cognitively demanding and are strongly preferred by users. While expert oversight remains essential, this suggests that AI-assisted extraction can enable protocol intelligence at scale, motivating the integration of similar methodologies into real-world clinical workflows to further validate its impact on feasibility, study start-up, and post-activation monitoring.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper evaluates an AI system using generative LLMs with retrieval-augmented generation (RAG) for automated extraction of information from clinical trial protocols. It reports that the RAG approach achieves 89.0% extraction accuracy against expert-supported reference annotations, outperforming standalone LLMs with fine-tuned prompts at 62.6%, and that AI-assisted simulated workflows for clinical research coordinators complete tasks 40% faster while being rated as less cognitively demanding and strongly preferred by users.
Significance. If the reported accuracy and workflow gains hold under more rigorous evaluation with characterized datasets, the work could meaningfully reduce protocol management burden in clinical trials and support scalable 'protocol intelligence.' The direct empirical comparison between RAG and baseline LLMs is a strength, as is the inclusion of user preference and cognitive load metrics in the simulated CRC tasks.
major comments (3)
- [Evaluation / Results] The manuscript provides no information on the size, therapeutic-area distribution, length, amendment frequency, or selection criteria of the clinical trial protocol collection used for evaluation, which is load-bearing for the generalizability of the 89.0% accuracy claim and the 40% speedup result.
- [Methods] The creation of the 'expert-supported reference annotations' is described only at a high level with no reported inter-annotator agreement, adjudication protocol, or error taxonomy, leaving the accuracy comparisons (89.0% vs 62.6%) only partially supported.
- [Results] No statistical tests, confidence intervals, or significance testing are mentioned for the accuracy or time-savings differences, which weakens the strength of the central performance claims.
minor comments (1)
- [Abstract] The abstract would be strengthened by briefly stating dataset size and any statistical support for the headline numbers.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We agree with the identified gaps in dataset description, annotation details, and statistical reporting, and have revised the manuscript to address each point directly.
read point-by-point responses
-
Referee: [Evaluation / Results] The manuscript provides no information on the size, therapeutic-area distribution, length, amendment frequency, or selection criteria of the clinical trial protocol collection used for evaluation, which is load-bearing for the generalizability of the 89.0% accuracy claim and the 40% speedup result.
Authors: We agree that these details are necessary to assess generalizability. In the revised manuscript we have added a dedicated Methods subsection describing the protocol collection, including its size, therapeutic-area distribution, average length, amendment frequency, and selection criteria from ClinicalTrials.gov. These additions support the scope of the reported accuracy and workflow results. revision: yes
-
Referee: [Methods] The creation of the 'expert-supported reference annotations' is described only at a high level with no reported inter-annotator agreement, adjudication protocol, or error taxonomy, leaving the accuracy comparisons (89.0% vs 62.6%) only partially supported.
Authors: We acknowledge the high-level description in the original submission. The revised Methods section now provides a full account of the annotation process, including inter-annotator agreement, the adjudication protocol, and an error taxonomy. These additions strengthen the foundation for the accuracy comparisons. revision: yes
-
Referee: [Results] No statistical tests, confidence intervals, or significance testing are mentioned for the accuracy or time-savings differences, which weakens the strength of the central performance claims.
Authors: We agree that statistical rigor is required. The revised Results section now includes appropriate statistical tests, 95% confidence intervals, and significance testing for both the accuracy and time-savings differences. This directly bolsters the central performance claims. revision: yes
Circularity Check
No circularity: empirical accuracy and workflow measurements rest on direct comparisons
full rationale
The paper reports measured extraction accuracy (89.0% RAG vs 62.6% baseline) and simulated workflow speedups (40% faster) obtained by running the described RAG pipeline on an evaluation collection of protocols and comparing outputs to expert-supported reference annotations. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the load-bearing claims. The central results are produced by standard empirical evaluation against an external reference set rather than by any construction that reduces the reported numbers to the inputs by definition.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Expert annotations provide reliable ground truth for protocol information extraction
Reference graph
Works this paper leans on
-
[1]
L. M. Friedman, C. D. Furberg, D. L. DeMets, Fundamentals of Clinical Trials, 5th Edition, Springer, 2015
work page 2015
-
[2]
C. T. Jones, P. M. Jester, M. Fitz-Gerald, Issues in research management: Protocol challenges in the era of complexity, Research Practitioner 14 (3) (2013) 122–127
work page 2013
- [3]
-
[4]
K. A. Getz, et al., The impact of protocol amendments on clinical trial performance and cost, Therapeutic Innovation & Regulatory Science 52 (5) (2018) 577–586
work page 2018
-
[5]
D. Gryaznov, et al., Reporting quality of clinical trial protocols: a re- peated cross-sectional study about the adherence to spirit recommenda- 29 tions in switzerland, canada and germany (aspire-scage), BMJ Open 12 (2022) e053417
work page 2022
-
[6]
K. A. Getz, et al., New benchmarks on protocol amendment practices, trends and their impact on clinical trial performance, Therapeutic Innova- tion & Regulatory Science 58 (3) (2024) 539–548
work page 2024
-
[7]
S. Datta, et al., Autocriteria: a generalizable clinical trial eligibility cri- teria extraction system powered by large language models, Journal of the American Medical Informatics Association 31 (2) (2024) 375–385
work page 2024
-
[8]
M. Kramer, Extraction of schedules of activities tables from clinical trial protocols, https://github.com/markkramerus/publications/blob/ma in/2-Extraction%20of%20SoA%20Tables%20from%20PDFs.pdf (2025)
work page 2025
-
[9]
M. Kargren, J. April, G. Clark, J. Mackinnon, A. Nathoo, E. Theron, Un- locking new efficiencies: How structured content authoring is streamlining the production of clinical documents for the pharmaceutical industry, Med- ical Writing 32 (3) (2023) 32–37
work page 2023
-
[10]
T. Georgieff, Navigating toward a digital clinical trial protocol, Applied Clinical Trials 32 (12) (2023)
work page 2023
-
[11]
A. Vadakin, R. D. Kush, Cdisc standards and innovations, Clinical Evalu- ation 40 (Suppl. 31) (2012) 217–228
work page 2012
-
[12]
A.-W. Chan, I. Boutron, S. Hopewell, D. Moher, K. Schulz, et al., Spirit 2025 statement: updated guideline for protocols of randomised trials , BMJ 389 (2025) e081477. URL https://dx.doi.org/10.1136/bmj-2024-081477
- [13]
-
[14]
R. Babaeipour, F. Charest, M. Wright, Ai-assisted protocol complexity estimation for improved clinical trial workflowsIn preparation
-
[15]
X. Liu, et al., Clinical trial information extraction with bert, in: IEEE 9th International Conference on Healthcare Informatics (ICHI), 2021, pp. 505–506. 30
work page 2021
-
[16]
Snorkel AI, Augmenting the clinical trial design information extraction, Blog, https://snorkel.ai/blog/augmenting-the-clinical-trial-d esign-information-extraction/ (2022)
work page 2022
-
[17]
I. C. Wiest, et al., A software pipeline for medical information extraction with large language models, open source and suitable for oncology, npj Precision Oncology 9 (2025) 313
work page 2025
-
[18]
P. Hosseini, I. Castro, I. Ghinassi, M. Purver, Efficient solutions for an intriguing failure of LLMs: Long context window does not mean LLMs can analyze long sequences flawlessly , in: O. Rambow, L. Wanner, M. Apidi- anaki, H. Al-Khalifa, B. D. Eugenio, S. Schockaert (Eds.), Proceedings of the 31st International Conference on Computational Linguistics, Asso...
work page 2025
-
[19]
N. F. Liu, et al., Lost in the middle: How language models use long contexts, Transactions of the Association for Computational Linguistics 12 (2024) 157–173
work page 2024
-
[20]
P. Lewis, et al., Retrieval-augmented generation for knowledge-intensive nlp tasks, in: Advances in Neural Information Processing Systems (NeurIPS), 2020, pp. 9459–9474
work page 2020
-
[21]
Rajpurkar, et al., Ai in health and medicine, Nature Medicine 28 (1) (2022) 31–38
P. Rajpurkar, et al., Ai in health and medicine, Nature Medicine 28 (1) (2022) 31–38
work page 2022
-
[22]
Journal of Society for Clinical Data Management, Representing clinical study schedule of activities as fhir resources: Required characteristic at- tributes, https://www.jscdm.org/article/id/266/ (2025)
work page 2025
-
[23]
D. Ferrés, H. Saggion, F. Ronzano, À. Bravo, Pdfdigest: an adaptable layout-aware pdf-to-xml textual content extractor for scientific articles, in: Proceedings of the Eleventh International Conference on Language Re- sources and Evaluation (LREC-2018), Miyazaki, Japan, 2018, pp. 1896– 1901
work page 2018
-
[24]
X. Zhong, E. ShafieiBavani, A. Jimeno Yepes, Image-based table recogni- tion: Data, model, and evaluation , in: Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceed- ings, Part XXI, Springer-Verlag, Berlin, Heidelberg, 2020, pp. 564–580. 31 doi:10.1007/978-3-030-58589-1_34 . URL https://doi.org/10.1007/978-3-030-5...
-
[25]
J. Gu, X. Jiang, Z. Shi, H. Tan, X. Zhai, C. Xu, W. Li, Y. Shen, S. Ma, H. Liu, S. Wang, K. Zhang, Z. Lin, B. Zhang, L. Ni, W. Gao, Y. Wang, J. Guo, A survey on llm-as-a-judge , The Innovation (2026) 101253 doi: https://doi.org/10.1016/j.xinn.2025.101253. URL https://www.sciencedirect.com/science/article/pii/S26666 75825004564
-
[26]
E. Croxford, Y. Gao, E. First, N. Pellegrino, M. Schnier, J. Caskey, et al., Automating evaluation of ai text generation in healthcare with a large language model (llm)-as-a-judge , medRxiv [Preprint] (2025). doi:10.110 1/2025.04.22.25326219. URL https://pmc.ncbi.nlm.nih.gov/articles/PMC12045442/
work page 2025
- [27]
-
[28]
X. Wang, H. Kim, S. Rahman, K. Mitra, Z. Miao, Human-llm collaborative annotation through effective verification of llm labels, in: Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems (CHI ’24), Association for Computing Machinery, New York, NY, USA, 2024, pp. 1–21, article 303, 1–21
work page 2024
-
[29]
P. Thomas, S. Spielman, N. Craswell, B. Mitra, Large language models can accurately predict searcher preferences , in: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’24, Association for Computing Machinery, New York, NY, USA, 2024, p. 19301940. doi:10.1145/3626772.3657707. URL https:...
-
[30]
National Library of Medicine, Clinicaltrials.gov, https://clinicaltria ls.gov/, n.d
-
[31]
Cohere, Cohere embeddings, july 2025 (2025). 32
work page 2025
-
[32]
H. Chase, et al., Langchain: Building applications with llms through com- posability, GitHub, https://github.com/langchain- ai/langchain (2022)
work page 2022
-
[33]
X. Wang, Z. Wang, X. Gao, F. Zhang, Y. Wu, Z. Xu, T. Shi, Z. Wang, S. Li, Q. Qian, R. Yin, C. Lv, X. Zheng, X. Huang, Searching for best practices in retrieval-augmented generation , in: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024, pp. 17753–17788. URL https://aclanthology.org/2024.emnlp-main.981/
work page 2024
-
[34]
S. Li, L. Stenzel, C. Eickhoff, S. A. Bahrainian, Enhancing retrieval- augmented generation: A study of best practices , in: Proceedings of the 31st International Conference on Computational Linguistics (COLING), 2025, pp. 6682–6698. URL https://aclanthology.org/2025.coling-main.449/
work page 2025
-
[35]
OpenAI, Gpt-4o, july 2025 (2025)
work page 2025
- [36]
-
[37]
L. M. Schulze Buschoff, E. Akata, M. Bethge, et al., Visual cognition in multimodal large language models, Nature Machine Intelligence 7 (2025) 96–106. doi:10.1038/s42256-024-00963-y
-
[38]
PaperBench: Evaluating AI's Ability to Replicate AI Research
G. Starace, J. Wijk, Y. Tang, S. Pearce, J. Miller, R. Weinstein, et al., PaperBench: Evaluating AI’s ability to replicate AI research , arXiv (2025). arXiv:2504.01848. URL https://arxiv.org/abs/2504.01848
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
L. Aroyo, C. Welty, Truth is a lie: Crowd truth and the seven myths of human annotation , AI Magazine 36 (1) (2015) 15–24. doi:10.1609/aima g.v36i1.2564. URL https://ojs.aaai.org/aimagazine/index.php/aimagazine/art icle/view/2564
-
[40]
F. Yu, N. Seedat, D. Herrmannova, F. Schilder, J. R. Schwarz, Beyond pointwise scores: Decomposed criteria-based evaluation of llm responses , 33 in: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, 2025, pp. 1931–1954. doi:10.186 53/v1/2025.emnlp-industry.136. URL https://aclanthology.org/2025.emnlp-...
work page 2025
-
[41]
N. Markey, I. El-Mansouri, G. Rensonnet, C. van Langen, C. Meier, From RAGs to riches: Utilizing large language models to write documents for clinical trials, Clinical Trials 22 (5) (2025) 626–631. doi:10.1177/174077 45251320806
-
[42]
S. Waikar, A. G. Bhat, Retrieval augmented generation (RAG) for eval- uating regulatory compliance of drug information and clinical trial pro- tocols, CPT: Pharmacometrics & Systems Pharmacology (2026). doi: 10.1002/psp4.70201
-
[43]
Y. Artsi, V. Sorin, B. S. Glicksberg, P. Korfiatis, G. N. Nadkarni, E. Klang, Large language models in real-world clinical workflows: a systematic review of applications and implementation, Frontiers in Digital Health 7 (2025) 1659134. doi:10.3389/fdgth.2025.1659134
-
[44]
A. Badani, F. Y. de Moraes, P. Vollmuth, et al., AI and innovation in clinical trials, npj Digital Medicine 8 (2025) 683. doi:10.1038/s41746-0 25-02048-5 . 34 A Appendices A.1 Protocol documents selection Starting from identifiable protocol documents (using the “Study Documents” field), the following filtering logic is applied: Listing 1: Filtering logic for...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.