pith. sign in

arxiv: 2602.00052 · v2 · submitted 2026-01-19 · 💻 cs.IR · cs.AI· cs.CL· cs.LG

AI-assisted Protocol Information Extraction For Improved Accuracy and Efficiency in Clinical Trial Workflows

Pith reviewed 2026-05-16 12:50 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.CLcs.LG
keywords clinical trial protocolsinformation extractionretrieval-augmented generationlarge language modelsclinical research coordinatorsAI-assisted workflows
0
0 comments X

The pith

Retrieval-augmented generation extracts clinical trial protocol information at 89% accuracy versus 62.6% for standalone LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates an AI system that combines generative large language models with retrieval-augmented generation to automatically extract structured information from complex clinical trial protocols. This RAG approach reaches 89.0% accuracy against expert-supported reference annotations, outperforming standalone LLMs with fine-tuned prompts at 62.6%. In simulated workflows, clinical research coordinators complete extraction tasks 40% faster with AI assistance, rate the work as less cognitively demanding, and strongly prefer the method. A sympathetic reader would care because rising protocol complexity and amendments create heavy burdens for trial teams around knowledge management, documentation quality, and compliance. The authors conclude that expert oversight remains essential while the results point toward scaling protocol intelligence through similar AI methods.

Core claim

The authors demonstrate that their clinical-trial-specific RAG process extracts protocol information with 89.0% accuracy compared to 62.6% for standalone LLMs, while AI-assisted extraction tasks in simulated CRC workflows are completed 40% faster, rated as less cognitively demanding, and preferred by users over manual methods.

What carries the argument

Retrieval-augmented generation (RAG) process tailored to clinical trial protocols, which retrieves relevant protocol sections to ground and improve the accuracy of generative LLM outputs.

If this is right

  • Protocol content can be structured into standard formats more reliably to improve documentation quality and compliance support.
  • Clinical research coordinators can handle information extraction tasks more quickly and with lower cognitive load.
  • Integration of similar AI methodologies into real-world clinical workflows could enable protocol intelligence at scale.
  • Expert oversight would still be required even after AI assistance is introduced.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same RAG approach could be tested on other dense regulatory or medical documents beyond trial protocols to check for similar accuracy gains.
  • Connecting the extracted structured data directly to trial management platforms might produce efficiency gains in feasibility assessments that the current simulations do not measure.
  • Longitudinal use on evolving protocols could help teams track amendments systematically, an application left for future validation.

Load-bearing premise

The simulated CRC workflows and expert reference annotations accurately reflect real-world clinical trial conditions and that the chosen evaluation set is representative of typical protocol complexity and amendment patterns.

What would settle it

A head-to-head comparison of the RAG system's extractions against expert manual annotations on a fresh set of real, ongoing clinical trial protocols drawn from multiple sites and therapeutic areas.

Figures

Figures reproduced from arXiv: 2602.00052 by Fran\c{c}ois Charest, Madison Wright, Ramtin Babaeipour.

Figure 1
Figure 1. Figure 1: RAG process for clinical protocol information extraction. The RAG system first processes protocol Portable Document Format (PDF) files subdi￾viding them into meaningful chunks and storing them in a vector database with semantic embeddings. When users query for specific protocol informa￾tion (e.g., inclusion/exclusion criteria), the system retrieves the most relevant chunks and provides them as context to a… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of task completion times between non-AI and AI-assisted protocol abstraction tasks (AI-assisted: RAG 2, gpt-4o-mini for generation, gpt￾4o for SoE). Box plot shows median completion time, interquartile range, and outliers (circles) for both conditions. Non-AI tasks required substantially longer completion times compared to AI-assisted tasks. This translates to an average time reduction of 47 min… view at source ↗
Figure 3
Figure 3. Figure 3: Distribution and comparison of item-weighted accuracy scores be￾tween AI-assisted and unassisted conditions. (A) Histogram showing count dis￾tribution of item-weighted scores, with AI-assisted scores (blue) and unassisted scores (gray) overlaid. (B) Box plot displaying median, quartiles, and range of scores for both conditions. (C) Count distribution by rounded score values on 0–5 scale, comparing frequenc… view at source ↗
read the original abstract

Increasing clinical trial protocol complexity, amendments, and challenges around knowledge management create significant burden for trial teams. Structuring protocol content into standard formats has the potential to improve efficiency, support documentation quality, and strengthen compliance. We evaluate an Artificial Intelligence (AI) system using generative LLMs with Retrieval-Augmented Generation (RAG) for automated clinical trial protocol information extraction. We compare the extraction accuracy of our clinical-trial-specific RAG process against that of publicly available (standalone) LLMs. We also assess the operational impact of AI-assistance on simulated extraction Clinical Research Coordinator (CRC) workflows. Our RAG process shows higher extraction accuracy (89.0%) than standalone LLMs with fine-tuned prompts (62.6%) against expert-supported reference annotations. In simulated extraction workflows, AI-assisted tasks are completed 40% faster, are rated as less cognitively demanding and are strongly preferred by users. While expert oversight remains essential, this suggests that AI-assisted extraction can enable protocol intelligence at scale, motivating the integration of similar methodologies into real-world clinical workflows to further validate its impact on feasibility, study start-up, and post-activation monitoring.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper evaluates an AI system using generative LLMs with retrieval-augmented generation (RAG) for automated extraction of information from clinical trial protocols. It reports that the RAG approach achieves 89.0% extraction accuracy against expert-supported reference annotations, outperforming standalone LLMs with fine-tuned prompts at 62.6%, and that AI-assisted simulated workflows for clinical research coordinators complete tasks 40% faster while being rated as less cognitively demanding and strongly preferred by users.

Significance. If the reported accuracy and workflow gains hold under more rigorous evaluation with characterized datasets, the work could meaningfully reduce protocol management burden in clinical trials and support scalable 'protocol intelligence.' The direct empirical comparison between RAG and baseline LLMs is a strength, as is the inclusion of user preference and cognitive load metrics in the simulated CRC tasks.

major comments (3)
  1. [Evaluation / Results] The manuscript provides no information on the size, therapeutic-area distribution, length, amendment frequency, or selection criteria of the clinical trial protocol collection used for evaluation, which is load-bearing for the generalizability of the 89.0% accuracy claim and the 40% speedup result.
  2. [Methods] The creation of the 'expert-supported reference annotations' is described only at a high level with no reported inter-annotator agreement, adjudication protocol, or error taxonomy, leaving the accuracy comparisons (89.0% vs 62.6%) only partially supported.
  3. [Results] No statistical tests, confidence intervals, or significance testing are mentioned for the accuracy or time-savings differences, which weakens the strength of the central performance claims.
minor comments (1)
  1. [Abstract] The abstract would be strengthened by briefly stating dataset size and any statistical support for the headline numbers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree with the identified gaps in dataset description, annotation details, and statistical reporting, and have revised the manuscript to address each point directly.

read point-by-point responses
  1. Referee: [Evaluation / Results] The manuscript provides no information on the size, therapeutic-area distribution, length, amendment frequency, or selection criteria of the clinical trial protocol collection used for evaluation, which is load-bearing for the generalizability of the 89.0% accuracy claim and the 40% speedup result.

    Authors: We agree that these details are necessary to assess generalizability. In the revised manuscript we have added a dedicated Methods subsection describing the protocol collection, including its size, therapeutic-area distribution, average length, amendment frequency, and selection criteria from ClinicalTrials.gov. These additions support the scope of the reported accuracy and workflow results. revision: yes

  2. Referee: [Methods] The creation of the 'expert-supported reference annotations' is described only at a high level with no reported inter-annotator agreement, adjudication protocol, or error taxonomy, leaving the accuracy comparisons (89.0% vs 62.6%) only partially supported.

    Authors: We acknowledge the high-level description in the original submission. The revised Methods section now provides a full account of the annotation process, including inter-annotator agreement, the adjudication protocol, and an error taxonomy. These additions strengthen the foundation for the accuracy comparisons. revision: yes

  3. Referee: [Results] No statistical tests, confidence intervals, or significance testing are mentioned for the accuracy or time-savings differences, which weakens the strength of the central performance claims.

    Authors: We agree that statistical rigor is required. The revised Results section now includes appropriate statistical tests, 95% confidence intervals, and significance testing for both the accuracy and time-savings differences. This directly bolsters the central performance claims. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical accuracy and workflow measurements rest on direct comparisons

full rationale

The paper reports measured extraction accuracy (89.0% RAG vs 62.6% baseline) and simulated workflow speedups (40% faster) obtained by running the described RAG pipeline on an evaluation collection of protocols and comparing outputs to expert-supported reference annotations. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the load-bearing claims. The central results are produced by standard empirical evaluation against an external reference set rather than by any construction that reduces the reported numbers to the inputs by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, invented entities, or non-standard axioms are stated. The work assumes standard LLM prompting and retrieval capabilities function as described in the clinical domain.

axioms (1)
  • domain assumption Expert annotations provide reliable ground truth for protocol information extraction
    Invoked when accuracy is measured against expert-supported reference annotations

pith-pipeline@v0.9.0 · 5512 in / 1174 out tokens · 29491 ms · 2026-05-16T12:50:11.375606+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 1 internal anchor

  1. [1]

    L. M. Friedman, C. D. Furberg, D. L. DeMets, Fundamentals of Clinical Trials, 5th Edition, Springer, 2015

  2. [2]

    C. T. Jones, P. M. Jester, M. Fitz-Gerald, Issues in research management: Protocol challenges in the era of complexity, Research Practitioner 14 (3) (2013) 122–127

  3. [3]

    Varse, L

    F. Varse, L. Janani, Y. Moradi, M. Solaymani-Dodaran, H. R. Baradaran, S. Rimaz, Challenges in the design, conduction, analysis, and reporting of randomized clinical trial studies: A systematic review, Medical Journal of the Islamic Republic of Iran 33 (1) (2019) 37

  4. [4]

    K. A. Getz, et al., The impact of protocol amendments on clinical trial performance and cost, Therapeutic Innovation & Regulatory Science 52 (5) (2018) 577–586

  5. [5]

    D. Gryaznov, et al., Reporting quality of clinical trial protocols: a re- peated cross-sectional study about the adherence to spirit recommenda- 29 tions in switzerland, canada and germany (aspire-scage), BMJ Open 12 (2022) e053417

  6. [6]

    K. A. Getz, et al., New benchmarks on protocol amendment practices, trends and their impact on clinical trial performance, Therapeutic Innova- tion & Regulatory Science 58 (3) (2024) 539–548

  7. [7]

    S. Datta, et al., Autocriteria: a generalizable clinical trial eligibility cri- teria extraction system powered by large language models, Journal of the American Medical Informatics Association 31 (2) (2024) 375–385

  8. [8]

    M. Kramer, Extraction of schedules of activities tables from clinical trial protocols, https://github.com/markkramerus/publications/blob/ma in/2-Extraction%20of%20SoA%20Tables%20from%20PDFs.pdf (2025)

  9. [9]

    Kargren, J

    M. Kargren, J. April, G. Clark, J. Mackinnon, A. Nathoo, E. Theron, Un- locking new efficiencies: How structured content authoring is streamlining the production of clinical documents for the pharmaceutical industry, Med- ical Writing 32 (3) (2023) 32–37

  10. [10]

    Georgieff, Navigating toward a digital clinical trial protocol, Applied Clinical Trials 32 (12) (2023)

    T. Georgieff, Navigating toward a digital clinical trial protocol, Applied Clinical Trials 32 (12) (2023)

  11. [11]

    Vadakin, R

    A. Vadakin, R. D. Kush, Cdisc standards and innovations, Clinical Evalu- ation 40 (Suppl. 31) (2012) 217–228

  12. [12]

    A.-W. Chan, I. Boutron, S. Hopewell, D. Moher, K. Schulz, et al., Spirit 2025 statement: updated guideline for protocols of randomised trials , BMJ 389 (2025) e081477. URL https://dx.doi.org/10.1136/bmj-2024-081477

  13. [13]

    Maleki, S

    M. Maleki, S. A. Ghahari, Clinical trials protocol authoring using llms, arXiv, https://arxiv.org/html/2404.05044v2 (2024)

  14. [14]

    Babaeipour, F

    R. Babaeipour, F. Charest, M. Wright, Ai-assisted protocol complexity estimation for improved clinical trial workflowsIn preparation

  15. [15]

    Liu, et al., Clinical trial information extraction with bert, in: IEEE 9th International Conference on Healthcare Informatics (ICHI), 2021, pp

    X. Liu, et al., Clinical trial information extraction with bert, in: IEEE 9th International Conference on Healthcare Informatics (ICHI), 2021, pp. 505–506. 30

  16. [16]

    Snorkel AI, Augmenting the clinical trial design information extraction, Blog, https://snorkel.ai/blog/augmenting-the-clinical-trial-d esign-information-extraction/ (2022)

  17. [17]

    I. C. Wiest, et al., A software pipeline for medical information extraction with large language models, open source and suitable for oncology, npj Precision Oncology 9 (2025) 313

  18. [18]

    Hosseini, I

    P. Hosseini, I. Castro, I. Ghinassi, M. Purver, Efficient solutions for an intriguing failure of LLMs: Long context window does not mean LLMs can analyze long sequences flawlessly , in: O. Rambow, L. Wanner, M. Apidi- anaki, H. Al-Khalifa, B. D. Eugenio, S. Schockaert (Eds.), Proceedings of the 31st International Conference on Computational Linguistics, Asso...

  19. [19]

    N. F. Liu, et al., Lost in the middle: How language models use long contexts, Transactions of the Association for Computational Linguistics 12 (2024) 157–173

  20. [20]

    Lewis, et al., Retrieval-augmented generation for knowledge-intensive nlp tasks, in: Advances in Neural Information Processing Systems (NeurIPS), 2020, pp

    P. Lewis, et al., Retrieval-augmented generation for knowledge-intensive nlp tasks, in: Advances in Neural Information Processing Systems (NeurIPS), 2020, pp. 9459–9474

  21. [21]

    Rajpurkar, et al., Ai in health and medicine, Nature Medicine 28 (1) (2022) 31–38

    P. Rajpurkar, et al., Ai in health and medicine, Nature Medicine 28 (1) (2022) 31–38

  22. [22]

    Journal of Society for Clinical Data Management, Representing clinical study schedule of activities as fhir resources: Required characteristic at- tributes, https://www.jscdm.org/article/id/266/ (2025)

  23. [23]

    Ferrés, H

    D. Ferrés, H. Saggion, F. Ronzano, À. Bravo, Pdfdigest: an adaptable layout-aware pdf-to-xml textual content extractor for scientific articles, in: Proceedings of the Eleventh International Conference on Language Re- sources and Evaluation (LREC-2018), Miyazaki, Japan, 2018, pp. 1896– 1901

  24. [24]

    Zhong, E

    X. Zhong, E. ShafieiBavani, A. Jimeno Yepes, Image-based table recogni- tion: Data, model, and evaluation , in: Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceed- ings, Part XXI, Springer-Verlag, Berlin, Heidelberg, 2020, pp. 564–580. 31 doi:10.1007/978-3-030-58589-1_34 . URL https://doi.org/10.1007/978-3-030-5...

  25. [25]

    J. Gu, X. Jiang, Z. Shi, H. Tan, X. Zhai, C. Xu, W. Li, Y. Shen, S. Ma, H. Liu, S. Wang, K. Zhang, Z. Lin, B. Zhang, L. Ni, W. Gao, Y. Wang, J. Guo, A survey on llm-as-a-judge , The Innovation (2026) 101253 doi: https://doi.org/10.1016/j.xinn.2025.101253. URL https://www.sciencedirect.com/science/article/pii/S26666 75825004564

  26. [26]

    Croxford, Y

    E. Croxford, Y. Gao, E. First, N. Pellegrino, M. Schnier, J. Caskey, et al., Automating evaluation of ai text generation in healthcare with a large language model (llm)-as-a-judge , medRxiv [Preprint] (2025). doi:10.110 1/2025.04.22.25326219. URL https://pmc.ncbi.nlm.nih.gov/articles/PMC12045442/

  27. [27]

    M. Yuan, J. Chen, Z. Xing, G. Mohammadi, A. Quigley, A case study of scalable content annotation using multi-llm consensus and human review , arXiv (2025). arXiv:2503.17620. URL https://arxiv.org/pdf/2503.17620

  28. [28]

    X. Wang, H. Kim, S. Rahman, K. Mitra, Z. Miao, Human-llm collaborative annotation through effective verification of llm labels, in: Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems (CHI ’24), Association for Computing Machinery, New York, NY, USA, 2024, pp. 1–21, article 303, 1–21

  29. [29]

    ISBN 9798400704314

    P. Thomas, S. Spielman, N. Craswell, B. Mitra, Large language models can accurately predict searcher preferences , in: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’24, Association for Computing Machinery, New York, NY, USA, 2024, p. 19301940. doi:10.1145/3626772.3657707. URL https:...

  30. [30]

    National Library of Medicine, Clinicaltrials.gov, https://clinicaltria ls.gov/, n.d

  31. [31]

    Cohere, Cohere embeddings, july 2025 (2025). 32

  32. [32]

    Chase, et al., Langchain: Building applications with llms through com- posability, GitHub, https://github.com/langchain- ai/langchain (2022)

    H. Chase, et al., Langchain: Building applications with llms through com- posability, GitHub, https://github.com/langchain- ai/langchain (2022)

  33. [33]

    X. Wang, Z. Wang, X. Gao, F. Zhang, Y. Wu, Z. Xu, T. Shi, Z. Wang, S. Li, Q. Qian, R. Yin, C. Lv, X. Zheng, X. Huang, Searching for best practices in retrieval-augmented generation , in: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024, pp. 17753–17788. URL https://aclanthology.org/2024.emnlp-main.981/

  34. [34]

    S. Li, L. Stenzel, C. Eickhoff, S. A. Bahrainian, Enhancing retrieval- augmented generation: A study of best practices , in: Proceedings of the 31st International Conference on Computational Linguistics (COLING), 2025, pp. 6682–6698. URL https://aclanthology.org/2025.coling-main.449/

  35. [35]

    OpenAI, Gpt-4o, july 2025 (2025)

  36. [36]

    Smock, R

    B. Smock, R. Pesala, R. Abraham, Pubtables-1m: Towards comprehen- sive table extraction from unstructured documents, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 4634–4642

  37. [37]

    L. M. Schulze Buschoff, E. Akata, M. Bethge, et al., Visual cognition in multimodal large language models, Nature Machine Intelligence 7 (2025) 96–106. doi:10.1038/s42256-024-00963-y

  38. [38]

    PaperBench: Evaluating AI's Ability to Replicate AI Research

    G. Starace, J. Wijk, Y. Tang, S. Pearce, J. Miller, R. Weinstein, et al., PaperBench: Evaluating AI’s ability to replicate AI research , arXiv (2025). arXiv:2504.01848. URL https://arxiv.org/abs/2504.01848

  39. [39]

    Aroyo, C

    L. Aroyo, C. Welty, Truth is a lie: Crowd truth and the seven myths of human annotation , AI Magazine 36 (1) (2015) 15–24. doi:10.1609/aima g.v36i1.2564. URL https://ojs.aaai.org/aimagazine/index.php/aimagazine/art icle/view/2564

  40. [40]

    F. Yu, N. Seedat, D. Herrmannova, F. Schilder, J. R. Schwarz, Beyond pointwise scores: Decomposed criteria-based evaluation of llm responses , 33 in: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, 2025, pp. 1931–1954. doi:10.186 53/v1/2025.emnlp-industry.136. URL https://aclanthology.org/2025.emnlp-...

  41. [41]

    Markey, I

    N. Markey, I. El-Mansouri, G. Rensonnet, C. van Langen, C. Meier, From RAGs to riches: Utilizing large language models to write documents for clinical trials, Clinical Trials 22 (5) (2025) 626–631. doi:10.1177/174077 45251320806

  42. [42]

    Waikar, A

    S. Waikar, A. G. Bhat, Retrieval augmented generation (RAG) for eval- uating regulatory compliance of drug information and clinical trial pro- tocols, CPT: Pharmacometrics & Systems Pharmacology (2026). doi: 10.1002/psp4.70201

  43. [43]

    Artsi, V

    Y. Artsi, V. Sorin, B. S. Glicksberg, P. Korfiatis, G. N. Nadkarni, E. Klang, Large language models in real-world clinical workflows: a systematic review of applications and implementation, Frontiers in Digital Health 7 (2025) 1659134. doi:10.3389/fdgth.2025.1659134

  44. [44]

    Study Documents

    A. Badani, F. Y. de Moraes, P. Vollmuth, et al., AI and innovation in clinical trials, npj Digital Medicine 8 (2025) 683. doi:10.1038/s41746-0 25-02048-5 . 34 A Appendices A.1 Protocol documents selection Starting from identifiable protocol documents (using the “Study Documents” field), the following filtering logic is applied: Listing 1: Filtering logic for...