pith. machine review for the scientific record. sign in

arxiv: 2604.04168 · v2 · submitted 2026-04-05 · 💻 cs.CL · cs.IR

Recognition: 2 theorem links

· Lean Theorem

A Semi-Automated Annotation Workflow for Paediatric Histopathology Reports Using Small Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-13 16:46 UTC · model grok-4.3

classification 💻 cs.CL cs.IR
keywords small language modelsclinical information extractionpaediatric histopathologysemi-automated annotationquestion answeringrenal biopsy reportsCPU-only processing
0
0 comments X

The pith

Small language models extract structured data from paediatric renal biopsy reports at 84% accuracy on standard hardware.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a workflow that lets small language models turn free-text paediatric histopathology reports into structured information. This matters because most clinical records are locked in unstructured text, making large-scale research and analysis difficult without expensive or privacy-risky tools. The approach frames the task as question answering, uses clinician-written guidelines and a few examples to guide the models, and has models disagree to decide which cases need human review. After annotating 400 reports as a test set, the best small model reached 84.3 percent accuracy while running on ordinary CPUs.

Core claim

A semi-automated annotation workflow using small language models, set up as a question-answering task with clinician-guided entity guidelines and few-shot examples, extracts structured information from paediatric renal biopsy reports. The top model, Gemma 2 2B, achieves 84.3% accuracy on CPU-only infrastructure, outperforming baselines such as spaCy and BioBERT, and requires only minimal clinician involvement through a disagreement modelling framework.

What carries the argument

The question-answering formulation of information extraction, guided by entity guidelines and few-shot examples, combined with a disagreement modelling framework to select reports for clinical review.

If this is right

  • Structured clinical data becomes available from existing electronic patient records without cloud uploads or high-end hardware.
  • Clinician time is focused only on reports where models disagree, reducing manual effort.
  • Accuracy gains come from both entity guidelines (7-19%) and few-shot examples (6-38%), though they do not add up when used together.
  • This method works for domains with constrained diagnostic scope like paediatric renal biopsies.
  • Local processing keeps sensitive patient data private while still enabling automated annotation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar workflows could be adapted for other types of medical reports with well-defined structures.
  • Institutions lacking GPU access might still build large annotated datasets using only consumer hardware.
  • The approach could lower the barrier for creating research-ready clinical databases in paediatrics.
  • Performance on reports from new hospitals or later time periods remains an open question for transferability.

Load-bearing premise

The 400 manually annotated reports are a representative and accurate gold standard that reflects the full range of reports the system will encounter.

What would settle it

Evaluating the Gemma 2 2B model on a fresh set of reports from a different hospital or time period and finding accuracy below 70 percent would indicate the workflow does not transfer reliably.

read the original abstract

Electronic Patient Record (EPR) systems contain valuable clinical information, but much of it is trapped in unstructured text, limiting its use for research and decision-making. Large language models can extract such information but require substantial computational resources to run locally, and sending sensitive clinical data to cloud-based services, even when deidentified, raises significant patient privacy concerns. In this study, we develop a resource-efficient semi-automated annotation workflow using small language models (SLMs) to extract structured information from unstructured EPR data, focusing on paediatric histopathology reports. As a proof-of-concept, we apply the workflow to paediatric renal biopsy reports, a domain chosen for its constrained diagnostic scope and well-defined underlying biology. We develop the workflow iteratively with clinical oversight across three meetings, manually annotating 400 reports from a dataset of 2,111 at Great Ormond Street Hospital as a gold standard, while developing an automated information extraction approach using SLMs. We frame extraction as a Question-Answering task grounded by clinician-guided entity guidelines and few-shot examples, evaluating five instruction-tuned SLMs with a disagreement modelling framework to prioritise reports for clinical review. Gemma 2 2B achieves the highest accuracy at 84.3%, outperforming off-the-shelf models including spaCy (74.3%), BioBERT-SQuAD (62.3%), RoBERTa-SQuAD (59.7%), and GLiNER (60.2%). Entity guidelines improved performance by 7-19% over the zero-shot baseline, and few-shot examples by 6-38%, though their benefits do not compound when combined. These results demonstrate that SLMs can extract structured information from specialised clinical domains on CPU-only infrastructure with minimal clinician involvement. Our code is available at https://github.com/gosh-dre/nlp_renal_biopsy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims to develop a semi-automated annotation workflow using small language models (SLMs) to extract structured information from unstructured paediatric histopathology reports, focusing on renal biopsies at Great Ormond Street Hospital. They manually annotate 400 reports from a 2,111-report dataset as a gold standard through iterative clinician oversight, frame extraction as a QA task augmented by entity guidelines and few-shot examples, and evaluate five instruction-tuned SLMs on CPU-only hardware. Gemma 2 2B achieves the highest accuracy of 84.3%, outperforming baselines including spaCy (74.3%), BioBERT-SQuAD (62.3%), RoBERTa-SQuAD (59.7%), and GLiNER (60.2%). Entity guidelines improve performance by 7-19% and few-shot examples by 6-38% over zero-shot baselines (though benefits do not compound), supporting the conclusion that SLMs enable reliable extraction with minimal clinician involvement and public code release.

Significance. If the reported accuracies and improvements hold under more rigorous validation, the work offers a practical demonstration of resource-efficient clinical NLP that avoids cloud-based privacy risks and heavy GPU requirements. The open code and focus on a constrained diagnostic domain strengthen its potential utility for unlocking EPR data in paediatric research settings with limited computational resources.

major comments (3)
  1. [Abstract] Abstract: The 84.3% accuracy for Gemma 2 2B (and all baseline comparisons) is reported without error bars, confidence intervals, or details on train/test splits or statistical testing, leaving the central performance claim only partially supported and difficult to interpret for reliability.
  2. [Abstract] Abstract: No inter-annotator agreement metric (e.g., Cohen's kappa) or details on the number of clinicians, annotation process, or disagreement resolution are provided for the 400-report gold standard, which is load-bearing for the claim that the labels constitute trustworthy ground truth for measuring SLM performance.
  3. [Abstract] Abstract: All evaluation data originates from a single institution and time period (Great Ormond Street Hospital paediatric renal biopsies), with no external validation set; this directly weakens the generalizability assertion that the workflow (and reported gains from guidelines/few-shot) will transfer to other clinical reporting conventions or sites.
minor comments (2)
  1. The abstract states the workflow was developed iteratively across three meetings but provides no further specifics on clinician inputs or changes made during iteration.
  2. Consider including a summary table of all five SLMs' accuracies plus the ablation results for guidelines and few-shot combinations to improve clarity of the comparative results.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which has identified important areas for improving the clarity and transparency of our work. We address each major comment below and describe the revisions we will implement.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The 84.3% accuracy for Gemma 2 2B (and all baseline comparisons) is reported without error bars, confidence intervals, or details on train/test splits or statistical testing, leaving the central performance claim only partially supported and difficult to interpret for reliability.

    Authors: We acknowledge this limitation in the abstract. The full manuscript describes the evaluation as a single 80/20 split on the 400-report gold standard set (with the SLMs used in zero- and few-shot settings, so no model training occurred). To strengthen the presentation, we will add 95% bootstrap confidence intervals to all reported accuracies in the abstract and results, explicitly note the evaluation protocol and split details, and clarify that no formal statistical significance tests were conducted between models given the proof-of-concept focus on practical utility. These changes will appear in the revised abstract and methods. revision: yes

  2. Referee: [Abstract] Abstract: No inter-annotator agreement metric (e.g., Cohen's kappa) or details on the number of clinicians, annotation process, or disagreement resolution are provided for the 400-report gold standard, which is load-bearing for the claim that the labels constitute trustworthy ground truth for measuring SLM performance.

    Authors: The gold standard was created by two senior paediatric pathologists via iterative consensus across three meetings, with disagreements resolved through direct discussion rather than independent parallel annotation. We did not compute Cohen's kappa because the process was designed as iterative consensus rather than independent labelling. In the revision we will expand the methods section with full details on the number of clinicians, the meeting-based process, and consensus resolution, while adding an explicit statement that formal IAA metrics were not calculated and noting this as a limitation of the gold standard. revision: yes

  3. Referee: [Abstract] Abstract: All evaluation data originates from a single institution and time period (Great Ormond Street Hospital paediatric renal biopsies), with no external validation set; this directly weakens the generalizability assertion that the workflow (and reported gains from guidelines/few-shot) will transfer to other clinical reporting conventions or sites.

    Authors: We agree that the single-centre design limits claims of broad generalizability. As a proof-of-concept study using only the GOSH renal biopsy corpus, external validation data were not available. In the revised manuscript we will moderate the abstract and discussion language to clearly state the single-institution limitation, remove or qualify any overstatements of transferability, and add a dedicated paragraph outlining the need for future multi-site studies to assess performance across different reporting conventions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical accuracy measured against independent human annotations

full rationale

The paper describes an empirical workflow in which 400 paediatric renal biopsy reports are manually annotated by clinicians as a gold standard, followed by evaluation of five instruction-tuned SLMs on that fixed set using accuracy, with ablations for entity guidelines and few-shot examples. No equations, fitted parameters, or derivation steps appear in the provided text. The reported 84.3% accuracy (Gemma 2 2B) is a direct comparison to the externally created human labels rather than a quantity that reduces to the model inputs or prior self-citations by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling are present. The evaluation is therefore self-contained against the stated gold standard and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The claim rests on the accuracy and representativeness of the 400-report gold standard created by the authors and on the assumption that the chosen SLMs and prompting strategy generalize within the domain.

pith-pipeline@v0.9.0 · 5687 in / 1091 out tokens · 63233 ms · 2026-05-13T16:46:38.910751+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 2 internal anchors

  1. [1]

    Lyu, G.: Data-driven decision making in patient management: a systematic review. BMC Med. Inform. Decis. Mak.25(1), 239 (2025) https://doi.org/10.1186/s12911-025-03072-x

  2. [2]

    PLOS Digit

    Bean, D.M., Kraljevic, Z., Shek, A., Teo, J., Dobson, R.J.B.: Hospital-wide natural language pro- cessing summarising the health data of 1 million patients. PLOS Digit. Health2(5), 0000218 (2023) https://doi.org/10.1371/journal.pdig.0000218

  3. [3]

    In: AMIA Annu

    Eyre, H., Chapman, A.B., Peterson, K.S., Shi, J., Alba, P.R., Jones, M.M., Box, T.L., DuVall, S.L., Patterson, O.V.: Launching into clinical space with medspacy: a new clinical text processing toolkit in python. In: AMIA Annu. Symp. Proc., vol. 2021, p. 438 (2022). https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8861690/

  4. [4]

    Kraljevic, Z., Searle, T., Shek, A., Roguski, L., Noor, K., Bean, D., Mascio, A., Zhu, L., Folarin, A.A., Roberts, A., Bendayan, R., Richardson, M.P., Stewart, R., Shah, A.D., Wong, W.K., Ibrahim, Z., Teo, J.T., Dobson, R.J.B.: Multi-domain clinical natural language processing with medcat: The medical concept annotation toolkit. Artif. Intell. Med.117, 10...

  5. [5]

    In: 2024 46th Annu

    Jabarulla, M.Y., Oeltze-Jafra, S., Beerbaum, P., Uden, T.: Meddoc-bot: A chat tool for comparative analysis of large language models in the context of the pediatric hypertension guideline. In: 2024 46th Annu. Int. Conf. IEEE Eng. Med. Biol. Soc. (EMBC), pp. 1–4 (2024). https://doi.org/10.1109/ EMBC53108.2024.10781509

  6. [6]

    JAMA Pediatr.178(3), 313–315 (2024) https://doi.org/10.1001/jamapediatrics.2023.5750

    Barile, J., Margolis, A., Cason, G., Kim, R., Kalash, S., Tchaconas, A., Milanaik, R.: Diagnostic accuracy of a large language model in pediatric case studies. JAMA Pediatr.178(3), 313–315 (2024) https://doi.org/10.1001/jamapediatrics.2023.5750

  7. [7]

    In: Proc

    Andrew, J.J., Vincent, M., Burgun, A., Garcelon, N.: Evaluating llms for temporal entity extraction from pediatric clinical text in rare diseases context. In: Proc. First Workshop Patient-Oriented Lang. Process. (CL4Health) @ LREC-COLING 2024, pp. 145–152 (2024). https://aclanthology.org/2024.cl4health-1.18/

  8. [8]

    Yang, D., Wei, J., Xiao, D., Wang, S., Wu, T., Li, G., Li, M., Wang, S., Chen, J., Jiang, Y., Xu, Q., Li, K., Zhai, P., Zhang, L.: Pediatricsgpt: Large language models as chinese medical assistants for pediatric applications. In: Adv. Neural Inf. Process. Syst., vol. 37, pp. 138632–138662 (2024). https://doi.org/10.52202/079017-4398

  9. [9]

    In: Proc

    Neumann, M., King, D., Beltagy, I., Ammar, W.: Scispacy: Fast and robust models for biomedical natural language processing. In: Proc. 18th BioNLP Workshop Shared Task, pp. 319–327 (2019). https://aclanthology.org/W19-5034/

  10. [10]

    Biobert: A pre-trained biomedical language representation model for biomedical text mining

    Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C.H., Kang, J.: Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics36(4), 1234–1240 (2019) https://doi.org/10.1093/bioinformatics/btz682

  11. [12]

    2024 Conf

    Zaratiana, U., Tomeh, N., Holat, P., Charnois, T.: In: Proc. 2024 Conf. North Am. Chapter Assoc. Comput. Linguist.: Hum. Lang. Technol. (Vol. 1: Long Papers), pp. 5364–5376 (2024). https://doi. org/10.18653/v1/2024.naacl-long.300

  12. [13]

    In: Findings Assoc

    Wang, S., Sun, X., Li, X., Ouyang, R., Wu, F., Zhang, T., Li, J., Wang, G.: Gpt-ner: Named entity recognition via large language models. In: Findings Assoc. Comput. Linguist.: NAACL 2025, pp. 4257–4275 (2025). https://doi.org/10.18653/v1/2025.findings-naacl.239

  13. [14]

    arXiv preprint (2023) https://doi.org/10.48550/arXiv.2305.15444

    Ashok, D., Lipton, Z.C.: Promptner: Prompting for named entity recognition. arXiv preprint (2023) https://doi.org/10.48550/arXiv.2305.15444

  14. [15]

    In: Proc

    Bogdanov, S., Constantin, A., Bernard, T., Crabb´ e, B., Bernard, E.P.: Nuner: Entity recognition encoder pre-training via llm-annotated data. In: Proc. 2024 Conf. Empir. Methods Nat. Lang. Process., pp. 11829–11841 (2024). https://doi.org/10.18653/v1/2024.emnlp-main.660

  15. [16]

    Accessed: 5 Dec 2024

    NuMind: NuExtract: A Foundation Model for Structured Extraction (2024). Accessed: 5 Dec 2024. https://numind.com/nuextract

  16. [17]

    A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions , volume=

    Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., Liu, T.: A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Trans. Inf. Syst.43(2) (2025) https://doi.org/10.1145/3703155

  17. [18]

    Research Square preprint (2023) https://doi.org/10.21203/ 22 rs.3.rs-3035772/v1

    Zeng, K.G., Dutt, T., Witowski, J., GV, K.K., Yeung, F., Kim, M., Kim, J., Pleasure, M., Moczulski, C., Lopez, L.J.L., Zhang, H., Al Harbi, M., Shamout, F.E., Major, V.J., Heacock, L., Moy, L., Schnabel, F., Pak, L.M., Shen, Y., Geras, K.J.: Improving information extraction from pathology reports using named entity recognition. Research Square preprint (2...

  18. [19]

    a new transformer language model for pathology domain

    Santos, T., Tariq, A., Das, S., Vayalpati, K., Smith, G.H., Trivedi, H., Banerjee, I.: Pathologybert - pre-trained vs. a new transformer language model for pathology domain. In: AMIA Annu. Symp. Proc., vol. 2022, pp. 962–971 (2023).https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10148354/

  19. [20]

    In: Proc

    Peng, Y., Chen, Q., Lu, Z.: An empirical study of multi-task learning on bert for biomedical text mining. In: Proc. 19th SIGBioMed Workshop Biomed. Lang. Process., pp. 205–214 (2020). https://aclanthology.org/2020.bionlp-1.22/

  20. [21]

    Mu, Y., Tizhoosh, H.R., Tayebi, R.M., Ross, C., Sur, M., Leber, B., Campbell, C.J.V.: A bert model generates diagnostically relevant semantic embeddings from pathology synopses with active learning. Commun. Med.1(1), 11 (2021) https://doi.org/10.1038/s43856-021-00008-0

  21. [22]

    medRxiv preprint (2023) https://doi.org/10.1101/2023.09.10.23295318

    Lu, Y., Srinivasan, G., Preum, S., Pettus, J., Davis, M., Greenburg, J., Vaickus, L., Levy, J.: Assess- ing the impact of pretraining domain relevance on large language models across various pathology reporting tasks. medRxiv preprint (2023) https://doi.org/10.1101/2023.09.10.23295318

  22. [23]

    In: AMIA Jt

    Bumgardner, V.K.C., Mullen, A., Armstrong, S.E., Hickey, C., Marek, V., Talbert, J.: Local large language models for complex structured tasks. In: AMIA Jt. Summits Transl. Sci. Proc., vol. 2024, pp. 105–114 (2024).https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11141822/

  23. [24]

    In: Proc

    Tan, Z., Li, D., Wang, S., Beigi, A., Jiang, B., Bhattacharjee, A., Karami, M., Li, J., Cheng, L., Liu, H.: Large language models for data annotation and synthesis: A survey. In: Proc. 2024 Conf. Empir. Methods Nat. Lang. Process., pp. 930–957 (2024). https://doi.org/10.18653/v1/2024.emnlp-main.54

  24. [25]

    In: Proc

    Xiao, R., Dong, Y., Zhao, J., Wu, R., Lin, M., Chen, G., Wang, H.: Freeal: Towards human-free active learning in the era of large language models. In: Proc. 2023 Conf. Empir. Methods Nat. Lang. Process., pp. 14520–14535 (2023). https://doi.org/10.18653/v1/2023.emnlp-main.896

  25. [26]

    In Ku, L.-W., Martins, A

    Munnangi, M., Feldman, S., Wallace, B., Amir, S., Hope, T., Naik, A.: On-the-fly definition augmen- tation of llms for biomedical ner. In: Proc. 2024 Conf. North Am. Chapter Assoc. Comput. Linguist.: Hum. Lang. Technol. (Vol. 1: Long Papers), pp. 3833–3854 (2024). https://doi.org/10.18653/v1/2024. naacl-long.212

  26. [27]

    control bars

    Naraki, Y., Yamaki, R., Ikeda, Y., Horie, T., Naganuma, H.: Augmenting ner datasets with llms: Towards automated and refined annotation. arXiv preprint (2024) https://doi.org/10.48550/arXiv. 2404.01334

  27. [28]

    In: Findings Assoc

    Jeong, S., Baek, J., Cho, S., Hwang, S., Park, J.: Test-time self-adaptive small language models for question answering. In: Findings Assoc. Comput. Linguist.: EMNLP 2023, pp. 15459–15469 (2023). https://doi.org/10.18653/v1/2023.findings-emnlp.1033

  28. [29]

    Accessed: 18 Dec 2024

    Warmerdam, V.: Large Disagreement Models (2023). Accessed: 18 Dec 2024. https://koaning.io/ posts/large-disagreement-models/

  29. [30]

    Accessed: 18 Dec 2024

    Montani, I.: A Practical Guide to Human-in-the-Loop Distillation (2024). Accessed: 18 Dec 2024. https://explosion.ai/blog/human-in-the-loop-distillation

  30. [31]

    npj Digit

    Gao, Y., Chen, Y., Wang, M., Wu, J., Kim, Y., Zhou, K., Li, M., Liu, X., Fu, X., Wu, J., Wu, H.: Optimising the paradigms of human ai collaborative clinical coding. npj Digit. Med.7(1), 368 (2024) https://doi.org/10.1038/s41746-024-01363-7

  31. [32]

    Future Healthc

    Pope, R., Zenonos, A., Bryant, W., Spiridou, A., Key, D., Patel, S., Robinson, J., Styles, A., Rock- enbach, C., Bicknell, G., Rajendran, P., Taylor, A.M., Sebire, N.J.: Real-world learnings for digital health industry–nhs collaboration: Life sciences vision in action. Future Healthc. J.11(3), 100168 (2024) https://doi.org/10.1016/j.fhj.2024.100168 23

  32. [33]

    Kidney Int.101(4), 678–691 (2022) https://doi.org/10.1016/j.kint.2021.11.028

    Loupy, A., Mengel, M., Haas, M.: Thirty years of the international banff classification for allograft pathology: the past, present, and future of kidney transplant diagnostics. Kidney Int.101(4), 678–691 (2022) https://doi.org/10.1016/j.kint.2021.11.028

  33. [34]

    Accessed: 26 Nov 2024

    Miranda, L.J.V.: Some thoughts on the annotation process (2023). Accessed: 26 Nov 2024. https: //ljvmiranda921.github.io/notebook/2023/07/03/devlog-calamancy/

  34. [35]

    In: Proc

    Miranda, L.J.V.: Developing a named entity recognition dataset for tagalog. In: Proc. First Workshop South East Asian Lang. Process., pp. 13–20 (2023). https://doi.org/10.18653/v1/2023.sealp-1.2

  35. [36]

    In: Proc

    Zheng, L., Chiang, W.L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E.P., Zhang, H., Gonzalez, J.E., Stoica, I.: Judging llm-as-a-judge with mt-bench and chatbot arena. In: Proc. 37th Int. Conf. Neural Inf. Process. Syst. (2023). https://github.com/lm-sys/FastChat/tree/ main/fastchat/llm judge

  36. [37]

    In: Proc

    Panickssery, A., Bowman, S.R., Feng, S.: Llm evaluators recognize and favor their own generations. In: Proc. 38th Int. Conf. Neural Inf. Process. Syst. (2024). https://proceedings.neurips.cc/paperfiles/ paper/2024/hash/7f1f0218e45f5414c79c0679633e47bc-Abstract-Conference.html

  37. [38]

    Know What You Don't Know: Unanswerable Questions for SQuAD

    Rajpurkar, P., Jia, R., Liang, P.: Know what you don’t know: Unanswerable questions for squad. arXiv preprint (2018) https://doi.org/10.48550/arXiv.1806.03822

  38. [39]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized bert pretraining approach. arXiv preprint (2019) https: //doi.org/10.48550/arXiv.1907.11692

  39. [40]

    how i would like ai used for my imaging

    Lee, L., Salami, R.K., Martin, H., Shantharam, L., Thomas, K., Ashworth, E., Allan, E., Yung, K.W., Pauling, C., Leyden, D., Arthurs, O.J., Shelmerdine, S.C.: “how i would like ai used for my imaging”: children and young persons’ perspectives. Eur. Radiol.34(12), 7751–7764 (2024) https: //doi.org/10.1007/s00330-024-10839-9

  40. [41]

    In: Proc

    Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A....

  41. [42]

    In: 2021 Int

    Yang, M.: A survey on few-shot learning in natural language processing. In: 2021 Int. Conf. Artif. Intell. Electromech. Autom. (AIEA), pp. 294–297 (2021). https://doi.org/10.1109/AIEA53260.2021. 00069

  42. [43]

    In: Companion Proc

    Rahmani, H.A., Yilmaz, E., Craswell, N., Mitra, B.: Judgeblender: Ensembling automatic relevance judgments. In: Companion Proc. ACM Web Conf. 2025, pp. 1268–1272 (2025). https://doi.org/10. 1145/3701716.3715536

  43. [44]

    Accessed: 26 Nov 2024

    Great Ormond Street Hospital: Helping parents find answers after miscarriage (2024). Accessed: 26 Nov 2024. https://www.gosh.nhs.uk/news/helping-parents-find-answers-after-miscarriage/

  44. [45]

    chronic allograft nephropathy

    Shelmerdine, S., Davendralingam, N.: Radiology’s role in humanising mortality. BMJ388, 127 (2025) https://doi.org/10.1136/bmj.r127 24 Supplementary Information 1 Annotation and Code Design 1.1 Annotation Processes Human annotation via Streamlit annotation app.We have built a Streamlit web app for QA annotation that can be customised to display relevant se...