arxiv: 2604.04168 · v2 · submitted 2026-04-05 · 💻 cs.CL · cs.IR

Recognition: 2 theorem links

· Lean Theorem

A Semi-Automated Annotation Workflow for Paediatric Histopathology Reports Using Small Language Models

Avish Vijayaraghavan , Jaskaran Singh Kawatra , Sebin Sabu , Jonny Sheldon , Will Poulett , Alex Eze , Daniel Key , John Booth

show 6 more authors

Shiren Patel Jonny Pearson Dan Schofield Jonathan Hope Pavithra Rajendran Neil Sebire

Authors on Pith no claims yet

Pith reviewed 2026-05-13 16:46 UTC · model grok-4.3

classification 💻 cs.CL cs.IR

keywords small language modelsclinical information extractionpaediatric histopathologysemi-automated annotationquestion answeringrenal biopsy reportsCPU-only processing

0 comments

The pith

Small language models extract structured data from paediatric renal biopsy reports at 84% accuracy on standard hardware.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a workflow that lets small language models turn free-text paediatric histopathology reports into structured information. This matters because most clinical records are locked in unstructured text, making large-scale research and analysis difficult without expensive or privacy-risky tools. The approach frames the task as question answering, uses clinician-written guidelines and a few examples to guide the models, and has models disagree to decide which cases need human review. After annotating 400 reports as a test set, the best small model reached 84.3 percent accuracy while running on ordinary CPUs.

Core claim

A semi-automated annotation workflow using small language models, set up as a question-answering task with clinician-guided entity guidelines and few-shot examples, extracts structured information from paediatric renal biopsy reports. The top model, Gemma 2 2B, achieves 84.3% accuracy on CPU-only infrastructure, outperforming baselines such as spaCy and BioBERT, and requires only minimal clinician involvement through a disagreement modelling framework.

What carries the argument

The question-answering formulation of information extraction, guided by entity guidelines and few-shot examples, combined with a disagreement modelling framework to select reports for clinical review.

If this is right

Structured clinical data becomes available from existing electronic patient records without cloud uploads or high-end hardware.
Clinician time is focused only on reports where models disagree, reducing manual effort.
Accuracy gains come from both entity guidelines (7-19%) and few-shot examples (6-38%), though they do not add up when used together.
This method works for domains with constrained diagnostic scope like paediatric renal biopsies.
Local processing keeps sensitive patient data private while still enabling automated annotation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar workflows could be adapted for other types of medical reports with well-defined structures.
Institutions lacking GPU access might still build large annotated datasets using only consumer hardware.
The approach could lower the barrier for creating research-ready clinical databases in paediatrics.
Performance on reports from new hospitals or later time periods remains an open question for transferability.

Load-bearing premise

The 400 manually annotated reports are a representative and accurate gold standard that reflects the full range of reports the system will encounter.

What would settle it

Evaluating the Gemma 2 2B model on a fresh set of reports from a different hospital or time period and finding accuracy below 70 percent would indicate the workflow does not transfer reliably.

read the original abstract

Electronic Patient Record (EPR) systems contain valuable clinical information, but much of it is trapped in unstructured text, limiting its use for research and decision-making. Large language models can extract such information but require substantial computational resources to run locally, and sending sensitive clinical data to cloud-based services, even when deidentified, raises significant patient privacy concerns. In this study, we develop a resource-efficient semi-automated annotation workflow using small language models (SLMs) to extract structured information from unstructured EPR data, focusing on paediatric histopathology reports. As a proof-of-concept, we apply the workflow to paediatric renal biopsy reports, a domain chosen for its constrained diagnostic scope and well-defined underlying biology. We develop the workflow iteratively with clinical oversight across three meetings, manually annotating 400 reports from a dataset of 2,111 at Great Ormond Street Hospital as a gold standard, while developing an automated information extraction approach using SLMs. We frame extraction as a Question-Answering task grounded by clinician-guided entity guidelines and few-shot examples, evaluating five instruction-tuned SLMs with a disagreement modelling framework to prioritise reports for clinical review. Gemma 2 2B achieves the highest accuracy at 84.3%, outperforming off-the-shelf models including spaCy (74.3%), BioBERT-SQuAD (62.3%), RoBERTa-SQuAD (59.7%), and GLiNER (60.2%). Entity guidelines improved performance by 7-19% over the zero-shot baseline, and few-shot examples by 6-38%, though their benefits do not compound when combined. These results demonstrate that SLMs can extract structured information from specialised clinical domains on CPU-only infrastructure with minimal clinician involvement. Our code is available at https://github.com/gosh-dre/nlp_renal_biopsy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A 2B model hits 84% on one hospital's renal biopsy reports via QA prompting and a review loop, but the single-site gold standard without agreement stats or external checks limits how much we can trust the transfer claims.

read the letter

The paper's main result is that Gemma 2 2B reaches 84.3% accuracy extracting structured fields from paediatric renal biopsy reports on CPU hardware, using a QA framing, clinician-written entity guidelines, few-shot examples, and a disagreement step to route cases for review. They built this on 400 manually labelled reports from their own 2,111-report collection at Great Ormond Street, with iterative clinician input over three meetings, and they release the code.

Referee Report

3 major / 2 minor

Summary. The paper claims to develop a semi-automated annotation workflow using small language models (SLMs) to extract structured information from unstructured paediatric histopathology reports, focusing on renal biopsies at Great Ormond Street Hospital. They manually annotate 400 reports from a 2,111-report dataset as a gold standard through iterative clinician oversight, frame extraction as a QA task augmented by entity guidelines and few-shot examples, and evaluate five instruction-tuned SLMs on CPU-only hardware. Gemma 2 2B achieves the highest accuracy of 84.3%, outperforming baselines including spaCy (74.3%), BioBERT-SQuAD (62.3%), RoBERTa-SQuAD (59.7%), and GLiNER (60.2%). Entity guidelines improve performance by 7-19% and few-shot examples by 6-38% over zero-shot baselines (though benefits do not compound), supporting the conclusion that SLMs enable reliable extraction with minimal clinician involvement and public code release.

Significance. If the reported accuracies and improvements hold under more rigorous validation, the work offers a practical demonstration of resource-efficient clinical NLP that avoids cloud-based privacy risks and heavy GPU requirements. The open code and focus on a constrained diagnostic domain strengthen its potential utility for unlocking EPR data in paediatric research settings with limited computational resources.

major comments (3)

[Abstract] Abstract: The 84.3% accuracy for Gemma 2 2B (and all baseline comparisons) is reported without error bars, confidence intervals, or details on train/test splits or statistical testing, leaving the central performance claim only partially supported and difficult to interpret for reliability.
[Abstract] Abstract: No inter-annotator agreement metric (e.g., Cohen's kappa) or details on the number of clinicians, annotation process, or disagreement resolution are provided for the 400-report gold standard, which is load-bearing for the claim that the labels constitute trustworthy ground truth for measuring SLM performance.
[Abstract] Abstract: All evaluation data originates from a single institution and time period (Great Ormond Street Hospital paediatric renal biopsies), with no external validation set; this directly weakens the generalizability assertion that the workflow (and reported gains from guidelines/few-shot) will transfer to other clinical reporting conventions or sites.

minor comments (2)

The abstract states the workflow was developed iteratively across three meetings but provides no further specifics on clinician inputs or changes made during iteration.
Consider including a summary table of all five SLMs' accuracies plus the ablation results for guidelines and few-shot combinations to improve clarity of the comparative results.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which has identified important areas for improving the clarity and transparency of our work. We address each major comment below and describe the revisions we will implement.

read point-by-point responses

Referee: [Abstract] Abstract: The 84.3% accuracy for Gemma 2 2B (and all baseline comparisons) is reported without error bars, confidence intervals, or details on train/test splits or statistical testing, leaving the central performance claim only partially supported and difficult to interpret for reliability.

Authors: We acknowledge this limitation in the abstract. The full manuscript describes the evaluation as a single 80/20 split on the 400-report gold standard set (with the SLMs used in zero- and few-shot settings, so no model training occurred). To strengthen the presentation, we will add 95% bootstrap confidence intervals to all reported accuracies in the abstract and results, explicitly note the evaluation protocol and split details, and clarify that no formal statistical significance tests were conducted between models given the proof-of-concept focus on practical utility. These changes will appear in the revised abstract and methods. revision: yes
Referee: [Abstract] Abstract: No inter-annotator agreement metric (e.g., Cohen's kappa) or details on the number of clinicians, annotation process, or disagreement resolution are provided for the 400-report gold standard, which is load-bearing for the claim that the labels constitute trustworthy ground truth for measuring SLM performance.

Authors: The gold standard was created by two senior paediatric pathologists via iterative consensus across three meetings, with disagreements resolved through direct discussion rather than independent parallel annotation. We did not compute Cohen's kappa because the process was designed as iterative consensus rather than independent labelling. In the revision we will expand the methods section with full details on the number of clinicians, the meeting-based process, and consensus resolution, while adding an explicit statement that formal IAA metrics were not calculated and noting this as a limitation of the gold standard. revision: yes
Referee: [Abstract] Abstract: All evaluation data originates from a single institution and time period (Great Ormond Street Hospital paediatric renal biopsies), with no external validation set; this directly weakens the generalizability assertion that the workflow (and reported gains from guidelines/few-shot) will transfer to other clinical reporting conventions or sites.

Authors: We agree that the single-centre design limits claims of broad generalizability. As a proof-of-concept study using only the GOSH renal biopsy corpus, external validation data were not available. In the revised manuscript we will moderate the abstract and discussion language to clearly state the single-institution limitation, remove or qualify any overstatements of transferability, and add a dedicated paragraph outlining the need for future multi-site studies to assess performance across different reporting conventions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical accuracy measured against independent human annotations

full rationale

The paper describes an empirical workflow in which 400 paediatric renal biopsy reports are manually annotated by clinicians as a gold standard, followed by evaluation of five instruction-tuned SLMs on that fixed set using accuracy, with ablations for entity guidelines and few-shot examples. No equations, fitted parameters, or derivation steps appear in the provided text. The reported 84.3% accuracy (Gemma 2 2B) is a direct comparison to the externally created human labels rather than a quantity that reduces to the model inputs or prior self-citations by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling are present. The evaluation is therefore self-contained against the stated gold standard and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The claim rests on the accuracy and representativeness of the 400-report gold standard created by the authors and on the assumption that the chosen SLMs and prompting strategy generalize within the domain.

pith-pipeline@v0.9.0 · 5687 in / 1091 out tokens · 63233 ms · 2026-05-13T16:46:38.910751+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We frame extraction as a Question-Answering task grounded by clinician-guided entity guidelines and few-shot examples, evaluating five instruction-tuned SLMs with a disagreement modelling framework
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Gemma 2 2B achieves the highest accuracy at 84.3%

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 2 internal anchors

[1]

Lyu, G.: Data-driven decision making in patient management: a systematic review. BMC Med. Inform. Decis. Mak.25(1), 239 (2025) https://doi.org/10.1186/s12911-025-03072-x

work page doi:10.1186/s12911-025-03072-x 2025
[2]

PLOS Digit

Bean, D.M., Kraljevic, Z., Shek, A., Teo, J., Dobson, R.J.B.: Hospital-wide natural language pro- cessing summarising the health data of 1 million patients. PLOS Digit. Health2(5), 0000218 (2023) https://doi.org/10.1371/journal.pdig.0000218

work page doi:10.1371/journal.pdig.0000218 2023
[3]

In: AMIA Annu

Eyre, H., Chapman, A.B., Peterson, K.S., Shi, J., Alba, P.R., Jones, M.M., Box, T.L., DuVall, S.L., Patterson, O.V.: Launching into clinical space with medspacy: a new clinical text processing toolkit in python. In: AMIA Annu. Symp. Proc., vol. 2021, p. 438 (2022). https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8861690/

work page 2021
[4]

Kraljevic, Z., Searle, T., Shek, A., Roguski, L., Noor, K., Bean, D., Mascio, A., Zhu, L., Folarin, A.A., Roberts, A., Bendayan, R., Richardson, M.P., Stewart, R., Shah, A.D., Wong, W.K., Ibrahim, Z., Teo, J.T., Dobson, R.J.B.: Multi-domain clinical natural language processing with medcat: The medical concept annotation toolkit. Artif. Intell. Med.117, 10...

work page arXiv 2021
[5]

In: 2024 46th Annu

Jabarulla, M.Y., Oeltze-Jafra, S., Beerbaum, P., Uden, T.: Meddoc-bot: A chat tool for comparative analysis of large language models in the context of the pediatric hypertension guideline. In: 2024 46th Annu. Int. Conf. IEEE Eng. Med. Biol. Soc. (EMBC), pp. 1–4 (2024). https://doi.org/10.1109/ EMBC53108.2024.10781509

work page arXiv 2024
[6]

JAMA Pediatr.178(3), 313–315 (2024) https://doi.org/10.1001/jamapediatrics.2023.5750

Barile, J., Margolis, A., Cason, G., Kim, R., Kalash, S., Tchaconas, A., Milanaik, R.: Diagnostic accuracy of a large language model in pediatric case studies. JAMA Pediatr.178(3), 313–315 (2024) https://doi.org/10.1001/jamapediatrics.2023.5750

work page doi:10.1001/jamapediatrics.2023.5750 2024
[7]

In: Proc

Andrew, J.J., Vincent, M., Burgun, A., Garcelon, N.: Evaluating llms for temporal entity extraction from pediatric clinical text in rare diseases context. In: Proc. First Workshop Patient-Oriented Lang. Process. (CL4Health) @ LREC-COLING 2024, pp. 145–152 (2024). https://aclanthology.org/2024.cl4health-1.18/

work page 2024
[8]

Yang, D., Wei, J., Xiao, D., Wang, S., Wu, T., Li, G., Li, M., Wang, S., Chen, J., Jiang, Y., Xu, Q., Li, K., Zhai, P., Zhang, L.: Pediatricsgpt: Large language models as chinese medical assistants for pediatric applications. In: Adv. Neural Inf. Process. Syst., vol. 37, pp. 138632–138662 (2024). https://doi.org/10.52202/079017-4398

work page doi:10.52202/079017-4398 2024
[9]

In: Proc

Neumann, M., King, D., Beltagy, I., Ammar, W.: Scispacy: Fast and robust models for biomedical natural language processing. In: Proc. 18th BioNLP Workshop Shared Task, pp. 319–327 (2019). https://aclanthology.org/W19-5034/

work page 2019
[10]

Biobert: A pre-trained biomedical language representation model for biomedical text mining

Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C.H., Kang, J.: Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics36(4), 1234–1240 (2019) https://doi.org/10.1093/bioinformatics/btz682

work page doi:10.1093/bioinformatics/btz682 2019
[12]

2024 Conf

Zaratiana, U., Tomeh, N., Holat, P., Charnois, T.: In: Proc. 2024 Conf. North Am. Chapter Assoc. Comput. Linguist.: Hum. Lang. Technol. (Vol. 1: Long Papers), pp. 5364–5376 (2024). https://doi. org/10.18653/v1/2024.naacl-long.300

work page doi:10.18653/v1/2024.naacl-long.300 2024
[13]

In: Findings Assoc

Wang, S., Sun, X., Li, X., Ouyang, R., Wu, F., Zhang, T., Li, J., Wang, G.: Gpt-ner: Named entity recognition via large language models. In: Findings Assoc. Comput. Linguist.: NAACL 2025, pp. 4257–4275 (2025). https://doi.org/10.18653/v1/2025.findings-naacl.239

work page doi:10.18653/v1/2025.findings-naacl.239 2025
[14]

arXiv preprint (2023) https://doi.org/10.48550/arXiv.2305.15444

Ashok, D., Lipton, Z.C.: Promptner: Prompting for named entity recognition. arXiv preprint (2023) https://doi.org/10.48550/arXiv.2305.15444

work page doi:10.48550/arxiv.2305.15444 2023
[15]

In: Proc

Bogdanov, S., Constantin, A., Bernard, T., Crabb´ e, B., Bernard, E.P.: Nuner: Entity recognition encoder pre-training via llm-annotated data. In: Proc. 2024 Conf. Empir. Methods Nat. Lang. Process., pp. 11829–11841 (2024). https://doi.org/10.18653/v1/2024.emnlp-main.660

work page doi:10.18653/v1/2024.emnlp-main.660 2024
[16]

Accessed: 5 Dec 2024

NuMind: NuExtract: A Foundation Model for Structured Extraction (2024). Accessed: 5 Dec 2024. https://numind.com/nuextract

work page 2024
[17]

A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions , volume=

Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., Liu, T.: A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Trans. Inf. Syst.43(2) (2025) https://doi.org/10.1145/3703155

work page doi:10.1145/3703155 2025
[18]

Research Square preprint (2023) https://doi.org/10.21203/ 22 rs.3.rs-3035772/v1

Zeng, K.G., Dutt, T., Witowski, J., GV, K.K., Yeung, F., Kim, M., Kim, J., Pleasure, M., Moczulski, C., Lopez, L.J.L., Zhang, H., Al Harbi, M., Shamout, F.E., Major, V.J., Heacock, L., Moy, L., Schnabel, F., Pak, L.M., Shen, Y., Geras, K.J.: Improving information extraction from pathology reports using named entity recognition. Research Square preprint (2...

work page 2023
[19]

a new transformer language model for pathology domain

Santos, T., Tariq, A., Das, S., Vayalpati, K., Smith, G.H., Trivedi, H., Banerjee, I.: Pathologybert - pre-trained vs. a new transformer language model for pathology domain. In: AMIA Annu. Symp. Proc., vol. 2022, pp. 962–971 (2023).https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10148354/

work page 2022
[20]

In: Proc

Peng, Y., Chen, Q., Lu, Z.: An empirical study of multi-task learning on bert for biomedical text mining. In: Proc. 19th SIGBioMed Workshop Biomed. Lang. Process., pp. 205–214 (2020). https://aclanthology.org/2020.bionlp-1.22/

work page 2020
[21]

Mu, Y., Tizhoosh, H.R., Tayebi, R.M., Ross, C., Sur, M., Leber, B., Campbell, C.J.V.: A bert model generates diagnostically relevant semantic embeddings from pathology synopses with active learning. Commun. Med.1(1), 11 (2021) https://doi.org/10.1038/s43856-021-00008-0

work page doi:10.1038/s43856-021-00008-0 2021
[22]

medRxiv preprint (2023) https://doi.org/10.1101/2023.09.10.23295318

Lu, Y., Srinivasan, G., Preum, S., Pettus, J., Davis, M., Greenburg, J., Vaickus, L., Levy, J.: Assess- ing the impact of pretraining domain relevance on large language models across various pathology reporting tasks. medRxiv preprint (2023) https://doi.org/10.1101/2023.09.10.23295318

work page doi:10.1101/2023.09.10.23295318 2023
[23]

In: AMIA Jt

Bumgardner, V.K.C., Mullen, A., Armstrong, S.E., Hickey, C., Marek, V., Talbert, J.: Local large language models for complex structured tasks. In: AMIA Jt. Summits Transl. Sci. Proc., vol. 2024, pp. 105–114 (2024).https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11141822/

work page 2024
[24]

In: Proc

Tan, Z., Li, D., Wang, S., Beigi, A., Jiang, B., Bhattacharjee, A., Karami, M., Li, J., Cheng, L., Liu, H.: Large language models for data annotation and synthesis: A survey. In: Proc. 2024 Conf. Empir. Methods Nat. Lang. Process., pp. 930–957 (2024). https://doi.org/10.18653/v1/2024.emnlp-main.54

work page doi:10.18653/v1/2024.emnlp-main.54 2024
[25]

In: Proc

Xiao, R., Dong, Y., Zhao, J., Wu, R., Lin, M., Chen, G., Wang, H.: Freeal: Towards human-free active learning in the era of large language models. In: Proc. 2023 Conf. Empir. Methods Nat. Lang. Process., pp. 14520–14535 (2023). https://doi.org/10.18653/v1/2023.emnlp-main.896

work page doi:10.18653/v1/2023.emnlp-main.896 2023
[26]

In Ku, L.-W., Martins, A

Munnangi, M., Feldman, S., Wallace, B., Amir, S., Hope, T., Naik, A.: On-the-fly definition augmen- tation of llms for biomedical ner. In: Proc. 2024 Conf. North Am. Chapter Assoc. Comput. Linguist.: Hum. Lang. Technol. (Vol. 1: Long Papers), pp. 3833–3854 (2024). https://doi.org/10.18653/v1/2024. naacl-long.212

work page doi:10.18653/v1/2024 2024
[27]

control bars

Naraki, Y., Yamaki, R., Ikeda, Y., Horie, T., Naganuma, H.: Augmenting ner datasets with llms: Towards automated and refined annotation. arXiv preprint (2024) https://doi.org/10.48550/arXiv. 2404.01334

work page internal anchor Pith review doi:10.48550/arxiv 2024
[28]

In: Findings Assoc

Jeong, S., Baek, J., Cho, S., Hwang, S., Park, J.: Test-time self-adaptive small language models for question answering. In: Findings Assoc. Comput. Linguist.: EMNLP 2023, pp. 15459–15469 (2023). https://doi.org/10.18653/v1/2023.findings-emnlp.1033

work page doi:10.18653/v1/2023.findings-emnlp.1033 2023
[29]

Accessed: 18 Dec 2024

Warmerdam, V.: Large Disagreement Models (2023). Accessed: 18 Dec 2024. https://koaning.io/ posts/large-disagreement-models/

work page 2023
[30]

Accessed: 18 Dec 2024

Montani, I.: A Practical Guide to Human-in-the-Loop Distillation (2024). Accessed: 18 Dec 2024. https://explosion.ai/blog/human-in-the-loop-distillation

work page 2024
[31]

npj Digit

Gao, Y., Chen, Y., Wang, M., Wu, J., Kim, Y., Zhou, K., Li, M., Liu, X., Fu, X., Wu, J., Wu, H.: Optimising the paradigms of human ai collaborative clinical coding. npj Digit. Med.7(1), 368 (2024) https://doi.org/10.1038/s41746-024-01363-7

work page doi:10.1038/s41746-024-01363-7 2024
[32]

Future Healthc

Pope, R., Zenonos, A., Bryant, W., Spiridou, A., Key, D., Patel, S., Robinson, J., Styles, A., Rock- enbach, C., Bicknell, G., Rajendran, P., Taylor, A.M., Sebire, N.J.: Real-world learnings for digital health industry–nhs collaboration: Life sciences vision in action. Future Healthc. J.11(3), 100168 (2024) https://doi.org/10.1016/j.fhj.2024.100168 23

work page doi:10.1016/j.fhj.2024.100168 2024
[33]

Kidney Int.101(4), 678–691 (2022) https://doi.org/10.1016/j.kint.2021.11.028

Loupy, A., Mengel, M., Haas, M.: Thirty years of the international banff classification for allograft pathology: the past, present, and future of kidney transplant diagnostics. Kidney Int.101(4), 678–691 (2022) https://doi.org/10.1016/j.kint.2021.11.028

work page doi:10.1016/j.kint.2021.11.028 2022
[34]

Accessed: 26 Nov 2024

Miranda, L.J.V.: Some thoughts on the annotation process (2023). Accessed: 26 Nov 2024. https: //ljvmiranda921.github.io/notebook/2023/07/03/devlog-calamancy/

work page 2023
[35]

In: Proc

Miranda, L.J.V.: Developing a named entity recognition dataset for tagalog. In: Proc. First Workshop South East Asian Lang. Process., pp. 13–20 (2023). https://doi.org/10.18653/v1/2023.sealp-1.2

work page doi:10.18653/v1/2023.sealp-1.2 2023
[36]

In: Proc

Zheng, L., Chiang, W.L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E.P., Zhang, H., Gonzalez, J.E., Stoica, I.: Judging llm-as-a-judge with mt-bench and chatbot arena. In: Proc. 37th Int. Conf. Neural Inf. Process. Syst. (2023). https://github.com/lm-sys/FastChat/tree/ main/fastchat/llm judge

work page 2023
[37]

In: Proc

Panickssery, A., Bowman, S.R., Feng, S.: Llm evaluators recognize and favor their own generations. In: Proc. 38th Int. Conf. Neural Inf. Process. Syst. (2024). https://proceedings.neurips.cc/paperfiles/ paper/2024/hash/7f1f0218e45f5414c79c0679633e47bc-Abstract-Conference.html

work page 2024
[38]

Know What You Don't Know: Unanswerable Questions for SQuAD

Rajpurkar, P., Jia, R., Liang, P.: Know what you don’t know: Unanswerable questions for squad. arXiv preprint (2018) https://doi.org/10.48550/arXiv.1806.03822

work page Pith review doi:10.48550/arxiv.1806.03822 2018
[39]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized bert pretraining approach. arXiv preprint (2019) https: //doi.org/10.48550/arXiv.1907.11692

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1907.11692 2019
[40]

how i would like ai used for my imaging

Lee, L., Salami, R.K., Martin, H., Shantharam, L., Thomas, K., Ashworth, E., Allan, E., Yung, K.W., Pauling, C., Leyden, D., Arthurs, O.J., Shelmerdine, S.C.: “how i would like ai used for my imaging”: children and young persons’ perspectives. Eur. Radiol.34(12), 7751–7764 (2024) https: //doi.org/10.1007/s00330-024-10839-9

work page doi:10.1007/s00330-024-10839-9 2024
[41]

In: Proc

Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A....

work page 2020
[42]

In: 2021 Int

Yang, M.: A survey on few-shot learning in natural language processing. In: 2021 Int. Conf. Artif. Intell. Electromech. Autom. (AIEA), pp. 294–297 (2021). https://doi.org/10.1109/AIEA53260.2021. 00069

work page doi:10.1109/aiea53260.2021 2021
[43]

In: Companion Proc

Rahmani, H.A., Yilmaz, E., Craswell, N., Mitra, B.: Judgeblender: Ensembling automatic relevance judgments. In: Companion Proc. ACM Web Conf. 2025, pp. 1268–1272 (2025). https://doi.org/10. 1145/3701716.3715536

work page arXiv 2025
[44]

Accessed: 26 Nov 2024

Great Ormond Street Hospital: Helping parents find answers after miscarriage (2024). Accessed: 26 Nov 2024. https://www.gosh.nhs.uk/news/helping-parents-find-answers-after-miscarriage/

work page 2024
[45]

chronic allograft nephropathy

Shelmerdine, S., Davendralingam, N.: Radiology’s role in humanising mortality. BMJ388, 127 (2025) https://doi.org/10.1136/bmj.r127 24 Supplementary Information 1 Annotation and Code Design 1.1 Annotation Processes Human annotation via Streamlit annotation app.We have built a Streamlit web app for QA annotation that can be customised to display relevant se...

work page doi:10.1136/bmj.r127 2025