Recognition: 2 theorem links
· Lean TheoremA Semi-Automated Annotation Workflow for Paediatric Histopathology Reports Using Small Language Models
Pith reviewed 2026-05-13 16:46 UTC · model grok-4.3
The pith
Small language models extract structured data from paediatric renal biopsy reports at 84% accuracy on standard hardware.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A semi-automated annotation workflow using small language models, set up as a question-answering task with clinician-guided entity guidelines and few-shot examples, extracts structured information from paediatric renal biopsy reports. The top model, Gemma 2 2B, achieves 84.3% accuracy on CPU-only infrastructure, outperforming baselines such as spaCy and BioBERT, and requires only minimal clinician involvement through a disagreement modelling framework.
What carries the argument
The question-answering formulation of information extraction, guided by entity guidelines and few-shot examples, combined with a disagreement modelling framework to select reports for clinical review.
If this is right
- Structured clinical data becomes available from existing electronic patient records without cloud uploads or high-end hardware.
- Clinician time is focused only on reports where models disagree, reducing manual effort.
- Accuracy gains come from both entity guidelines (7-19%) and few-shot examples (6-38%), though they do not add up when used together.
- This method works for domains with constrained diagnostic scope like paediatric renal biopsies.
- Local processing keeps sensitive patient data private while still enabling automated annotation.
Where Pith is reading between the lines
- Similar workflows could be adapted for other types of medical reports with well-defined structures.
- Institutions lacking GPU access might still build large annotated datasets using only consumer hardware.
- The approach could lower the barrier for creating research-ready clinical databases in paediatrics.
- Performance on reports from new hospitals or later time periods remains an open question for transferability.
Load-bearing premise
The 400 manually annotated reports are a representative and accurate gold standard that reflects the full range of reports the system will encounter.
What would settle it
Evaluating the Gemma 2 2B model on a fresh set of reports from a different hospital or time period and finding accuracy below 70 percent would indicate the workflow does not transfer reliably.
read the original abstract
Electronic Patient Record (EPR) systems contain valuable clinical information, but much of it is trapped in unstructured text, limiting its use for research and decision-making. Large language models can extract such information but require substantial computational resources to run locally, and sending sensitive clinical data to cloud-based services, even when deidentified, raises significant patient privacy concerns. In this study, we develop a resource-efficient semi-automated annotation workflow using small language models (SLMs) to extract structured information from unstructured EPR data, focusing on paediatric histopathology reports. As a proof-of-concept, we apply the workflow to paediatric renal biopsy reports, a domain chosen for its constrained diagnostic scope and well-defined underlying biology. We develop the workflow iteratively with clinical oversight across three meetings, manually annotating 400 reports from a dataset of 2,111 at Great Ormond Street Hospital as a gold standard, while developing an automated information extraction approach using SLMs. We frame extraction as a Question-Answering task grounded by clinician-guided entity guidelines and few-shot examples, evaluating five instruction-tuned SLMs with a disagreement modelling framework to prioritise reports for clinical review. Gemma 2 2B achieves the highest accuracy at 84.3%, outperforming off-the-shelf models including spaCy (74.3%), BioBERT-SQuAD (62.3%), RoBERTa-SQuAD (59.7%), and GLiNER (60.2%). Entity guidelines improved performance by 7-19% over the zero-shot baseline, and few-shot examples by 6-38%, though their benefits do not compound when combined. These results demonstrate that SLMs can extract structured information from specialised clinical domains on CPU-only infrastructure with minimal clinician involvement. Our code is available at https://github.com/gosh-dre/nlp_renal_biopsy.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to develop a semi-automated annotation workflow using small language models (SLMs) to extract structured information from unstructured paediatric histopathology reports, focusing on renal biopsies at Great Ormond Street Hospital. They manually annotate 400 reports from a 2,111-report dataset as a gold standard through iterative clinician oversight, frame extraction as a QA task augmented by entity guidelines and few-shot examples, and evaluate five instruction-tuned SLMs on CPU-only hardware. Gemma 2 2B achieves the highest accuracy of 84.3%, outperforming baselines including spaCy (74.3%), BioBERT-SQuAD (62.3%), RoBERTa-SQuAD (59.7%), and GLiNER (60.2%). Entity guidelines improve performance by 7-19% and few-shot examples by 6-38% over zero-shot baselines (though benefits do not compound), supporting the conclusion that SLMs enable reliable extraction with minimal clinician involvement and public code release.
Significance. If the reported accuracies and improvements hold under more rigorous validation, the work offers a practical demonstration of resource-efficient clinical NLP that avoids cloud-based privacy risks and heavy GPU requirements. The open code and focus on a constrained diagnostic domain strengthen its potential utility for unlocking EPR data in paediatric research settings with limited computational resources.
major comments (3)
- [Abstract] Abstract: The 84.3% accuracy for Gemma 2 2B (and all baseline comparisons) is reported without error bars, confidence intervals, or details on train/test splits or statistical testing, leaving the central performance claim only partially supported and difficult to interpret for reliability.
- [Abstract] Abstract: No inter-annotator agreement metric (e.g., Cohen's kappa) or details on the number of clinicians, annotation process, or disagreement resolution are provided for the 400-report gold standard, which is load-bearing for the claim that the labels constitute trustworthy ground truth for measuring SLM performance.
- [Abstract] Abstract: All evaluation data originates from a single institution and time period (Great Ormond Street Hospital paediatric renal biopsies), with no external validation set; this directly weakens the generalizability assertion that the workflow (and reported gains from guidelines/few-shot) will transfer to other clinical reporting conventions or sites.
minor comments (2)
- The abstract states the workflow was developed iteratively across three meetings but provides no further specifics on clinician inputs or changes made during iteration.
- Consider including a summary table of all five SLMs' accuracies plus the ablation results for guidelines and few-shot combinations to improve clarity of the comparative results.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback, which has identified important areas for improving the clarity and transparency of our work. We address each major comment below and describe the revisions we will implement.
read point-by-point responses
-
Referee: [Abstract] Abstract: The 84.3% accuracy for Gemma 2 2B (and all baseline comparisons) is reported without error bars, confidence intervals, or details on train/test splits or statistical testing, leaving the central performance claim only partially supported and difficult to interpret for reliability.
Authors: We acknowledge this limitation in the abstract. The full manuscript describes the evaluation as a single 80/20 split on the 400-report gold standard set (with the SLMs used in zero- and few-shot settings, so no model training occurred). To strengthen the presentation, we will add 95% bootstrap confidence intervals to all reported accuracies in the abstract and results, explicitly note the evaluation protocol and split details, and clarify that no formal statistical significance tests were conducted between models given the proof-of-concept focus on practical utility. These changes will appear in the revised abstract and methods. revision: yes
-
Referee: [Abstract] Abstract: No inter-annotator agreement metric (e.g., Cohen's kappa) or details on the number of clinicians, annotation process, or disagreement resolution are provided for the 400-report gold standard, which is load-bearing for the claim that the labels constitute trustworthy ground truth for measuring SLM performance.
Authors: The gold standard was created by two senior paediatric pathologists via iterative consensus across three meetings, with disagreements resolved through direct discussion rather than independent parallel annotation. We did not compute Cohen's kappa because the process was designed as iterative consensus rather than independent labelling. In the revision we will expand the methods section with full details on the number of clinicians, the meeting-based process, and consensus resolution, while adding an explicit statement that formal IAA metrics were not calculated and noting this as a limitation of the gold standard. revision: yes
-
Referee: [Abstract] Abstract: All evaluation data originates from a single institution and time period (Great Ormond Street Hospital paediatric renal biopsies), with no external validation set; this directly weakens the generalizability assertion that the workflow (and reported gains from guidelines/few-shot) will transfer to other clinical reporting conventions or sites.
Authors: We agree that the single-centre design limits claims of broad generalizability. As a proof-of-concept study using only the GOSH renal biopsy corpus, external validation data were not available. In the revised manuscript we will moderate the abstract and discussion language to clearly state the single-institution limitation, remove or qualify any overstatements of transferability, and add a dedicated paragraph outlining the need for future multi-site studies to assess performance across different reporting conventions. revision: yes
Circularity Check
No circularity: empirical accuracy measured against independent human annotations
full rationale
The paper describes an empirical workflow in which 400 paediatric renal biopsy reports are manually annotated by clinicians as a gold standard, followed by evaluation of five instruction-tuned SLMs on that fixed set using accuracy, with ablations for entity guidelines and few-shot examples. No equations, fitted parameters, or derivation steps appear in the provided text. The reported 84.3% accuracy (Gemma 2 2B) is a direct comparison to the externally created human labels rather than a quantity that reduces to the model inputs or prior self-citations by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling are present. The evaluation is therefore self-contained against the stated gold standard and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We frame extraction as a Question-Answering task grounded by clinician-guided entity guidelines and few-shot examples, evaluating five instruction-tuned SLMs with a disagreement modelling framework
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Gemma 2 2B achieves the highest accuracy at 84.3%
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Lyu, G.: Data-driven decision making in patient management: a systematic review. BMC Med. Inform. Decis. Mak.25(1), 239 (2025) https://doi.org/10.1186/s12911-025-03072-x
-
[2]
Bean, D.M., Kraljevic, Z., Shek, A., Teo, J., Dobson, R.J.B.: Hospital-wide natural language pro- cessing summarising the health data of 1 million patients. PLOS Digit. Health2(5), 0000218 (2023) https://doi.org/10.1371/journal.pdig.0000218
-
[3]
Eyre, H., Chapman, A.B., Peterson, K.S., Shi, J., Alba, P.R., Jones, M.M., Box, T.L., DuVall, S.L., Patterson, O.V.: Launching into clinical space with medspacy: a new clinical text processing toolkit in python. In: AMIA Annu. Symp. Proc., vol. 2021, p. 438 (2022). https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8861690/
work page 2021
-
[4]
Kraljevic, Z., Searle, T., Shek, A., Roguski, L., Noor, K., Bean, D., Mascio, A., Zhu, L., Folarin, A.A., Roberts, A., Bendayan, R., Richardson, M.P., Stewart, R., Shah, A.D., Wong, W.K., Ibrahim, Z., Teo, J.T., Dobson, R.J.B.: Multi-domain clinical natural language processing with medcat: The medical concept annotation toolkit. Artif. Intell. Med.117, 10...
-
[5]
Jabarulla, M.Y., Oeltze-Jafra, S., Beerbaum, P., Uden, T.: Meddoc-bot: A chat tool for comparative analysis of large language models in the context of the pediatric hypertension guideline. In: 2024 46th Annu. Int. Conf. IEEE Eng. Med. Biol. Soc. (EMBC), pp. 1–4 (2024). https://doi.org/10.1109/ EMBC53108.2024.10781509
-
[6]
JAMA Pediatr.178(3), 313–315 (2024) https://doi.org/10.1001/jamapediatrics.2023.5750
Barile, J., Margolis, A., Cason, G., Kim, R., Kalash, S., Tchaconas, A., Milanaik, R.: Diagnostic accuracy of a large language model in pediatric case studies. JAMA Pediatr.178(3), 313–315 (2024) https://doi.org/10.1001/jamapediatrics.2023.5750
-
[7]
Andrew, J.J., Vincent, M., Burgun, A., Garcelon, N.: Evaluating llms for temporal entity extraction from pediatric clinical text in rare diseases context. In: Proc. First Workshop Patient-Oriented Lang. Process. (CL4Health) @ LREC-COLING 2024, pp. 145–152 (2024). https://aclanthology.org/2024.cl4health-1.18/
work page 2024
-
[8]
Yang, D., Wei, J., Xiao, D., Wang, S., Wu, T., Li, G., Li, M., Wang, S., Chen, J., Jiang, Y., Xu, Q., Li, K., Zhai, P., Zhang, L.: Pediatricsgpt: Large language models as chinese medical assistants for pediatric applications. In: Adv. Neural Inf. Process. Syst., vol. 37, pp. 138632–138662 (2024). https://doi.org/10.52202/079017-4398
- [9]
-
[10]
Biobert: A pre-trained biomedical language representation model for biomedical text mining
Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C.H., Kang, J.: Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics36(4), 1234–1240 (2019) https://doi.org/10.1093/bioinformatics/btz682
-
[12]
Zaratiana, U., Tomeh, N., Holat, P., Charnois, T.: In: Proc. 2024 Conf. North Am. Chapter Assoc. Comput. Linguist.: Hum. Lang. Technol. (Vol. 1: Long Papers), pp. 5364–5376 (2024). https://doi. org/10.18653/v1/2024.naacl-long.300
-
[13]
Wang, S., Sun, X., Li, X., Ouyang, R., Wu, F., Zhang, T., Li, J., Wang, G.: Gpt-ner: Named entity recognition via large language models. In: Findings Assoc. Comput. Linguist.: NAACL 2025, pp. 4257–4275 (2025). https://doi.org/10.18653/v1/2025.findings-naacl.239
-
[14]
arXiv preprint (2023) https://doi.org/10.48550/arXiv.2305.15444
Ashok, D., Lipton, Z.C.: Promptner: Prompting for named entity recognition. arXiv preprint (2023) https://doi.org/10.48550/arXiv.2305.15444
-
[15]
Bogdanov, S., Constantin, A., Bernard, T., Crabb´ e, B., Bernard, E.P.: Nuner: Entity recognition encoder pre-training via llm-annotated data. In: Proc. 2024 Conf. Empir. Methods Nat. Lang. Process., pp. 11829–11841 (2024). https://doi.org/10.18653/v1/2024.emnlp-main.660
-
[16]
NuMind: NuExtract: A Foundation Model for Structured Extraction (2024). Accessed: 5 Dec 2024. https://numind.com/nuextract
work page 2024
-
[17]
Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., Liu, T.: A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Trans. Inf. Syst.43(2) (2025) https://doi.org/10.1145/3703155
-
[18]
Research Square preprint (2023) https://doi.org/10.21203/ 22 rs.3.rs-3035772/v1
Zeng, K.G., Dutt, T., Witowski, J., GV, K.K., Yeung, F., Kim, M., Kim, J., Pleasure, M., Moczulski, C., Lopez, L.J.L., Zhang, H., Al Harbi, M., Shamout, F.E., Major, V.J., Heacock, L., Moy, L., Schnabel, F., Pak, L.M., Shen, Y., Geras, K.J.: Improving information extraction from pathology reports using named entity recognition. Research Square preprint (2...
work page 2023
-
[19]
a new transformer language model for pathology domain
Santos, T., Tariq, A., Das, S., Vayalpati, K., Smith, G.H., Trivedi, H., Banerjee, I.: Pathologybert - pre-trained vs. a new transformer language model for pathology domain. In: AMIA Annu. Symp. Proc., vol. 2022, pp. 962–971 (2023).https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10148354/
work page 2022
- [20]
-
[21]
Mu, Y., Tizhoosh, H.R., Tayebi, R.M., Ross, C., Sur, M., Leber, B., Campbell, C.J.V.: A bert model generates diagnostically relevant semantic embeddings from pathology synopses with active learning. Commun. Med.1(1), 11 (2021) https://doi.org/10.1038/s43856-021-00008-0
-
[22]
medRxiv preprint (2023) https://doi.org/10.1101/2023.09.10.23295318
Lu, Y., Srinivasan, G., Preum, S., Pettus, J., Davis, M., Greenburg, J., Vaickus, L., Levy, J.: Assess- ing the impact of pretraining domain relevance on large language models across various pathology reporting tasks. medRxiv preprint (2023) https://doi.org/10.1101/2023.09.10.23295318
-
[23]
Bumgardner, V.K.C., Mullen, A., Armstrong, S.E., Hickey, C., Marek, V., Talbert, J.: Local large language models for complex structured tasks. In: AMIA Jt. Summits Transl. Sci. Proc., vol. 2024, pp. 105–114 (2024).https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11141822/
work page 2024
-
[24]
Tan, Z., Li, D., Wang, S., Beigi, A., Jiang, B., Bhattacharjee, A., Karami, M., Li, J., Cheng, L., Liu, H.: Large language models for data annotation and synthesis: A survey. In: Proc. 2024 Conf. Empir. Methods Nat. Lang. Process., pp. 930–957 (2024). https://doi.org/10.18653/v1/2024.emnlp-main.54
-
[25]
Xiao, R., Dong, Y., Zhao, J., Wu, R., Lin, M., Chen, G., Wang, H.: Freeal: Towards human-free active learning in the era of large language models. In: Proc. 2023 Conf. Empir. Methods Nat. Lang. Process., pp. 14520–14535 (2023). https://doi.org/10.18653/v1/2023.emnlp-main.896
-
[26]
Munnangi, M., Feldman, S., Wallace, B., Amir, S., Hope, T., Naik, A.: On-the-fly definition augmen- tation of llms for biomedical ner. In: Proc. 2024 Conf. North Am. Chapter Assoc. Comput. Linguist.: Hum. Lang. Technol. (Vol. 1: Long Papers), pp. 3833–3854 (2024). https://doi.org/10.18653/v1/2024. naacl-long.212
-
[27]
Naraki, Y., Yamaki, R., Ikeda, Y., Horie, T., Naganuma, H.: Augmenting ner datasets with llms: Towards automated and refined annotation. arXiv preprint (2024) https://doi.org/10.48550/arXiv. 2404.01334
work page internal anchor Pith review doi:10.48550/arxiv 2024
-
[28]
Jeong, S., Baek, J., Cho, S., Hwang, S., Park, J.: Test-time self-adaptive small language models for question answering. In: Findings Assoc. Comput. Linguist.: EMNLP 2023, pp. 15459–15469 (2023). https://doi.org/10.18653/v1/2023.findings-emnlp.1033
-
[29]
Warmerdam, V.: Large Disagreement Models (2023). Accessed: 18 Dec 2024. https://koaning.io/ posts/large-disagreement-models/
work page 2023
-
[30]
Montani, I.: A Practical Guide to Human-in-the-Loop Distillation (2024). Accessed: 18 Dec 2024. https://explosion.ai/blog/human-in-the-loop-distillation
work page 2024
-
[31]
Gao, Y., Chen, Y., Wang, M., Wu, J., Kim, Y., Zhou, K., Li, M., Liu, X., Fu, X., Wu, J., Wu, H.: Optimising the paradigms of human ai collaborative clinical coding. npj Digit. Med.7(1), 368 (2024) https://doi.org/10.1038/s41746-024-01363-7
-
[32]
Pope, R., Zenonos, A., Bryant, W., Spiridou, A., Key, D., Patel, S., Robinson, J., Styles, A., Rock- enbach, C., Bicknell, G., Rajendran, P., Taylor, A.M., Sebire, N.J.: Real-world learnings for digital health industry–nhs collaboration: Life sciences vision in action. Future Healthc. J.11(3), 100168 (2024) https://doi.org/10.1016/j.fhj.2024.100168 23
-
[33]
Kidney Int.101(4), 678–691 (2022) https://doi.org/10.1016/j.kint.2021.11.028
Loupy, A., Mengel, M., Haas, M.: Thirty years of the international banff classification for allograft pathology: the past, present, and future of kidney transplant diagnostics. Kidney Int.101(4), 678–691 (2022) https://doi.org/10.1016/j.kint.2021.11.028
-
[34]
Miranda, L.J.V.: Some thoughts on the annotation process (2023). Accessed: 26 Nov 2024. https: //ljvmiranda921.github.io/notebook/2023/07/03/devlog-calamancy/
work page 2023
-
[35]
Miranda, L.J.V.: Developing a named entity recognition dataset for tagalog. In: Proc. First Workshop South East Asian Lang. Process., pp. 13–20 (2023). https://doi.org/10.18653/v1/2023.sealp-1.2
-
[36]
Zheng, L., Chiang, W.L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E.P., Zhang, H., Gonzalez, J.E., Stoica, I.: Judging llm-as-a-judge with mt-bench and chatbot arena. In: Proc. 37th Int. Conf. Neural Inf. Process. Syst. (2023). https://github.com/lm-sys/FastChat/tree/ main/fastchat/llm judge
work page 2023
-
[37]
Panickssery, A., Bowman, S.R., Feng, S.: Llm evaluators recognize and favor their own generations. In: Proc. 38th Int. Conf. Neural Inf. Process. Syst. (2024). https://proceedings.neurips.cc/paperfiles/ paper/2024/hash/7f1f0218e45f5414c79c0679633e47bc-Abstract-Conference.html
work page 2024
-
[38]
Know What You Don't Know: Unanswerable Questions for SQuAD
Rajpurkar, P., Jia, R., Liang, P.: Know what you don’t know: Unanswerable questions for squad. arXiv preprint (2018) https://doi.org/10.48550/arXiv.1806.03822
-
[39]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized bert pretraining approach. arXiv preprint (2019) https: //doi.org/10.48550/arXiv.1907.11692
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1907.11692 2019
-
[40]
how i would like ai used for my imaging
Lee, L., Salami, R.K., Martin, H., Shantharam, L., Thomas, K., Ashworth, E., Allan, E., Yung, K.W., Pauling, C., Leyden, D., Arthurs, O.J., Shelmerdine, S.C.: “how i would like ai used for my imaging”: children and young persons’ perspectives. Eur. Radiol.34(12), 7751–7764 (2024) https: //doi.org/10.1007/s00330-024-10839-9
-
[41]
Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A....
work page 2020
-
[42]
Yang, M.: A survey on few-shot learning in natural language processing. In: 2021 Int. Conf. Artif. Intell. Electromech. Autom. (AIEA), pp. 294–297 (2021). https://doi.org/10.1109/AIEA53260.2021. 00069
-
[43]
Rahmani, H.A., Yilmaz, E., Craswell, N., Mitra, B.: Judgeblender: Ensembling automatic relevance judgments. In: Companion Proc. ACM Web Conf. 2025, pp. 1268–1272 (2025). https://doi.org/10. 1145/3701716.3715536
-
[44]
Great Ormond Street Hospital: Helping parents find answers after miscarriage (2024). Accessed: 26 Nov 2024. https://www.gosh.nhs.uk/news/helping-parents-find-answers-after-miscarriage/
work page 2024
-
[45]
Shelmerdine, S., Davendralingam, N.: Radiology’s role in humanising mortality. BMJ388, 127 (2025) https://doi.org/10.1136/bmj.r127 24 Supplementary Information 1 Annotation and Code Design 1.1 Annotation Processes Human annotation via Streamlit annotation app.We have built a Streamlit web app for QA annotation that can be customised to display relevant se...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.