pith. machine review for the scientific record. sign in

arxiv: 2605.03103 · v1 · submitted 2026-05-04 · 💻 cs.CL · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

MedStruct-S: A Benchmark for Key Discovery, Key-Conditioned QA and Semi-Structured Extraction from OCR Clinical Reports

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:07 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords MedStruct-Ssemi-structured information extractionOCR clinical reportskey discoverykey-conditioned QAencoder-only modelsdecoder-only modelsmedical document processing
0
0 comments X

The pith

MedStruct-S benchmark shows encoder-only models lead on key QA from noisy clinical reports even when smaller than decoder-only ones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates MedStruct-S with 3,582 real clinical report pages to test key discovery, key-conditioned question answering, and full key-value extraction when keys are unknown in advance and OCR introduces noise. It evaluates encoder-only sequence labeling models against decoder-only structured generation models ranging from 0.11B to 103B parameters. Encoder-only models deliver the highest accuracy on non-null key-conditioned QA despite much smaller size, and they still outperform comparably sized decoder models overall. Fine-tuned decoder-only models achieve the strongest results across all tasks when model scale is not held constant. This setup gives a direct way to pick architectures for turning messy scanned medical documents into structured patient histories.

Core claim

MedStruct-S supplies annotated real-world OCR clinical report pages for three tasks under unknown keys and noise: field-header discovery, key-conditioned QA, and end-to-end key-value pair extraction. Benchmarking four encoder-only and five decoder-only models shows that encoder-only models achieve the best performance for non-null-value key-conditioned QA despite being substantially smaller than decoder-only models; when model sizes are comparable, encoder-only models still perform better overall; and fine-tuned decoder-only models deliver the strongest overall results without controlling for scale.

What carries the argument

MedStruct-S benchmark dataset and its three evaluation tasks, used to compare encoder-only sequence labeling with post-processing against decoder-only structured generation on OCR-derived clinical reports.

If this is right

  • Encoder-only models can be selected for efficient key-conditioned QA on clinical documents when computational resources are limited.
  • Fine-tuning decoder-only models yields the highest end-to-end extraction accuracy across heterogeneous keys and OCR noise.
  • Model comparisons for semi-structured IE must separate the effects of architecture from those of parameter count.
  • Benchmarks that ignore unknown keys and OCR artifacts will overestimate performance in actual medical record processing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Task-specific smaller models may offer better practical trade-offs than scaling up general decoder models for targeted extraction subtasks.
  • The same evaluation approach could be applied to semi-structured documents outside medicine to test whether the encoder advantage holds in other domains.
  • Hybrid systems that combine encoder labeling for QA with decoder generation for final formatting might capture the best of both observed behaviors.

Load-bearing premise

The 3,582 annotated pages and their task definitions accurately represent real-world clinical report heterogeneity, OCR noise distributions, and annotation quality without significant selection bias or labeling inconsistencies.

What would settle it

Running the same suite of encoder-only and decoder-only models on a fresh collection of clinical reports drawn from different hospitals or scanners and finding that decoder-only models now outperform encoders on non-null key-conditioned QA.

Figures

Figures reproduced from arXiv: 2605.03103 by Haiyang Qian, Yingyun Li, Yu Wang.

Figure 1
Figure 1. Figure 1: A clinical report page and its annotated key–value pairs important for obtaining medical histories. This process begins with OCR on images, followed by model-based extraction on the OCR-derived text. Clinical reports usually have an explicit layout: clinical concepts appear as field headers, followed by patient-specific text, separated by delimiters such as colons [9, 12]. In this paper, field headers and … view at source ↗
Figure 2
Figure 2. Figure 2: We collect clinical reports from cancer patient care programs and run OCR on 3,582 pages. Through a 560-person-day annotation effort, the process results in MedStruct-S, a semi-structured benchmark built from OCR-derived clinical reports. For privacy and compliance, we also release a de-identified version, MedStruct-S (De-ID) is obtained by replacing sensitive information (e.g., patient IDs, birth dates, a… view at source ↗
Figure 2
Figure 2. Figure 2: MedStruct-S: data corpus, task definitions, and evaluation metrics corrected during this process. To ensure annotation quality, 20% of the samples were randomly selected for manual verification. MedStruct-S covers multiple categories of clinical reports ( view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of categories and statistics on text length reported in MedStruct-S. To support public release of the benchmark while ensuring patient privacy, MedStruct-S (De-ID) is constructed by replacing sensitive information with synthetic placeholders while preserving the original report structure and OCR noise characteristics. We quantify the fidelity of the de-identification process by measuring page-… view at source ↗
Figure 4
Figure 4. Figure 4: Key frequency distribution in MedStruct-S view at source ↗
Figure 5
Figure 5. Figure 5: Page-level text similarity between the original dataset and de-identified variants. 3.2 Task Definitions Let p denote an OCR-derived clinical report page, K the ground-truth key set, and KV the ground-truth set of key–value pairs. We define three tasks. In Task 1 (Key Discovery), the input is p and the output is a predicted key set Kˆ , identified without assuming a predefined key inventory. Task 2 (Key￾Co… view at source ↗
read the original abstract

Semi-structured information extraction (IE) from OCR-derived clinical reports is crucial for efficiently reconstructing patients' longitudinal medical histories. In practice, this scenario commonly involves three tasks: (i) field-header (key) discovery, (ii) key-conditioned question answering (QA), and (iii) end-to-end key-value pair extraction. However, existing evaluations often under-model two factors: heterogeneous and incompletely known key representations, and OCR-induced noise. This makes it difficult to assess model robustness in real-world settings. We present MedStruct-S, a benchmark specifically designed to evaluate these tasks under unknown keys and OCR noise. MedStruct-S contains 3,582 annotated real-world clinical report pages. Using MedStruct-S, we benchmark two representative paradigms: encoder-only sequence labeling with post-processing and decoder-only structured generation, covering four encoder-only and five decoder-only models spanning 0.11B to 103B parameters. Our results show that encoder-only models achieve the best performance for non-null-value key-conditioned QA despite being substantially smaller than decoder-only models. When comparing models of similar order of magnitude, encoder-only models still perform better overall. Without controlling for model scale, fine-tuned decoder-only models deliver the strongest overall results. These findings show that the benchmark provides a reliable and practical basis for selecting and comparing models across different semi-structured IE settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces MedStruct-S, a benchmark consisting of 3,582 annotated real-world OCR clinical report pages for three tasks: field-header (key) discovery, key-conditioned QA, and end-to-end semi-structured key-value extraction under unknown keys and OCR noise. It evaluates four encoder-only sequence-labeling models and five decoder-only structured-generation models (0.11B–103B parameters), reporting that encoder-only models achieve the best performance on non-null-value key-conditioned QA despite smaller size, that encoder-only models still outperform at comparable scale, and that fine-tuned decoder-only models yield the strongest overall results when scale is uncontrolled.

Significance. If the benchmark's construction and annotations are shown to be representative, the work supplies a practical, task-specific resource for model selection in clinical semi-structured IE and clarifies paradigm-level trade-offs between encoder-only and decoder-only approaches. The explicit empirical comparisons across model families and sizes constitute a concrete contribution that can inform deployment decisions in medical informatics pipelines.

major comments (1)
  1. [Dataset construction / abstract] The description of the 3,582-page dataset (abstract and dataset-construction section) provides no quantitative information on source diversity, inter-annotator agreement, annotation protocol, or OCR-error statistics. Because the central empirical claims—encoder-only superiority on non-null key-conditioned QA and the overall ranking of paradigms—are derived exclusively from performance on this fixed collection, the absence of these diagnostics leaves open the possibility that observed differences are artifacts of limited report heterogeneity or labeling inconsistencies rather than robust modeling properties.
minor comments (1)
  1. [Abstract] The abstract states comparative results but omits the concrete metric values, confidence intervals, or statistical tests that would allow readers to gauge effect sizes directly.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and recommendation for major revision. We agree that the dataset description requires additional quantitative details to strengthen the paper and will revise accordingly.

read point-by-point responses
  1. Referee: [Dataset construction / abstract] The description of the 3,582-page dataset (abstract and dataset-construction section) provides no quantitative information on source diversity, inter-annotator agreement, annotation protocol, or OCR-error statistics. Because the central empirical claims—encoder-only superiority on non-null key-conditioned QA and the overall ranking of paradigms—are derived exclusively from performance on this fixed collection, the absence of these diagnostics leaves open the possibility that observed differences are artifacts of limited report heterogeneity or labeling inconsistencies rather than robust modeling properties.

    Authors: We agree that the current manuscript lacks these quantitative diagnostics in the dataset-construction section, which is a valid concern for assessing benchmark robustness. In the revised version we will expand that section with: (1) source diversity statistics (e.g., number of contributing institutions, distribution of report types and specialties); (2) inter-annotator agreement scores (e.g., Cohen's kappa on a double-annotated subset); (3) the complete annotation protocol, including guidelines for handling ambiguous or OCR-degraded keys; and (4) OCR-error statistics (e.g., average character and word error rates plus common error categories). These additions will directly address the possibility of data artifacts and support the validity of the reported model comparisons. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical benchmarking on newly introduced dataset

full rationale

The paper introduces the MedStruct-S benchmark containing 3,582 annotated real-world clinical report pages and evaluates encoder-only and decoder-only models on key discovery, key-conditioned QA, and end-to-end extraction tasks under OCR noise. All reported results consist of direct performance metrics from these evaluations, with no equations, fitted parameters, or derivations that reduce to self-defined quantities or prior self-citations. The central claims (e.g., encoder-only models outperforming on non-null QA) are empirical observations on the fixed benchmark rather than any self-referential construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard domain assumptions about clinical IE tasks and benchmarking practices without introducing fitted parameters, new physical entities, or ad-hoc inventions; it evaluates existing model paradigms on new data.

axioms (1)
  • domain assumption The three tasks of field-header discovery, key-conditioned QA, and end-to-end key-value extraction are the primary practical needs for semi-structured IE from OCR clinical reports.
    Explicitly stated in the abstract as the common scenario in practice.

pith-pipeline@v0.9.0 · 5554 in / 1281 out tokens · 130030 ms · 2026-05-08T18:07:08.781620+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

29 extracted references · 19 canonical work pages · 2 internal anchors

  1. [1]

    https://ai.baidu.com/tech/ ocr(2025), accessed: 2025

    Baidu AI Cloud: Baidu ocr technical documentation. https://ai.baidu.com/tech/ ocr(2025), accessed: 2025

  2. [2]

    Bhattacharyya, A., Tripathi, A., Das, U., Karmakar, A., Pathak, A., Gupta, M.: Information extraction from visually rich documents using LLM-based organization of documents into independent textual segments. pp. 17241– 17256. Association for Computational Linguistics, Vienna, Austria (Jul 2025). https://doi.org/10.18653/v1/2025.acl-long.844

  3. [3]

    Bioinformatics39(1), btac817 (12 2022)

    Chen, W., Li, Z., Fang, H., Yao, Q., Zhong, C., Hao, J., Zhang, Q., Huang, X., Peng, J., Shao, Z.: A benchmark for automatic medical consultation sys- tem: frameworks, tasks and datasets. Bioinformatics39(1), btac817 (12 2022). https://doi.org/10.1093/bioinformatics/btac817

  4. [4]

    In: Findings of the Association for Computational Linguistics: EMNLP 2020

    Cui, Y., Che, W., Liu, T., Qin, B., Wang, S., Hu, G.: Revisiting pre-trained models for Chinese natural language processing. In: Findings of the Association for Computational Linguistics: EMNLP 2020. pp. 657–668 (2020)

  5. [5]

    IEEE/ACM Transactions on Audio, Speech, and Language Processing29, 3504–3514 (2021)

    Cui, Y., Che, W., Liu, T., Qin, B., Yang, Z.: Pre-training with whole word mask- ing for chinese bert. IEEE/ACM Transactions on Audio, Speech, and Language Processing29, 3504–3514 (2021)

  6. [6]

    Dai, Z., Wang, X., Ni, P., Li, Y., Li, G., Bai, X.: Named entity recognition using bert bilstm crf for chinese electronic health records. pp. 1–5 (10 2019). https://doi.org/10.1109/CISP-BMEI48845.2019.8965823

  7. [7]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirec- tional transformers for language understanding. CoRRabs/1810.04805(2018), http://arxiv.org/abs/1810.04805

  8. [8]

    In: IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR)

    Duan, Y., Chen, Z., Hu, Y., Wang, W., Ye, S., Shi, B., Lu, L., Hou, Q., Lu, T., Li, H., Dai, J., Wang, W.: Docopilot: Improving multimodal mod- els for document-level understanding. In: 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4026–4037 (2025). https://doi.org/10.1109/CVPR52734.2025.00381

  9. [9]

    Journal of Biomedical Informatics109, 103526 (Sep 2020)

    Fu, S., Chen, D., He, H., Liu, S., Moon, S., Peterson, K.J., Shen, F., Wang, L., Wang, Y., Wen, A., Zhao, Y., Sohn, S., Liu, H.: Clinical concept extraction: A methodology review. Journal of Biomedical Informatics109, 103526 (Sep 2020). https://doi.org/10.1016/j.jbi.2020.103526, http://dx.doi.org/10.1016/j.jbi.202 0.103526

  10. [10]

    Hugging Face Repository (2025)

    Group, A.: Antangelmed: A large-scale medical moe model. Hugging Face Repository (2025)

  11. [11]

    In: Natural Language Processing and Chinese Computing (NLPCC)

    Guan, T., Wang, Q., Guo, Z., et al.: Cmeie: Construction and evaluation of chinese medical information extraction dataset. In: Natural Language Processing and Chinese Computing (NLPCC). pp. 270–282. Springer (2020)

  12. [12]

    In: Accepted to ICDAR-OST (2019)

    Guillaume Jaume, Hazim Kemal Ekenel, J.P.T.: Funsd: A dataset for form under- standing in noisy scanned documents. In: Accepted to ICDAR-OST (2019)

  13. [13]

    arXiv preprint arXiv:2106.14463 (2021)

    Jain, S., Agrawal, A., Saporta, A., Truong, S.Q., Duong, D.N., Bui, T., Chambon, P., Zhang, Y., Lungren, M.P., Ng, A.Y., et al.: Radgraph: Extracting clinical entities and relations from radiology reports. arXiv preprint arXiv:2106.14463 (2021)

  14. [14]

    Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized bert pretraining approach (2019),https://arxiv.org/abs/1907.11692

  15. [15]

    In: Proceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers)

    Lu, Y., Liu, Q., Dai, D., Xiao, X., Lin, H., Han, X., Sun, L., Wu, H.: Unified structure generation for universal information extraction. In: Proceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers). pp. 5755–5772 (2022)

  16. [16]

    Ouyang, L., Qu, Y., Zhou, H., Zhu, J., Zhang, R., Lin, Q., Wang, B., Zhao, Z., Jiang, M., Zhao, X., Shi, J., Wu, F., Chu, P., Liu, M., Li, Z., Xu, C., Zhang, B., Shi, B., Tu, Z., He, C.: Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations (2024),https://arxiv.org/abs/2412.07626

  17. [17]

    In: Rogers, A., Boyd- Graber, J., Okazaki, N

    Tanwar, E., Dutta, S., Borthakur, M., Chakraborty, T.: Multilingual LLMs are better cross-lingual in-context learners with alignment. In: Rogers, A., Boyd- Graber, J., Okazaki, N. (eds.) Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 6292–

  18. [18]

    Multilingual LLM s are better cross-lingual in-context learners with alignment

    Association for Computational Linguistics, Toronto, Canada (Jul 2023). https://doi.org/10.18653/v1/2023.acl-long.346, https://aclanthology.org/2023. acl-long.346/

  19. [19]

    Counts: Benchmarking llm numerical reasoning with verifiable rewards

    Team, B.M.: Baichuan-m2: Scaling medical capability with large verifier system. arXiv preprint arXiv:2501.00000 (2025)

  20. [20]

    Qwen2.5 Technical Report

    Team, Q.: Qwen2.5 technical report. arXiv preprint arXiv:2412.15115 (2024)

  21. [21]

    Tkachenko, M., Malyuk, M., Holmanyuk, A., Liubimov, N.: Label Studio: Data labeling software (2020-2025), https://github.com/HumanSignal/label-studio , open source software available from https://github.com/HumanSignal/label-studio

  22. [22]

    arXiv preprint arXiv:2304.08085

    Wang, X., Zhou, W., Zu, C., Xia, H., Chen, T., Zhang, Y., Zheng, R., Ye, J., Zhang, Q., Gui, T., et al.: Instructuie: Multi-task instruction tuning for unified information extraction. arxiv 2023. arXiv preprint arXiv:2304.08085 (2023)

  23. [23]

    doi: 10.18653/v1/2020.emnlp- demos.6

    Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Le Scao, T., Gugger, S., Drame, M., Lhoest, Q., Rush, A.: Transformers: State-of-the-art natural language processing. In: Liu, Q., Schlangen, D. (eds.) Proceedi...

  24. [24]

    Xu, Z., Gong, L., Ke, G., He, D., Zheng, S., Wang, L., Bian, J., Liu, T.Y.: Mc-bert: Efficient language pre-training via a meta controller (2020), https://arxiv.org/ab s/2006.05744

  25. [25]

    Qwen3 Technical Report

    Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., Zheng, C., Liu, D., Zhou, F., Huang, F., Hu, F., Ge, H., Wei, H., Lin, H., Tang, J., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Zhou, J., Lin, J., Dang, K., Bao, K., Yang, K., Yu, L., Deng, L.C., Li, M., Xue, M., Li, M., Zhang, P., Wang, P., Zhu,...

  26. [26]

    EHRStruct : A comprehensive benchmark framework for evaluating large language models on structured electronic health record tasks, 2025

    Yang, X., Zhao, X., Shen, Z.: Ehrstruct: A comprehensive benchmark framework for evaluating large language models on structured electronic health record tasks. ArXiv abs/2511.08206(2025), https://api.semanticscholar.org/CorpusID:282922202

  27. [27]

    In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

    Zhang, N., Chen, M., Bi, Z., Liang, X., Li, L., Shang, X., Xuan, K., Zhao, J., Li, H., Huang, C.H., Ni, J., Shao, G., Chen, L., Tou, H., Huang, G., Chen, H.: CBLUE: A Chinese biomedical language understanding evaluation benchmark. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 7888–...

  28. [28]

    arXiv preprint arXiv:2106.08087 (2021)

    Zhang, N., Chen, M., Bi, Z., Liang, X., Li, L., Shang, X., Xuan, K., Zhao, J., Li, H., Huang, C.H., et al.: Cblue benchmark: Technical report. arXiv preprint arXiv:2106.08087 (2021)

  29. [29]

    arXiv preprint arXiv:2310.14151 (2023)

    Zhu, W., Hou, G., Chen, M., Zhang, N., et al.: Promptcblue: A chinese prompt tuning benchmark for the medical domain. arXiv preprint arXiv:2310.14151 (2023)