arxiv: 2605.03103 · v1 · submitted 2026-05-04 · 💻 cs.CL · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

MedStruct-S: A Benchmark for Key Discovery, Key-Conditioned QA and Semi-Structured Extraction from OCR Clinical Reports

Yingyun Li , Yu Wang , Haiyang Qian

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:07 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords MedStruct-Ssemi-structured information extractionOCR clinical reportskey discoverykey-conditioned QAencoder-only modelsdecoder-only modelsmedical document processing

0 comments

The pith

MedStruct-S benchmark shows encoder-only models lead on key QA from noisy clinical reports even when smaller than decoder-only ones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates MedStruct-S with 3,582 real clinical report pages to test key discovery, key-conditioned question answering, and full key-value extraction when keys are unknown in advance and OCR introduces noise. It evaluates encoder-only sequence labeling models against decoder-only structured generation models ranging from 0.11B to 103B parameters. Encoder-only models deliver the highest accuracy on non-null key-conditioned QA despite much smaller size, and they still outperform comparably sized decoder models overall. Fine-tuned decoder-only models achieve the strongest results across all tasks when model scale is not held constant. This setup gives a direct way to pick architectures for turning messy scanned medical documents into structured patient histories.

Core claim

MedStruct-S supplies annotated real-world OCR clinical report pages for three tasks under unknown keys and noise: field-header discovery, key-conditioned QA, and end-to-end key-value pair extraction. Benchmarking four encoder-only and five decoder-only models shows that encoder-only models achieve the best performance for non-null-value key-conditioned QA despite being substantially smaller than decoder-only models; when model sizes are comparable, encoder-only models still perform better overall; and fine-tuned decoder-only models deliver the strongest overall results without controlling for scale.

What carries the argument

MedStruct-S benchmark dataset and its three evaluation tasks, used to compare encoder-only sequence labeling with post-processing against decoder-only structured generation on OCR-derived clinical reports.

If this is right

Encoder-only models can be selected for efficient key-conditioned QA on clinical documents when computational resources are limited.
Fine-tuning decoder-only models yields the highest end-to-end extraction accuracy across heterogeneous keys and OCR noise.
Model comparisons for semi-structured IE must separate the effects of architecture from those of parameter count.
Benchmarks that ignore unknown keys and OCR artifacts will overestimate performance in actual medical record processing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Task-specific smaller models may offer better practical trade-offs than scaling up general decoder models for targeted extraction subtasks.
The same evaluation approach could be applied to semi-structured documents outside medicine to test whether the encoder advantage holds in other domains.
Hybrid systems that combine encoder labeling for QA with decoder generation for final formatting might capture the best of both observed behaviors.

Load-bearing premise

The 3,582 annotated pages and their task definitions accurately represent real-world clinical report heterogeneity, OCR noise distributions, and annotation quality without significant selection bias or labeling inconsistencies.

What would settle it

Running the same suite of encoder-only and decoder-only models on a fresh collection of clinical reports drawn from different hospitals or scanners and finding that decoder-only models now outperform encoders on non-null key-conditioned QA.

Figures

Figures reproduced from arXiv: 2605.03103 by Haiyang Qian, Yingyun Li, Yu Wang.

**Figure 1.** Figure 1: A clinical report page and its annotated key–value pairs important for obtaining medical histories. This process begins with OCR on images, followed by model-based extraction on the OCR-derived text. Clinical reports usually have an explicit layout: clinical concepts appear as field headers, followed by patient-specific text, separated by delimiters such as colons [9, 12]. In this paper, field headers and … view at source ↗

**Figure 2.** Figure 2: We collect clinical reports from cancer patient care programs and run OCR on 3,582 pages. Through a 560-person-day annotation effort, the process results in MedStruct-S, a semi-structured benchmark built from OCR-derived clinical reports. For privacy and compliance, we also release a de-identified version, MedStruct-S (De-ID) is obtained by replacing sensitive information (e.g., patient IDs, birth dates, a… view at source ↗

**Figure 2.** Figure 2: MedStruct-S: data corpus, task definitions, and evaluation metrics corrected during this process. To ensure annotation quality, 20% of the samples were randomly selected for manual verification. MedStruct-S covers multiple categories of clinical reports ( view at source ↗

**Figure 3.** Figure 3: Distribution of categories and statistics on text length reported in MedStruct-S. To support public release of the benchmark while ensuring patient privacy, MedStruct-S (De-ID) is constructed by replacing sensitive information with synthetic placeholders while preserving the original report structure and OCR noise characteristics. We quantify the fidelity of the de-identification process by measuring page-… view at source ↗

**Figure 4.** Figure 4: Key frequency distribution in MedStruct-S view at source ↗

**Figure 5.** Figure 5: Page-level text similarity between the original dataset and de-identified variants. 3.2 Task Definitions Let p denote an OCR-derived clinical report page, K the ground-truth key set, and KV the ground-truth set of key–value pairs. We define three tasks. In Task 1 (Key Discovery), the input is p and the output is a predicted key set Kˆ , identified without assuming a predefined key inventory. Task 2 (KeyCo… view at source ↗

read the original abstract

Semi-structured information extraction (IE) from OCR-derived clinical reports is crucial for efficiently reconstructing patients' longitudinal medical histories. In practice, this scenario commonly involves three tasks: (i) field-header (key) discovery, (ii) key-conditioned question answering (QA), and (iii) end-to-end key-value pair extraction. However, existing evaluations often under-model two factors: heterogeneous and incompletely known key representations, and OCR-induced noise. This makes it difficult to assess model robustness in real-world settings. We present MedStruct-S, a benchmark specifically designed to evaluate these tasks under unknown keys and OCR noise. MedStruct-S contains 3,582 annotated real-world clinical report pages. Using MedStruct-S, we benchmark two representative paradigms: encoder-only sequence labeling with post-processing and decoder-only structured generation, covering four encoder-only and five decoder-only models spanning 0.11B to 103B parameters. Our results show that encoder-only models achieve the best performance for non-null-value key-conditioned QA despite being substantially smaller than decoder-only models. When comparing models of similar order of magnitude, encoder-only models still perform better overall. Without controlling for model scale, fine-tuned decoder-only models deliver the strongest overall results. These findings show that the benchmark provides a reliable and practical basis for selecting and comparing models across different semi-structured IE settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MedStruct-S gives a new testbed for clinical IE under unknown keys and OCR noise, with the main result that encoder-only models edge out decoder-only ones on key-conditioned QA when scale is controlled.

read the letter

The paper's core contribution is MedStruct-S, a collection of 3,582 annotated real-world clinical report pages aimed at three tasks: key discovery, key-conditioned QA, and end-to-end extraction. It explicitly targets the gaps in prior work around heterogeneous unknown keys and OCR-induced noise, which are common in actual medical pipelines. The benchmarks cover four encoder-only and five decoder-only models from 0.11B to 103B parameters, and the headline empirical claim is that encoder-only models lead on non-null key-conditioned QA even when smaller, while fine-tuned decoders win overall without scale controls. That comparison is new and directly useful for practitioners choosing models for semi-structured medical extraction.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces MedStruct-S, a benchmark consisting of 3,582 annotated real-world OCR clinical report pages for three tasks: field-header (key) discovery, key-conditioned QA, and end-to-end semi-structured key-value extraction under unknown keys and OCR noise. It evaluates four encoder-only sequence-labeling models and five decoder-only structured-generation models (0.11B–103B parameters), reporting that encoder-only models achieve the best performance on non-null-value key-conditioned QA despite smaller size, that encoder-only models still outperform at comparable scale, and that fine-tuned decoder-only models yield the strongest overall results when scale is uncontrolled.

Significance. If the benchmark's construction and annotations are shown to be representative, the work supplies a practical, task-specific resource for model selection in clinical semi-structured IE and clarifies paradigm-level trade-offs between encoder-only and decoder-only approaches. The explicit empirical comparisons across model families and sizes constitute a concrete contribution that can inform deployment decisions in medical informatics pipelines.

major comments (1)

[Dataset construction / abstract] The description of the 3,582-page dataset (abstract and dataset-construction section) provides no quantitative information on source diversity, inter-annotator agreement, annotation protocol, or OCR-error statistics. Because the central empirical claims—encoder-only superiority on non-null key-conditioned QA and the overall ranking of paradigms—are derived exclusively from performance on this fixed collection, the absence of these diagnostics leaves open the possibility that observed differences are artifacts of limited report heterogeneity or labeling inconsistencies rather than robust modeling properties.

minor comments (1)

[Abstract] The abstract states comparative results but omits the concrete metric values, confidence intervals, or statistical tests that would allow readers to gauge effect sizes directly.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and recommendation for major revision. We agree that the dataset description requires additional quantitative details to strengthen the paper and will revise accordingly.

read point-by-point responses

Referee: [Dataset construction / abstract] The description of the 3,582-page dataset (abstract and dataset-construction section) provides no quantitative information on source diversity, inter-annotator agreement, annotation protocol, or OCR-error statistics. Because the central empirical claims—encoder-only superiority on non-null key-conditioned QA and the overall ranking of paradigms—are derived exclusively from performance on this fixed collection, the absence of these diagnostics leaves open the possibility that observed differences are artifacts of limited report heterogeneity or labeling inconsistencies rather than robust modeling properties.

Authors: We agree that the current manuscript lacks these quantitative diagnostics in the dataset-construction section, which is a valid concern for assessing benchmark robustness. In the revised version we will expand that section with: (1) source diversity statistics (e.g., number of contributing institutions, distribution of report types and specialties); (2) inter-annotator agreement scores (e.g., Cohen's kappa on a double-annotated subset); (3) the complete annotation protocol, including guidelines for handling ambiguous or OCR-degraded keys; and (4) OCR-error statistics (e.g., average character and word error rates plus common error categories). These additions will directly address the possibility of data artifacts and support the validity of the reported model comparisons. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical benchmarking on newly introduced dataset

full rationale

The paper introduces the MedStruct-S benchmark containing 3,582 annotated real-world clinical report pages and evaluates encoder-only and decoder-only models on key discovery, key-conditioned QA, and end-to-end extraction tasks under OCR noise. All reported results consist of direct performance metrics from these evaluations, with no equations, fitted parameters, or derivations that reduce to self-defined quantities or prior self-citations. The central claims (e.g., encoder-only models outperforming on non-null QA) are empirical observations on the fixed benchmark rather than any self-referential construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard domain assumptions about clinical IE tasks and benchmarking practices without introducing fitted parameters, new physical entities, or ad-hoc inventions; it evaluates existing model paradigms on new data.

axioms (1)

domain assumption The three tasks of field-header discovery, key-conditioned QA, and end-to-end key-value extraction are the primary practical needs for semi-structured IE from OCR clinical reports.
Explicitly stated in the abstract as the common scenario in practice.

pith-pipeline@v0.9.0 · 5554 in / 1281 out tokens · 130030 ms · 2026-05-08T18:07:08.781620+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Cost (J(x)=½(x+x⁻¹)−1) Jcost_unit0 unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ϕ(u, v) = 1 − d_lev(u, v)/max(|u|,|v|)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

29 extracted references · 19 canonical work pages · 2 internal anchors

[1]

https://ai.baidu.com/tech/ ocr(2025), accessed: 2025

Baidu AI Cloud: Baidu ocr technical documentation. https://ai.baidu.com/tech/ ocr(2025), accessed: 2025

2025
[2]

Bhattacharyya, A., Tripathi, A., Das, U., Karmakar, A., Pathak, A., Gupta, M.: Information extraction from visually rich documents using LLM-based organization of documents into independent textual segments. pp. 17241– 17256. Association for Computational Linguistics, Vienna, Austria (Jul 2025). https://doi.org/10.18653/v1/2025.acl-long.844

work page doi:10.18653/v1/2025.acl-long.844 2025
[3]

Bioinformatics39(1), btac817 (12 2022)

Chen, W., Li, Z., Fang, H., Yao, Q., Zhong, C., Hao, J., Zhang, Q., Huang, X., Peng, J., Shao, Z.: A benchmark for automatic medical consultation sys- tem: frameworks, tasks and datasets. Bioinformatics39(1), btac817 (12 2022). https://doi.org/10.1093/bioinformatics/btac817

work page doi:10.1093/bioinformatics/btac817 2022
[4]

In: Findings of the Association for Computational Linguistics: EMNLP 2020

Cui, Y., Che, W., Liu, T., Qin, B., Wang, S., Hu, G.: Revisiting pre-trained models for Chinese natural language processing. In: Findings of the Association for Computational Linguistics: EMNLP 2020. pp. 657–668 (2020)

2020
[5]

IEEE/ACM Transactions on Audio, Speech, and Language Processing29, 3504–3514 (2021)

Cui, Y., Che, W., Liu, T., Qin, B., Yang, Z.: Pre-training with whole word mask- ing for chinese bert. IEEE/ACM Transactions on Audio, Speech, and Language Processing29, 3504–3514 (2021)

2021
[6]

Dai, Z., Wang, X., Ni, P., Li, Y., Li, G., Bai, X.: Named entity recognition using bert bilstm crf for chinese electronic health records. pp. 1–5 (10 2019). https://doi.org/10.1109/CISP-BMEI48845.2019.8965823

work page doi:10.1109/cisp-bmei48845.2019.8965823 2019
[7]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirec- tional transformers for language understanding. CoRRabs/1810.04805(2018), http://arxiv.org/abs/1810.04805

work page Pith review arXiv 2018
[8]

In: IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR)

Duan, Y., Chen, Z., Hu, Y., Wang, W., Ye, S., Shi, B., Lu, L., Hou, Q., Lu, T., Li, H., Dai, J., Wang, W.: Docopilot: Improving multimodal mod- els for document-level understanding. In: 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4026–4037 (2025). https://doi.org/10.1109/CVPR52734.2025.00381

work page doi:10.1109/cvpr52734.2025.00381 2025
[9]

Journal of Biomedical Informatics109, 103526 (Sep 2020)

Fu, S., Chen, D., He, H., Liu, S., Moon, S., Peterson, K.J., Shen, F., Wang, L., Wang, Y., Wen, A., Zhao, Y., Sohn, S., Liu, H.: Clinical concept extraction: A methodology review. Journal of Biomedical Informatics109, 103526 (Sep 2020). https://doi.org/10.1016/j.jbi.2020.103526, http://dx.doi.org/10.1016/j.jbi.202 0.103526

work page doi:10.1016/j.jbi.2020.103526 2020
[10]

Hugging Face Repository (2025)

Group, A.: Antangelmed: A large-scale medical moe model. Hugging Face Repository (2025)

2025
[11]

In: Natural Language Processing and Chinese Computing (NLPCC)

Guan, T., Wang, Q., Guo, Z., et al.: Cmeie: Construction and evaluation of chinese medical information extraction dataset. In: Natural Language Processing and Chinese Computing (NLPCC). pp. 270–282. Springer (2020)

2020
[12]

In: Accepted to ICDAR-OST (2019)

Guillaume Jaume, Hazim Kemal Ekenel, J.P.T.: Funsd: A dataset for form under- standing in noisy scanned documents. In: Accepted to ICDAR-OST (2019)

2019
[13]

arXiv preprint arXiv:2106.14463 (2021)

Jain, S., Agrawal, A., Saporta, A., Truong, S.Q., Duong, D.N., Bui, T., Chambon, P., Zhang, Y., Lungren, M.P., Ng, A.Y., et al.: Radgraph: Extracting clinical entities and relations from radiology reports. arXiv preprint arXiv:2106.14463 (2021)

work page arXiv 2021
[14]

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized bert pretraining approach (2019),https://arxiv.org/abs/1907.11692

work page internal anchor Pith review arXiv 2019
[15]

In: Proceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers)

Lu, Y., Liu, Q., Dai, D., Xiao, X., Lin, H., Han, X., Sun, L., Wu, H.: Unified structure generation for universal information extraction. In: Proceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers). pp. 5755–5772 (2022)

2022
[16]

Ouyang, L., Qu, Y., Zhou, H., Zhu, J., Zhang, R., Lin, Q., Wang, B., Zhao, Z., Jiang, M., Zhao, X., Shi, J., Wu, F., Chu, P., Liu, M., Li, Z., Xu, C., Zhang, B., Shi, B., Tu, Z., He, C.: Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations (2024),https://arxiv.org/abs/2412.07626

work page arXiv 2024
[17]

In: Rogers, A., Boyd- Graber, J., Okazaki, N

Tanwar, E., Dutta, S., Borthakur, M., Chakraborty, T.: Multilingual LLMs are better cross-lingual in-context learners with alignment. In: Rogers, A., Boyd- Graber, J., Okazaki, N. (eds.) Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 6292–
[18]

Multilingual LLM s are better cross-lingual in-context learners with alignment

Association for Computational Linguistics, Toronto, Canada (Jul 2023). https://doi.org/10.18653/v1/2023.acl-long.346, https://aclanthology.org/2023. acl-long.346/

work page doi:10.18653/v1/2023.acl-long.346 2023
[19]

Counts: Benchmarking llm numerical reasoning with verifiable rewards

Team, B.M.: Baichuan-m2: Scaling medical capability with large verifier system. arXiv preprint arXiv:2501.00000 (2025)

work page arXiv 2025
[20]

Qwen2.5 Technical Report

Team, Q.: Qwen2.5 technical report. arXiv preprint arXiv:2412.15115 (2024)

work page Pith review arXiv 2024
[21]

Tkachenko, M., Malyuk, M., Holmanyuk, A., Liubimov, N.: Label Studio: Data labeling software (2020-2025), https://github.com/HumanSignal/label-studio , open source software available from https://github.com/HumanSignal/label-studio

2020
[22]

arXiv preprint arXiv:2304.08085

Wang, X., Zhou, W., Zu, C., Xia, H., Chen, T., Zhang, Y., Zheng, R., Ye, J., Zhang, Q., Gui, T., et al.: Instructuie: Multi-task instruction tuning for unified information extraction. arxiv 2023. arXiv preprint arXiv:2304.08085 (2023)

work page arXiv 2023
[23]

doi: 10.18653/v1/2020.emnlp- demos.6

Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Le Scao, T., Gugger, S., Drame, M., Lhoest, Q., Rush, A.: Transformers: State-of-the-art natural language processing. In: Liu, Q., Schlangen, D. (eds.) Proceedi...

work page doi:10.18653/v1/2020.emnlp- 2020
[24]

Xu, Z., Gong, L., Ke, G., He, D., Zheng, S., Wang, L., Bian, J., Liu, T.Y.: Mc-bert: Efficient language pre-training via a meta controller (2020), https://arxiv.org/ab s/2006.05744

work page arXiv 2020
[25]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., Zheng, C., Liu, D., Zhou, F., Huang, F., Hu, F., Ge, H., Wei, H., Lin, H., Tang, J., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Zhou, J., Lin, J., Dang, K., Bao, K., Yang, K., Yu, L., Deng, L.C., Li, M., Xue, M., Li, M., Zhang, P., Wang, P., Zhu,...

work page internal anchor Pith review arXiv 2025
[26]

EHRStruct : A comprehensive benchmark framework for evaluating large language models on structured electronic health record tasks, 2025

Yang, X., Zhao, X., Shen, Z.: Ehrstruct: A comprehensive benchmark framework for evaluating large language models on structured electronic health record tasks. ArXiv abs/2511.08206(2025), https://api.semanticscholar.org/CorpusID:282922202

work page arXiv 2025
[27]

In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Zhang, N., Chen, M., Bi, Z., Liang, X., Li, L., Shang, X., Xuan, K., Zhao, J., Li, H., Huang, C.H., Ni, J., Shao, G., Chen, L., Tou, H., Huang, G., Chen, H.: CBLUE: A Chinese biomedical language understanding evaluation benchmark. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 7888–...

2022
[28]

arXiv preprint arXiv:2106.08087 (2021)

Zhang, N., Chen, M., Bi, Z., Liang, X., Li, L., Shang, X., Xuan, K., Zhao, J., Li, H., Huang, C.H., et al.: Cblue benchmark: Technical report. arXiv preprint arXiv:2106.08087 (2021)

work page arXiv 2021
[29]

arXiv preprint arXiv:2310.14151 (2023)

Zhu, W., Hou, G., Chen, M., Zhang, N., et al.: Promptcblue: A chinese prompt tuning benchmark for the medical domain. arXiv preprint arXiv:2310.14151 (2023)

work page arXiv 2023