Recognition: 2 theorem links
· Lean TheoremMedStruct-S: A Benchmark for Key Discovery, Key-Conditioned QA and Semi-Structured Extraction from OCR Clinical Reports
Pith reviewed 2026-05-08 18:07 UTC · model grok-4.3
The pith
MedStruct-S benchmark shows encoder-only models lead on key QA from noisy clinical reports even when smaller than decoder-only ones.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MedStruct-S supplies annotated real-world OCR clinical report pages for three tasks under unknown keys and noise: field-header discovery, key-conditioned QA, and end-to-end key-value pair extraction. Benchmarking four encoder-only and five decoder-only models shows that encoder-only models achieve the best performance for non-null-value key-conditioned QA despite being substantially smaller than decoder-only models; when model sizes are comparable, encoder-only models still perform better overall; and fine-tuned decoder-only models deliver the strongest overall results without controlling for scale.
What carries the argument
MedStruct-S benchmark dataset and its three evaluation tasks, used to compare encoder-only sequence labeling with post-processing against decoder-only structured generation on OCR-derived clinical reports.
If this is right
- Encoder-only models can be selected for efficient key-conditioned QA on clinical documents when computational resources are limited.
- Fine-tuning decoder-only models yields the highest end-to-end extraction accuracy across heterogeneous keys and OCR noise.
- Model comparisons for semi-structured IE must separate the effects of architecture from those of parameter count.
- Benchmarks that ignore unknown keys and OCR artifacts will overestimate performance in actual medical record processing.
Where Pith is reading between the lines
- Task-specific smaller models may offer better practical trade-offs than scaling up general decoder models for targeted extraction subtasks.
- The same evaluation approach could be applied to semi-structured documents outside medicine to test whether the encoder advantage holds in other domains.
- Hybrid systems that combine encoder labeling for QA with decoder generation for final formatting might capture the best of both observed behaviors.
Load-bearing premise
The 3,582 annotated pages and their task definitions accurately represent real-world clinical report heterogeneity, OCR noise distributions, and annotation quality without significant selection bias or labeling inconsistencies.
What would settle it
Running the same suite of encoder-only and decoder-only models on a fresh collection of clinical reports drawn from different hospitals or scanners and finding that decoder-only models now outperform encoders on non-null key-conditioned QA.
Figures
read the original abstract
Semi-structured information extraction (IE) from OCR-derived clinical reports is crucial for efficiently reconstructing patients' longitudinal medical histories. In practice, this scenario commonly involves three tasks: (i) field-header (key) discovery, (ii) key-conditioned question answering (QA), and (iii) end-to-end key-value pair extraction. However, existing evaluations often under-model two factors: heterogeneous and incompletely known key representations, and OCR-induced noise. This makes it difficult to assess model robustness in real-world settings. We present MedStruct-S, a benchmark specifically designed to evaluate these tasks under unknown keys and OCR noise. MedStruct-S contains 3,582 annotated real-world clinical report pages. Using MedStruct-S, we benchmark two representative paradigms: encoder-only sequence labeling with post-processing and decoder-only structured generation, covering four encoder-only and five decoder-only models spanning 0.11B to 103B parameters. Our results show that encoder-only models achieve the best performance for non-null-value key-conditioned QA despite being substantially smaller than decoder-only models. When comparing models of similar order of magnitude, encoder-only models still perform better overall. Without controlling for model scale, fine-tuned decoder-only models deliver the strongest overall results. These findings show that the benchmark provides a reliable and practical basis for selecting and comparing models across different semi-structured IE settings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces MedStruct-S, a benchmark consisting of 3,582 annotated real-world OCR clinical report pages for three tasks: field-header (key) discovery, key-conditioned QA, and end-to-end semi-structured key-value extraction under unknown keys and OCR noise. It evaluates four encoder-only sequence-labeling models and five decoder-only structured-generation models (0.11B–103B parameters), reporting that encoder-only models achieve the best performance on non-null-value key-conditioned QA despite smaller size, that encoder-only models still outperform at comparable scale, and that fine-tuned decoder-only models yield the strongest overall results when scale is uncontrolled.
Significance. If the benchmark's construction and annotations are shown to be representative, the work supplies a practical, task-specific resource for model selection in clinical semi-structured IE and clarifies paradigm-level trade-offs between encoder-only and decoder-only approaches. The explicit empirical comparisons across model families and sizes constitute a concrete contribution that can inform deployment decisions in medical informatics pipelines.
major comments (1)
- [Dataset construction / abstract] The description of the 3,582-page dataset (abstract and dataset-construction section) provides no quantitative information on source diversity, inter-annotator agreement, annotation protocol, or OCR-error statistics. Because the central empirical claims—encoder-only superiority on non-null key-conditioned QA and the overall ranking of paradigms—are derived exclusively from performance on this fixed collection, the absence of these diagnostics leaves open the possibility that observed differences are artifacts of limited report heterogeneity or labeling inconsistencies rather than robust modeling properties.
minor comments (1)
- [Abstract] The abstract states comparative results but omits the concrete metric values, confidence intervals, or statistical tests that would allow readers to gauge effect sizes directly.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and recommendation for major revision. We agree that the dataset description requires additional quantitative details to strengthen the paper and will revise accordingly.
read point-by-point responses
-
Referee: [Dataset construction / abstract] The description of the 3,582-page dataset (abstract and dataset-construction section) provides no quantitative information on source diversity, inter-annotator agreement, annotation protocol, or OCR-error statistics. Because the central empirical claims—encoder-only superiority on non-null key-conditioned QA and the overall ranking of paradigms—are derived exclusively from performance on this fixed collection, the absence of these diagnostics leaves open the possibility that observed differences are artifacts of limited report heterogeneity or labeling inconsistencies rather than robust modeling properties.
Authors: We agree that the current manuscript lacks these quantitative diagnostics in the dataset-construction section, which is a valid concern for assessing benchmark robustness. In the revised version we will expand that section with: (1) source diversity statistics (e.g., number of contributing institutions, distribution of report types and specialties); (2) inter-annotator agreement scores (e.g., Cohen's kappa on a double-annotated subset); (3) the complete annotation protocol, including guidelines for handling ambiguous or OCR-degraded keys; and (4) OCR-error statistics (e.g., average character and word error rates plus common error categories). These additions will directly address the possibility of data artifacts and support the validity of the reported model comparisons. revision: yes
Circularity Check
No circularity: direct empirical benchmarking on newly introduced dataset
full rationale
The paper introduces the MedStruct-S benchmark containing 3,582 annotated real-world clinical report pages and evaluates encoder-only and decoder-only models on key discovery, key-conditioned QA, and end-to-end extraction tasks under OCR noise. All reported results consist of direct performance metrics from these evaluations, with no equations, fitted parameters, or derivations that reduce to self-defined quantities or prior self-citations. The central claims (e.g., encoder-only models outperforming on non-null QA) are empirical observations on the fixed benchmark rather than any self-referential construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The three tasks of field-header discovery, key-conditioned QA, and end-to-end key-value extraction are the primary practical needs for semi-structured IE from OCR clinical reports.
Lean theorems connected to this paper
-
IndisputableMonolith.Cost (J(x)=½(x+x⁻¹)−1)Jcost_unit0 unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
ϕ(u, v) = 1 − d_lev(u, v)/max(|u|,|v|)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
https://ai.baidu.com/tech/ ocr(2025), accessed: 2025
Baidu AI Cloud: Baidu ocr technical documentation. https://ai.baidu.com/tech/ ocr(2025), accessed: 2025
2025
-
[2]
Bhattacharyya, A., Tripathi, A., Das, U., Karmakar, A., Pathak, A., Gupta, M.: Information extraction from visually rich documents using LLM-based organization of documents into independent textual segments. pp. 17241– 17256. Association for Computational Linguistics, Vienna, Austria (Jul 2025). https://doi.org/10.18653/v1/2025.acl-long.844
-
[3]
Bioinformatics39(1), btac817 (12 2022)
Chen, W., Li, Z., Fang, H., Yao, Q., Zhong, C., Hao, J., Zhang, Q., Huang, X., Peng, J., Shao, Z.: A benchmark for automatic medical consultation sys- tem: frameworks, tasks and datasets. Bioinformatics39(1), btac817 (12 2022). https://doi.org/10.1093/bioinformatics/btac817
-
[4]
In: Findings of the Association for Computational Linguistics: EMNLP 2020
Cui, Y., Che, W., Liu, T., Qin, B., Wang, S., Hu, G.: Revisiting pre-trained models for Chinese natural language processing. In: Findings of the Association for Computational Linguistics: EMNLP 2020. pp. 657–668 (2020)
2020
-
[5]
IEEE/ACM Transactions on Audio, Speech, and Language Processing29, 3504–3514 (2021)
Cui, Y., Che, W., Liu, T., Qin, B., Yang, Z.: Pre-training with whole word mask- ing for chinese bert. IEEE/ACM Transactions on Audio, Speech, and Language Processing29, 3504–3514 (2021)
2021
-
[6]
Dai, Z., Wang, X., Ni, P., Li, Y., Li, G., Bai, X.: Named entity recognition using bert bilstm crf for chinese electronic health records. pp. 1–5 (10 2019). https://doi.org/10.1109/CISP-BMEI48845.2019.8965823
-
[7]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirec- tional transformers for language understanding. CoRRabs/1810.04805(2018), http://arxiv.org/abs/1810.04805
work page Pith review arXiv 2018
-
[8]
In: IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR)
Duan, Y., Chen, Z., Hu, Y., Wang, W., Ye, S., Shi, B., Lu, L., Hou, Q., Lu, T., Li, H., Dai, J., Wang, W.: Docopilot: Improving multimodal mod- els for document-level understanding. In: 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4026–4037 (2025). https://doi.org/10.1109/CVPR52734.2025.00381
-
[9]
Journal of Biomedical Informatics109, 103526 (Sep 2020)
Fu, S., Chen, D., He, H., Liu, S., Moon, S., Peterson, K.J., Shen, F., Wang, L., Wang, Y., Wen, A., Zhao, Y., Sohn, S., Liu, H.: Clinical concept extraction: A methodology review. Journal of Biomedical Informatics109, 103526 (Sep 2020). https://doi.org/10.1016/j.jbi.2020.103526, http://dx.doi.org/10.1016/j.jbi.202 0.103526
-
[10]
Hugging Face Repository (2025)
Group, A.: Antangelmed: A large-scale medical moe model. Hugging Face Repository (2025)
2025
-
[11]
In: Natural Language Processing and Chinese Computing (NLPCC)
Guan, T., Wang, Q., Guo, Z., et al.: Cmeie: Construction and evaluation of chinese medical information extraction dataset. In: Natural Language Processing and Chinese Computing (NLPCC). pp. 270–282. Springer (2020)
2020
-
[12]
In: Accepted to ICDAR-OST (2019)
Guillaume Jaume, Hazim Kemal Ekenel, J.P.T.: Funsd: A dataset for form under- standing in noisy scanned documents. In: Accepted to ICDAR-OST (2019)
2019
-
[13]
arXiv preprint arXiv:2106.14463 (2021)
Jain, S., Agrawal, A., Saporta, A., Truong, S.Q., Duong, D.N., Bui, T., Chambon, P., Zhang, Y., Lungren, M.P., Ng, A.Y., et al.: Radgraph: Extracting clinical entities and relations from radiology reports. arXiv preprint arXiv:2106.14463 (2021)
-
[14]
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized bert pretraining approach (2019),https://arxiv.org/abs/1907.11692
work page internal anchor Pith review arXiv 2019
-
[15]
In: Proceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers)
Lu, Y., Liu, Q., Dai, D., Xiao, X., Lin, H., Han, X., Sun, L., Wu, H.: Unified structure generation for universal information extraction. In: Proceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers). pp. 5755–5772 (2022)
2022
-
[16]
Ouyang, L., Qu, Y., Zhou, H., Zhu, J., Zhang, R., Lin, Q., Wang, B., Zhao, Z., Jiang, M., Zhao, X., Shi, J., Wu, F., Chu, P., Liu, M., Li, Z., Xu, C., Zhang, B., Shi, B., Tu, Z., He, C.: Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations (2024),https://arxiv.org/abs/2412.07626
-
[17]
In: Rogers, A., Boyd- Graber, J., Okazaki, N
Tanwar, E., Dutta, S., Borthakur, M., Chakraborty, T.: Multilingual LLMs are better cross-lingual in-context learners with alignment. In: Rogers, A., Boyd- Graber, J., Okazaki, N. (eds.) Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 6292–
-
[18]
Multilingual LLM s are better cross-lingual in-context learners with alignment
Association for Computational Linguistics, Toronto, Canada (Jul 2023). https://doi.org/10.18653/v1/2023.acl-long.346, https://aclanthology.org/2023. acl-long.346/
-
[19]
Counts: Benchmarking llm numerical reasoning with verifiable rewards
Team, B.M.: Baichuan-m2: Scaling medical capability with large verifier system. arXiv preprint arXiv:2501.00000 (2025)
-
[20]
Team, Q.: Qwen2.5 technical report. arXiv preprint arXiv:2412.15115 (2024)
work page Pith review arXiv 2024
-
[21]
Tkachenko, M., Malyuk, M., Holmanyuk, A., Liubimov, N.: Label Studio: Data labeling software (2020-2025), https://github.com/HumanSignal/label-studio , open source software available from https://github.com/HumanSignal/label-studio
2020
-
[22]
arXiv preprint arXiv:2304.08085
Wang, X., Zhou, W., Zu, C., Xia, H., Chen, T., Zhang, Y., Zheng, R., Ye, J., Zhang, Q., Gui, T., et al.: Instructuie: Multi-task instruction tuning for unified information extraction. arxiv 2023. arXiv preprint arXiv:2304.08085 (2023)
-
[23]
doi: 10.18653/v1/2020.emnlp- demos.6
Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Le Scao, T., Gugger, S., Drame, M., Lhoest, Q., Rush, A.: Transformers: State-of-the-art natural language processing. In: Liu, Q., Schlangen, D. (eds.) Proceedi...
- [24]
-
[25]
Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., Zheng, C., Liu, D., Zhou, F., Huang, F., Hu, F., Ge, H., Wei, H., Lin, H., Tang, J., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Zhou, J., Lin, J., Dang, K., Bao, K., Yang, K., Yu, L., Deng, L.C., Li, M., Xue, M., Li, M., Zhang, P., Wang, P., Zhu,...
work page internal anchor Pith review arXiv 2025
-
[26]
Yang, X., Zhao, X., Shen, Z.: Ehrstruct: A comprehensive benchmark framework for evaluating large language models on structured electronic health record tasks. ArXiv abs/2511.08206(2025), https://api.semanticscholar.org/CorpusID:282922202
-
[27]
In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zhang, N., Chen, M., Bi, Z., Liang, X., Li, L., Shang, X., Xuan, K., Zhao, J., Li, H., Huang, C.H., Ni, J., Shao, G., Chen, L., Tou, H., Huang, G., Chen, H.: CBLUE: A Chinese biomedical language understanding evaluation benchmark. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 7888–...
2022
-
[28]
arXiv preprint arXiv:2106.08087 (2021)
Zhang, N., Chen, M., Bi, Z., Liang, X., Li, L., Shang, X., Xuan, K., Zhao, J., Li, H., Huang, C.H., et al.: Cblue benchmark: Technical report. arXiv preprint arXiv:2106.08087 (2021)
-
[29]
arXiv preprint arXiv:2310.14151 (2023)
Zhu, W., Hou, G., Chen, M., Zhang, N., et al.: Promptcblue: A chinese prompt tuning benchmark for the medical domain. arXiv preprint arXiv:2310.14151 (2023)
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.