pith. sign in

arxiv: 2502.20295 · v2 · submitted 2025-02-27 · 💻 cs.LG · cs.AI· cs.CV

Judge a Book by its Cover: Investigating Multi-Modal LLMs for Multi-Page Handwritten Document Transcription

Pith reviewed 2026-05-23 01:59 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CV
keywords handwritten text recognitionmulti-modal LLMsmulti-page transcriptionOCR post-processingprompting strategieszero-shot methodsdocument analysis
0
0 comments X

The pith

New prompting strategies let multi-modal LLMs transcribe multi-page handwritten documents by sharing context across pages without added complexity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests ways to use multi-modal LLMs for transcribing multi-page handwritten documents in a zero-shot setting. Existing approaches treat each page separately and discard shared details such as handwriting style and semantic content that run through real documents. The authors create a benchmark from existing single-page data plus a new Malvern-Hills dataset, then evaluate combinations of OCR, LLM post-processing, and end-to-end MLLM transcription. They introduce OCR+PAGE-1 and OCR+PAGE-N prompting methods that feed selected prior-page information into the model while keeping prompts short. These strategies outperform earlier page-by-page methods on the benchmark.

Core claim

The paper claims that OCR+PAGE-1 and OCR+PAGE-N prompting strategies enable multi-modal LLMs to achieve higher accuracy on multi-page handwritten transcription tasks by selectively incorporating shared content from earlier pages, while avoiding the prompt overload that comes from simply concatenating all prior text.

What carries the argument

OCR+PAGE-1 and OCR+PAGE-N prompting strategies, which combine OCR output with selective reuse of prior-page context inside MLLM prompts to carry semantic and stylistic information forward without full document concatenation.

If this is right

  • Multi-page documents can be transcribed more accurately in zero-shot settings without retraining models on new labeled sets.
  • Prompt length stays manageable even as page count grows, because only targeted prior content is reused.
  • The same strategies can be applied to other MLLM tasks that involve sequential or related inputs.
  • Benchmarks built from single-page datasets become usable for evaluating multi-page performance.
  • Combining OCR with selective MLLM prompting outperforms both pure OCR post-processing and pure image-based MLLM transcription.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could reduce the need for page-by-page manual correction in archives of long handwritten records.
  • It may generalize to tasks such as multi-page form filling or table extraction where style and content persist across pages.
  • If the strategies scale, they might lower the cost of digitizing historical collections that currently require extensive fine-tuning.
  • Testing on documents with varying handwriting consistency would reveal how much the shared-context benefit depends on document uniformity.

Load-bearing premise

That adding shared context from prior pages through these prompts will raise transcription accuracy on actual multi-page documents rather than triggering overload, dilution, or extra hallucinations.

What would settle it

A controlled test on real multi-page handwritten documents showing no accuracy gain or a drop when using OCR+PAGE-1 or OCR+PAGE-N compared with single-page OCR-plus-LLM baselines.

Figures

Figures reproduced from arXiv: 2502.20295 by Benjamin Gutteridge, Matthew Thomas Jackson, Toni Kukurin, Xiaowen Dong.

Figure 1
Figure 1. Figure 1: An illustration of how +FIRST PAGE works; the OCR text of a multi-page document is provided to an MLLM, along with just the first page image of the docu￾ment. Blue denotes the first page. Given that separate pages from the same document will have very similar formatting — handwriting, structure, im￾age lighting/angle, etc. — it is likely that the errors made by an OCR engine will be fairly consistent over … view at source ↗
Figure 2
Figure 2. Figure 2: An example of how +FIRST PAGE propagates OCR error corrections across pages. Though the MLLM only has access to the image of the first page, it uses the corrections that the OCR (i) frequently mistakes ‘i‘ for ‘1’ and (ii) fre￾quently mistakes words for numbers to correctly transcribe the word ‘in’ on the unseen second page. See Figures 11–14 in the Appendix for further examples. text mapping to the remain… view at source ↗
Figure 3
Figure 3. Figure 3: OCR ONLY (→ LLM) illustration. OCR CONCAT [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: OCR ONLY PBP illustration. MLLM [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: VISION* illustration. OCR MLLM [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: +ALL PAGES illustration. Error catching. As a post-processing step for MLLM methods, we perform simple common-sense checks for catastrophic MLLM error, such as repeating sections of text ad infinitum, or refusing to return an output due to an in￾advertent triggering of OpenAI’s guardrails (“Sorry, but I can’t answer that...”). To avoid such outliers unfairly drag￾ging down the overall score of a method (th… view at source ↗
Figure 7
Figure 7. Figure 7: Comparing PBP and all-at-once performance for text only, vision only and mixed methods (all using GPT￾4O) on IAM multi-page documents of varying page counts. Relative CER Improvement is the same as in [PITH_FULL_IMAGE:figures/full_fig_p005_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Performance and cost of all OCR engine, MLLM [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗
Figure 11
Figure 11. Figure 11: With +FIRST PAGE, the correctly-transcribed oc￾currence of ‘draws’ in the first page can be extrapolated to the unseen ‘draw’ on the second page. 2nd page snippet GT: 'Healy' OCR only: 'Mealy' OCR→LLM: 'Mealy' +first page→MLLM: 'Healy' 1st page snippet GT: 'Healy' OCR only: 'Mealy' OCR→LLM: 'Mealy' +first page→MLLM: 'Healy' ❌ ❌ ✅ ❌ ❌ ✅ [PITH_FULL_IMAGE:figures/full_fig_p008_11.png] view at source ↗
Figure 9
Figure 9. Figure 9: An example of a document from the IAM Hand [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
Figure 12
Figure 12. Figure 12: With +FIRST PAGE, the correctly-transcribed oc￾currence of the name ‘Mr Healy’ in the first page can be extrapolated to the unseen occurence on the second page. 2nd page snippet GT: 'fairly wide area' OCR only: 'Paily miche mide wear' OCR→LLM: 'fairly large area' +first page → 'fairly wide area' ❌ ❌ ✅ [PITH_FULL_IMAGE:figures/full_fig_p008_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: +FIRST PAGE corrects where OCR ONLY→LLM gets it wrong, despite only having access only to the garbled OCR output and not the image of the word ‘wide’ shown above. Suggests some degree of reasoning using the seem￾ingly irrelevant first page text — i.e. it can see that ‘m’s on page 1 look similar to ‘w’s and reason that ‘mide’ could be ‘wide’ [PITH_FULL_IMAGE:figures/full_fig_p008_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: An unusual case where the OCR engine erro [PITH_FULL_IMAGE:figures/full_fig_p009_14.png] view at source ↗
read the original abstract

Handwriting text recognition (HTR) remains a challenging task. Existing approaches require fine-tuning on labeled data, which is impractical to obtain for real-world problems, or rely on zero-shot tools such as OCR engines and multi-modal LLMs (MLLMs). MLLMs have shown promise both as end-to-end transcribers and as OCR post-processors, but to date there is little empirical research evaluating different MLLM prompting strategies for HTR, particularly for the case of multi-page documents. Most handwritten documents are multi-page, and share context such as semantic content and handwriting style across pages, yet MLLMs are typically used for transcription at the page level, meaning they throw away this shared context. They are also typically used as either text-only post-processors or image-only OCR alternatives, rather than leveraging multiple modes. This paper investigates a suite of methods combining OCR, LLM post-processing and MLLM end-to-end transcription, for the task of zero-shot multi-page handwritten document transcription. We introduce a benchmark for this task from existing single-page datasets, including a new dataset, Malvern-Hills. Finally, we introduce OCR+PAGE-1 and OCR+PAGE-N, prompting strategies for multi-page transcription that outperform existing methods by sharing content across pages while minimizing prompt complexity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper investigates zero-shot multi-page handwritten document transcription using multi-modal LLMs (MLLMs). It argues that page-level usage of MLLMs discards shared context such as semantic content and handwriting style across pages in real multi-page documents. The authors construct a benchmark by combining existing single-page datasets with a new Malvern-Hills dataset, and introduce two prompting strategies (OCR+PAGE-1 and OCR+PAGE-N) that share context across pages while limiting prompt length. They evaluate combinations of OCR, LLM post-processing, and MLLM end-to-end transcription, claiming the new strategies outperform prior approaches.

Significance. If the results are robust, the work offers practical guidance on leveraging shared document context in MLLM prompting for HTR without fine-tuning, addressing a gap in handling authentic multi-page documents. The new Malvern-Hills dataset and multi-page benchmark constitute a concrete contribution that could support future research in zero-shot document transcription.

major comments (2)
  1. [Benchmark Construction] Benchmark section: The benchmark is assembled from existing single-page datasets. No details are provided on whether concatenated pages preserve consistent author style, topic continuity, or layout. This is load-bearing for the central claim, as the reported gains of OCR+PAGE-1 and OCR+PAGE-N are attributed to exploiting shared cross-page context; an artificial construction risks measuring artifacts rather than genuine multi-page benefits.
  2. [Experiments] Experimental evaluation: The abstract asserts outperformance by the proposed strategies, yet the manuscript must include explicit quantitative comparisons (e.g., CER/WER tables against page-level baselines) with error analysis to confirm that gains arise from context sharing rather than other factors.
minor comments (2)
  1. [Abstract] Abstract: The claim of outperformance should be accompanied by at least one concrete metric or baseline comparison to orient readers.
  2. [Methods] Methods: Provide precise prompt templates or pseudocode for OCR+PAGE-1 versus OCR+PAGE-N to ensure reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important aspects of benchmark construction and experimental reporting. We address each major comment below and will revise the manuscript to incorporate additional details and explicit comparisons.

read point-by-point responses
  1. Referee: [Benchmark Construction] Benchmark section: The benchmark is assembled from existing single-page datasets. No details are provided on whether concatenated pages preserve consistent author style, topic continuity, or layout. This is load-bearing for the central claim, as the reported gains of OCR+PAGE-1 and OCR+PAGE-N are attributed to exploiting shared cross-page context; an artificial construction risks measuring artifacts rather than genuine multi-page benefits.

    Authors: We agree that the manuscript should provide more explicit details on benchmark construction. The existing single-page datasets were grouped by original document source where metadata permitted, and the new Malvern-Hills dataset comprises authentic multi-page handwritten documents. We will add a dedicated subsection describing the concatenation procedure, any available metadata on author and topic continuity, and a discussion of limitations arising from the use of single-page sources. revision: yes

  2. Referee: [Experiments] Experimental evaluation: The abstract asserts outperformance by the proposed strategies, yet the manuscript must include explicit quantitative comparisons (e.g., CER/WER tables against page-level baselines) with error analysis to confirm that gains arise from context sharing rather than other factors.

    Authors: We acknowledge that the experimental results section would be strengthened by expanded quantitative tables and error analysis. The manuscript already reports CER/WER metrics for the proposed strategies versus baselines, but we will add comprehensive side-by-side tables, statistical significance tests, and a dedicated error analysis subsection that examines whether improvements correlate with cross-page context sharing. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical prompting comparison with no derivation chain

full rationale

The paper introduces OCR+PAGE-1 and OCR+PAGE-N prompting strategies and evaluates them empirically against baselines on a benchmark assembled from single-page datasets plus Malvern-Hills. No equations, fitted parameters, uniqueness theorems, or self-citations are used to derive results; performance claims rest on direct measurement of transcription accuracy. The central claims do not reduce to inputs by construction, satisfying the self-contained empirical case (score 0-2).

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical prompting study with no mathematical model; no free parameters, axioms, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5780 in / 1019 out tokens · 51927 ms · 2026-05-23T01:59:03.702431+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 8 internal anchors

  1. [1]

    , " * write output.state after.block = add.period write newline

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

  3. [3]

    GPT-4 Technical Report

    Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F. L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774

  4. [4]

    A.; Afzal, M

    Azawi, M. A.; Afzal, M. Z.; and Breuel, T. M. 2013. Normalizing historical orthography for OCR historical documents using LSTM. In Proceedings of the 2nd International Workshop on Historical Document Imaging and Processing, 80--85

  5. [5]

    B.; Daimary, D.; Amitab, K.; and Kandar, D

    Bora, M. B.; Daimary, D.; Amitab, K.; and Kandar, D. 2020. Handwritten character recognition from images using CNN-ECOC. Procedia Computer Science, 167: 2403--2409

  6. [6]

    M.; Ul-Hasan, A.; Al-Azawi, M

    Breuel, T. M.; Ul-Hasan, A.; Al-Azawi, M. A.; and Shafait, F. 2013. High-performance OCR for printed English and Fraktur using LSTM networks. In 2013 12th international conference on document analysis and recognition, 683--687. IEEE

  7. [7]

    Carbonell, M.; Mas, J.; Villegas, M.; Forn \'e s, A.; and Llad \'o s, J. 2019. End-to-end handwritten text detection and transcription in full pages. In 2019 International conference on document analysis and recognition workshops (ICDARW), volume 5, 29--34. IEEE

  8. [8]

    Causer, T.; Grint, K.; Sichani, A.-M.; and Terras, M. 2018. ‘Making such bargain’: Transcribe Bentham and the quality and cost-effectiveness of crowdsourced transcription. Digital Scholarship in the Humanities, 33(3): 467--487

  9. [9]

    Chen, Y.; Qian, S.; Tang, H.; Lai, X.; Liu, Z.; Han, S.; and Jia, J. 2023. Longlora: Efficient fine-tuning of long-context large language models. arXiv preprint arXiv:2309.12307

  10. [10]

    Devlin, J. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805

  11. [11]

    M.; Contreras, D.; Barrios, J

    Diem, M.; Fiel, S.; Kleber, F.; Sablatnig, R.; Saavedra, J. M.; Contreras, D.; Barrios, J. M.; and Oliveira, L. S. 2014. ICFHR 2014 competition on handwritten digit string recognition in challenging datasets (HDSRC 2014). In 2014 14th International Conference on Frontiers in Handwriting Recognition, 779--784. IEEE

  12. [12]

    ScribbleLens

    Dolfing, H. J.; Bellegarda, J.; Chorowski, J.; Marxer, R.; and Laurent, A. 2020. The “ScribbleLens” Dutch historical handwriting corpus. In 2020 17th international conference on frontiers in handwriting recognition (ICFHR), 67--72. IEEE

  13. [13]

    Dong, Q.; Li, L.; Dai, D.; Zheng, C.; Ma, J.; Li, R.; Xia, H.; Xu, J.; Wu, Z.; Liu, T.; et al. 2022. A survey on in-context learning. arXiv preprint arXiv:2301.00234

  14. [14]

    Dosovitskiy, A. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929

  15. [15]

    Floridi, L.; and Chiriatti, M. 2020. GPT-3: Its nature, scope, limits, and consequences. Minds and Machines, 30: 681--694

  16. [16]

    Fujitake, M. 2024. Dtrocr: Decoder-only transformer for optical character recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 8025--8035

  17. [17]

    Grali \'n ski, F.; Stanis awek, T.; Wr \'o blewska, A.; Lipi \'n ski, D.; Kaliska, A.; Rosalska, P.; Topolski, B.; and Biecek, P. 2020. Kleister: A novel task for information extraction involving long documents with complex layout. arXiv preprint arXiv:2003.02356

  18. [18]

    Huang, Y.; Lv, T.; Cui, L.; Lu, Y.; and Wei, F. 2022. Layoutlmv3: Pre-training for document ai with unified text and image masking. In Proceedings of the 30th ACM International Conference on Multimedia, 4083--4091

  19. [19]

    Huang, Z.; Chen, K.; He, J.; Bai, X.; Karatzas, D.; Lu, S.; and Jawahar, C. 2019. Icdar2019 competition on scanned receipt ocr and information extraction. In 2019 International Conference on Document Analysis and Recognition (ICDAR), 1516--1520. IEEE

  20. [20]

    Karpinska, M.; Thai, K.; Lo, K.; Goyal, T.; and Iyyer, M. 2024. One thousand and one pairs: A" novel" challenge for long-context language models. arXiv preprint arXiv:2406.16264

  21. [21]

    Kim, G.; Hong, T.; Yim, M.; Park, J.; Yim, J.; Hwang, W.; Yun, S.; Han, D.; and Park, S. 2021. Donut: Document understanding transformer without ocr. arXiv preprint arXiv:2111.15664, 7(15): 2

  22. [22]

    Kim, Y.; Chang, Y.; Karpinska, M.; Garimella, A.; Manjunatha, V.; Lo, K.; Goyal, T.; and Iyyer, M. 2024. FABLES: Evaluating faithfulness and content selection in book-length summarization. arXiv preprint arXiv:2404.01261

  23. [23]

    Li, M.; Lv, T.; Chen, J.; Cui, L.; Lu, Y.; Florencio, D.; Zhang, C.; Li, Z.; and Wei, F. 2023. Trocr: Transformer-based optical character recognition with pre-trained models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, 13094--13102

  24. [24]

    F.; Lin, K.; Hewitt, J.; Paranjape, A.; Bevilacqua, M.; Petroni, F.; and Liang, P

    Liu, N. F.; Lin, K.; Hewitt, J.; Paranjape, A.; Bevilacqua, M.; Petroni, F.; and Liang, P. 2024. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12: 157--173

  25. [25]

    Liu, Y. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 364

  26. [26]

    Liu, Y.; Li, Z.; Yang, B.; Li, C.; Yin, X.; Liu, C.-l.; Jin, L.; and Bai, X. 2023. On the hidden mystery of ocr in large multimodal models. arXiv preprint arXiv:2305.07895

  27. [27]

    Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; and Guo, B. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, 10012--10022

  28. [28]

    B.; Walker, D

    Lund, W. B.; Walker, D. D.; and Ringger, E. K. 2011. Progressive alignment and discriminative error correction for multiple OCR engines. In 2011 International Conference on Document Analysis and Recognition, 764--768. IEEE

  29. [29]

    Marti, U.-V.; and Bunke, H. 2002. The IAM-database: an English sentence database for offline handwriting recognition. International journal on document analysis and recognition, 5: 39--46

  30. [30]

    C.; Maier, V.; and Green, P

    Morris, A. C.; Maier, V.; and Green, P. D. 2004. From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition. In Interspeech, 2765--2768

  31. [31]

    Park, S.; Shin, S.; Lee, B.; Lee, J.; Surh, J.; Seo, M.; and Lee, H. 2019. CORD: a consolidated receipt dataset for post-OCR parsing. In Workshop on Document Intelligence at NeurIPS 2019

  32. [32]

    Peer, D.; Sch \"o pf, P.; Nebendahl, V.; Rietzler, A.; and Stabinger, S. 2024. ANLS*--A Universal Document Processing Metric for Generative Large Language Models. arXiv preprint arXiv:2402.03848

  33. [33]

    Rigaud, C.; Doucet, A.; Coustaty, M.; and Moreux, J.-P. 2019. ICDAR 2019 competition on post-OCR text correction. In 2019 international conference on document analysis and recognition (ICDAR), 1588--1593. IEEE

  34. [34]

    A.; Romero, V.; Toselli, A

    S \'a nchez, J. A.; Romero, V.; Toselli, A. H.; Villegas, M.; and Vidal, E. 2019. A set of benchmarks for handwritten text recognition on historical documents. Pattern Recognition, 94: 122--134

  35. [35]

    Schaefer, R.; and Neudecker, C. 2020. A two-step approach for automatic OCR post-correction. In Proceedings of the 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, 52--57

  36. [36]

    Schick, T.; Dwivedi-Yu, J.; Dess \` , R.; Raileanu, R.; Lomeli, M.; Hambro, E.; Zettlemoyer, L.; Cancedda, N.; and Scialom, T. 2024. Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems, 36

  37. [37]

    Serrano, N.; Castro, F.; and Juan, A. 2010. The RODRIGO Database. In LREC, 19--21

  38. [38]

    Shi, L.; Zhang, H.; Yao, Y.; Li, Z.; and Zhao, H. 2024. Keep the Cost Down: A Review on Methods to Optimize LLM's KV-Cache Consumption. arXiv preprint arXiv:2407.18003

  39. [39]

    Stanis awek, T.; Grali \'n ski, F.; Wr \'o blewska, A.; Lipi \'n ski, D.; Kaliska, A.; Rosalska, P.; Topolski, B.; and Biecek, P. 2021. Kleister: key information extraction datasets involving long documents with complex layouts. In International Conference on Document Analysis and Recognition, 564--579. Springer

  40. [40]

    Vaswani, A. 2017. Attention is all you need. arXiv preprint arXiv:1706.03762

  41. [41]

    Veninga, M. 2024. LLMs for OCR Post-Correction. Master's thesis, University of Twente

  42. [42]

    Wigington, C.; Tensmeyer, C.; Davis, B.; Barrett, W.; Price, B.; and Cohen, S. 2018. Start, follow, read: End-to-end full-page handwriting recognition. In Proceedings of the European conference on computer vision (ECCV), 367--383

  43. [43]

    U.; Zhang, D.; Ramanathan, M

    Wu, D.; Ahmad, W. U.; Zhang, D.; Ramanathan, M. K.; and Ma, X. 2024. REPOFORMER: Selective retrieval for repository-level code completion. arXiv preprint arXiv:2403.10059

  44. [44]

    Xu, Y.; Li, M.; Cui, L.; Huang, S.; Wei, F.; and Zhou, M. 2020. Layoutlm: Pre-training of text and layout for document image understanding. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, 1192--1200

  45. [45]

    Yang, J.; Ren, P.; and Kong, X. 2019. Handwriting text recognition based on faster R-CNN. In 2019 Chinese Automation Congress (CAC), 2450--2454. IEEE

  46. [46]

    Yin, S.; Fu, C.; Zhao, S.; Li, K.; Sun, X.; Xu, T.; and Chen, E. 2024. A survey on multimodal large language models. National Science Review, nwae403

  47. [47]

    Yu, H.; Chen, J.; Li, B.; Ma, J.; Guan, M.; Xu, X.; Wang, X.; Qu, S.; and Xue, X. 2021. Benchmarking chinese text recognition: Datasets, baselines, and an empirical study. arXiv preprint arXiv:2112.15093

  48. [48]

    Yuan, Y.; Liu, X.; Dikubab, W.; Liu, H.; Ji, Z.; Wu, Z.; and Bai, X. 2022. Syntax-aware network for handwritten mathematical expression recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 4553--4562

  49. [49]

    Zhang, R.; Zhou, Y.; Jiang, Q.; Song, Q.; Li, N.; Zhou, K.; Wang, L.; Wang, D.; Liao, M.; Yang, M.; et al. 2019. Icdar 2019 robust reading challenge on reading chinese text on signboard. In 2019 international conference on document analysis and recognition (ICDAR), 1577--1581. IEEE

  50. [50]

    A Survey of Large Language Models

    Zhao, W. X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; et al. 2023. A survey of large language models. arXiv preprint arXiv:2303.18223