pith. sign in

arxiv: 2507.21114 · v3 · submitted 2025-07-11 · 💻 cs.IR · cs.AI· cs.CV

Page image classification for content-specific data processing

Pith reviewed 2026-05-19 05:57 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.CV
keywords image classificationhistorical documentspage categorizationdocument digitizationcontent-based routingmachine learninghumanities archives
0
0 comments X p. Extension

The pith

An image classification system sorts historical document pages by content to enable tailored processing pipelines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Digitization projects produce large collections of historical page images containing mixed text, graphics, tables, and layouts. Manual sorting of these images for appropriate analysis is slow and impractical at scale. The paper develops and evaluates a machine learning image classifier that automatically assigns pages to content categories chosen specifically to match different downstream techniques such as OCR for textual pages and dedicated image analysis for graphical ones.

Core claim

The authors develop an image classification system for historical document pages whose categories separate content types that require distinct analysis methods, such as optical character recognition for handwritten, typed, or printed text and separate processing for drawings, maps, or photographs.

What carries the argument

Content-specific image classifier that assigns each historical page image to one of a small set of categories designed to route it to the matching analysis pipeline.

If this is right

  • Text-heavy pages can be sent directly to OCR without wasting resources on graphics pages.
  • Graphics and map pages can be routed to image-analysis tools instead of OCR.
  • Large humanities archives can be processed with less human sorting effort.
  • Processing workflows become automated by content type rather than uniform.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same category set might transfer to modern born-digital documents with minor retraining.
  • Combining the classifier with layout detection could further refine routing decisions.
  • Error patterns on rare page types could guide targeted data collection for retraining.

Load-bearing premise

A standard image classifier trained on historical page images can separate the chosen content types reliably enough to improve downstream specialized pipelines.

What would settle it

Running the classifier on a new collection of labeled historical pages and finding accuracy too low to reduce manual review time or error rate compared with random routing.

Figures

Figures reproduced from arXiv: 2507.21114 by Kateryna Lutsai, Pavel Stra\v{n}\'ak.

Figure 1
Figure 1. Figure 1: Number of page scans over time Digital archives derived from historical documents exhibit several unique characteristics that compli￾cate their management. The collections often span significant historical periods, with document creation ranging from the early 20th century to the present day, and the data volume typically increases exponen￾tially over time, as illustrated in [PITH_FULL_IMAGE:figures/full_… view at source ↗
Figure 2
Figure 2. Figure 2: Various page types examples (a) Hand-written table on a damaged paper (b) Tiny-scale drawing (c) Article scan with a photo 1.2 Challenges in management of scanned documents [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Various page types examples (a) Manually commented typed report with a small logo (b) Large-scale canvas with a map and a legend table 2/56 [PITH_FULL_IMAGE:figures/full_fig_p002_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Problematic for OCR pages (a) Hand-written text on a gray paper (b) Page from a large volume journal (c) Scanned copy with printing defects To address these challenges, this thesis focuses on the development of an automated system for classifying page images from historical archives based on their visual content and layout. Such a tool is highly beneficial for institutions managing large digital collection… view at source ↗
Figure 5
Figure 5. Figure 5: One of the oldest and one of the latest pages in our collection (a) Notebook with a hand-drawn sketch & skewed lines (b) Modern digital-born (printed and then scanned for some reason) map Thanks for using Overleaf to write your article. Your introduction goes here! Some examples of commonly used commands and features are listed below, to help you get started. 2 EXPLORATION OF THE RAW DATA The primary datas… view at source ↗
Figure 6
Figure 6. Figure 6: Various page and content defects (a) Low-contrast text (b) Skewed table (c) Bleed-through tables • Page Skew and Alignment Issues: Many pages suffer from skew, where content is not aligned horizontally (Figures 32 47g 49c 50c 50g 51d 51f 52a 54d 54e 57c 57g and 57i). This is a well￾documented problem in Optical Character Recognition (OCR) literature that can arise from improper paper feeding during scannin… view at source ↗
Figure 8
Figure 8. Figure 8: Various page and content defects (a) Teared page corner (b) Large volume bound & skewed table (c) Corrections & filled-in stamp These complex and often overlapping defects make naive document processing unreliable, underscor￾ing the need for specialized classifiers that are robust to such visual noise. 2.1.2 Textual Variations and Annotations Beyond the physical defects, the documents displayed considerabl… view at source ↗
Figure 9
Figure 9. Figure 9: DLA application samples (a) Imaginary tables & ignored figure (b) Ignored text & imaginary figures (c) Ignored text paragraph 2.2.1 OCR performance The Tesseract OCR engine was applied to sample pages to measure text recognition accuracy. The performance varied significantly depending on the document’s condition. High accuracy was achieved on pages with clean backgrounds and high-contrast printed text (see… view at source ↗
Figure 10
Figure 10. Figure 10: DeepDoctection mistakes on pages with tables and figures (a) Table as a figure (b) Drawing as a table (c) Header as a figure 2.2.5 Page classification based on detected elements In an attempt to improve upon the initial DLA results, a rule-based classifier was developed. This system used heuristics based on line counts and the output of DeepDoctection. For each page, the number of long and short horizonta… view at source ↗
Figure 11
Figure 11. Figure 11: DeepDoctection mistakes on pages with maps and drawings (a) Map as a table (b) Map as a table & ignored legend (c) Figures as a table • Classification Consistency: Pages with similar content must be assigned to the same category. Consistency was prioritized over isolated instances of correctness, as a few accurate classifications among many errors were considered less useful. • Primacy of Structured Data:… view at source ↗
Figure 12
Figure 12. Figure 12: Distribution of categories based on the document creation year 3.1 Classification Categories and Priorities The 11 target classes were defined in [PITH_FULL_IMAGE:figures/full_fig_p011_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Low-compute models compared on the same data in a cross-fold validation. Left — proposed by the data provider annotation, Right — proposed by us annotation scheme. A Random Forest Classifier (RFC) Breiman (2001) was chosen for classification due to its effective￾ness, interpretability, and efficiency on modest hardware. While several other low-compute models were evaluated, the RFC yielded the best prelim… view at source ↗
Figure 15
Figure 15. Figure 15: EfficientNetV2 confusion matrix (a) Size S — 97.73% (b) Size M — 97.41% (c) Size L — 97.86% EfficientNetV2 is a convolutional neural network that balances width, depth, and resolution using compound scaling Tan and Le (2019) pretrained on ImageNet-21k Ridnik et al. (2021). RegNetY is a family of ResNet-like architectures designed via network design spaces Radosavovic et al. (2020). Specifications of model… view at source ↗
Figure 17
Figure 17. Figure 17: DiT confusion matrix, 10 epochs (a) Base RVL — 97.03% (b) Large — 96.91% (c) Large RVL — 97.28% The Document Image Transformer (DiT) and Vision Transformer (ViT) are architectures that apply the Transformer mechanism directly to image patches, a departure from traditional CNNs. DiT is specifically pre-trained on large-scale document images, making it a strong candidate for our task. We fine-tuned several … view at source ↗
Figure 18
Figure 18. Figure 18: ViT confusion matrix, 10 epochs (a) Base 224 — 97.54% (b) Base 384 — 97.41% (c) Large 384 — 97.73% allows for zero-shot classification by comparing image features to the features of textual descriptions of each category. While zero-shot performance was limited (accuracies below 46% as shown in [PITH_FULL_IMAGE:figures/full_fig_p016_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Finetuned CLIP Comparison of classification accuracy per category descriptions set in the same order as Tables 13 through 20 are enumerated (summarized in [PITH_FULL_IMAGE:figures/full_fig_p017_19.png] view at source ↗
Figure 23
Figure 23. Figure 23: Scheme of the transformers & CNNs model architecture • Evaluation strategy: Per-epoch evaluation • Saving strategy: Per-epoch model checkpoint saving • Best model selection: Based on accuracy metric These default parameters are optimized for datasets containing 10,000-50,000 page samples. The documentation advises adjusting the number of epochs based on evaluation loss to prevent overfitting and modifying… view at source ↗
Figure 24
Figure 24. Figure 24: Scheme of the CLIP model architecture 24/56 [PITH_FULL_IMAGE:figures/full_fig_p024_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Model inference use-case 4.5.1 PDF documents to page images The data preparation pipeline includes multi-platform scripts to convert PDF documents into PNG images: • Unix script (pdf2png.sh) — Uses pdftoppm for conversion with zero-padded page numbers • Windows script (pdf2png.bat) — Uses ImageMagick and Ghostscript with sequential page numbers These scripts create a directory structure where each PDF is … view at source ↗
Figure 26
Figure 26. Figure 26: Accuracy vs. Parameter Count across evaluated models. Models above the trendline (Finetuned CLIP) deliver superior efficiency [PITH_FULL_IMAGE:figures/full_fig_p029_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Water damage resulting in blur and overlapping ink (a) Mixed text with round stamp, water-blurred (b) Handwritten text, water-damaged (c) Typed text blurred by water stains [PITH_FULL_IMAGE:figures/full_fig_p032_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Cornerhole damage exposing paper fibers and mixed content, yellowish paper texture (a) Major text correction with cornerhole tear (b) Handwritten drawing next to cornerhole (c) Typed table layout with torn cornerhole 32/56 [PITH_FULL_IMAGE:figures/full_fig_p032_28.png] view at source ↗
Figure 31
Figure 31. Figure 31: Bleed-through artifacts on thin paper (a) Drawn table showing text bleed (b) Handwritten text blurred by bleed (c) Typed text with minor bleed correction C LABEL EXAMPLES All Figures 47 to 57 are summarized in [PITH_FULL_IMAGE:figures/full_fig_p034_31.png] view at source ↗
Figure 32
Figure 32. Figure 32: Skewed table prints and alignment issues (a) Hand-written comment on skewed color table (b) Color-printed table, slight skew (c) Typed table with visible skew (d) Arrow marking on a skewed table page 35/56 [PITH_FULL_IMAGE:figures/full_fig_p035_32.png] view at source ↗
Figure 34
Figure 34. Figure 34: Large drawn tables and grained-paper scans (a) Squared-table drawing on large page (b) Packed-table drawing on large paper (c) Hand-drawn image on grained scan (d) Extra-edge table on paper (mixed content) 37/56 [PITH_FULL_IMAGE:figures/full_fig_p037_34.png] view at source ↗
Figure 35
Figure 35. Figure 35: Scans from thick volumes and journals (a) Color table in thick journal scan (b) Black-and-white journal page with table (c) Text from a thick book scan [PITH_FULL_IMAGE:figures/full_fig_p038_35.png] view at source ↗
Figure 36
Figure 36. Figure 36: Typed text corrections (a) Corrected typed text (b) Minor typed correction (c) Crossed-out lines & gray corners 38/56 [PITH_FULL_IMAGE:figures/full_fig_p038_36.png] view at source ↗
Figure 37
Figure 37. Figure 37: Printed tables with color artifacts and pen stains (a) Table with heavy scribbles (b) Small-sized table on a page (c) Pen stains and yellow paper under table [PITH_FULL_IMAGE:figures/full_fig_p039_37.png] view at source ↗
Figure 38
Figure 38. Figure 38: Mixed-text pages and scribbles (a) Hand-drawn mixed-text page (b) Yellowed photo with typed caption (c) Gray paper with manual photo captions 39/56 [PITH_FULL_IMAGE:figures/full_fig_p039_38.png] view at source ↗
Figure 40
Figure 40. Figure 40: DeepDoctection attempt on a table (a) Source image (b) Recognized table cells --- TABLE 1 --- obrázcích č., 6 = 26 předvádím ukázky charakteristických keramických nálezů z některých signifikant- ních celků, Zprvu zde shrnuji nápln obrázků č. 6 -= 26 v tabulce: vrstva, objekt nálezové situace obr. Čs [PITH_FULL_IMAGE:figures/full_fig_p040_40.png] view at source ↗
Figure 42
Figure 42. Figure 42: DeepDoctection attempt on a photo with text (a) Source image (b) Recognized elements --- TABLE 1 --- P. Dr. Kare! Závadský T. J. rektor Papežské koleje na Velehradě, přírodovědec a spisovatel. Nar. se 15. I. 1886 v Dol. Benešově u Hlučína, na kněze vysvěcen 26. července 1920 v Inšpruku, zemřel náhle po krátké nemoci 2. listopadu 1949 Modleme se: štolskými ho Karla uděl prosíme, Bože, jenž jsi mezi apo- kn… view at source ↗
Figure 44
Figure 44. Figure 44: DeepDoctection mistakes on pages with drawings (a) Missed table & drawing (b) Partial table (c) Missed drawing [PITH_FULL_IMAGE:figures/full_fig_p042_44.png] view at source ↗
Figure 46
Figure 46. Figure 46: DeepDoctection mistakes on pages with plain texts (a) Text as table (b) Missed font (c) Missed lines [PITH_FULL_IMAGE:figures/full_fig_p043_46.png] view at source ↗
Figure 47
Figure 47. Figure 47: Label DRAW examples (a) City drawings (b) Realistic painting (c) Ground schematic (d) Territory map (e) Building plan (f) Within book scan (g) Within written notes 44/56 [PITH_FULL_IMAGE:figures/full_fig_p044_47.png] view at source ↗
Figure 48
Figure 48. Figure 48: Label DRAW L examples (a) Scheme with a legend (b) Map inside a form (c) Drawing inside a form (d) Wall drawing (e) Territoty map (f) Schema in a form (g) Buildings top view 45/56 [PITH_FULL_IMAGE:figures/full_fig_p045_48.png] view at source ↗
Figure 49
Figure 49. Figure 49: Label LINE HW examples (a) Manually filled-in form (b) Gray paper form (c) Handwritten journal (d) Filled form (e) Filled object notes (f) Handwritten table (g) Front page header 46/56 [PITH_FULL_IMAGE:figures/full_fig_p046_49.png] view at source ↗
Figure 50
Figure 50. Figure 50: Label LINE P examples (a) Colored cells (b) Table within text (c) Journal page (d) Colorful header (e) Old style print (f) Full-page table (g) Widened print 47/56 [PITH_FULL_IMAGE:figures/full_fig_p047_50.png] view at source ↗
Figure 51
Figure 51. Figure 51: Label LINE T examples (a) Filled-in form (b) Stamped front page (c) Grayish paper (d) Edgeless table (e) Filled-in form header (f) Page holes on the edge (g) Full-page table 48/56 [PITH_FULL_IMAGE:figures/full_fig_p048_51.png] view at source ↗
Figure 52
Figure 52. Figure 52: Label PHOTO examples (a) Newspaper page (b) Front Page (c) Artifact photos (d) Texture cutouts (e) Schema and photo (f) Multiple object cutouts (g) No captions 49/56 [PITH_FULL_IMAGE:figures/full_fig_p049_52.png] view at source ↗
Figure 53
Figure 53. Figure 53: Label PHOTO L examples (a) Manual captions (b) Map with legend (c) Photo cutouts (d) Photo and legend (e) Near drawing (f) Within filled form (g) Object and legend 50/56 [PITH_FULL_IMAGE:figures/full_fig_p050_53.png] view at source ↗
Figure 54
Figure 54. Figure 54: Label TEXT examples (a) Newspaper cutouts (b) Journal cover (c) Manual correction (d) Text comment (e) Front page (f) Whitespace (g) Postcard scan 51/56 [PITH_FULL_IMAGE:figures/full_fig_p051_54.png] view at source ↗
Figure 55
Figure 55. Figure 55: Label TEXT HW examples (a) Clean list-like (b) Yellow paper (c) Flipped paper (d) Tiny paper (e) Dual page (f) Simple stamp (g) Wet paper 52/56 [PITH_FULL_IMAGE:figures/full_fig_p052_55.png] view at source ↗
Figure 56
Figure 56. Figure 56: Label TEXT P examples (a) Color printed (b) Journal scan (c) Book or article page (d) Title page (e) Decorative prints (f) Simple text (g) Skewed print (h) Shifted formatting (i) List-like column 53/56 [PITH_FULL_IMAGE:figures/full_fig_p053_56.png] view at source ↗
read the original abstract

Digitization projects in humanities often generate vast quantities of page images from historical documents, presenting significant challenges for manual sorting and analysis. These archives contain diverse content, including various text types (handwritten, typed, printed), graphical elements (drawings, maps, photos), and layouts (plain text, tables, forms). Efficiently processing this heterogeneous data requires automated methods to categorize pages based on their content, enabling tailored downstream analysis pipelines. This project addresses this need by developing and evaluating an image classification system specifically designed for historical document pages, leveraging advancements in artificial intelligence and machine learning. The set of categories was chosen to facilitate content-specific processing workflows, separating pages requiring different analysis techniques (e.g., OCR for text, image analysis for graphics)

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript describes the development of an image classification system for historical document page images. It defines a set of content categories (handwritten/typed/printed text, graphics, maps, photos, tables, forms) chosen to route pages to specialized downstream pipelines such as OCR or image analysis, and reports on the training and evaluation of a classifier leveraging modern AI/ML techniques for this purpose.

Significance. A reliable page-level classifier of this kind would materially reduce manual triage costs in large-scale humanities digitization projects and improve the precision of subsequent automated analysis. The category design is well-motivated for workflow separation, but the manuscript supplies no quantitative performance data, dataset description, or error analysis, so the practical significance cannot yet be assessed.

major comments (2)
  1. [Results / Evaluation] The manuscript contains no reported accuracy, per-class F1, confusion matrix, or any other quantitative performance metric for the classifier. This absence is load-bearing for the central claim that the system can reliably separate the chosen categories at a level useful for routing pages to specialized pipelines.
  2. [Dataset / Methods] No information is provided on dataset size, composition, train/validation/test splits, or source of the historical page images. Without these details it is impossible to judge whether the reported (or unreported) performance generalizes or whether class imbalance or domain shift undermines the workflow-separation goal.
minor comments (1)
  1. [Abstract / Introduction] The abstract and introduction repeat the motivation for content-specific processing but do not preview the concrete categories or the classifier architecture; a short enumerated list of the target classes would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that quantitative performance metrics and dataset details are necessary to substantiate the practical utility of the classifier for content-specific routing in digitization workflows, and we will incorporate these in the revised manuscript.

read point-by-point responses
  1. Referee: [Results / Evaluation] The manuscript contains no reported accuracy, per-class F1, confusion matrix, or any other quantitative performance metric for the classifier. This absence is load-bearing for the central claim that the system can reliably separate the chosen categories at a level useful for routing pages to specialized pipelines.

    Authors: We agree that the absence of quantitative metrics prevents a full assessment of the classifier's reliability for workflow separation. The initial manuscript emphasized the category design and system architecture to support content-specific pipelines but did not include the evaluation numbers. In revision we will add overall accuracy, per-class F1 scores, a confusion matrix, and error analysis to demonstrate performance on the chosen categories. revision: yes

  2. Referee: [Dataset / Methods] No information is provided on dataset size, composition, train/validation/test splits, or source of the historical page images. Without these details it is impossible to judge whether the reported (or unreported) performance generalizes or whether class imbalance or domain shift undermines the workflow-separation goal.

    Authors: We acknowledge the omission and agree that dataset provenance and split information are required to evaluate generalization and potential imbalances. The revised manuscript will describe the dataset size, category composition, sources of the historical page images, and the train/validation/test splits used during model development. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical development project with no derivations or self-referential claims.

full rationale

The manuscript presents a practical ML development effort for classifying historical document page images into content categories chosen to support downstream workflows. No equations, mathematical derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear. The central claim is an applied system description rather than a theoretical result that reduces to its own inputs by construction. This is the expected self-contained outcome for an engineering project without the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the work implicitly relies on standard supervised image classification assumptions from prior ML literature.

pith-pipeline@v0.9.0 · 5652 in / 1102 out tokens · 31969 ms · 2026-05-19T05:57:34.317724+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 4 internal anchors

  1. [1]

    Beyer, L., Zhai, X., and Kolesnikov, A. (2022). Better plain vit baselines for imagenet-1k. arXiv preprint arXiv:2205.01580

  2. [2]

    Breiman, L. (2001). Random forests. Machine learning , 45:5--32

  3. [3]

    Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929

  4. [4]

    M., Shanmugam, K., and Dinstein, I

    Haralick, R. M., Shanmugam, K., and Dinstein, I. H. (1973). Textural features for image classification. IEEE Transactions on systems, man, and cybernetics , (6):610--621

  5. [5]

    W., Ufkes, A., and Derpanis, K

    Harley, A. W., Ufkes, A., and Derpanis, K. G. (2015). Evaluation of deep convolutional nets for document image classification and retrieval. In International Conference on Document Analysis and Recognition ( ICDAR )

  6. [6]

    Hu, M.-K. (1962). Visual pattern recognition by moment invariants. IRE transactions on information theory , 8(2):179--187

  7. [7]

    Lewis, D., Agam, G., Argamon, S., Frieder, O., Grossman, D., and Heard, J. (2006). Building a test collection for complex document information processing. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval , pages 665--666

  8. [8]

    Li, J., Xu, Y., Lv, T., Cui, L., Zhang, C., and Wei, F. (2022). Dit: Self-supervised pre-training for document image transformer. In Proceedings of the 30th ACM international conference on multimedia , pages 3530--3539

  9. [9]

    Liu, L., Wang, Z., Qiu, T., Chen, Q., Lu, Y., and Suen, C. Y. (2021). Document image classification: Progress over two decades. Neurocomputing , 453:223--240

  10. [10]

    and Krivankova, D

    Lutsai, K. and Krivankova, D. (2025). Annotated page images from the (archaeological) historical archive

  11. [11]

    Lutsai, K., Stranak, P., Novak, D., and Krivankova, D. (2025). ATRIUM's page classifier: Classification of historical page images using fine-tuned ViT

  12. [12]

    Nikolaidou, K., Seuret, M., Mokayed, H., and Liwicki, M. (2022). A survey of historical document image datasets. International Journal on Document Analysis and Recognition (IJDAR) , 25(4):305--338

  13. [13]

    Paszke, A. (2019). Pytorch: An imperative style, high-performance deep learning library. arXiv preprint arXiv:1912.01703

  14. [14]

    W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al

    Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. (2021). Learning transferable visual models from natural language supervision. In International conference on machine learning , pages 8748--8763. PmLR

  15. [15]

    P., Girshick, R., He, K., and Doll \'a r, P

    Radosavovic, I., Kosaraju, R. P., Girshick, R., He, K., and Doll \'a r, P. (2020). Designing network design spaces. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 10428--10436

  16. [16]

    Ridnik, T., Ben-Baruch, E., Noy, A., and Zelnik-Manor, L. (2021). Imagenet-21k pretraining for the masses. arXiv preprint arXiv:2104.10972

  17. [17]

    Smith, R. (2007). An overview of the tesseract ocr engine. In Ninth international conference on document analysis and recognition (ICDAR 2007) , volume 2, pages 629--633. IEEE

  18. [18]

    and Le, Q

    Tan, M. and Le, Q. (2019). Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning , pages 6105--6114. PMLR

  19. [19]

    Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al. (2019). Huggingface's transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771

  20. [20]

    Demystifying CLIP Data

    Xu, H., Xie, S., Tan, X. E., Huang, P.-Y., Howes, R., Sharma, V., Li, S.-W., Ghosh, G., Zettlemoyer, L., and Feichtenhofer, C. (2023). Demystifying clip data. arXiv preprint arXiv:2309.16671

  21. [21]

    Yousefi, J. (2011). Image binarization using otsu thresholding algorithm. Ontario, Canada: University of Guelph , 10:9