Multimodal deep networks for text and image-based document classification

Catherine Herold; C\'edric Vidal; Kuider Slimani; Nicolas Audebert

arxiv: 1907.06370 · v1 · pith:4ZP2K2YJnew · submitted 2019-07-15 · 💻 cs.CV

Multimodal deep networks for text and image-based document classification

Nicolas Audebert , Catherine Herold , Kuider Slimani , C\'edric Vidal This is my paper

Pith reviewed 2026-05-24 21:40 UTC · model grok-4.3

classification 💻 cs.CV

keywords document classificationmultimodal learningOCR text embeddingsdeep neural networksimage classificationTobacco3482RVL-CDIP

0 comments

The pith

Multimodal network fusing image features with OCR word embeddings raises document classification accuracy by 3%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper designs a neural network that processes both document images and text extracted by OCR to perform classification. Visual analysis alone cannot achieve the fine-grained results needed for archival and administrative documents because key information often resides in the text. By learning from word embeddings computed on the OCR output together with image data, the network improves accuracy over image-only baselines. The improvement holds on Tobacco3482 and RVL-CDIP even when the text comes from imperfect OCR rather than clean ground truth. A new QS-OCR dataset is released to support further work on this combination of signals.

Core claim

A multimodal neural network that learns jointly from word embeddings computed on OCR-extracted text and from the document image itself improves classification accuracy by 3% over pure image models on the Tobacco3482 and RVL-CDIP datasets augmented with the QS-OCR text dataset, and the gain occurs even without clean text information.

What carries the argument

Multimodal neural network that fuses image features with text embeddings derived from OCR output.

If this is right

Document classification systems can improve by incorporating OCR text even when the OCR is noisy.
Fine-grained distinctions that depend on textual content become reachable without requiring perfectly transcribed text.
The new QS-OCR dataset provides a public resource for training and evaluating multimodal document models.
The 3% gain demonstrates that image and text modalities are not redundant for this task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same fusion strategy could be tested on document collections in languages with different OCR error profiles.
Performance might degrade on documents where text and image content conflict rather than reinforce each other.
Future work could measure how much of the gain survives when the OCR engine is replaced by a weaker or stronger model.

Load-bearing premise

That OCR-derived text embeddings supply complementary signal the network can fuse with image features to produce reliable accuracy gains.

What would settle it

Running the identical image-only and multimodal models on Tobacco3482 or RVL-CDIP and observing no accuracy difference or a drop when the OCR text branch is added.

Figures

Figures reproduced from arXiv: 1907.06370 by Catherine Herold, C\'edric Vidal, Kuider Slimani, Nicolas Audebert.

**Figure 2.** Figure 2: Document samples from the RVL-CDIP [1] dataset with corresponding text extracted by Tesseract OCR. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: MobileNetV2 uses inverted residual blocks to reduce [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Tesseract OCR outputs noisy text that does not entirely overlap with the assumptions usually held when training word [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

read the original abstract

Classification of document images is a critical step for archival of old manuscripts, online subscription and administrative procedures. Computer vision and deep learning have been suggested as a first solution to classify documents based on their visual appearance. However, achieving the fine-grained classification that is required in real-world setting cannot be achieved by visual analysis alone. Often, the relevant information is in the actual text content of the document. We design a multimodal neural network that is able to learn from word embeddings, computed on text extracted by OCR, and from the image. We show that this approach boosts pure image accuracy by 3% on Tobacco3482 and RVL-CDIP augmented by our new QS-OCR text dataset (https://github.com/Quicksign/ocrized-text-dataset), even without clean text information.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper delivers a modest 3% accuracy lift on two document datasets by fusing CNN image features with OCR text embeddings and releases the corresponding text data as QS-OCR.

read the letter

The paper shows that adding text embeddings from OCR to an image-based document classifier improves accuracy by 3% on Tobacco3482 and RVL-CDIP. They also release the OCR text as a new dataset called QS-OCR. What the work does well is the data release. Providing the extracted text for these standard benchmarks lets other researchers test multimodal ideas without extra preprocessing. The approach itself is a straightforward combination of CNN image features and word embeddings, and the experiments use actual OCR output, which matches real-world conditions. The gain is real but limited in scope. It applies to fine-grained classification of administrative and archival documents where text content matters. The paper does not position this as a major architectural advance, which keeps expectations in line with the results. A minor concern is whether the fusion method adds much beyond what a text-only model could achieve, or how sensitive the improvement is to OCR quality. The full paper likely includes the architecture details and tables, but those would need checking for baseline strength and any statistical tests. This is the kind of paper that fits applied computer vision work on documents. Someone building a system for scanning and sorting forms or manuscripts could use the dataset and the reported lift. It is solid enough on the empirical side to go through peer review. I would recommend sending it to peer review.

Referee Report

0 major / 2 minor

Summary. The manuscript proposes a multimodal neural network for document image classification that fuses image features with word embeddings derived from OCR-extracted text. It reports that this fusion yields a 3% accuracy improvement over image-only baselines on the Tobacco3482 and RVL-CDIP datasets when augmented with the newly released QS-OCR text dataset, and that the gain holds even with noisy OCR output rather than clean text.

Significance. If the empirical result holds, the work provides evidence that OCR-derived text embeddings supply complementary signal to visual features for fine-grained document classification tasks. The public release of the QS-OCR dataset constitutes a concrete, reusable contribution that can support further multimodal experiments on standard benchmarks.

minor comments (2)

The abstract states the 3% boost but does not name the fusion architecture (e.g., late fusion of CNN and embedding features), the exact baselines, or any error bars/statistical tests; these details should be added to the abstract or highlighted in §3–4 for immediate evaluability.
The new QS-OCR dataset is introduced with a GitHub link; the manuscript should include a brief description of its construction, size, and OCR quality statistics in the experimental section to allow readers to assess robustness to OCR noise.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary of our work and the recommendation for minor revision. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity; purely empirical result

full rationale

The paper reports an empirical accuracy improvement from late fusion of CNN image features and word-embedding text features extracted via OCR. No derivation chain, equations, or parameter-fitting steps are presented as predictions; the 3% lift on Tobacco3482 and RVL-CDIP is a measured experimental outcome after describing the architecture and training protocol. No self-citations, uniqueness theorems, or ansatzes are invoked to justify the result, and the claim does not reduce to its inputs by construction. The work is self-contained as a standard multimodal classification experiment.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical machine-learning study. The abstract contains no explicit free parameters, mathematical axioms, or invented entities; the central claim rests entirely on reported experimental outcomes.

pith-pipeline@v0.9.0 · 5667 in / 1080 out tokens · 25825 ms · 2026-05-24T21:40:47.835256+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages

[1]

Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval,

A. W . Harleyet al., “Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval,” inICDAR, Aug. 2015

work page 2015
[2]

Document Analysis System,

K. Y . W onget al., “Document Analysis System,” IBM J. Res. Dev ., Nov. 1982

work page 1982
[3]

Tesseract: An Open-Source Optical Character Recognition Engine,

A. Kay, “Tesseract: An Open-Source Optical Character Recognition Engine, ”Linux J., July 2007

work page 2007
[4]

Classification of binary document images into textual or nontextual data blocks using neural network models,

D. X. Le et al., “Classification of binary document images into textual or nontextual data blocks using neural network models, ” Mach. V is. Appl., Sept. 1995

work page 1995
[5]

Segmentation and classification for mixed text/image documents using neural network,

S. Imade et al., “Segmentation and classification for mixed text/image documents using neural network,” inICDAR, Oct. 1993

work page 1993
[6]

Gradient-based learning applied to document recognition,

Y . LeCunet al., “Gradient-based learning applied to document recognition, ”Proc. IEEE, Nov. 1998

work page 1998
[7]

A survey of document image classification,

N. Chen and D. Blostein, “ A survey of document image classification, ”Int. J. Doc. Anal. Recogn., June 2007

work page 2007
[8]

Structural similarity for document image classification and retrieval,

J. Kumar et al. , “Structural similarity for document image classification and retrieval, ”P attern Recognit. Lett., 2014

work page 2014
[9]

Analysis of CNNs for Document Image Classification,

C. Tensmeyer and T . Martinez, “ Analysis of CNNs for Document Image Classification, ” inICDAR, Nov. 2017

work page 2017
[10]

Cutting the Error by Half: Investigation of V ery Deep CNN and Advanced Training Strategies for Document Image Classification,

M. Z. Afzal et al., “Cutting the Error by Half: Investigation of V ery Deep CNN and Advanced Training Strategies for Document Image Classification, ” inICDAR, Nov. 2017

work page 2017
[11]

Document Image Classification with Intra- Domain Transfer Learning and Stacked Generalization of Deep Convolutional Neural Networks,

A. Das et al. , “Document Image Classification with Intra- Domain Transfer Learning and Stacked Generalization of Deep Convolutional Neural Networks, ” inICPR, Aug. 2018

work page 2018
[12]

Identity Documents Classification as an Image Classification Problem,

R. Sicre et al., “Identity Documents Classification as an Image Classification Problem, ” inICIAP, Sept. 2017

work page 2017
[13]

dhSegment : A generic deep-learning approach for document segmentation,

S. Ares Oliveiraet al., “dhSegment : A generic deep-learning approach for document segmentation, ” inICFHR, Aug. 2018

work page 2018
[14]

Automatic Document Classification,

H. Borko and M. Bernick, “ Automatic Document Classification, ” J. ACM, Apr. 1963

work page 1963
[15]

One-Class SVMs for Document Classification,

L. M. Manevitz and M. Y ousef, “One-Class SVMs for Document Classification, ”J. Mach. Learn. Res, Dec. 2001

work page 2001
[16]

Statistical topic models for multi-label document classification,

T . N. Rubin et al., “Statistical topic models for multi-label document classification, ”Mach. Learn., July 2012

work page 2012
[17]

Efficient Estimation of W ord Representations in V ector Space,

T . Mikolovet al., “Efficient Estimation of W ord Representations in V ector Space, ” inICLR, Jan. 2013

work page 2013
[18]

Deep Contextualized W ord Representations,

M. Peterset al., “Deep Contextualized W ord Representations,” in NAACL, June 2018

work page 2018
[19]

Hierarchical Attention Networks for Document Classification,

Z. Y anget al., “Hierarchical Attention Networks for Document Classification, ” inNAACL, 2016

work page 2016
[20]

Embedded Textual Content for Document Image Classification with CNNs,

L. Noce et al., “Embedded Textual Content for Document Image Classification with CNNs, ” inACM DocEng, 2016

work page 2016
[21]

Learning to Extract Semantic Structure from Documents Using Multimodal FCNNs,

X. Y anget al., “Learning to Extract Semantic Structure from Documents Using Multimodal FCNNs, ” inCVPR, July 2017

work page 2017
[22]

Improving Classification of an Industrial Document Image Database by Combining Visual and Textual Features,

O. Augereauet al., “Improving Classification of an Industrial Document Image Database by Combining Visual and Textual Features, ” inIAPR W orkshop, Apr. 2014

work page 2014
[23]

MobileNetV2: Inverted Residuals and Linear Bottlenecks,

M. Sandleret al., “MobileNetV2: Inverted Residuals and Linear Bottlenecks, ” inCVPR, June 2018

work page 2018
[24]

CNN Features Off-the-Shelf: An Astounding Baseline for Recognition,

A. S. Razavian et al. , “CNN Features Off-the-Shelf: An Astounding Baseline for Recognition, ” inCVPRW, June 2014

work page 2014
[25]

Deep Residual Learning for Image Recognition,

K. He et al., “Deep Residual Learning for Image Recognition, ” in CVPR, June 2016

work page 2016
[26]

A Threshold Selection Method from Gray-Level Histograms,

N. Otsu, “ A Threshold Selection Method from Gray-Level Histograms, ”IEEE Trans. Syst. Man. Cybern., Jan. 1979

work page 1979
[27]

Glove: Global V ectors for W ord Representation,

J. Pennington et al. , “Glove: Global V ectors for W ord Representation, ” inEMNLP, Oct. 2014

work page 2014
[28]

Mimicking W ord Embeddings using Subword RNNs,

Y . Pinteret al., “Mimicking W ord Embeddings using Subword RNNs, ” inEMNLP, Sept. 2017

work page 2017
[29]

Enriching W ord V ectors with Subword Information,

P . Bojanowskiet al., “Enriching W ord V ectors with Subword Information, ”Trans. Assoc. Comput. Linguist., 2017

work page 2017
[30]

Bag of Tricks for Efficient T ext Classification,

A. Joulinet al., “Bag of Tricks for Efficient T ext Classification, ” in EACL, 2017

work page 2017
[31]

Magnitude: A Fast, Efficient Universal V ector Embedding Utility Package,

A. Patelet al., “Magnitude: A Fast, Efficient Universal V ector Embedding Utility Package, ” inEMNLP, Nov. 2018

work page 2018
[32]

A Simple but T ough-to-Beat Baseline for Sentence Embeddings,

S. Arora et al. , “ A Simple but T ough-to-Beat Baseline for Sentence Embeddings, ” inICLR, Nov. 2016

work page 2016
[33]

spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing,

M. Honnibal and I. Montani, “spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing, ”T o appear, 2017

work page 2017
[34]

Multimodal deep learning for robust RGB-D object recognition,

A. Eitel et al., “Multimodal deep learning for robust RGB-D object recognition, ” inIROS, Sept. 2015

work page 2015
[35]

Delving Deep into Rectifiers,

K. He et al., “Delving Deep into Rectifiers, ” inICCV, 2015

work page 2015
[36]

Convolutional Neural Networks for Sentence Classification,

Y . Kim, “Convolutional Neural Networks for Sentence Classification, ” inEMNLP, Oct. 2014

work page 2014
[37]

Nielsen, Usability Engineering

J. Nielsen, Usability Engineering. 1993

work page 1993
[38]

Xception: Deep Learning with Depthwise Separable Convolutions,

F . Chollet, “Xception: Deep Learning with Depthwise Separable Convolutions, ” inCVPR, July 2017

work page 2017
[39]

Robust W ord V ectors: Context-Informed Embeddings for Noisy T exts,

V . Malykhet al., “Robust W ord V ectors: Context-Informed Embeddings for Noisy T exts, ” inEMNLP W-NUT, 2018

work page 2018

[1] [1]

Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval,

A. W . Harleyet al., “Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval,” inICDAR, Aug. 2015

work page 2015

[2] [2]

Document Analysis System,

K. Y . W onget al., “Document Analysis System,” IBM J. Res. Dev ., Nov. 1982

work page 1982

[3] [3]

Tesseract: An Open-Source Optical Character Recognition Engine,

A. Kay, “Tesseract: An Open-Source Optical Character Recognition Engine, ”Linux J., July 2007

work page 2007

[4] [4]

Classification of binary document images into textual or nontextual data blocks using neural network models,

D. X. Le et al., “Classification of binary document images into textual or nontextual data blocks using neural network models, ” Mach. V is. Appl., Sept. 1995

work page 1995

[5] [5]

Segmentation and classification for mixed text/image documents using neural network,

S. Imade et al., “Segmentation and classification for mixed text/image documents using neural network,” inICDAR, Oct. 1993

work page 1993

[6] [6]

Gradient-based learning applied to document recognition,

Y . LeCunet al., “Gradient-based learning applied to document recognition, ”Proc. IEEE, Nov. 1998

work page 1998

[7] [7]

A survey of document image classification,

N. Chen and D. Blostein, “ A survey of document image classification, ”Int. J. Doc. Anal. Recogn., June 2007

work page 2007

[8] [8]

Structural similarity for document image classification and retrieval,

J. Kumar et al. , “Structural similarity for document image classification and retrieval, ”P attern Recognit. Lett., 2014

work page 2014

[9] [9]

Analysis of CNNs for Document Image Classification,

C. Tensmeyer and T . Martinez, “ Analysis of CNNs for Document Image Classification, ” inICDAR, Nov. 2017

work page 2017

[10] [10]

Cutting the Error by Half: Investigation of V ery Deep CNN and Advanced Training Strategies for Document Image Classification,

M. Z. Afzal et al., “Cutting the Error by Half: Investigation of V ery Deep CNN and Advanced Training Strategies for Document Image Classification, ” inICDAR, Nov. 2017

work page 2017

[11] [11]

Document Image Classification with Intra- Domain Transfer Learning and Stacked Generalization of Deep Convolutional Neural Networks,

A. Das et al. , “Document Image Classification with Intra- Domain Transfer Learning and Stacked Generalization of Deep Convolutional Neural Networks, ” inICPR, Aug. 2018

work page 2018

[12] [12]

Identity Documents Classification as an Image Classification Problem,

R. Sicre et al., “Identity Documents Classification as an Image Classification Problem, ” inICIAP, Sept. 2017

work page 2017

[13] [13]

dhSegment : A generic deep-learning approach for document segmentation,

S. Ares Oliveiraet al., “dhSegment : A generic deep-learning approach for document segmentation, ” inICFHR, Aug. 2018

work page 2018

[14] [14]

Automatic Document Classification,

H. Borko and M. Bernick, “ Automatic Document Classification, ” J. ACM, Apr. 1963

work page 1963

[15] [15]

One-Class SVMs for Document Classification,

L. M. Manevitz and M. Y ousef, “One-Class SVMs for Document Classification, ”J. Mach. Learn. Res, Dec. 2001

work page 2001

[16] [16]

Statistical topic models for multi-label document classification,

T . N. Rubin et al., “Statistical topic models for multi-label document classification, ”Mach. Learn., July 2012

work page 2012

[17] [17]

Efficient Estimation of W ord Representations in V ector Space,

T . Mikolovet al., “Efficient Estimation of W ord Representations in V ector Space, ” inICLR, Jan. 2013

work page 2013

[18] [18]

Deep Contextualized W ord Representations,

M. Peterset al., “Deep Contextualized W ord Representations,” in NAACL, June 2018

work page 2018

[19] [19]

Hierarchical Attention Networks for Document Classification,

Z. Y anget al., “Hierarchical Attention Networks for Document Classification, ” inNAACL, 2016

work page 2016

[20] [20]

Embedded Textual Content for Document Image Classification with CNNs,

L. Noce et al., “Embedded Textual Content for Document Image Classification with CNNs, ” inACM DocEng, 2016

work page 2016

[21] [21]

Learning to Extract Semantic Structure from Documents Using Multimodal FCNNs,

X. Y anget al., “Learning to Extract Semantic Structure from Documents Using Multimodal FCNNs, ” inCVPR, July 2017

work page 2017

[22] [22]

Improving Classification of an Industrial Document Image Database by Combining Visual and Textual Features,

O. Augereauet al., “Improving Classification of an Industrial Document Image Database by Combining Visual and Textual Features, ” inIAPR W orkshop, Apr. 2014

work page 2014

[23] [23]

MobileNetV2: Inverted Residuals and Linear Bottlenecks,

M. Sandleret al., “MobileNetV2: Inverted Residuals and Linear Bottlenecks, ” inCVPR, June 2018

work page 2018

[24] [24]

CNN Features Off-the-Shelf: An Astounding Baseline for Recognition,

A. S. Razavian et al. , “CNN Features Off-the-Shelf: An Astounding Baseline for Recognition, ” inCVPRW, June 2014

work page 2014

[25] [25]

Deep Residual Learning for Image Recognition,

K. He et al., “Deep Residual Learning for Image Recognition, ” in CVPR, June 2016

work page 2016

[26] [26]

A Threshold Selection Method from Gray-Level Histograms,

N. Otsu, “ A Threshold Selection Method from Gray-Level Histograms, ”IEEE Trans. Syst. Man. Cybern., Jan. 1979

work page 1979

[27] [27]

Glove: Global V ectors for W ord Representation,

J. Pennington et al. , “Glove: Global V ectors for W ord Representation, ” inEMNLP, Oct. 2014

work page 2014

[28] [28]

Mimicking W ord Embeddings using Subword RNNs,

Y . Pinteret al., “Mimicking W ord Embeddings using Subword RNNs, ” inEMNLP, Sept. 2017

work page 2017

[29] [29]

Enriching W ord V ectors with Subword Information,

P . Bojanowskiet al., “Enriching W ord V ectors with Subword Information, ”Trans. Assoc. Comput. Linguist., 2017

work page 2017

[30] [30]

Bag of Tricks for Efficient T ext Classification,

A. Joulinet al., “Bag of Tricks for Efficient T ext Classification, ” in EACL, 2017

work page 2017

[31] [31]

Magnitude: A Fast, Efficient Universal V ector Embedding Utility Package,

A. Patelet al., “Magnitude: A Fast, Efficient Universal V ector Embedding Utility Package, ” inEMNLP, Nov. 2018

work page 2018

[32] [32]

A Simple but T ough-to-Beat Baseline for Sentence Embeddings,

S. Arora et al. , “ A Simple but T ough-to-Beat Baseline for Sentence Embeddings, ” inICLR, Nov. 2016

work page 2016

[33] [33]

spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing,

M. Honnibal and I. Montani, “spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing, ”T o appear, 2017

work page 2017

[34] [34]

Multimodal deep learning for robust RGB-D object recognition,

A. Eitel et al., “Multimodal deep learning for robust RGB-D object recognition, ” inIROS, Sept. 2015

work page 2015

[35] [35]

Delving Deep into Rectifiers,

K. He et al., “Delving Deep into Rectifiers, ” inICCV, 2015

work page 2015

[36] [36]

Convolutional Neural Networks for Sentence Classification,

Y . Kim, “Convolutional Neural Networks for Sentence Classification, ” inEMNLP, Oct. 2014

work page 2014

[37] [37]

Nielsen, Usability Engineering

J. Nielsen, Usability Engineering. 1993

work page 1993

[38] [38]

Xception: Deep Learning with Depthwise Separable Convolutions,

F . Chollet, “Xception: Deep Learning with Depthwise Separable Convolutions, ” inCVPR, July 2017

work page 2017

[39] [39]

Robust W ord V ectors: Context-Informed Embeddings for Noisy T exts,

V . Malykhet al., “Robust W ord V ectors: Context-Informed Embeddings for Noisy T exts, ” inEMNLP W-NUT, 2018

work page 2018