pith. sign in

arxiv: 2606.23524 · v1 · pith:DJOUVCEJnew · submitted 2026-06-22 · 💻 cs.CV

Scaling State-Space Models from Lines to Paragraphs: An Ablation of Mamba-based OCR

Pith reviewed 2026-06-26 08:44 UTC · model grok-4.3

classification 💻 cs.CV
keywords OCRMambastate-space modelshandwriting recognitionsequence modelsdocument transcriptiontransformer comparisondata efficiency
0
0 comments X

The pith

Mamba-based OCR matches transformers below 1% CER on synthetic paragraphs while running 1.4 to 4.5 times faster, but lags on real handwriting mainly because its autoregressive decoder requires more data for long sequences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests how Mamba state-space models scale in end-to-end OCR when inputs grow from short lines to full paragraphs of up to 1000 characters. A hyperparameter study on synthetic data finds the recurrent state dimension and expansion factor as the main controls for long-sequence accuracy. Direct comparison shows both Mamba and Transformer models stay under 1% character error rate on clean synthetic paragraphs, with Mamba delivering speedups that increase with length. On real IAM handwriting the Mamba model records higher error rates of 8.2% on lines and 10.0% on paragraphs versus 4.2% and 3.5% for the Transformer; controlled ablations trace most of the difference to the SSM decoder's greater data hunger on extended sequences.

Core claim

On clean synthetic paragraphs both the Mamba SSM and the Transformer baseline achieve below 1% CER across lengths from 100 to 1000 characters, with the SSM running 1.4 to 4.5 times faster as length increases. On IAM handwriting the SSM reaches 8.2% CER on lines and 10.0% on paragraphs compared to 4.2% and 3.5% for the Transformer. Controlled experiments attribute a substantial part of the gap to data scarcity, showing the autoregressive SSM decoder is markedly data-hungry on long sequences.

What carries the argument

The autoregressive Mamba decoder whose recurrent state dimension and expansion factor serve as the dominant levers for accuracy on long OCR sequences.

If this is right

  • SSMs deliver linear-time decoding whose speed advantage over quadratic attention grows with paragraph length on clean data.
  • Tuning the recurrent state dimension and expansion factor enables SSMs to maintain low error rates as sequence length increases.
  • A substantial share of the real-data accuracy gap disappears when data volume is increased, indicating the SSM architecture itself is not the limiting factor.
  • SSMs become a practical choice for large-scale document transcription once sufficient training data for long sequences is available.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Larger real handwriting corpora could allow Mamba OCR to match Transformer accuracy without architectural changes.
  • Data-augmentation or efficiency techniques aimed at long-sequence SSM decoders might broaden their use in data-limited handwriting domains.
  • The same scaling and data-ablation approach could be applied to other long-sequence vision-language tasks to map where SSMs are data-efficient.

Load-bearing premise

The controlled experiments isolate data scarcity as the primary cause of the performance gap on real handwriting rather than differences in optimization or model capacity.

What would settle it

Train the Mamba OCR model on a substantially larger real handwriting paragraph dataset and measure whether its CER on IAM paragraphs approaches the Transformer's 3.5%.

Figures

Figures reproduced from arXiv: 2606.23524 by Cl\'ement Chatelain, Merveilles Agbeti-Messan, Pierrick Tranouez, St\'ephane Nicolas, Thierry Paquet.

Figure 1
Figure 1. Figure 1: Mamba-OCR (AR variant) architecture. The shared CNN visual encoder extracts features enriched with 2D sinusoidal positional encoding and flattened into a sequence. The bidirectional Mamba connector models global visual context using a shared Mamba block processed forward and backward. Segment embeddings distinguish visual from text tokens before concatenation. The autoregressive Mamba decoder (4 unidirecti… view at source ↗
Figure 2
Figure 2. Figure 2: Inference latency scaling with sequence length on clean synthetic para [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗
read the original abstract

End-to-end OCR increasingly relies on autoregressive sequence models, where the quadratic cost of Transformer attention limits efficient transcription of long, paragraph-level text. State-Space Models (SSMs) such as Mamba offer linear-time decoding and have recently been shown to match Transformer accuracy on printed historical lines, but their behavior as sequences grow from short lines to full paragraphs, and their generalization to handwriting, remain poorly understood. We study how a Mamba-based OCR recognizer scales from lines to paragraphs. We first conduct a systematic exploration of its four core hyperparameters (decoder depth, state dimension, expansion factor, and connector depth) on synthetic paragraphs from 100 to 1,000 characters, identifying the recurrent state dimension and the expansion factor as the dominant levers for long-sequence accuracy. We then compare the recognizer against a Transformer baseline trained under an identical protocol. On clean synthetic paragraphs, both models stay below 1% CER at every length while the SSM runs 1.4 to 4.5 times faster, the speedup growing with sequence length. On real handwriting, however, the SSM lags clearly behind: it reaches 8.2% CER on IAM lines and 10.0% on IAM paragraphs, against 4.2% and 3.5% for the Transformer baseline. Through controlled experiments we show that a substantial part of this gap stems from data scarcity rather than from an intrinsic architectural limit: the autoregressive SSM decoder is markedly data-hungry on long sequences. Our study clarifies when SSMs are a practical choice for large-scale document transcription and when they are not.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that Mamba-based SSMs for end-to-end OCR match Transformer accuracy (<1% CER) on clean synthetic paragraphs of 100-1000 characters while delivering 1.4-4.5x speedups that increase with length, after identifying state dimension and expansion factor as the dominant hyperparameters via ablation of decoder depth, state dimension, expansion factor, and connector depth. On real IAM handwriting the SSM underperforms (8.2% CER on lines, 10.0% on paragraphs vs. 4.2% and 3.5% for the Transformer), with the gap attributed to the autoregressive SSM decoder being markedly data-hungry on long sequences, as shown by controlled experiments.

Significance. If the controlled-experiment results hold, the work supplies concrete guidance on when linear-time SSMs are practical for paragraph-level document transcription versus when data requirements make them inferior to quadratic-attention models. The systematic hyperparameter study on synthetic data and the direct speed/accuracy comparison constitute reusable empirical benchmarks for long-sequence OCR architectures.

major comments (2)
  1. [Abstract] Abstract: the attribution that 'a substantial part of this gap stems from data scarcity rather than from an intrinsic architectural limit' rests on 'controlled experiments' whose design (matching of optimizer, learning-rate schedule, data augmentation, and effective capacity while varying only training-set size) is not described, so the causal claim that the SSM decoder is 'markedly data-hungry' cannot be evaluated from the supplied text.
  2. [IAM results paragraph] IAM results paragraph: without the quantitative outcomes or protocol details of the controlled experiments that vary training-set size, it is impossible to confirm that the observed 8.2–10.0% vs. 3.5–4.2% CER gap is isolated to data scarcity rather than unaccounted differences in optimization or capacity, which is load-bearing for the paper's main practical conclusion.
minor comments (2)
  1. [Hyperparameter ablation section] The ranges and grid used for the four core hyperparameters should be tabulated so readers can reproduce the ablation that identified state dimension and expansion factor as dominant.
  2. [Synthetic paragraphs results] Specify the exact sequence lengths at which the 1.4x and 4.5x speedups were measured.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive comments. We agree that the description of the controlled experiments is insufficient in the current manuscript and will expand the text to include the full protocol and quantitative results. This addresses the core concern about evaluating the data-scarcity claim.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the attribution that 'a substantial part of this gap stems from data scarcity rather than from an intrinsic architectural limit' rests on 'controlled experiments' whose design (matching of optimizer, learning-rate schedule, data augmentation, and effective capacity while varying only training-set size) is not described, so the causal claim that the SSM decoder is 'markedly data-hungry' cannot be evaluated from the supplied text.

    Authors: We agree that the current text does not describe the controlled-experiment protocol. In the revision we will add a dedicated subsection (under the IAM experiments) that specifies: (i) the exact optimizer and learning-rate schedule used for both models, (ii) the data-augmentation pipeline, (iii) how effective capacity was matched (parameter count and state/expansion settings), and (iv) the training-set sizes varied while holding all other factors fixed. The subsection will also report the resulting CER curves, allowing readers to assess the claim that the SSM decoder is markedly data-hungry on long sequences. revision: yes

  2. Referee: [IAM results paragraph] IAM results paragraph: without the quantitative outcomes or protocol details of the controlled experiments that vary training-set size, it is impossible to confirm that the observed 8.2–10.0% vs. 3.5–4.2% CER gap is isolated to data scarcity rather than unaccounted differences in optimization or capacity, which is load-bearing for the paper's main practical conclusion.

    Authors: We accept this criticism. The revised IAM results paragraph will be accompanied by the quantitative outcomes (CER vs. training-set size) and the complete protocol details listed above. These additions will make explicit that the performance gap is isolated to data requirements under matched optimization and capacity conditions, thereby supporting the paper's practical conclusion. revision: yes

Circularity Check

0 steps flagged

No circularity: results are direct empirical measurements

full rationale

The paper reports ablation studies, hyperparameter sweeps, and direct CER/speed comparisons between Mamba-based and Transformer OCR models on synthetic and real handwriting data. All load-bearing claims (accuracy scaling, speedup factors, performance gap on IAM) are presented as outcomes of training and evaluation protocols rather than derivations, fitted parameters renamed as predictions, or self-citation chains. The attribution of the real-data gap to data scarcity is framed as an interpretation of controlled experiments; no equations, uniqueness theorems, or ansatzes are invoked that would reduce the reported results to their own inputs by construction. This is a standard empirical study whose central claims remain independently falsifiable via replication of the training runs.

Axiom & Free-Parameter Ledger

4 free parameters · 2 axioms · 0 invented entities

The central empirical claims rest on standard machine learning training assumptions and hyperparameter exploration rather than new postulates or fitted constants for the main result.

free parameters (4)
  • decoder depth
    One of the four core hyperparameters explored in the ablation.
  • state dimension
    Dominant lever for long-sequence accuracy, explored rather than fixed ad hoc.
  • expansion factor
    Dominant lever identified for accuracy.
  • connector depth
    Explored hyperparameter.
axioms (2)
  • domain assumption The training protocol is identical and fair for both Mamba and Transformer models.
    The comparison relies on this to attribute differences to architecture and data.
  • domain assumption Synthetic data adequately represents the scaling behavior for real-world OCR tasks.
    Used to identify hyperparameters before real data tests.

pith-pipeline@v0.9.1-grok · 5851 in / 1372 out tokens · 37219 ms · 2026-06-26T08:44:03.530365+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

16 extracted references

  1. [1]

    transformers and BiLSTM-based models for historical newspaper OCR

    Agbeti-Messan, M., et al.: A benchmark of state-space models vs. transformers and BiLSTM-based models for historical newspaper OCR. arXiv preprint (2026)

  2. [2]

    In: ICDAR (2017)

    Bluche, T., et al.: Gated convolutional recurrent neural networks for multilingual handwriting recognition. In: ICDAR (2017)

  3. [3]

    IJDAR (2025)

    Constum, T., et al.: DANIEL: A fast document attention network for information extraction and labelling of handwritten documents. IJDAR (2025)

  4. [4]

    TPAMI (2023)

    Coquenet, D., et al.: DAN: A segmentation-free document attention network for handwritten document recognition. TPAMI (2023)

  5. [5]

    TPAMI (2023)

    Coquenet, D., et al.: End-to-end handwritten paragraph text recognition using a vertical attention network. TPAMI (2023)

  6. [6]

    Google: Gemini (gemini-2.0-flash-001 version) [large language model] (2025)

  7. [7]

    In: COLM (2024)

    Gu, A., et al.: Mamba: Linear-time sequence modeling with selective state spaces. In: COLM (2024)

  8. [8]

    IMPACT Centre of Competence: IMPACT dataset of historical newspapers (2011)

  9. [9]

    arXiv preprint (2026)

    Kim, C., et al.: DRetHTR: Linear-time decoder-only retentive network for hand- written text recognition. arXiv preprint (2026)

  10. [10]

    In: AAAI (2023)

    Li, M., et al.: TrOCR: Transformer-based optical character recognition with pre- trained models. In: AAAI (2023)

  11. [11]

    IJDAR (2002)

    Marti, U.V., et al.: The IAM-database: an English sentence database for offline handwriting recognition. IJDAR (2002)

  12. [12]

    Journal of Documentation (2019)

    Muehlberger, G., et al.: Transforming scholarship in the archives through hand- written text recognition: Transkribus as a case study. Journal of Documentation (2019)

  13. [13]

    Puigcerver, J., et al.: Are multidimensional recurrent layers really necessary for handwritten text recognition? In: ICDAR (2017)

  14. [14]

    arXiv preprint (2023)

    Sun, Y., et al.: Retentive Network: A successor to Transformer for large language models. arXiv preprint (2023)

  15. [15]

    In: ICFHR (2016)

    Voigtlaender, P., et al.: Handwriting recognition with large multidimensional long short-term memory recurrent neural networks. In: ICFHR (2016)

  16. [16]

    In: ICLR (2025)

    Yang, S., et al.: Gated Delta Networks: Improving Mamba2 with Delta Rule. In: ICLR (2025)