pith. sign in

arxiv: 2605.06475 · v1 · pith:XD5JTTSWnew · submitted 2026-05-07 · 💻 cs.AI · cs.CV

Probabilistic Dating of Historical Manuscripts via Evidential Deep Regression on Visual Script Features

Pith reviewed 2026-05-08 09:43 UTC · model grok-4.3

classification 💻 cs.AI cs.CV
keywords historical manuscript datingevidential deep learninguncertainty estimationprobabilistic regressioncomputer visionmedieval documentsscript analysis
0
0 comments X

The pith

Evidential deep regression dates historical manuscripts to within 5 years from visual features

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors treat the dating of historical manuscript pages as a continuous regression task rather than classifying into century bins. They use an evidential neural network that outputs a full distribution over possible years along with separate measures of uncertainty, all from a single image patch of the script. Tested on patches from three medieval codices, the model achieves a mean absolute error of 5.4 years and superior calibration compared to other uncertainty quantification techniques. This precision matters for historians because it provides dates much finer than traditional labels and indicates when the prediction is reliable. The model also reveals that uncertainty grows with image quality issues and can pinpoint problematic areas in the script.

Core claim

The paper establishes that an evidential deep regression model using a Normal-Inverse-Gamma head on visual script features can predict manuscript dates continuously with 5.4 years mean absolute error and 92.6% prediction interval coverage probability on the DIVA-HisDB benchmark, while decomposing uncertainties and outperforming sampling-based methods in efficiency and calibration.

What carries the argument

The Normal-Inverse-Gamma evidential output head attached to an EfficientNet-B2 backbone, which models the predictive distribution directly and is trained with a joint negative-log-likelihood and evidence-regularization objective to enable uncertainty decomposition in regression.

If this is right

  • The 20% of patches with lowest uncertainty achieve 0.5 years MAE.
  • Aleatoric uncertainty predicts dating error with Spearman correlation 0.729 and rises with worsening image degradation.
  • Spatial maps of uncertainty identify script regions responsible for high aleatoric uncertainty.
  • Aggregating predictions at the page level reduces MAE to 4.5 years.
  • Single-pass inference provides better calibration than MC Dropout or Deep Ensembles at 5 times lower cost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This framework could be applied to other historical image analysis tasks where precise continuous labels are scarce but visual patterns evolve over time.
  • The uncertainty maps might assist paleographers in focusing on diagnostic script features for manual verification.
  • If scaled to larger collections, it would allow probabilistic timelines of manuscript production across archives.

Load-bearing premise

Visual script features extracted from patches of only three codices contain sufficient information to support accurate continuous year regression that generalizes, and the Normal-Inverse-Gamma evidential framework decomposes uncertainties correctly.

What would settle it

A test on manuscript pages from additional codices not seen during training, measuring if the mean absolute error exceeds 10 years or if the prediction interval coverage probability falls significantly below 90%.

Figures

Figures reproduced from arXiv: 2605.06475 by Ranjith Chodavarapu.

Figure 1
Figure 1. Figure 1: Calibration comparison. Evidential model (blue, PICP=92.6%) closely tracks the perfect calibration diagonal. view at source ↗
Figure 2
Figure 2. Figure 2: Uncertainty vs. error for total, aleatoric, and epistemic components. Red markers: per-bin means. Aleatoric view at source ↗
Figure 3
Figure 3. Figure 3: Spatial decomposition of aleatoric (hot, cols 2–3) vs. epistemic (blue, cols 4–5) uncertainty for one patch per view at source ↗
Figure 4
Figure 4. Figure 4: t-SNE of EfficientNet-B2 features (3,000 test patches). Left: coloured by manuscript. Right: coloured by year view at source ↗
Figure 5
Figure 5. Figure 5: GradCAM attention on manuscript patches. Each sample shows the original patch, overlay, and heatmap. The view at source ↗
Figure 6
Figure 6. Figure 6: MAE vs. percentage of patches retained (most certain first, blue) and mean uncertainty (orange dashed). Filtering view at source ↗
Figure 7
Figure 7. Figure 7: Per-manuscript error distributions (top row) and aleatoric uncertainty vs. error scatter (bottom row). CB55 view at source ↗
Figure 8
Figure 8. Figure 8: MAE and mean uncertainty under eight degradation conditions. Uncertainty rises monotonically with blur view at source ↗
Figure 9
Figure 9. Figure 9: Page-level predictions with 90% prediction intervals (18 test pages, sorted by true date). All true dates fall within view at source ↗
Figure 10
Figure 10. Figure 10: Reliability diagram: mean predicted uncertainty vs. mean absolute error per bin. Points below the diagonal view at source ↗
read the original abstract

We introduce a probabilistic approach for dating historical manuscript pages from visual features alone. Instead of aggregating centuries into classes as is standard in the previous literature, we pose dating as an evidential deep regression problem over a continuous year axis, allowing our neural network to output a full predictive distribution with decomposed aleatoric and epistemic uncertainty in a single forward pass. Our architecture combines an EfficientNet-B2 backbone with a Normal-Inverse-Gamma (NIG) output head trained with a joint negative-log-likelihood and evidence-regularization objective. On the DIVA-HisDB benchmark (150 pages, 3 medieval codices, 151,936 patches), our model scores a test MAE of 5.4 years, well below the 50-year century-label supervision granularity, with 93\% of patches within 5 years and 97\% within 10 years. Our approach achieves \textbf{PICP=92.6\%}, the best calibration among all compared methods, in a single forward pass, outperforming MC Dropout (PICP=88.2\%, 50 passes) and Deep Ensembles (PICP=79.7\%, 5 models) at $5\times$ lower inference cost. Uncertainty decomposition shows aleatoric uncertainty is a strong predictor of dating error (Spearman $\rho=0.729$), and a selective prediction about the most certain 20\% of patches can provide \textbf{0.5 years MAE}. We show that predicted uncertainty increases as image degradation worsens, spatial decomposition maps explain which script regions cause aleatoric uncertainty, and page-level aggregation reduces MAE to 4.5 years with $\rho=0.905$ between uncertainty and page-level error.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript presents a probabilistic method for dating historical manuscript pages from visual script features alone by posing the task as evidential deep regression over a continuous year axis. An EfficientNet-B2 backbone is combined with a Normal-Inverse-Gamma (NIG) output head and trained using a joint negative-log-likelihood plus evidence-regularization objective. On the DIVA-HisDB benchmark (150 pages from 3 medieval codices, 151,936 patches), the model reports a test MAE of 5.4 years, 93% of patches within 5 years, 97% within 10 years, and PICP=92.6% (best among compared methods) in a single forward pass, outperforming MC Dropout and Deep Ensembles at lower inference cost. Additional results include uncertainty-error correlation, selective prediction, and spatial uncertainty maps.

Significance. If the reported performance and uncertainty calibration generalize beyond the three codices, the work would advance digital paleography by replacing coarse century classification with continuous, calibrated year estimates and decomposed uncertainties obtainable in one pass. The empirical strengths on a public benchmark, including aleatoric uncertainty as an error predictor and page-level aggregation benefits, are clear. The single-pass efficiency relative to ensembles is a practical advantage.

major comments (2)
  1. [§4 and §5] §4 (Experimental Setup) and §5 (Results): The headline metrics (MAE 5.4 years, PICP 92.6%) are obtained on patches drawn from the same three codices in DIVA-HisDB. With only three distinct dating targets and 151k patches, a standard patch- or page-level random split permits the model to exploit codex-specific visual traits (script style, layout, degradation) rather than learning a transferable continuous mapping from script features to year. No leave-one-codex-out results, page-level cross-validation across codices, or evaluation on external manuscripts are reported, directly undermining the central claim that the method dates manuscripts 'from visual features alone' in a generalizable manner.
  2. [§5.2] §5.2 (Uncertainty Calibration): The NIG evidential framework is claimed to produce well-calibrated uncertainties (PICP 92.6%) superior to MC Dropout and Deep Ensembles. However, without explicit verification that the evidence-regularization term prevents the model from fitting codex-specific noise in this closed three-codices setting, the superior calibration may be an artifact of the limited domain rather than a general property of the evidential loss. A concrete test (e.g., out-of-codex PICP) is needed to support the uncertainty decomposition claims.
minor comments (3)
  1. [Abstract] The abstract states that continuous labels are used 'instead of aggregating centuries into classes' but does not clarify the source or precision of the ground-truth years for the three codices (exact dates vs. approximate ranges).
  2. [§3.1] §3.1: The joint loss combining NIG negative log-likelihood and evidence regularization is described at a high level; an explicit equation showing the weighting hyperparameter and its effect on the predictive variance would improve reproducibility.
  3. [Figure 4] Figure 4 (spatial uncertainty maps): The caption and surrounding text should explicitly state the patch size and stride used when generating the maps to allow readers to interpret the highlighted script regions.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive and detailed feedback. The concerns about generalizability given the benchmark's limited scope and the need for stronger validation of uncertainty calibration are well-taken. We respond point-by-point below and indicate the changes we will incorporate.

read point-by-point responses
  1. Referee: [§4 and §5] §4 (Experimental Setup) and §5 (Results): The headline metrics (MAE 5.4 years, PICP 92.6%) are obtained on patches drawn from the same three codices in DIVA-HisDB. With only three distinct dating targets and 151k patches, a standard patch- or page-level random split permits the model to exploit codex-specific visual traits (script style, layout, degradation) rather than learning a transferable continuous mapping from script features to year. No leave-one-codex-out results, page-level cross-validation across codices, or evaluation on external manuscripts are reported, directly undermining the central claim that the method dates manuscripts 'from visual features alone' in a generalizable manner.

    Authors: We agree that the three-codex scope of DIVA-HisDB limits strong claims of broad transferability and that a random page-level split can still permit codex-specific cues to influence results. Our original split was performed at the page level precisely to avoid patch-level leakage from the same physical page, but this does not fully address cross-codex generalization. In the revised manuscript we will add leave-one-codex-out (LOCO) experiments: for each codex we train on the other two and evaluate on the held-out codex, reporting MAE, PICP, and uncertainty-error correlation under this protocol. This directly tests whether the continuous regression mapping transfers across distinct manuscripts. Evaluation on manuscripts completely external to DIVA-HisDB is not possible with currently available public data and is therefore listed as a standing limitation. revision: partial

  2. Referee: [§5.2] §5.2 (Uncertainty Calibration): The NIG evidential framework is claimed to produce well-calibrated uncertainties (PICP 92.6%) superior to MC Dropout and Deep Ensembles. However, without explicit verification that the evidence-regularization term prevents the model from fitting codex-specific noise in this closed three-codices setting, the superior calibration may be an artifact of the limited domain rather than a general property of the evidential loss. A concrete test (e.g., out-of-codex PICP) is needed to support the uncertainty decomposition claims.

    Authors: We concur that calibration must be verified outside the training codices to substantiate that the evidential loss and regularization produce meaningful uncertainty rather than codex-specific artifacts. We will therefore include LOCO PICP, expected calibration error, and the aleatoric-uncertainty vs. error Spearman correlation in the revised results section. These additional metrics will show whether the Normal-Inverse-Gamma head maintains its reported advantages when the test codex is unseen, thereby strengthening the uncertainty-decomposition claims. revision: yes

standing simulated objections not resolved
  • Evaluation on manuscripts external to the DIVA-HisDB benchmark, as no such additional data were available for the original study.

Circularity Check

0 steps flagged

No circularity: empirical ML evaluation on public benchmark

full rationale

The paper trains an EfficientNet-B2 + NIG regression head on DIVA-HisDB patches (151k from 3 codices) and reports direct test metrics (MAE 5.4, PICP 92.6%). No derivation chain exists that reduces predictions to fitted inputs by construction, nor any self-definitional equations or load-bearing self-citations. The NIG evidential loss and uncertainty decomposition follow standard prior formulations from external literature; results are falsifiable on the held-out patches without tautological re-use of training targets. Generalization limits to three codices are a validity concern, not a circularity issue.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard deep learning assumptions for regression and uncertainty modeling rather than new postulates. No invented entities are introduced.

free parameters (1)
  • NIG distribution parameters
    The parameters of the Normal-Inverse-Gamma output head are learned from data during training; specific values or regularization strengths not detailed in abstract.
axioms (2)
  • domain assumption Visual script features contain sufficient information for year-level dating precision beyond century granularity
    Invoked when posing dating as continuous regression on image patches.
  • domain assumption The evidential deep learning objective with NIG head correctly decomposes aleatoric and epistemic uncertainty
    Central to the training objective and uncertainty analysis claims.

pith-pipeline@v0.9.0 · 5609 in / 1623 out tokens · 67695 ms · 2026-05-08T09:43:55.557373+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages

  1. [1]

    Amini, W

    A. Amini, W. Schwarting, A. Soleimany, and D. Rus. Deep evidential regression.Advances in neural information processing systems, 33:14927–14937, 2020

  2. [2]

    A. Ciula. Digital palaeography: using the digital representation of medieval script to support palaeographic analysis. 2005. URLhttps://api.semanticscholar.org/CorpusID:113619742

  3. [3]

    Cloppet, V

    F. Cloppet, V . Eglin, M. Helias-Baron, C. Kieu, N. Vincent, and D. Stutzmann. Icdar2017 competition on the classification of medieval handwritings in latin script. In2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), volume 01, pages 1371–1376, 2017. doi: 10.1109/ICDAR. 2017.224

  4. [4]

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. doi: 10.1109/CVPR.2009.5206848

  5. [5]

    Gal and Z

    Y . Gal and Z. Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. Ininternational conference on machine learning, pages 1050–1059. PMLR, 2016

  6. [6]

    He and L

    S. He and L. Schomaker. Beyond ocr: Multi-faceted understanding of handwritten document char- acteristics.Pattern Recognition, 63:321–333, 2017. ISSN 0031-3203. doi: https://doi.org/10.1016/ j.patcog.2016.09.017. URL https://www.sciencedirect.com/science/article/pii/ S0031320316302783

  7. [7]

    Lakshminarayanan, A

    B. Lakshminarayanan, A. Pritzel, and C. Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles.Advances in neural information processing systems, 30, 2017

  8. [8]

    Leibig, V

    C. Leibig, V . Allken, P. Berens, and S. Wahl. Leveraging uncertainty information from deep neural networks for disease detection.Scientific Reports, 10 2017. doi: https://doi.org/10.1038/s41598-017-17876-z

  9. [9]

    J. Li, Y . Xu, T. Lv, L. Cui, C. Zhang, and F. Wei. Dit: Self-supervised pre-training for document image transformer. InProceedings of the 30th ACM international conference on multimedia, pages 3530–3539, 2022

  10. [10]

    Louloudis, N

    G. Louloudis, N. Stamatopoulos, and B. Gatos. Icdar 2011 writer identification contest. InProceedings of the 2011 International Conference on Document Analysis and Recognition, ICDAR ’11, page 1475–1479, USA, 2011. IEEE Computer Society. ISBN 9780769545202. doi: 10.1109/ICDAR.2011.293. URL https://doi.org/10.1109/ICDAR.2011.293

  11. [11]

    A. G. Roy, S. Conjeti, N. Navab, and C. Wachinger. Inherent brain segmentation quality control from fully convnet monte carlo sampling. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 664–672. Springer, 2018

  12. [12]

    Seuret, A

    M. Seuret, A. Nicolaou, D. Rodr´ıguez-Salas, N. Weichselbaumer, D. Stutzmann, M. Mayr, A. Maier, and V . Christlein. Icdar 2021 competition on historical document classification. In J. Llad ´os, D. Lopresti, and S. Uchida, editors,Document Analysis and Recognition – ICDAR 2021, pages 618–634, Cham, 2021. Springer International Publishing. ISBN 978-3-030-86337-1

  13. [13]

    Simistira, M

    F. Simistira, M. Seuret, N. Eichenberger, A. Garz, M. Liwicki, and R. Ingold. Diva-hisdb: A precisely annotated large dataset of challenging medieval manuscripts. In2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), pages 471–476, 2016. doi: 10.1109/ICFHR.2016.0093. 10

  14. [14]

    P. A. Stokes. Digital approaches to paleography and book history: Some challenges, present and fu- ture.Frontiers in Digital Humanities, V olume 2 - 2015, 2015. ISSN 2297-2668. doi: 10.3389/fdigh. 2015.00005. URL https://www.frontiersin.org/journals/digital-humanities/ articles/10.3389/fdigh.2015.00005

  15. [15]

    Tan and Q

    M. Tan and Q. Le. Efficientnet: Rethinking model scaling for convolutional neural networks. InInterna- tional conference on machine learning, pages 6105–6114. PMLR, 2019

  16. [16]

    e-codices — virtual manuscript library of switzerland

    University of Fribourg. e-codices — virtual manuscript library of switzerland. https://www. e-codices.unifr.ch/en, 2023. Accessed: 2025. A Supplementary Figures This appendix provides supplementary experimental results that complement the main paper. Figure 6 plots the accuracy-coverage trade-off under uncertainty thresholding. Figure 7 plots the errors d...