pith. sign in

arxiv: 2605.25781 · v1 · pith:PKLGT6NEnew · submitted 2026-05-25 · 💻 cs.CL

Double Triangle Annotation: A Scalable Human-in-the-Loop Framework for High-Precision Historical Document Annotation

Pith reviewed 2026-06-29 21:33 UTC · model grok-4.3

classification 💻 cs.CL
keywords human-in-the-loop annotationhistorical documentsmultimodal LLMsstructured extractionword error ratemedical directoriesconsensus mechanismbenchmark release
0
0 comments X

The pith

Double Triangle Annotation uses model consensus to reach 0.003 word error rate on historical document extraction while auto-accepting over 85 percent of fields.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Double Triangle Annotation as a two-layer human-in-the-loop framework for high-precision annotation of structured data from historical documents. Two independent multimodal models label each item in parallel, with agreement leading to automatic acceptance and disagreement sent to a human jury. A second layer applies the same process across two such systems before escalating to an expert. This setup relies only on the independence of model errors and needs no task-specific adjustments. On the Rosenwald Guides corpus of French medical directories from 1887 to 1906, the method yields a final word error rate of 0.003 and auto-accepts over 85 percent of more than 13,000 fields, with the resulting benchmark released publicly.

Core claim

The Double Triangle Annotation framework consists of a first layer where two architecturally independent multimodal large language models annotate documents in parallel, auto-accepting on agreement and routing disagreements to a human jury, followed by a second layer that cross-checks two such consensus outputs against each other and escalates remaining conflicts to a domain expert. This process produces high-precision annotations for structured information extraction, demonstrated by a final word error rate of 0.003 on the Guides Rosenwald corpus spanning 1887-1906, while auto-accepting over 85% of 13,595 fields without requiring distributional priors or calibration.

What carries the argument

Double Triangle Annotation, the two-layer consensus mechanism using parallel independent model annotations and cross-system verification to minimize human intervention.

If this is right

  • High-precision ground-truth datasets can be generated efficiently for large historical corpora.
  • Annotation autonomy increases automatically with improvements in underlying models.
  • The approach applies to other structured extraction tasks from documents without custom calibration.
  • Released benchmarks like the Rosenwald Guides ground truth enable standardized evaluation of future extraction methods.
  • The framework scales annotation efforts while maintaining low error rates through layered checks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar consensus methods could reduce annotation costs in other domains such as legal or archival records.
  • If model capabilities advance, the human jury layer might become unnecessary for many fields.
  • The released dataset opens opportunities for testing extraction models specifically on 19th-20th century French medical texts.
  • Extending the framework to more than two models per layer could further increase auto-acceptance rates.

Load-bearing premise

The errors produced by the two independent multimodal models are statistically independent.

What would settle it

Observing that the two models produce the same incorrect annotation on a substantial fraction of fields, resulting in auto-acceptance of errors and an elevated final word error rate above the reported 0.003.

Figures

Figures reproduced from arXiv: 2605.25781 by Yi Ren.

Figure 1
Figure 1. Figure 1: Overview of the Double Triangle Annotation framework. Layer 1 (gray boxes) pairs two independent [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Annotated entries from the Rosenwald Guide [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The two residual errors both stem from gen [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Screenshot of the annotation platform. The [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
read the original abstract

Evaluating structured-information extraction from historical documents at scale requires high-precision ground-truth annotations, yet traditional manual labeling is expensive and fully automated pipelines built on large language models are prone to hallucination. We propose Double Triangle Annotation, a two-layer human-in-the-loop framework that leverages cross-model consensus to automate the majority of annotation work while ensuring high-precision outputs. In the first layer, two architecturally independent Multimodal Large Language Models annotate each document in parallel; when they agree, the label is auto-accepted, and disagreements are routed to a human jury. A second layer cross-checks two such systems against each other, escalating residual conflicts to a domain expert. The framework rests on a single assumption -- error independence between models -- requires no distributional priors or task-specific calibration, and becomes more autonomous as model capability improves. On the Guides Rosenwald, a corpus of French medical directories spanning 1887-1906, the framework achieves a final Word Error Rate of 0.003. Applied at scale, model consensus auto-accepts over 85% of 13,595 fields. We release the resulting benchmark -- the first structured-extraction ground truth for the Rosenwald Guides -- to support future work on historical document processing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes Double Triangle Annotation, a two-layer human-in-the-loop framework for high-precision structured information extraction from historical documents. Two architecturally independent MLLMs annotate in parallel in layer 1 (auto-accept on agreement, route disagreements to human jury); layer 2 cross-checks systems and escalates residuals to a domain expert. The framework assumes error independence between models, requires no priors or calibration, and improves with model capability. On the Guides Rosenwald corpus (French medical directories, 1887-1906), it reports a final WER of 0.003 and auto-accepts over 85% of 13,595 fields, releasing the resulting benchmark as the first structured-extraction ground truth for this corpus.

Significance. If the error-independence assumption holds and the reported WER is supported by rigorous validation, the framework provides a practical, scalable approach to generating high-precision annotations for historical document processing, substantially reducing manual effort while addressing hallucination risks in fully automated LLM pipelines. The public release of the Rosenwald Guides benchmark is a concrete, reusable contribution that can benchmark future work in the area.

major comments (2)
  1. [Abstract] Abstract: The central claims of final WER=0.003 and 85% automation rest entirely on the unvalidated assumption of error independence between the two MLLMs (explicitly identified as the sole assumption). No ablation study, error-correlation matrix, per-field expert audit of the consensus-accepted subset, or analysis of potential correlated failures on historical features (ligatures, abbreviations, faded ink) is described. Without such evidence, the measured WER on corrected disagreements alone does not establish the precision of the auto-accepted labels.
  2. [Methods / Experiments] Methods / Experiments (inferred from absence in provided description): The manuscript supplies no details on the specific MLLMs employed, jury composition and instructions, domain-expert escalation criteria, annotation guidelines, or dataset characteristics (e.g., field types, document image quality distribution, or train/test splits within the 13,595 fields). These omissions prevent assessment of whether the reported numbers are reproducible or generalizable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We respond point-by-point to the major comments and indicate where revisions will be made to address valid concerns.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claims of final WER=0.003 and 85% automation rest entirely on the unvalidated assumption of error independence between the two MLLMs (explicitly identified as the sole assumption). No ablation study, error-correlation matrix, per-field expert audit of the consensus-accepted subset, or analysis of potential correlated failures on historical features (ligatures, abbreviations, faded ink) is described. Without such evidence, the measured WER on corrected disagreements alone does not establish the precision of the auto-accepted labels.

    Authors: We agree that the framework's claims rest on the explicitly stated error-independence assumption and that the reported WER of 0.003 is measured after human correction of all disagreements. The auto-accepted subset is not directly audited in the current manuscript. To strengthen the evidence, the revised version will add (1) an analysis of error correlations across a sample of fields, focusing on historical features such as ligatures, abbreviations, and faded ink, and (2) results from a limited per-field expert audit of randomly sampled auto-accepted labels. These additions will provide empirical support for the precision of the consensus-accepted outputs. revision: yes

  2. Referee: [Methods / Experiments] Methods / Experiments (inferred from absence in provided description): The manuscript supplies no details on the specific MLLMs employed, jury composition and instructions, domain-expert escalation criteria, annotation guidelines, or dataset characteristics (e.g., field types, document image quality distribution, or train/test splits within the 13,595 fields). These omissions prevent assessment of whether the reported numbers are reproducible or generalizable.

    Authors: The full manuscript contains these details, but we accept that they require greater prominence and expansion for clarity. In revision we will enlarge the Methods section to specify the exact MLLMs (names, versions, and prompting), jury composition and instructions, escalation criteria, annotation guidelines, field types, image-quality distribution across the corpus, and confirmation that no train/test split is applicable because the work concerns annotation rather than supervised model training. These changes will directly improve reproducibility and allow better assessment of generalizability. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical framework with explicit assumption

full rationale

The paper presents a two-layer human-in-the-loop annotation framework whose results (0.003 WER, 85% auto-accept rate) are reported as direct empirical measurements on the Rosenwald corpus. No equations, parameter fittings, predictions derived from inputs, or self-citations appear in the provided text. The sole load-bearing premise is the explicitly stated assumption of error independence between models, which is not derived from or equivalent to any fitted quantity or prior result within the paper. The derivation chain is therefore self-contained and contains no reductions by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on one domain assumption of error independence between the two MLLMs; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption error independence between models
    Explicitly stated as the single assumption on which the framework rests.

pith-pipeline@v0.9.1-grok · 5744 in / 1240 out tokens · 31051 ms · 2026-06-29T21:33:51.180223+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

7 extracted references · 3 canonical work pages · 2 internal anchors

  1. [1]

    Large Language Models Are Effective Human Annotation Assistants, But Not Good Independent Annotators

    Large language models are effective human 9 annotation assistants, but not good independent an- notators.CoRR, abs/2503.06778. Hannah Kim, Kushan Mitra, Rafael Li Chen, Sajjadur Rahman, and Dan Zhang. 2024. MEGAnno+: A human-LLM collaborative annotation system. In Proceedings of the 18th Conference of the European Chapter of the Association for Computatio...

  2. [2]

    Qwen3-VL Technical Report

    Are large language models good annotators? InProceedings on “I Can’t Believe It’s Not Better: Failure Modes in the Age of Foundation Models” at NeurIPS 2023 Workshops, volume 239 ofProceed- ings of Machine Learning Research, pages 38–48. PMLR. Nicholas Pangakis and Samuel Wolken. 2025. Keeping humans in the loop: Human-centered automated an- notation with...

  3. [3]

    Reference

    DIV A-HisDB: A precisely annotated large dataset of challenging medieval manuscripts. In2016 15th International Conference on Frontiers in Hand- writing Recognition (ICFHR), pages 471–476. Yu-Min Tseng, Wei-Lin Chen, Chung-Chi Chen, and Hsin-Hsi Chen. 2025. Evaluating large language models as expert annotators.CoRR, abs/2508.07827. Shuohang Wang, Yang Liu...

  4. [4]

    Lisez la colonne de GAUCHE de haut en bas COMPLÈTEMENT

  5. [5]

    Puis lisez la colonne de DROITE de haut en bas COMPLÈTEMENT

  6. [6]

    NE MÉLANGEZ PAS les colonnes - terminez entièrement la gauche avant la droite

  7. [7]

    - Incluez toujours les titres de civilité (Mme, Mlle, etc.) dans le champ nom s'ils sont visibles

    Cet ordre est ESSENTIEL pour l'évaluation ultérieure - Ne produisez que les entrées de médecins (ignorez publicités, textes d'éditeur). - Incluez toujours les titres de civilité (Mme, Mlle, etc.) dans le champ nom s'ils sont visibles. - Si aucune entrée de médecin n'est trouvée dans l'image, retournez seulement l'en-tête TSV. - Séparez les colonnes par de...