pith. sign in

arxiv: 2603.28103 · v2 · pith:7576W6DKnew · submitted 2026-03-30 · 💻 cs.DL · cs.AI· cs.IR

Transcription and Recognition of Italian Parliamentary Speeches Using Vision-Language Models

Pith reviewed 2026-05-21 10:10 UTC · model grok-4.3

classification 💻 cs.DL cs.AIcs.IR
keywords vision-language modelsoptical character recognitionItalian parliamentary speechestranscriptionspeaker identificationsemantic segmentationentity linkinghistorical documents
0
0 comments X

The pith

A vision-language model pipeline substantially improves transcription quality and speaker tagging for Italian parliamentary speeches.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes replacing traditional optical character recognition pipelines with a new approach that uses vision-language models on scanned historical Italian parliamentary documents. These models take both the visual layout of the page and the extracted text as input to refine the transcription, classify different elements in the proceedings, and identify who is speaking. The identified speakers are then matched to an existing knowledge base using database queries and fuzzy matching techniques. This matters because parliamentary records contain valuable historical and political information that becomes far more usable once transcription errors are reduced and semantic details like speaker identity are added automatically.

Core claim

The pipeline first applies a specialised OCR model to extract text while preserving reading order, then employs a large-scale vision-language model that jointly reasons over visual layout and textual content to refine the transcription, classify elements, and identify speakers. Extracted speakers are linked to the Chamber of Deputies knowledge base through SPARQL queries combined with a multi-strategy fuzzy matching procedure. Evaluation against an established benchmark shows substantial improvements in both transcription quality and speaker tagging accuracy compared with prior OCR-only methods.

What carries the argument

The large-scale vision-language model that performs transcription refinement, element classification, and speaker identification by jointly reasoning over visual layout and textual content.

If this is right

  • Historical parliamentary proceedings become more searchable and analyzable once accurate transcriptions and speaker labels are available at scale.
  • Speaker identification enables tracking of individual contributions across debates and sessions in the Italian Chamber of Deputies.
  • Entity linking to the existing knowledge base supports integration with other political and biographical datasets.
  • Semantic segmentation of the documents allows higher-level tasks such as summarization or topic extraction to be performed more reliably.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same joint visual-textual reasoning approach could be tested on parliamentary records from other countries that exist only as scanned images.
  • The pipeline might generalize to other classes of historical documents where layout information helps disambiguate text, such as old newspapers or legal archives.
  • Fine-tuning the vision-language model on additional Italian political corpora could further reduce domain-specific errors in speaker names and terminology.

Load-bearing premise

A large-scale vision-language model can reliably refine transcriptions, classify elements, and identify speakers by jointly reasoning over visual layout and text without introducing major new errors or biases.

What would settle it

Running the pipeline on the same established benchmark and finding no measurable reduction in transcription error rates or no gain in speaker tagging accuracy would falsify the claim of substantial improvements.

Figures

Figures reproduced from arXiv: 2603.28103 by Alfio Ferrara, Giovanni Pagano, Luigi Curini, Sergio Picascia.

Figure 2
Figure 2. Figure 2: Excerpt from the stenographic report of the session held on November 27th 1874, Legisla￾ture 12 of the Kingdom of Italy. This page excerpt serves as the running example throughout this sec￾tion. 3.1. Document Acquisition For each legislature in the history of the Italian Par￾liament, we retrieved the complete list of sessions (sedute). Each session, which can be thought of as a parliamentary meeting, is un… view at source ↗
Figure 1
Figure 1. Figure 1: Pipeline diagram showing the six stages with data flow between components. This section presents the methodology devel￾oped for the automatic transcription and semantic labelling of Italian parliamentary session reports. The proposed pipeline transforms digitised parlia￾mentary documents into structured, semantically annotated data, enabling downstream analyses on political discourse studies. The pipeline … view at source ↗
Figure 3
Figure 3. Figure 3: illustrates the effect of post-processing on the running example. The speech fragment on page 26, originally marked with an "unknown" speaker, is merged with the incomplete element at the end of page 25: the hyphenated word is rejoined, the content is concatenated, and the speaker identity is propagated from the preceding context. Before Post-Processing End of page 25: speaker: "MINGHETTI, PRESIDENTE DEL C… view at source ↗
read the original abstract

Parliamentary proceedings represent a rich yet challenging resource for computational analysis, particularly when preserved only as scanned historical documents. Existing efforts to transcribe Italian parliamentary speeches have relied on traditional Optical Character Recognition pipelines, resulting in transcription errors and limited semantic annotation. In this paper, we propose a pipeline based on Vision-Language Models for the automatic transcription, semantic segmentation, and entity linking of Italian parliamentary speeches. The pipeline employs a specialised OCR model to extract text while preserving reading order, followed by a large-scale Vision-Language Model that performs transcription refinement, element classification, and speaker identification by jointly reasoning over visual layout and textual content. Extracted speakers are then linked to the Chamber of Deputies knowledge base through SPARQL queries and a multi-strategy fuzzy matching procedure. Evaluation against an established benchmark demonstrates substantial improvements both in transcription quality and speaker tagging.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript proposes a pipeline for transcribing, semantically segmenting, and entity-linking Italian parliamentary speeches from scanned historical documents. It begins with a specialized OCR model that extracts text while preserving reading order, followed by a large-scale Vision-Language Model that jointly reasons over visual layout and textual content to refine the transcription, classify elements, and identify speakers. Extracted speakers are then linked to the Chamber of Deputies knowledge base via SPARQL queries combined with multi-strategy fuzzy matching. The abstract states that evaluation on an established benchmark shows substantial improvements in both transcription quality and speaker tagging relative to prior OCR approaches.

Significance. If the reported gains are substantiated with concrete metrics, this approach could meaningfully advance automated processing of historical parliamentary records, particularly for non-English languages and complex layout-heavy documents. The integration of VLMs for joint visual-textual reasoning offers a plausible route beyond traditional OCR limitations, with potential applicability to other archival digitization tasks in digital humanities and political science.

major comments (1)
  1. Abstract: The central claim of 'substantial improvements' in transcription quality and speaker tagging is asserted without any quantitative metrics, error rates, dataset size, baseline comparisons, or statistical significance tests. This absence leaves the primary empirical contribution without verifiable support and makes it impossible to assess whether the VLM pipeline delivers load-bearing gains over prior methods.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading of the manuscript and for identifying this important point about the abstract. We address the comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [—] Abstract: The central claim of 'substantial improvements' in transcription quality and speaker tagging is asserted without any quantitative metrics, error rates, dataset size, baseline comparisons, or statistical significance tests. This absence leaves the primary empirical contribution without verifiable support and makes it impossible to assess whether the VLM pipeline delivers load-bearing gains over prior methods.

    Authors: We agree that the abstract would be strengthened by including specific quantitative metrics to support the claim of substantial improvements. The full manuscript reports evaluation results on an established benchmark with direct comparisons to prior OCR methods. In the revised version we will update the abstract to incorporate key figures from the results section, including transcription error rates, speaker tagging accuracy, benchmark dataset size, and baseline comparisons. This change will make the primary empirical claims immediately verifiable while preserving the abstract's summary character. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents a standard applied pipeline: specialised OCR for order-preserving extraction, followed by an off-the-shelf large-scale VLM for joint visual-textual refinement/classification/ID, then external SPARQL + fuzzy matching against the Chamber of Deputies knowledge base. All components are described as pre-existing technologies or standard procedures; the central claims rest on empirical improvements measured against an established external benchmark. No equations, fitted parameters renamed as predictions, self-citation chains, or uniqueness theorems appear in the provided text. The derivation is therefore self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on assumptions about the capabilities of existing OCR and Vision-Language Model technologies rather than new parameters or entities introduced in the abstract.

axioms (2)
  • domain assumption Specialised OCR model extracts text while preserving reading order
    Invoked as the first step of the proposed pipeline.
  • domain assumption Large-scale Vision-Language Model can jointly reason over visual layout and textual content for refinement, classification, and speaker identification
    Central premise for the second stage of the pipeline.

pith-pipeline@v0.9.0 · 5676 in / 1334 out tokens · 55474 ms · 2026-05-21T10:10:07.371694+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    The pipeline employs a specialised OCR model to extract text while preserving reading order, followed by a large-scale Vision-Language Model that performs transcription refinement, element classification, and speaker identification by jointly reasoning over visual layout and textual content.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 2 internal anchors

  1. [1]

    In the Ital- ian context, these records chronicle nearly two cen- turies of transformative events

    Introduction Parliamentary proceedings constitute one of the most valuable documentary sources for the study of political, linguistic, and social change. In the Ital- ian context, these records chronicle nearly two cen- turies of transformative events. The stenographic reports produced by both chambers of the Italian Parliament, the Camera dei Deputati an...

  2. [2]

    Transcription and Recognition of Italian Parliamentary Speeches Using Vision-Language Models

    have assembled comparable parliamen- tary corpora across European countries, while Italy-specific efforts, including IPSA (Frasnelli and Palmero Aprosio, 2024) and ItaParlCorpus (Cova, 2025), have produced large-scale datasets span- ning extensive historical periods. However, these resources predominantly rely on traditional Optical Character Recognition ...

  3. [3]

    Related Work This section reviews prior work relevant to our con- tribution, organisedintotwomainareas: parliamen- tary corpora with a focus on Italian resources, and vision-language models for document understand- ing and OCR. 2.1. Italian Parliamentary Resources Parliamentary debates constitute a valuable re- source for political science, linguistics, a...

  4. [4]

    Similarly, the ParlSpeech dataset(RauhandSchwalbach,2020)providesfull- text corpora from various advanced democracies, though notably excluding Italy

    assembled comparable corpora from 29 Eu- ropean countries, containing over one billion words and covering at least the period 2015–2022, with linguistic annotations following the Universal De- pendencies framework. Similarly, the ParlSpeech dataset(RauhandSchwalbach,2020)providesfull- text corpora from various advanced democracies, though notably excludin...

  5. [5]

    speaker":

    Methodology Figure 1: Pipeline diagram showing the six stages with data flow between components. This section presents the methodology devel- oped for the automatic transcription and semantic labelling of Italian parliamentary session reports. The proposed pipeline transforms digitised parlia- mentary documents into structured, semantically annotated data...

  6. [6]

    score-based ranking: candidates are ranked by their fuzzy matching scores; if a single can- didate achieves the highest score, it is se- lected

  7. [7]

    role matching: if the speaker’s role was ex- tracted, candidates whose roles match the ex- tracted role are prioritised

  8. [8]

    full name similarity: remaining ties are bro- ken by computing similarity between the can- didate’s full name and the extracted speaker name

  9. [9]

    G. ROSSI

    abbreviated name handling: for speaker names containing initials (e.g., "G. ROSSI"), the system generates abbreviated forms of candidate names and compares them

  10. [10]

    contextual mention: candidates whose full names appear elsewhere in the document text are favoured

  11. [11]

    speaker":

    weighted edit distance: a weighted Leven- shtein distance is computed, assigning lower substitution costs to vowel-vowel substitutions to account for spelling variations common in historical documents. Ifdisambiguationsucceeds,thespeakerislinked to the entity’s URI; otherwise, all high-scoring can- didates are retained as potential matches. In a second pa...

  12. [12]

    We evaluate both the OCR transcription quality and the speaker tagging accuracy using the benchmark dataset re- leased by the authors

    Evaluation To assess the effectiveness of the proposed pipeline, we conduct a comparative evaluation against IPSA (Frasnelli and Palmero Aprosio, 2024), a previously published system for Italian parliamentary corpus construction. We evaluate both the OCR transcription quality and the speaker tagging accuracy using the benchmark dataset re- leased by the a...

  13. [13]

    Conclusion In this paper, we presented a pipeline for automatic transcription, semantic segmentation, and entity linking of Italian parliamentary speeches based on Vision-Language Models. The proposed approach combines a specialised OCR model (dots.ocr) with a large-scale VLM (Qwen2.5-VL-72B) to jointly perform text extraction, element classifica- tion, a...

  14. [14]

    A Survey on Optical Character Recognition System

    A survey on optical character recognition system.arXiv preprint arXiv:1710.05703. Yumeng Li, Guang Yang, Hao Liu, Bowen Wang, and Colin Zhang. 2025. dots.ocr: Multilin- gual document layout parsing in a single vision- language model. Linke Ouyang, Yuan Qu, Hongbin Zhou, Jiawei Zhu, Rui Zhang, Qunshu Lin, Bin Wang, Zhiyuan Zhao, Man Jiang, Xiaomeng Zhao, J...