Transcription and Recognition of Italian Parliamentary Speeches Using Vision-Language Models

Alfio Ferrara; Giovanni Pagano; Luigi Curini; Sergio Picascia

arxiv: 2603.28103 · v2 · pith:7576W6DKnew · submitted 2026-03-30 · 💻 cs.DL · cs.AI· cs.IR

Transcription and Recognition of Italian Parliamentary Speeches Using Vision-Language Models

Luigi Curini , Alfio Ferrara , Giovanni Pagano , Sergio Picascia This is my paper

Pith reviewed 2026-05-21 10:10 UTC · model grok-4.3

classification 💻 cs.DL cs.AIcs.IR

keywords vision-language modelsoptical character recognitionItalian parliamentary speechestranscriptionspeaker identificationsemantic segmentationentity linkinghistorical documents

0 comments

The pith

A vision-language model pipeline substantially improves transcription quality and speaker tagging for Italian parliamentary speeches.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes replacing traditional optical character recognition pipelines with a new approach that uses vision-language models on scanned historical Italian parliamentary documents. These models take both the visual layout of the page and the extracted text as input to refine the transcription, classify different elements in the proceedings, and identify who is speaking. The identified speakers are then matched to an existing knowledge base using database queries and fuzzy matching techniques. This matters because parliamentary records contain valuable historical and political information that becomes far more usable once transcription errors are reduced and semantic details like speaker identity are added automatically.

Core claim

The pipeline first applies a specialised OCR model to extract text while preserving reading order, then employs a large-scale vision-language model that jointly reasons over visual layout and textual content to refine the transcription, classify elements, and identify speakers. Extracted speakers are linked to the Chamber of Deputies knowledge base through SPARQL queries combined with a multi-strategy fuzzy matching procedure. Evaluation against an established benchmark shows substantial improvements in both transcription quality and speaker tagging accuracy compared with prior OCR-only methods.

What carries the argument

The large-scale vision-language model that performs transcription refinement, element classification, and speaker identification by jointly reasoning over visual layout and textual content.

If this is right

Historical parliamentary proceedings become more searchable and analyzable once accurate transcriptions and speaker labels are available at scale.
Speaker identification enables tracking of individual contributions across debates and sessions in the Italian Chamber of Deputies.
Entity linking to the existing knowledge base supports integration with other political and biographical datasets.
Semantic segmentation of the documents allows higher-level tasks such as summarization or topic extraction to be performed more reliably.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same joint visual-textual reasoning approach could be tested on parliamentary records from other countries that exist only as scanned images.
The pipeline might generalize to other classes of historical documents where layout information helps disambiguate text, such as old newspapers or legal archives.
Fine-tuning the vision-language model on additional Italian political corpora could further reduce domain-specific errors in speaker names and terminology.

Load-bearing premise

A large-scale vision-language model can reliably refine transcriptions, classify elements, and identify speakers by jointly reasoning over visual layout and text without introducing major new errors or biases.

What would settle it

Running the pipeline on the same established benchmark and finding no measurable reduction in transcription error rates or no gain in speaker tagging accuracy would falsify the claim of substantial improvements.

Figures

Figures reproduced from arXiv: 2603.28103 by Alfio Ferrara, Giovanni Pagano, Luigi Curini, Sergio Picascia.

**Figure 2.** Figure 2: Excerpt from the stenographic report of the session held on November 27th 1874, Legislature 12 of the Kingdom of Italy. This page excerpt serves as the running example throughout this section. 3.1. Document Acquisition For each legislature in the history of the Italian Parliament, we retrieved the complete list of sessions (sedute). Each session, which can be thought of as a parliamentary meeting, is un… view at source ↗

**Figure 1.** Figure 1: Pipeline diagram showing the six stages with data flow between components. This section presents the methodology developed for the automatic transcription and semantic labelling of Italian parliamentary session reports. The proposed pipeline transforms digitised parliamentary documents into structured, semantically annotated data, enabling downstream analyses on political discourse studies. The pipeline … view at source ↗

**Figure 3.** Figure 3: illustrates the effect of post-processing on the running example. The speech fragment on page 26, originally marked with an "unknown" speaker, is merged with the incomplete element at the end of page 25: the hyphenated word is rejoined, the content is concatenated, and the speaker identity is propagated from the preceding context. Before Post-Processing End of page 25: speaker: "MINGHETTI, PRESIDENTE DEL C… view at source ↗

read the original abstract

Parliamentary proceedings represent a rich yet challenging resource for computational analysis, particularly when preserved only as scanned historical documents. Existing efforts to transcribe Italian parliamentary speeches have relied on traditional Optical Character Recognition pipelines, resulting in transcription errors and limited semantic annotation. In this paper, we propose a pipeline based on Vision-Language Models for the automatic transcription, semantic segmentation, and entity linking of Italian parliamentary speeches. The pipeline employs a specialised OCR model to extract text while preserving reading order, followed by a large-scale Vision-Language Model that performs transcription refinement, element classification, and speaker identification by jointly reasoning over visual layout and textual content. Extracted speakers are then linked to the Chamber of Deputies knowledge base through SPARQL queries and a multi-strategy fuzzy matching procedure. Evaluation against an established benchmark demonstrates substantial improvements both in transcription quality and speaker tagging.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper describes a practical VLM pipeline for transcribing and annotating scanned Italian parliamentary speeches, but its claims of substantial gains rest on evaluation details that need checking in the full text.

read the letter

This paper introduces a pipeline for transcribing and annotating Italian parliamentary speeches from scanned historical documents. It uses a specialized OCR to extract text while preserving order, then applies a Vision-Language Model to refine the transcription, classify elements, and identify speakers through joint visual and textual reasoning. Speakers are linked to the official knowledge base via SPARQL and fuzzy matching. What is new is the use of VLMs in this domain to handle refinement and speaker tagging together, rather than relying solely on traditional OCR methods. The paper does well in presenting a coherent workflow that targets real challenges like transcription errors and limited semantic information in old records. This could be helpful for building better digital archives of political proceedings. The approach appears sound on paper, with no signs of circular reasoning or invented steps. It builds logically on pre-existing technologies and an external database. The soft spot is the evaluation. While it claims substantial improvements on a benchmark for transcription quality and speaker tagging, the abstract lacks specific numbers, baseline comparisons, or details on the test set. If the full paper has robust quantitative results and analysis, that would address this; otherwise, the central claims rest on unverified assertions from the summary. Minor issues might include limited discussion of potential VLM biases or scalability. This is for researchers in digital humanities or those working on computational political analysis, particularly with Italian or similar language parliamentary data. A reader interested in practical applications of vision-language models for document understanding would find it relevant. I recommend sending it for peer review. It has enough substance for referees to evaluate the implementation and results properly, and it could contribute to the field with some strengthening of the evidence section.

Referee Report

1 major / 0 minor

Summary. The manuscript proposes a pipeline for transcribing, semantically segmenting, and entity-linking Italian parliamentary speeches from scanned historical documents. It begins with a specialized OCR model that extracts text while preserving reading order, followed by a large-scale Vision-Language Model that jointly reasons over visual layout and textual content to refine the transcription, classify elements, and identify speakers. Extracted speakers are then linked to the Chamber of Deputies knowledge base via SPARQL queries combined with multi-strategy fuzzy matching. The abstract states that evaluation on an established benchmark shows substantial improvements in both transcription quality and speaker tagging relative to prior OCR approaches.

Significance. If the reported gains are substantiated with concrete metrics, this approach could meaningfully advance automated processing of historical parliamentary records, particularly for non-English languages and complex layout-heavy documents. The integration of VLMs for joint visual-textual reasoning offers a plausible route beyond traditional OCR limitations, with potential applicability to other archival digitization tasks in digital humanities and political science.

major comments (1)

Abstract: The central claim of 'substantial improvements' in transcription quality and speaker tagging is asserted without any quantitative metrics, error rates, dataset size, baseline comparisons, or statistical significance tests. This absence leaves the primary empirical contribution without verifiable support and makes it impossible to assess whether the VLM pipeline delivers load-bearing gains over prior methods.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading of the manuscript and for identifying this important point about the abstract. We address the comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [—] Abstract: The central claim of 'substantial improvements' in transcription quality and speaker tagging is asserted without any quantitative metrics, error rates, dataset size, baseline comparisons, or statistical significance tests. This absence leaves the primary empirical contribution without verifiable support and makes it impossible to assess whether the VLM pipeline delivers load-bearing gains over prior methods.

Authors: We agree that the abstract would be strengthened by including specific quantitative metrics to support the claim of substantial improvements. The full manuscript reports evaluation results on an established benchmark with direct comparisons to prior OCR methods. In the revised version we will update the abstract to incorporate key figures from the results section, including transcription error rates, speaker tagging accuracy, benchmark dataset size, and baseline comparisons. This change will make the primary empirical claims immediately verifiable while preserving the abstract's summary character. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents a standard applied pipeline: specialised OCR for order-preserving extraction, followed by an off-the-shelf large-scale VLM for joint visual-textual refinement/classification/ID, then external SPARQL + fuzzy matching against the Chamber of Deputies knowledge base. All components are described as pre-existing technologies or standard procedures; the central claims rest on empirical improvements measured against an established external benchmark. No equations, fitted parameters renamed as predictions, self-citation chains, or uniqueness theorems appear in the provided text. The derivation is therefore self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on assumptions about the capabilities of existing OCR and Vision-Language Model technologies rather than new parameters or entities introduced in the abstract.

axioms (2)

domain assumption Specialised OCR model extracts text while preserving reading order
Invoked as the first step of the proposed pipeline.
domain assumption Large-scale Vision-Language Model can jointly reason over visual layout and textual content for refinement, classification, and speaker identification
Central premise for the second stage of the pipeline.

pith-pipeline@v0.9.0 · 5676 in / 1334 out tokens · 55474 ms · 2026-05-21T10:10:07.371694+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The pipeline employs a specialised OCR model to extract text while preserving reading order, followed by a large-scale Vision-Language Model that performs transcription refinement, element classification, and speaker identification by jointly reasoning over visual layout and textual content.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 2 internal anchors

[1]

In the Ital- ian context, these records chronicle nearly two cen- turies of transformative events

Introduction Parliamentary proceedings constitute one of the most valuable documentary sources for the study of political, linguistic, and social change. In the Ital- ian context, these records chronicle nearly two cen- turies of transformative events. The stenographic reports produced by both chambers of the Italian Parliament, the Camera dei Deputati an...

work page
[2]

Transcription and Recognition of Italian Parliamentary Speeches Using Vision-Language Models

have assembled comparable parliamen- tary corpora across European countries, while Italy-specific efforts, including IPSA (Frasnelli and Palmero Aprosio, 2024) and ItaParlCorpus (Cova, 2025), have produced large-scale datasets span- ning extensive historical periods. However, these resources predominantly rely on traditional Optical Character Recognition ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Related Work This section reviews prior work relevant to our con- tribution, organisedintotwomainareas: parliamen- tary corpora with a focus on Italian resources, and vision-language models for document understand- ing and OCR. 2.1. Italian Parliamentary Resources Parliamentary debates constitute a valuable re- source for political science, linguistics, a...

work page
[4]

Similarly, the ParlSpeech dataset(RauhandSchwalbach,2020)providesfull- text corpora from various advanced democracies, though notably excluding Italy

assembled comparable corpora from 29 Eu- ropean countries, containing over one billion words and covering at least the period 2015–2022, with linguistic annotations following the Universal De- pendencies framework. Similarly, the ParlSpeech dataset(RauhandSchwalbach,2020)providesfull- text corpora from various advanced democracies, though notably excludin...

work page 2015
[5]

speaker":

Methodology Figure 1: Pipeline diagram showing the six stages with data flow between components. This section presents the methodology devel- oped for the automatic transcription and semantic labelling of Italian parliamentary session reports. The proposed pipeline transforms digitised parlia- mentary documents into structured, semantically annotated data...

work page 2025
[6]

score-based ranking: candidates are ranked by their fuzzy matching scores; if a single can- didate achieves the highest score, it is se- lected

work page
[7]

role matching: if the speaker’s role was ex- tracted, candidates whose roles match the ex- tracted role are prioritised

work page
[8]

full name similarity: remaining ties are bro- ken by computing similarity between the can- didate’s full name and the extracted speaker name

work page
[9]

G. ROSSI

abbreviated name handling: for speaker names containing initials (e.g., "G. ROSSI"), the system generates abbreviated forms of candidate names and compares them

work page
[10]

contextual mention: candidates whose full names appear elsewhere in the document text are favoured

work page
[11]

speaker":

weighted edit distance: a weighted Leven- shtein distance is computed, assigning lower substitution costs to vowel-vowel substitutions to account for spelling variations common in historical documents. Ifdisambiguationsucceeds,thespeakerislinked to the entity’s URI; otherwise, all high-scoring can- didates are retained as potential matches. In a second pa...

work page
[12]

We evaluate both the OCR transcription quality and the speaker tagging accuracy using the benchmark dataset re- leased by the authors

Evaluation To assess the effectiveness of the proposed pipeline, we conduct a comparative evaluation against IPSA (Frasnelli and Palmero Aprosio, 2024), a previously published system for Italian parliamentary corpus construction. We evaluate both the OCR transcription quality and the speaker tagging accuracy using the benchmark dataset re- leased by the a...

work page 2024
[13]

Conclusion In this paper, we presented a pipeline for automatic transcription, semantic segmentation, and entity linking of Italian parliamentary speeches based on Vision-Language Models. The proposed approach combines a specialised OCR model (dots.ocr) with a large-scale VLM (Qwen2.5-VL-72B) to jointly perform text extraction, element classifica- tion, a...

work page 1948
[14]

A Survey on Optical Character Recognition System

A survey on optical character recognition system.arXiv preprint arXiv:1710.05703. Yumeng Li, Guang Yang, Hao Liu, Bowen Wang, and Colin Zhang. 2025. dots.ocr: Multilin- gual document layout parsing in a single vision- language model. Linke Ouyang, Yuan Qu, Hongbin Zhou, Jiawei Zhu, Rui Zhang, Qunshu Lin, Bin Wang, Zhiyuan Zhao, Man Jiang, Xiaomeng Zhao, J...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

In the Ital- ian context, these records chronicle nearly two cen- turies of transformative events

Introduction Parliamentary proceedings constitute one of the most valuable documentary sources for the study of political, linguistic, and social change. In the Ital- ian context, these records chronicle nearly two cen- turies of transformative events. The stenographic reports produced by both chambers of the Italian Parliament, the Camera dei Deputati an...

work page

[2] [2]

Transcription and Recognition of Italian Parliamentary Speeches Using Vision-Language Models

have assembled comparable parliamen- tary corpora across European countries, while Italy-specific efforts, including IPSA (Frasnelli and Palmero Aprosio, 2024) and ItaParlCorpus (Cova, 2025), have produced large-scale datasets span- ning extensive historical periods. However, these resources predominantly rely on traditional Optical Character Recognition ...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

Related Work This section reviews prior work relevant to our con- tribution, organisedintotwomainareas: parliamen- tary corpora with a focus on Italian resources, and vision-language models for document understand- ing and OCR. 2.1. Italian Parliamentary Resources Parliamentary debates constitute a valuable re- source for political science, linguistics, a...

work page

[4] [4]

Similarly, the ParlSpeech dataset(RauhandSchwalbach,2020)providesfull- text corpora from various advanced democracies, though notably excluding Italy

assembled comparable corpora from 29 Eu- ropean countries, containing over one billion words and covering at least the period 2015–2022, with linguistic annotations following the Universal De- pendencies framework. Similarly, the ParlSpeech dataset(RauhandSchwalbach,2020)providesfull- text corpora from various advanced democracies, though notably excludin...

work page 2015

[5] [5]

speaker":

Methodology Figure 1: Pipeline diagram showing the six stages with data flow between components. This section presents the methodology devel- oped for the automatic transcription and semantic labelling of Italian parliamentary session reports. The proposed pipeline transforms digitised parlia- mentary documents into structured, semantically annotated data...

work page 2025

[6] [6]

score-based ranking: candidates are ranked by their fuzzy matching scores; if a single can- didate achieves the highest score, it is se- lected

work page

[7] [7]

role matching: if the speaker’s role was ex- tracted, candidates whose roles match the ex- tracted role are prioritised

work page

[8] [8]

full name similarity: remaining ties are bro- ken by computing similarity between the can- didate’s full name and the extracted speaker name

work page

[9] [9]

G. ROSSI

abbreviated name handling: for speaker names containing initials (e.g., "G. ROSSI"), the system generates abbreviated forms of candidate names and compares them

work page

[10] [10]

contextual mention: candidates whose full names appear elsewhere in the document text are favoured

work page

[11] [11]

speaker":

weighted edit distance: a weighted Leven- shtein distance is computed, assigning lower substitution costs to vowel-vowel substitutions to account for spelling variations common in historical documents. Ifdisambiguationsucceeds,thespeakerislinked to the entity’s URI; otherwise, all high-scoring can- didates are retained as potential matches. In a second pa...

work page

[12] [12]

We evaluate both the OCR transcription quality and the speaker tagging accuracy using the benchmark dataset re- leased by the authors

Evaluation To assess the effectiveness of the proposed pipeline, we conduct a comparative evaluation against IPSA (Frasnelli and Palmero Aprosio, 2024), a previously published system for Italian parliamentary corpus construction. We evaluate both the OCR transcription quality and the speaker tagging accuracy using the benchmark dataset re- leased by the a...

work page 2024

[13] [13]

Conclusion In this paper, we presented a pipeline for automatic transcription, semantic segmentation, and entity linking of Italian parliamentary speeches based on Vision-Language Models. The proposed approach combines a specialised OCR model (dots.ocr) with a large-scale VLM (Qwen2.5-VL-72B) to jointly perform text extraction, element classifica- tion, a...

work page 1948

[14] [14]

A Survey on Optical Character Recognition System

A survey on optical character recognition system.arXiv preprint arXiv:1710.05703. Yumeng Li, Guang Yang, Hao Liu, Bowen Wang, and Colin Zhang. 2025. dots.ocr: Multilin- gual document layout parsing in a single vision- language model. Linke Ouyang, Yuan Qu, Hongbin Zhou, Jiawei Zhu, Rui Zhang, Qunshu Lin, Bin Wang, Zhiyuan Zhao, Man Jiang, Xiaomeng Zhao, J...

work page internal anchor Pith review Pith/arXiv arXiv 2025