VIDA: A dataset for Visually Dependent Ambiguity in Multimodal Machine Translation

Chris Biemann; Jingheng Pan; Liang Ding; Longyue Wang; Weihua Luo; Xintong Wang

arxiv: 2605.02035 · v2 · pith:5WKWUVIInew · submitted 2026-05-03 · 💻 cs.CL · cs.AI

VIDA: A dataset for Visually Dependent Ambiguity in Multimodal Machine Translation

Jingheng Pan , Xintong Wang , Longyue Wang , Liang Ding , Weihua Luo , Chris Biemann This is my paper

Pith reviewed 2026-05-08 19:29 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords multimodal machine translationvisual ambiguitydisambiguation datasetchain-of-thought fine-tuninglarge vision language modelssupervised fine-tuningevaluation metricsout-of-distribution generalization

0 comments

The pith

A dataset of 2,500 translation instances shows that chain-of-thought fine-tuning helps models use visual evidence to resolve ambiguities more consistently.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper creates a new dataset to test how well machine translation models use images to resolve ambiguous words or phrases in the input text. Prior benchmarks had quality problems and did not match real translation needs, so the authors curate 2,500 examples where vision is essential for disambiguation and design metrics to check resolution at the specific span. Experiments reveal that adding chain-of-thought reasoning to fine-tuning helps models disambiguate more reliably than regular fine-tuning, with particular benefits on examples unlike those seen during training. Readers should care because translation systems that properly ground meaning in visuals could reduce errors in contexts like describing images or videos.

Core claim

We introduce VIDA, a multimodal dataset consisting of 2,500 carefully curated instances in which resolving an annotated ambiguous source span requires visual evidence from the corresponding image. We propose Disambiguation-Centric Metrics that employ an LLM-as-a-judge classifier to determine whether the model has correctly resolved the ambiguous expressions at the span level. Through experiments with state-of-the-art large vision language models using vanilla inference, supervised fine-tuning, and chain-of-thought supervised fine-tuning, we find that chain-of-thought fine-tuning provides more consistent gains in disambiguation accuracy, especially on out-of-distribution subsets.

What carries the argument

The VIDA dataset of 2,500 translation instances requiring visual evidence for ambiguity resolution, evaluated with Disambiguation-Centric Metrics that use an LLM-as-a-judge to verify span-level correctness.

If this is right

Standard supervised fine-tuning improves overall translation quality but provides less consistent gains on disambiguation tasks.
Chain-of-thought supervised fine-tuning produces stronger and more reliable improvements in disambiguation accuracy.
The gains from chain-of-thought fine-tuning are especially pronounced on out-of-distribution examples.
The new metrics enable precise checking of whether models resolve specific ambiguous spans correctly rather than relying on overall sentence quality.
The approach supports evaluation across a wider range of ambiguity types than previous benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Encouraging step-by-step reasoning during training appears to help multimodal models better integrate visual context with linguistic input.
The dataset and metrics could be extended to additional languages or domains to test whether the same training pattern holds.
If the LLM judge scales reliably, it could support faster iteration on new disambiguation methods without full human evaluation.
Similar chain-of-thought fine-tuning might improve performance on other multimodal tasks involving ambiguous descriptions.

Load-bearing premise

The 2,500 instances are accurately annotated such that visual evidence is genuinely required to resolve each ambiguous span, and the LLM-as-a-judge classifier reliably measures correct span-level disambiguation without its own biases or errors.

What would settle it

An independent human review that identifies many dataset examples resolvable from source text alone without the image, or that shows the LLM judge disagrees with human judgments on resolution correctness in a substantial portion of cases.

Figures

Figures reproduced from arXiv: 2605.02035 by Chris Biemann, Jingheng Pan, Liang Ding, Longyue Wang, Weihua Luo, Xintong Wang.

**Figure 1.** Figure 1: Three-stage VIDA curation pipeline rule-based string matching. Furthermore, standard MT metrics such as BLEU (Papineni et al., 2002) and COMET (Rei et al., 2020) do not directly verify whether an ambiguous span has been resolved correctly, since surface-overlap metrics may penalize valid paraphrases or lexical variation and sentence-level metrics are too coarse-grained for span-level disambiguation. In t… view at source ↗

**Figure 2.** Figure 2: Example of CoT six-step reasoning resolving the ambiguity. view at source ↗

**Figure 3.** Figure 3: Case study of CoT-SFT vs. SFT tion and recognizes the intended interpretation during ambiguity checking. However, in the later disambiguation step, it over-interprets the phrase by incorrectly linking it to "someone physically touching" mentioned in the grounding step, rather than the relevant cue about the product feature. As a result, the model revises an initially adequate interpretation into an inc… view at source ↗

read the original abstract

Ambiguity resolution is a key challenge in multimodal machine translation (MMT), where models must genuinely leverage visual input to map an ambiguous expression to its intended meaning. Although prior work has proposed disambiguation-oriented benchmarks probing the role of vision, we observe that existing benchmarks remain limited by task-format mismatch, narrow ambiguity coverage, or insufficient visual-dependency validation. Moreover, existing ambiguity evaluations are not well suited to diverse ambiguity types in open-ended translation. To address these limitations, we present VIDA (Visually-Dependent Ambiguity), a dataset of 2,500 carefully curated instances in which resolving an annotated source span requires visual evidence. We further propose Disambiguation-Centric Metrics that use an LLM-as-a-judge classifier to verify whether annotated ambiguous expressions are resolved correctly at the span level. Experiments with two state-of-the-art LVLMs show that supervised fine-tuning (SFT) improves overall translation quality, while chain-of-thought SFT (CoT-SFT) yields stronger out-of-distribution disambiguation, suggesting that explicit disambiguation guidance improves generalization to diverse ambiguity types.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VIDA gives a targeted 2500-instance dataset for cases where vision is needed to resolve MT ambiguity, but the LLM-judge metrics lack human validation.

read the letter

The main point is a new dataset called VIDA with 2500 instances where an annotated ambiguous source span actually requires the image to pick the right meaning in translation. They also introduce disambiguation-centric metrics that rely on an LLM judge to score whether the span is resolved correctly in the output. Experiments compare vanilla inference, standard SFT, and their chain-of-thought SFT on two large vision-language models, with the claim that CoT-SFT gives more consistent gains on out-of-distribution subsets.

Referee Report

1 major / 1 minor

Summary. The paper introduces the VIDA dataset of 2,500 instances in which resolving an annotated ambiguous source span in machine translation requires visual evidence. It proposes Disambiguation-Centric Metrics that employ an LLM-as-a-judge classifier to assess whether the ambiguous expression is correctly resolved at the span level. Experiments on two state-of-the-art large vision-language models compare vanilla inference, standard supervised fine-tuning (SFT), and chain-of-thought SFT (CoT-SFT), concluding that CoT-SFT produces more consistent gains in disambiguation accuracy, particularly on out-of-distribution subsets.

Significance. If the dataset curation ensures genuine visual dependence and the LLM judge is shown to be reliable, the work would supply a needed benchmark for evaluating visual grounding in multimodal MT and would demonstrate a practical benefit of explicit reasoning traces for handling diverse ambiguity types beyond existing datasets.

major comments (1)

[Disambiguation-Centric Metrics and Experiments] The Disambiguation-Centric Metrics section relies on an LLM-as-a-judge classifier to produce the primary outcome measure (span-level disambiguation accuracy), yet no validation against human judgments, no quantification of judge accuracy or bias, and no ablation on judge reliability are reported. This directly affects the central claim that CoT-SFT yields stronger generalization than SFT, especially on OOD subsets, because any systematic preference of the judge for chain-of-thought outputs could artifactually inflate the reported advantage.

minor comments (1)

[Abstract] The abstract states that prior ambiguity-oriented evaluations suffer from data-quality issues and mismatch with translation scenarios; a short concrete example of one such issue would help readers immediately grasp the motivation for VIDA.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address the concern regarding the LLM-as-a-judge validation below and commit to strengthening the manuscript accordingly.

read point-by-point responses

Referee: The Disambiguation-Centric Metrics section relies on an LLM-as-a-judge classifier to produce the primary outcome measure (span-level disambiguation accuracy), yet no validation against human judgments, no quantification of judge accuracy or bias, and no ablation on judge reliability are reported. This directly affects the central claim that CoT-SFT yields stronger generalization than SFT, especially on OOD subsets, because any systematic preference of the judge for chain-of-thought outputs could artifactually inflate the reported advantage.

Authors: We agree that the absence of explicit validation for the LLM judge represents a limitation that could affect confidence in the disambiguation accuracy results and the comparative claims for CoT-SFT. The manuscript describes the judge prompt design intended to focus strictly on span-level resolution of the annotated ambiguous expression, independent of overall translation quality or reasoning style. However, no human validation, accuracy metrics, bias quantification, or ablation was included. In the revised version, we will add a dedicated subsection reporting a human validation study: three annotators will evaluate a stratified sample of 400 outputs (100 per model/setting combination across vanilla, SFT, and CoT-SFT, including OOD cases). We will report agreement rates with the LLM judge, Cohen's kappa, and any systematic preferences (e.g., toward CoT outputs). If bias is detected, we will either correct the metric or qualify the claims. We will also include a prompt ablation using an alternative judge model. These additions will directly address the potential artifact concern and support the generalization findings. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a purely empirical contribution centered on dataset curation (VIDA with 2,500 instances) and experimental evaluation of LVLMs under different training regimes. No mathematical derivations, fitted parameters renamed as predictions, or self-referential chains appear in the abstract or described methodology. The Disambiguation-Centric Metrics and LLM-as-a-judge are presented as measurement tools defined from the new annotations rather than reducing to prior results by construction. All claims rest on direct experimental comparisons, rendering the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Central claims rest on two unverified premises: accurate manual identification of spans whose resolution requires vision, and reliable performance of the LLM judge for span-level correctness. No external benchmarks or formal validation of these premises are described.

axioms (2)

domain assumption Annotated ambiguous spans require visual evidence for correct resolution
Stated as the defining property of the VIDA dataset instances
ad hoc to paper LLM-as-a-judge classifier accurately verifies span-level disambiguation
Proposed metric depends on this without reported validation or error analysis

pith-pipeline@v0.9.0 · 5516 in / 1387 out tokens · 65674 ms · 2026-05-08T19:29:55.976722+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

IndustryBench-MIPU: Benchmarking Multi-Image Attribute Value Extraction for Industrial Products
cs.CV 2026-06 unverdicted novelty 5.0

The paper presents the first benchmark for multi-image industrial product attribute extraction, finding that MLLMs achieve high precision but only 49.9% recall at product level due to multi-image completeness gaps.