VIDA: A dataset for Visually Dependent Ambiguity in Multimodal Machine Translation
Pith reviewed 2026-05-08 19:29 UTC · model grok-4.3
The pith
A dataset of 2,500 translation instances shows that chain-of-thought fine-tuning helps models use visual evidence to resolve ambiguities more consistently.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce VIDA, a multimodal dataset consisting of 2,500 carefully curated instances in which resolving an annotated ambiguous source span requires visual evidence from the corresponding image. We propose Disambiguation-Centric Metrics that employ an LLM-as-a-judge classifier to determine whether the model has correctly resolved the ambiguous expressions at the span level. Through experiments with state-of-the-art large vision language models using vanilla inference, supervised fine-tuning, and chain-of-thought supervised fine-tuning, we find that chain-of-thought fine-tuning provides more consistent gains in disambiguation accuracy, especially on out-of-distribution subsets.
What carries the argument
The VIDA dataset of 2,500 translation instances requiring visual evidence for ambiguity resolution, evaluated with Disambiguation-Centric Metrics that use an LLM-as-a-judge to verify span-level correctness.
If this is right
- Standard supervised fine-tuning improves overall translation quality but provides less consistent gains on disambiguation tasks.
- Chain-of-thought supervised fine-tuning produces stronger and more reliable improvements in disambiguation accuracy.
- The gains from chain-of-thought fine-tuning are especially pronounced on out-of-distribution examples.
- The new metrics enable precise checking of whether models resolve specific ambiguous spans correctly rather than relying on overall sentence quality.
- The approach supports evaluation across a wider range of ambiguity types than previous benchmarks.
Where Pith is reading between the lines
- Encouraging step-by-step reasoning during training appears to help multimodal models better integrate visual context with linguistic input.
- The dataset and metrics could be extended to additional languages or domains to test whether the same training pattern holds.
- If the LLM judge scales reliably, it could support faster iteration on new disambiguation methods without full human evaluation.
- Similar chain-of-thought fine-tuning might improve performance on other multimodal tasks involving ambiguous descriptions.
Load-bearing premise
The 2,500 instances are accurately annotated such that visual evidence is genuinely required to resolve each ambiguous span, and the LLM-as-a-judge classifier reliably measures correct span-level disambiguation without its own biases or errors.
What would settle it
An independent human review that identifies many dataset examples resolvable from source text alone without the image, or that shows the LLM judge disagrees with human judgments on resolution correctness in a substantial portion of cases.
Figures
read the original abstract
Ambiguity resolution is a key challenge in multimodal machine translation (MMT), where models must genuinely leverage visual input to map an ambiguous expression to its intended meaning. Although prior work has proposed disambiguation-oriented benchmarks probing the role of vision, we observe that existing benchmarks remain limited by task-format mismatch, narrow ambiguity coverage, or insufficient visual-dependency validation. Moreover, existing ambiguity evaluations are not well suited to diverse ambiguity types in open-ended translation. To address these limitations, we present VIDA (Visually-Dependent Ambiguity), a dataset of 2,500 carefully curated instances in which resolving an annotated source span requires visual evidence. We further propose Disambiguation-Centric Metrics that use an LLM-as-a-judge classifier to verify whether annotated ambiguous expressions are resolved correctly at the span level. Experiments with two state-of-the-art LVLMs show that supervised fine-tuning (SFT) improves overall translation quality, while chain-of-thought SFT (CoT-SFT) yields stronger out-of-distribution disambiguation, suggesting that explicit disambiguation guidance improves generalization to diverse ambiguity types.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the VIDA dataset of 2,500 instances in which resolving an annotated ambiguous source span in machine translation requires visual evidence. It proposes Disambiguation-Centric Metrics that employ an LLM-as-a-judge classifier to assess whether the ambiguous expression is correctly resolved at the span level. Experiments on two state-of-the-art large vision-language models compare vanilla inference, standard supervised fine-tuning (SFT), and chain-of-thought SFT (CoT-SFT), concluding that CoT-SFT produces more consistent gains in disambiguation accuracy, particularly on out-of-distribution subsets.
Significance. If the dataset curation ensures genuine visual dependence and the LLM judge is shown to be reliable, the work would supply a needed benchmark for evaluating visual grounding in multimodal MT and would demonstrate a practical benefit of explicit reasoning traces for handling diverse ambiguity types beyond existing datasets.
major comments (1)
- [Disambiguation-Centric Metrics and Experiments] The Disambiguation-Centric Metrics section relies on an LLM-as-a-judge classifier to produce the primary outcome measure (span-level disambiguation accuracy), yet no validation against human judgments, no quantification of judge accuracy or bias, and no ablation on judge reliability are reported. This directly affects the central claim that CoT-SFT yields stronger generalization than SFT, especially on OOD subsets, because any systematic preference of the judge for chain-of-thought outputs could artifactually inflate the reported advantage.
minor comments (1)
- [Abstract] The abstract states that prior ambiguity-oriented evaluations suffer from data-quality issues and mismatch with translation scenarios; a short concrete example of one such issue would help readers immediately grasp the motivation for VIDA.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation for major revision. We address the concern regarding the LLM-as-a-judge validation below and commit to strengthening the manuscript accordingly.
read point-by-point responses
-
Referee: The Disambiguation-Centric Metrics section relies on an LLM-as-a-judge classifier to produce the primary outcome measure (span-level disambiguation accuracy), yet no validation against human judgments, no quantification of judge accuracy or bias, and no ablation on judge reliability are reported. This directly affects the central claim that CoT-SFT yields stronger generalization than SFT, especially on OOD subsets, because any systematic preference of the judge for chain-of-thought outputs could artifactually inflate the reported advantage.
Authors: We agree that the absence of explicit validation for the LLM judge represents a limitation that could affect confidence in the disambiguation accuracy results and the comparative claims for CoT-SFT. The manuscript describes the judge prompt design intended to focus strictly on span-level resolution of the annotated ambiguous expression, independent of overall translation quality or reasoning style. However, no human validation, accuracy metrics, bias quantification, or ablation was included. In the revised version, we will add a dedicated subsection reporting a human validation study: three annotators will evaluate a stratified sample of 400 outputs (100 per model/setting combination across vanilla, SFT, and CoT-SFT, including OOD cases). We will report agreement rates with the LLM judge, Cohen's kappa, and any systematic preferences (e.g., toward CoT outputs). If bias is detected, we will either correct the metric or qualify the claims. We will also include a prompt ablation using an alternative judge model. These additions will directly address the potential artifact concern and support the generalization findings. revision: yes
Circularity Check
No significant circularity
full rationale
The paper is a purely empirical contribution centered on dataset curation (VIDA with 2,500 instances) and experimental evaluation of LVLMs under different training regimes. No mathematical derivations, fitted parameters renamed as predictions, or self-referential chains appear in the abstract or described methodology. The Disambiguation-Centric Metrics and LLM-as-a-judge are presented as measurement tools defined from the new annotations rather than reducing to prior results by construction. All claims rest on direct experimental comparisons, rendering the work self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Annotated ambiguous spans require visual evidence for correct resolution
- ad hoc to paper LLM-as-a-judge classifier accurately verifies span-level disambiguation
Forward citations
Cited by 1 Pith paper
-
IndustryBench-MIPU: Benchmarking Multi-Image Attribute Value Extraction for Industrial Products
The paper presents the first benchmark for multi-image industrial product attribute extraction, finding that MLLMs achieve high precision but only 49.9% recall at product level due to multi-image completeness gaps.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.