arxiv: 2604.20544 · v1 · submitted 2026-04-22 · 💻 cs.CV · cs.AI

Recognition: unknown

Evian: Towards Explainable Visual Instruction-tuning Data Auditing

Zimu Jia , Mingjie Xu , Andrew Estornell , Jiaheng Wei

Authors on Pith no claims yet

Pith reviewed 2026-05-10 00:46 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords visual instruction tuningdata auditinglarge vision-language modelsdataset curationlogical coherenceimage-text consistencyfactual accuracy

0 comments

The pith

Fine-tuning on a small high-quality visual instruction subset selected by EVIAN outperforms models trained on much larger datasets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large vision-language models require training data that maintains both visual fidelity and clear instruction following. Existing datasets often include subtle flaws such as logical fallacies and factual errors that current coarse scoring methods fail to catch. The paper introduces EVIAN, which decomposes model responses into visual description, subjective inference, and factual claim components, then evaluates them separately along axes of image-text consistency, logical coherence, and factual accuracy. This decomposition supports automated auditing and curation of data. Tests on a 300K benchmark with injected defects show that models fine-tuned on the resulting compact high-quality subsets exceed the performance of models trained on orders-of-magnitude larger collections.

Core claim

We build a 300K benchmark by injecting diverse subtle defects into visual instructions. We define a Decomposition-then-Evaluation paradigm that splits responses into visual description, subjective inference, and factual claim. We implement this in the EVIAN framework, which scores the components on Image-Text Consistency, Logical Coherence, and Factual Accuracy. Fine-tuning on the compact high-quality subset identified by EVIAN yields models that surpass those trained on far larger datasets, with Logical Coherence proving the most decisive quality factor.

What carries the argument

The Decomposition-then-Evaluation paradigm, which partitions model responses into visual description, subjective inference, and factual claim components and scores them independently on Image-Text Consistency, Logical Coherence, and Factual Accuracy.

If this is right

Dividing complex auditing into verifiable subtasks produces more reliable data curation than single-score filters.
Logical Coherence ranks as the dominant factor in determining the quality of visual instruction data.
Compact high-quality subsets can replace much larger noisy collections for effective fine-tuning.
Systematic defect injection supplies a controlled testbed for developing and validating auditing methods.
The three-axis evaluation isolates specific failure modes that scale-based approaches overlook.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The decomposition approach could extend to auditing other multimodal or text-only instruction datasets.
Linking model errors back to specific response components might enable targeted data fixes rather than wholesale retraining.
Automated auditing pipelines built on this method could reduce overall training compute by shrinking required dataset sizes.
Further tests on naturally occurring defects, rather than only synthetic ones, would strengthen the framework's real-world applicability.

Load-bearing premise

The synthetic defects injected into the 300K benchmark faithfully represent the nuanced semantic flaws that occur in naturally collected real-world visual instruction data, and the three-way decomposition isolates the components that determine data quality.

What would settle it

Train a vision-language model on the EVIAN-curated compact subset and compare its accuracy on standard LVLM benchmarks against models trained on the full uncurated large datasets; if the small-subset model does not exceed or match the larger ones, the central performance claim is false.

Figures

Figures reproduced from arXiv: 2604.20544 by Andrew Estornell, Jiaheng Wei, Mingjie Xu, Zimu Jia.

**Figure 1.** Figure 1: Overview of the two-phase EVIAN framework, which first decomposes a response into visual, inferential, [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 3.** Figure 3: Examples of our controlled defect injection. For each pair, the [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Score distribution comparing original and [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

The efficacy of Large Vision-Language Models (LVLMs) is critically dependent on the quality of their training data, requiring a precise balance between visual fidelity and instruction-following capability. Existing datasets, however, are plagued by inconsistent quality, and current data filtering methods rely on coarse-grained scores that lack the granularity to identify nuanced semantic flaws like logical fallacies or factual errors. This creates a fundamental bottleneck in developing more reliable models. To address this, we make three core contributions. First, we construct a large-scale, 300K-sample benchmark by systematically injecting diverse, subtle defects to provide a challenging testbed for data auditing. Second, we introduce a novel "Decomposition-then-Evaluation" paradigm that breaks model responses into constituent cognitive components: visual description, subjective inference, and factual claim, enabling targeted analysis. Third, we instantiate this paradigm via EVIAN (Explainable Visual Instruction-tuning Data AuditiNg), an automated framework that evaluates these components along the orthogonal axes of Image-Text Consistency, Logical Coherence, and Factual Accuracy. Our empirical findings challenge the prevailing scale-centric paradigm: a model fine-tuned on a compact, high-quality subset curated by EVIAN consistently surpassed models trained on orders-of-magnitude larger datasets. We also reveal that dividing complex auditing into verifiable subtasks enables robust curation, and that Logical Coherence is the most critical factor in data quality evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The decomposition into visual, inference, and factual components is the real contribution here, but the big performance claim lacks any numbers or details in the abstract.

read the letter

The paper's main move is to stop treating visual instruction data as a single blob and instead break model responses into three parts—visual description, subjective inference, and factual claim—then score them separately on consistency with the image, logical coherence, and factual accuracy. They back this with a 300K benchmark built by injecting controlled defects into cleaner samples. The claim that follows is that a small subset curated this way produces better fine-tuned models than training on much larger raw collections. That framing is new enough in the multimodal setting to stand out from coarser filtering approaches that just assign overall scores. The three axes feel like a practical way to isolate the kinds of errors that actually hurt LVLM reliability. The synthetic benchmark construction is also a clear, reproducible step that lets them test the auditing method at scale. The soft spot is the missing evidence. The abstract states the superiority result without reporting any accuracy deltas, dataset sizes used in the comparison, fine-tuning setups, or statistical checks. Without those, it is impossible to judge whether the gain is real or an artifact of how the test was run. The concern about synthetic defects is also worth checking in the full text: if the injected errors have statistical signatures that natural data lacks, the method might learn to detect the injection process rather than general quality signals. This work is for groups already running data curation pipelines for vision-language models and looking for finer tools than current heuristics. A reader who wants to try the decomposition on their own datasets would get immediate value from the paradigm even before the results are fully vetted. The paper is coherent enough on its own terms to deserve a serious referee who can examine the actual experiments and see whether the numbers support the headline.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces EVIAN, an automated framework for explainable auditing of visual instruction-tuning data for Large Vision-Language Models (LVLMs). It constructs a 300K-sample benchmark by systematically injecting diverse subtle defects into clean samples, proposes a 'Decomposition-then-Evaluation' paradigm that decomposes model responses into visual description, subjective inference, and factual claim components, and evaluates these along the axes of Image-Text Consistency, Logical Coherence, and Factual Accuracy. The central empirical claim is that fine-tuning an LVLM on a compact, high-quality subset curated via EVIAN consistently outperforms models trained on orders-of-magnitude larger datasets, while also identifying Logical Coherence as the most critical quality factor.

Significance. If the empirical results and benchmark validity hold, the work has substantial significance for multimodal AI research by shifting emphasis from data scale to targeted, explainable quality curation. It supplies a concrete benchmark and decomposition-based auditing tool that could improve LVLM reliability and efficiency. The framework's orthogonality of evaluation axes and the challenge to scale-centric paradigms are potentially impactful contributions, provided they are supported by reproducible experiments and generalization evidence beyond synthetic defects.

major comments (2)

[Section 3] Section 3 (Benchmark Construction): The 300K benchmark is created by injecting synthetic defects into presumably clean samples, followed by decomposition and scoring. This construction is load-bearing for the transferability claim, yet the manuscript provides no validation that these artificial defects match the distribution or entanglement of natural semantic flaws (e.g., subtle hallucinations or instruction misalignment) in real-world visual instruction data; if the injected defects create detectable signatures absent from natural data, EVIAN may overfit to benchmark artifacts rather than learn a general quality signal.
[Section 5] Experimental Results (Section 5): The abstract and introduction assert that the EVIAN-curated compact subset 'consistently surpassed' models trained on much larger datasets, but the provided text supplies no quantitative metrics (e.g., accuracy deltas, specific baselines such as random or heuristic filtering, statistical tests, or implementation details). This absence undermines assessment of the central claim's magnitude and robustness; the experiments section must include these to make the superiority verifiable.

minor comments (3)

[Abstract] Abstract: The acronym expansion 'Explainable Visual Instruction-tuning Data AuditiNg' contains inconsistent capitalization ('AuditiNg'); standardize to 'EVIAN' or 'Evian' throughout the manuscript and ensure the full name is defined on first use.
[Related Work] Related Work section: Additional citations are needed to prior data filtering and quality assessment methods for LVLMs (e.g., works on hallucination mitigation or instruction data pruning) to better position the novelty of the decomposition paradigm.
[Figure 1] Figure 1 or equivalent (Decomposition diagram): The caption and visual should more explicitly label the three cognitive components and the three evaluation axes to allow readers to follow the paradigm without constant reference to the main text.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating where revisions will be made to improve the manuscript.

read point-by-point responses

Referee: [Section 3] Section 3 (Benchmark Construction): The 300K benchmark is created by injecting synthetic defects into presumably clean samples, followed by decomposition and scoring. This construction is load-bearing for the transferability claim, yet the manuscript provides no validation that these artificial defects match the distribution or entanglement of natural semantic flaws (e.g., subtle hallucinations or instruction misalignment) in real-world visual instruction data; if the injected defects create detectable signatures absent from natural data, EVIAN may overfit to benchmark artifacts rather than learn a general quality signal.

Authors: We agree that demonstrating alignment between synthetic and natural defects is important for the transferability of EVIAN. The defect types were selected to reflect documented LVLM failure modes in the literature (e.g., visual hallucinations, logical inconsistencies, and instruction misalignment). However, the original manuscript does not contain a direct distributional comparison or validation against naturally occurring flaws. In the revision we will expand Section 3 with (1) explicit design rationale for each defect category and (2) a new preliminary study that applies EVIAN to a small curated set of real-world flawed samples drawn from public instruction-tuning datasets, reporting score distributions and qualitative examples to assess similarity. This will clarify the framework's scope while acknowledging remaining gaps. revision: partial
Referee: [Section 5] Experimental Results (Section 5): The abstract and introduction assert that the EVIAN-curated compact subset 'consistently surpassed' models trained on much larger datasets, but the provided text supplies no quantitative metrics (e.g., accuracy deltas, specific baselines such as random or heuristic filtering, statistical tests, or implementation details). This absence undermines assessment of the central claim's magnitude and robustness; the experiments section must include these to make the superiority verifiable.

Authors: We acknowledge that the quantitative details supporting the central claim were insufficiently elaborated in the submitted manuscript. Section 5 contains the relevant experiments, yet specific numerical results, baseline comparisons, and statistical information were not presented at the required level of detail. In the revised version we will expand the experimental section to report: concrete accuracy deltas on standard VQA and captioning benchmarks, explicit comparisons against random sampling and heuristic filtering baselines, results of statistical significance tests, and full hyperparameter and implementation details sufficient for reproducibility. These additions will make the superiority claim fully verifiable. revision: yes

standing simulated objections not resolved

Direct empirical validation that the distribution and entanglement of injected synthetic defects match those of natural semantic flaws in real-world visual instruction data

Circularity Check

0 steps flagged

No significant circularity; empirical claim is externally validated

full rationale

The paper's core contribution is an empirical framework: synthetic defect injection creates a 300K benchmark, a decomposition paradigm scores responses on three axes, and EVIAN applies this to curate subsets. The headline result (compact curated subset outperforms much larger datasets) is presented as an experimental outcome from fine-tuning and evaluation, not as a mathematical derivation or fitted quantity that reduces to its own inputs by construction. No equations, self-definitional loops, or load-bearing self-citations appear in the abstract or described chain. The method is self-contained against external benchmarks (standard LVLM training tasks), satisfying the criteria for a non-circular finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Abstract-only review prevents exhaustive enumeration; the central claim rests on the unverified assumption that the synthetic benchmark defects mirror real flaws and that the three-component decomposition is a valid proxy for data quality.

axioms (1)

domain assumption The decomposition of model responses into visual description, subjective inference, and factual claim accurately isolates the cognitive components that determine instruction data quality.
This premise underpins the entire paradigm but receives no justification or validation in the abstract.

invented entities (1)

EVian framework no independent evidence
purpose: Automated evaluation of the three decomposed components along consistency, coherence, and accuracy axes
A proposed system whose independent utility is asserted but not demonstrated in the abstract.

pith-pipeline@v0.9.0 · 5552 in / 1439 out tokens · 40502 ms · 2026-05-10T00:46:41.401393+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 8 canonical work pages · 4 internal anchors

[1]

Hallucination of Multimodal Large Language Models: A Survey

Hallucination of multimodal large language models: A survey.arXiv preprint arXiv:2404.18930. Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. 2024a. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognit...

work page internal anchor Pith review arXiv 2023
[2]

A Survey on LLM-as-a-Judge

A survey on llm-as-a-judge.arXiv preprint arXiv:2411.15594. Luke Guerdan, Solon Barocas, Kenneth Holstein, Hanna Wallach, Zhiwei Steven Wu, and Alexandra Choulde- chova. 2025. Validating llm-as-a-judge systems in the absence of gold labels.arXiv preprint arXiv:2503.05965. Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. 2021. Clip...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

Llms-as-judges: a comprehensive survey on llm-based evaluation methods.arXiv preprint arXiv:2412.05579. Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi

work page internal anchor Pith review arXiv
[4]

InInternational conference on ma- chine learning, pages 19730–19742

Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large lan- guage models. InInternational conference on ma- chine learning, pages 19730–19742. PMLR. Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. Blip: Bootstrapping language-image pre- training for unified vision-language understanding and generation. InInterna...

work page arXiv 2022
[5]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas- try, Amanda Askell, Pamela Mishkin, Jack Clark, and 1 others

Judge anything: Mllm as a judge across any modality.arXiv preprint arXiv:2503.17489. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas- try, Amanda Askell, Pamela Mishkin, Jack Clark, and 1 others. 2021. Learning transferable visual models from natural language supervision. InInternational conference on m...

work page arXiv 2021
[6]

AutoVDC: Automated Vision Data Cleaning Using Vision-Language Models

Autovdc: Automated vision data clean- ing using vision-language models.arXiv preprint arXiv:2507.12414. Bin Wang, Fan Wu, Xiao Han, Jiahui Peng, Huaping Zhong, Pan Zhang, Xiaoyi Dong, Weijia Li, Wei Li, Jiaqi Wang, and 1 others. 2024a. Vigc: Visual instruction generation and correction. InProceedings of the AAAI Conference on Artificial Intelligence, volu...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Logicvista: Multimodal llm logical reasoning benchmark in visual contexts.arXiv preprint arXiv:2407.04973, 2024

Logicvista: Multimodal llm logical reason- ing benchmark in visual contexts.arXiv preprint arXiv:2407.04973. Jinda Xu, Yuhao Song, Daming Wang, Weiwei Zhao, Minghua Chen, Kangliang Chen, and Qinya Li. 2025a. Quality over quantity: Boosting data effi- ciency through ensembled multimodal data curation. InProceedings of the AAAI Conference on Artificial Inte...

work page arXiv 2024
[8]

InProceedings of the 33rd ACM Interna- tional Conference on Multimedia, pages 4388–4397

Trustclip: Learning from noisy labels via se- mantic label verification and trust-aligned gradient projection. InProceedings of the 33rd ACM Interna- tional Conference on Multimedia, pages 4388–4397. Ruibin Zhao, Zhiwei Xie, Yipeng Zhuang, and Philip LH Yu. 2024. Automated quality evaluation of large-scale benchmark datasets for vision-language tasks.Inte...

2024
[9]

Unmasking and improving data credibility: A study with datasets for training harmless language models.arXiv preprint arXiv:2311.11202. A Evian Framework Implementation Details A.1 Models and Computational Resources All experiments were conducted on a high- performance computing node equipped with eight NVIDIA H100 (80GB) GPUs. We employed the Qwen3-235B-A...

work page arXiv
[10]

Table 5: Supervised Fine-Tuning (SFT) Hyperparame- ters for the Base Model

All key hyperparameters are detailed in Ta- ble 5. Table 5: Supervised Fine-Tuning (SFT) Hyperparame- ters for the Base Model. Hyperparameter Value Base Model Qwen2VL-2B Epochs 1 Learning Rate5×10 −6 Batch Size (per device) 2 Gradient Accumulation Steps 8 Weight Decay 0.0 Warmup Ratio 0.1 LR Scheduler Cosine Max Gradient Norm 1.0 Precision BF16 Max Sequen...
[11]

Tag the Complete Thought:Precisely wrap the shortest, complete phrase that conveys the entire logical idea (like a cause-and-effect statement) or the full piece of external information
[12]

Tag Interpretations of Effect/Cause: Always tag phrases that describe the ef- fect, purpose, or reason for a visual ele- ment
[13]

Strictly Visual is NOT Tagged:DO NOT tag objective, verifiable descrip- tions of visual facts
[14]

Do Not Change Words:Do not add, delete, or rephrase any original words, like Visible Text or Numbers
[15]

Marked Re- sponse:

Output Format:Your response must start with the prefix “Marked Re- sponse:”. Examples: Input:The lighting in the room is soft, creat- ing a cozy atmosphere. The design suggests it is from the Victorian era. Output:Marked Response: The lighting in the room is soft, <INFER>creating a cozy atmosphere</INFER>. <INFER>The design suggests it is from the Victori...

1976
[16]

<INFER>creating a cozy atmo- sphere</INFER>

Rewrite When Possible:If a tagged idea can be rephrased as a neutral, ob- jective, image-based description, rewrite it and remove the tags. For example, change “<INFER>creating a cozy atmo- sphere</INFER>” to “which illuminates the scene.”
[17]

Delete When Necessary:For clearly irrelevant or purely speculative content that cannot be visually confirmed, delete the entire tagged segment (including the tags)
[18]

No New Information:DO NOT intro- duce any new guesses, opinions, or vi- sual details that were not already present in the untagged parts of the original re- sponse
[19]

Cleaned Re- sponse:

Output Format:Your response must start with the prefix “Cleaned Re- sponse:”. Example: Input Annotated Response: A person wearing sunglasses stands under a tree. <INFER>She must be shielding her eyes from harsh sunlight.</INFER> Leaves are scattered on the ground. <KNOW>This park is famous for its autumn foliage tours.</KNOW> Output: Cleaned Response: A p...
[20]

Cleaned Response

Strictly Adhere to Input:Your output MUST be a faithful reorganization of ONLY the information present in the “Cleaned Response.”
[21]

Every object, at- tribute, and spatial relation from the in- put must be represented in your sum- mary

Preserve All Details:Do not omit any visual information. Every object, at- tribute, and spatial relation from the in- put must be represented in your sum- mary
[22]

beauti- ful

No New Content or Inference:Cru- cially, DO NOT add any new visual de- tails, reasoning, assumptions, or subjec- tive/interpretive language (e.g., “beauti- ful”, “seems like”, “creates a sense of”). Your job is to describe, not to analyze
[23]

Improve Flow:Focus on improving sen- tence structure and grammatical correct- ness to create a natural-sounding para- graph
[24]

Visual Summary:

Output Format:Your response must start with the prefix “Visual Summary:”. Example: Input Cleaned Response: A white cat is on a windowsill. The background shows buildings. Light is coming through the window. Output: Visual Summary: A white cat sits on a win- dowsill where bright light is streaming in. Buildings are visible in the background. A.4 Prompting ...
[25]

Isolate and Evaluate: Focus exclu- sively on the statements inside the <IN- FER> tags
[26]

Assess Plausibility against Image: Judge if the inference is a logical and plausi- ble conclusion derived from the visual information in the image
[27]

A per- son is running, <INFER>so this must be a professional athlete training for the Olympics</INFER>

Output Format: • Score: integer 1-5 • Explanation: A brief evaluation of the logical rigor, noting key flaws or strengths. Scoring Rubric: Score 1: Grossly Illogical or Baseless. The inference is pure speculation with no connection to the image (e.g., predict- ing the future from a photo of a cat), or it’s self-contradictory. Score 2: Significant Logical ...