How Well Do Models Follow Visual Instructions? VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing

Anna Korhonen; Chengzu Li; Chen Liang; Haochen Tian; Haodong Li; Huanyu Zhang; Liang Wang; Ruichuan An; Tieniu Tan; Xuehai Bai

arxiv: 2602.01851 · v2 · pith:OXFILZ3Xnew · submitted 2026-02-02 · 💻 cs.CV

How Well Do Models Follow Visual Instructions? VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing

Huanyu Zhang , Xuehai Bai , Chengzu Li , Chen Liang , Haochen Tian , Haodong Li , Ruichuan An , Yifan Zhang

show 4 more authors

Anna Korhonen Zhang Zhang Liang Wang Tieniu Tan

This is my paper

Pith reviewed 2026-05-22 12:06 UTC · model grok-4.3

classification 💻 cs.CV

keywords visual instructionsimage editingbenchmarkgenerative modelsmultimodal evaluationinstruction followingsketch-based editingLMM judge

0 comments

The pith

Proprietary image editing models follow visual instructions better than open-source ones but degrade sharply on complex tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces VIBE, a benchmark that tests image editing models using visual instructions such as sketches rather than text prompts alone. It structures evaluation around three levels of increasing complexity that move from basic spatial pointing to shape changes and then to cause-and-effect reasoning. Seventeen models are assessed with a scalable judge based on large multimodal models and task-specific metrics. The results indicate that closed-source systems show initial ability to interpret these instructions and lead overall, yet every model loses capability as the visual demands rise. Readers would care because visual instructions align with natural human communication, so stronger performance here could make generative editing tools more practical for everyday creative work.

Core claim

The paper establishes VIBE as a systematic benchmark for visual instruction-driven image editing featuring a three-level interaction hierarchy progressing from deictic grounding through morphological manipulation to causal reasoning, together with an LMM-as-a-judge framework using task-specific metrics. Comprehensive evaluation of seventeen representative open-source and proprietary models shows that proprietary systems exhibit early-stage visual instruction-following capabilities and consistently outperform open-source models, while performance degrades markedly with increasing task difficulty even for the strongest systems.

What carries the argument

The VIBE benchmark's three-level interaction hierarchy that captures deictic grounding, morphological manipulation, and causal reasoning, paired with an LMM-as-a-judge evaluation framework that applies task-specific metrics.

If this is right

Image editing systems will need targeted improvements in visual reasoning to maintain performance on harder instructions.
Proprietary models currently hold an edge in visual instruction following that may narrow with further open-source development.
Benchmarks for generative image models should incorporate visual instructions to better reflect real user intent.
Scalable LMM-based judging can support fine-grained assessment of spatial and causal editing tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Progress on visual instruction following may transfer to other multimodal tasks such as diagram-based design or interactive scene generation.
Curating larger and more diverse sets of real-user visual instructions could accelerate training of open-source models to close the observed gap.
Integrating explicit spatial grounding modules into editing pipelines might reduce the performance drop on higher-level causal tasks.

Load-bearing premise

The LMM-as-a-judge evaluation with task-specific metrics supplies reliable, unbiased, and fine-grained scores for how well models follow the visual instructions.

What would settle it

Human evaluators scoring the same model outputs on the VIBE test cases and showing low agreement with the LMM judge scores, or any single model maintaining high accuracy across all three difficulty levels without measurable decline.

Figures

Figures reproduced from arXiv: 2602.01851 by Anna Korhonen, Chengzu Li, Chen Liang, Haochen Tian, Haodong Li, Huanyu Zhang, Liang Wang, Ruichuan An, Tieniu Tan, Xuehai Bai, Yifan Zhang, Zhang Zhang.

**Figure 1.** Figure 1: Motivation and scope of the VIBE benchmark. Traditional image editing is largely text-guided, where conveying spatial intent relies on verbose descriptions and incurs high cognitive load. In contrast, visual instructions enable precise and explicit grounding, providing a more human-aligned interaction paradigm. VIBE is designed to fill the evaluation gap by systematically benchmarking this visual intruc… view at source ↗

**Figure 2.** Figure 2: Composition of VIBE. VIBE comprises 1,034 samples across 10 tasks, organized into a three-level hierarchy that reflects increasing interaction and reasoning complexity, from deictic grounding and morphological manipulation to causal reasoning. lenges of visual instruction-guided image editing. 2. VIBE To bridge the gap between linguistic instructions and precise image manipulation, we introduce the VIBE … view at source ↗

**Figure 3.** Figure 3: Overview of VIBE. VIBE organizes visual instruction-guided image editing into a three-level interaction hierarchy with increasing task complexity. The Deictic Level treats visual instructions as selectors that specify localized regions or objects for basic spatial operations. The Morphological Level interprets visual instructions as blueprints that define abstract structural constraints. The Causal Level v… view at source ↗

**Figure 4.** Figure 4: Performance across image styles on the Deictic Level. Left: Average Deictic Level scores across real-world, animation, and sketch images for four proprietary models. Right: Metric-level heatmaps for Seedream 4.5 and GPT-Image-1, illustrating style-dependent variations in Instruction Adherence, Contextual Preservation, and Visual Coherence. Nano Banana ProNano BananaSeedream 4.5 Wan 2.6 Wan 2.5 Step1X-Edit-… view at source ↗

**Figure 5.** Figure 5 [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6 [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Pearson correlation between human expert scores and LMM-based evaluation scores for Nano Banana Pro and GPTImage-1, demonstrating a strong alignment between human judgments and the LMM-as-a-Judge evaluator. moving from single-task to multi-task settings [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Two representative cases illustrating how textual and visual instructions interact. The first case shows that visual instructions can resolve target ambiguity that detailed text alone fails to address. The second case demonstrates that complex semantic constraints require the joint use of detailed textual and visual instructions. 4.3. Validity of LMM-as-a-Judge To validate the reliability of using LMM as e… view at source ↗

**Figure 9.** Figure 9: Examples of editing results from Seedream 4.5 across different image styles 14 [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: Qualitative incorrect examples on the Deictic and Morphological Level 15 [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗

**Figure 11.** Figure 11: Qualitative incorrect examples on the Causal Level 27 [PITH_FULL_IMAGE:figures/full_fig_p027_11.png] view at source ↗

**Figure 12.** Figure 12: Qualitative examples with visually embedded instructions. All examples use the same minimal textual prompt, “Edit this image following the instructions annotated on this picture.” Task specifications are conveyed through text and symbols embedded directly in the input image. Nano Banana Pro correctly executes single-task, multi-task, and causal editing operations based on these visually embedded instructi… view at source ↗

**Figure 13.** Figure 13: Screenshot of the developed data annotation system used in section 4.3. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_13.png] view at source ↗

read the original abstract

Recent generative models have achieved remarkable progress in image editing. However, existing systems and benchmarks remain largely text-guided. In contrast, human communication is inherently multimodal, where visual instructions such as sketches efficiently convey spatial and structural intent. To address this gap, we introduce VIBE, the Visual Instruction Benchmark for Image Editing with a three-level interaction hierarchy that captures deictic grounding, morphological manipulation, and causal reasoning. Across these levels, we curate high-quality and diverse test cases that reflect progressively increasing complexity in visual instruction following. We further propose a robust LMM-as-a-judge evaluation framework with task-specific metrics to enable scalable and fine-grained assessment. Through a comprehensive evaluation of 17 representative open-source and proprietary image editing models, we find that proprietary models exhibit early-stage visual instruction-following capabilities and consistently outperform open-source models. However, performance degrades markedly with increasing task difficulty even for the strongest systems, highlighting promising directions for future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VIBE introduces a benchmark for visual instructions in image editing with a three-level hierarchy, showing proprietary models ahead but all degrade on hard tasks, though the LMM judge lacks validation.

read the letter

The main takeaway from this paper is the introduction of the VIBE benchmark, which targets how models handle visual instructions for image editing through a structured three-level hierarchy. It also reports that proprietary models show better early capabilities than open-source alternatives, though all models see marked drops in performance as task difficulty increases. On the positive side, the work is new in shifting away from purely text-guided benchmarks to include visual elements such as sketches for conveying spatial intent. The hierarchy captures different aspects of instruction following, from basic pointing to more advanced causal reasoning, and the curation of high-quality test cases reflects real-world complexity. Using an LMM as judge with tailored metrics is a reasonable way to make evaluation scalable, and the comprehensive testing of 17 models gives a solid overview of where the field stands today. Where it could be stronger is in validating the judge. The central claims depend on the LMM producing accurate scores for things like deictic grounding and causal edits. Without evidence of alignment with human judgments or controls for potential biases in the judge, such as preferring certain output qualities, the differences between model types might not be as clear-cut. The paper could also benefit from more explicit discussion of how the test cases were selected and verified to avoid any circularity in the evaluation. This paper would appeal to researchers and practitioners in computer vision who are developing or assessing image editing systems that support multimodal inputs. It provides a framework that others could build on or compare against. Overall, it deserves to go through peer review so that the community can help refine the evaluation protocol and confirm the findings.

Referee Report

2 major / 1 minor

Summary. The paper introduces VIBE, a benchmark for visual instruction-driven image editing organized around a three-level hierarchy (deictic grounding, morphological manipulation, and causal reasoning). It curates progressively complex test cases and proposes an LMM-as-a-judge framework equipped with task-specific metrics. Evaluation of 17 open-source and proprietary image-editing models leads to the claim that proprietary systems exhibit early-stage visual instruction-following ability and outperform open-source counterparts, yet all models degrade sharply as task difficulty increases.

Significance. If the evaluation framework holds, the benchmark would usefully document the current gap between text-guided and visually instructed editing and the persistent difficulty of complex visual instructions, supplying a concrete reference point for future multimodal generative work.

major comments (2)

[Evaluation Framework] Abstract and Evaluation Framework section: the assertion of a 'robust' LMM-as-a-judge framework with task-specific metrics is not accompanied by reported inter-judge agreement, human correlation coefficients, or bias-control experiments. Because the central performance comparisons (proprietary vs. open-source, degradation across levels) rest entirely on the judge's outputs, this omission is load-bearing for the claims.
[Results] Results section: the statement that 'performance degrades markedly with increasing task difficulty' is presented without per-level quantitative breakdowns or tables that would allow readers to verify the magnitude of the drop for the strongest models on causal-reasoning cases.

minor comments (1)

[Abstract] The abstract states that 'high-quality and diverse test cases' were curated but does not report the total count or the distribution across the three hierarchy levels.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, indicating whether revisions have been made.

read point-by-point responses

Referee: [Evaluation Framework] Abstract and Evaluation Framework section: the assertion of a 'robust' LMM-as-a-judge framework with task-specific metrics is not accompanied by reported inter-judge agreement, human correlation coefficients, or bias-control experiments. Because the central performance comparisons (proprietary vs. open-source, degradation across levels) rest entirely on the judge's outputs, this omission is load-bearing for the claims.

Authors: We agree that additional quantitative validation of the LMM-as-a-judge would strengthen the evaluation framework. While the task-specific metrics were chosen to align with the hierarchical structure of the benchmark, we acknowledge the value of reporting inter-judge agreement, human correlation coefficients, and bias-control results. In the revised manuscript, we have added these analyses in the Evaluation Framework section, including agreement statistics across multiple LMM judges and correlation with human raters on a subset of cases. revision: yes
Referee: [Results] Results section: the statement that 'performance degrades markedly with increasing task difficulty' is presented without per-level quantitative breakdowns or tables that would allow readers to verify the magnitude of the drop for the strongest models on causal-reasoning cases.

Authors: We concur that explicit per-level breakdowns would improve verifiability of the degradation claim. The original manuscript summarized trends across levels but did not include a dedicated table with exact scores. We have added a new table in the Results section that reports performance for each of the three levels separately, with particular emphasis on the strongest proprietary models on the causal-reasoning subset. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark construction and model evaluations are independent of inputs

full rationale

The paper introduces VIBE as a new benchmark with a three-level hierarchy and curated test cases, then applies an LMM-as-a-judge framework with task-specific metrics to evaluate 17 models. The performance claims (proprietary models outperforming open-source ones, with degradation on harder tasks) follow directly from these external evaluations rather than any self-definitional loop, fitted parameter renamed as prediction, or load-bearing self-citation. No equations, ansatzes, or uniqueness theorems are invoked that reduce the results to the paper's own inputs by construction. The framework is presented as a proposed tool for assessment, not derived from prior results by the same authors in a way that creates tautology. This is a standard benchmark paper whose claims rest on observable model outputs against independently curated cases.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the domain assumption that visual instructions can be meaningfully stratified into deictic grounding, morphological manipulation, and causal reasoning, plus the choice of LMM judge as a scalable proxy for human evaluation.

axioms (1)

domain assumption Visual instructions can be effectively categorized into a three-level hierarchy of deictic grounding, morphological manipulation, and causal reasoning.
This hierarchy structures the benchmark and test case curation as described in the abstract.

pith-pipeline@v0.9.0 · 5733 in / 1171 out tokens · 51866 ms · 2026-05-22T12:06:08.619806+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce VIBE, the Visual Instruction Benchmark for Image Editing with a three-level interaction hierarchy that captures deictic grounding, morphological manipulation, and causal reasoning.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We further propose a robust LMM-as-a-judge evaluation framework with task-specific metrics

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling
cs.CV 2026-05 unverdicted novelty 7.0

Edit-Compass and EditReward-Compass are new unified benchmarks for fine-grained image editing evaluation and realistic reward modeling in reinforcement learning optimization.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · cited by 1 Pith paper

[1]

Edit this image following the instructions annotated on this picture

URL https://aclanthology.org/2024. emnlp-main.106/. Li, C., Wu, W., Zhang, H., Li, Q., Gao, Z., Xia, Y ., Hern´andez-Orallo, J., Vuli ´c, I., and Wei, F. 11plus- bench: Demystifying multimodal llm spatial reason- ing with cognitive-inspired analysis.arXiv preprint arXiv:2508.20068, 2025a. 9 VIBE: A Systematic Benchmark for Visual Instruction-Driven Image ...

work page arXiv 2024
[2]

Visual Instruction Localization Correctness Did the main edit occur on the object or region explicitly indicated by the visual instruction on the Input Image (The second image)?

work page
[3]

Visual Operator Type Compliance Was the type of edit consistent with the operation implied by the visual instruction?

work page
[4]

Visual Instruction Localization Correctness

Textual Action Semantic Compliance Did the model execute the core action specified in the Text Prompt? Scoring rules: - Score 1 if the requirement is clearly satisfied. - Score 0 if the requirement is not satisfied or is ambiguous. - If unsure, assign 0. - Partial compliance must be scored as 0. You may reason freely to reach your decision. Then, for EACH...

work page
[5]

- Ignore content missing only due to cropping

Cropping rule - If the output is cropped, only compare the overlapping visible region. - Ignore content missing only due to cropping

work page
[6]

- Do NOT list differences caused by: • minor blur or softness, • small texture or color shifts, • pixel-level noise, • slight position or alignment offsets

Difference listing (what counts as a difference) - List ONLY meaningful differences at the level of objects or semantic entities. - Do NOT list differences caused by: • minor blur or softness, • small texture or color shifts, • pixel-level noise, • slight position or alignment offsets. - A difference should be listed ONLY if it: • adds or removes a comple...

work page
[7]

Target rule - Identify the intended edit target based ONLY on: (a) the visual instruction marks, and (b) the text prompt

work page
[8]

Classification rule - IN TARGET: - IN TARGET: • any change within the intended target, • OR any imperfect attempt to edit the target (including misplacement, offset, scale error, or incomplete coverage). 21 VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing Table 8.Evaluation Prompt for Contextual Preservation 2/2 Contextual Preserva...

work page
[9]

Contextual Preservation

Scoring - Score = 1 if NO OUT OF TARGET differences exist. - Score = 0 if ANY OUT OF TARG Output format: First provide a brief analysis with these sections: - ## Differences - ## Target - ## Classification - ## Decision Then output the final JSON as the last part of your response: {“Contextual Preservation”:{“reason”: “string”, “score”: 0} }

work page
[10]

Scoring - Score = 1 if NO OUT OF TARGET differences exist. - Score = 0 if ANY OUT OF TARG 22 VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing Table 9.Evaluation Prompt for Visual Coherence 1/2 Visual Coherence 1/2 You are given THREE images and ONE text prompt. The first image: - This is the original image. The second image: - Visu...

work page
[11]

- Score 0 if the edited region introduces a different artistic or rendering domain

Style Consistency Did the edited region adopt the same artistic or rendering domain as the Input Image (e.g., line-art, watercolor, oil painting, 3D render, photographic style, pixel art, animation)? - Score 1 if the edited region clearly belongs to the same visual domain as the source image. - Score 0 if the edited region introduces a different artistic ...

work page
[12]

- Score 1 if the edited region integrates seamlessly with its surroundings

Visual Seamlessness Is the edited region visually continuous with its surrounding area, without obvious signs of compositing? Focus on whether there are clear visual discontinuities such as: - unnatural seams or hard boundaries, - abrupt changes in texture, color, or resolution, - visible cut-and-paste artifacts. - Score 1 if the edited region integrates ...

work page
[13]

score”: an integer value of 0 or 1. - “reason

Artifact-Free Generation Does the Output Image avoid obvious, domain-independent generative artifacts? 23 VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing Table 10.Evaluation Prompt for Visual Coherence 2/2 Visual Coherence 2/2 Consider artifacts such as: - unintended blurring or pixelation, - geometric distortion or deformation, -...

work page
[14]

IMAGE 1 (Golden Label / Ground Truth): Depicts the correct initial state, the correct future trajectory (green dashed line), and the correct final target (red bounding box)

work page
[15]

YOUR TASK: Evaluate the Generated Image against the Golden Label based on three independent metrics

IMAGE 2 (Model Generation): Depicts a predicted scene, trajectory, and target. YOUR TASK: Evaluate the Generated Image against the Golden Label based on three independent metrics. You must analyze the images step-by-step and output a final JSON score

work page
[16]

Context Preservation (CP) Goal: Verify that the static environment remains unchanged. Check: - Are all billiard balls (balls with numbers and the white ball) present? - Are the ball numbers consistent? - Is the spatial layout (positions of non-moving balls) consistent with the ground truth? - Is the black arrow on the white ball preserved? Ignore: Slight ...

work page
[17]

Path Correctness (PC) Goal: Verify the topology and direction of the green dashed trajectory. Check: - Does the predicted path move in the same cardinal direction? - Does the trajectory bounce off the same specific walls or cushions in the same order? (e.g., if Truth hits Top-Wall then Left-Wall, Prediction must do the same). - Is the path free of halluci...

work page
[18]

The Golden path bounces off the top cushion. The Generated path bounces off the bottom cushion. These are different

Collision Correctness (CC) Goal: Verify the final target identity. Check: - Does the red bounding box surround the same specific ball number as in the Golden Label? Note: This metric is strictly about the identity of the target, regardless of whether the path (PC) looks perfect. Scoring: - 0: If the red box highlights a different ball or an empty space. -...

work page
[19]

Wind-Identity Preservation (W-IP)

work page
[20]

Differences related to arrows MUST be ignored, unless they severely obstruct the subject or degrade visual quality

Wind-Pose/Placement Preservation (W-PP) IMPORTANT GLOBAL RULES: - Visual instruction markings (e.g., arrows) are NOT scene content. Differences related to arrows MUST be ignored, unless they severely obstruct the subject or degrade visual quality. - Wind effects are assumed to be intentional and allowed. - Do NOT penalize changes directly caused by wind. ...

work page
[21]

First provide your reasoning under: - ## Identity Preservation Analysis - ## Pose / Placement Analysis

work page
[23]

Wind-Identity Preservation

The JSON must be the LAST part of your response. ———————————————————— FINAL JSON FORMAT (EXACT): {“Wind-Identity Preservation”:{ “reason”: “string”, “score”: 0 }, “Wind-Other Preservation”:{ “reason”: “string”, “score”: 0 } } 31 VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing Table 15.Evaluation Prompt for Wind Direction Consisten...

work page
[24]

First provide your reasoning under: - ## Causal Wind Evidence - ## Direction Description - ## Direction Comparison

work page
[26]

Wind Direction Consistency

The JSON must be the LAST part of your response. ———————————————————— FINAL JSON FORMAT (EXACT): {“Wind Direction Consistency”:{ “reason”: “string (one-sentence summary)”, “score”: 0.0 } } 33 VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing Table 17.Evaluation Prompt for Contextual Preservation in Light Control 1/2 Contextual Prese...

work page
[27]

List all observable differences between IMAGE 1 and IMAGE 2

work page
[28]

———————————————————— PART 3: Contextual Preservation decision Scoring (binary): - Score = 1 ONLY if ALL observed differences fall under allowed lighting-related changes

For each difference, determine whether it is: - a lighting-related manifestation (allowed), OR - an unrelated content modification (not allowed). ———————————————————— PART 3: Contextual Preservation decision Scoring (binary): - Score = 1 ONLY if ALL observed differences fall under allowed lighting-related changes. - Score = 0 If ANY difference corresponds...

work page
[29]

First provide your reasoning under: - ## Difference Analysis - ## CP Decision

work page
[31]

Contextual Preservation

The JSON must be the LAST part of your response. ———————————————————— FINAL JSON FORMAT (EXACT): {“Contextual Preservation”:{ “reason”: “string (one-sentence summary)”, “score”: 0 } } 35 VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing Table 19.Evaluation Prompt for Lighting Direction Consistency 1/2 Lighting Direction Consistency ...

work page
[32]

Direction Matching Consistency (DMC)

work page
[33]

from upper-right toward lower-left

Physical Lighting Consistency (PLC) ———————————————————— PART 1: Direction description (no comparison) (1) Arrow direction in IMAGE 1: Describe the arrow direction as a continuous spatial direction. Use precise language (e.g., “from upper-right toward lower-left”, “slightly downward from right to left”). Do NOT reduce the direction to simple categories li...

work page
[34]

First provide your reasoning under: - ## Direction Description - ## Direction Comparison - ## Physical Consistency

work page
[36]

Direction Matching Consistency

The JSON must be the LAST part of your response. ———————————————————— FINAL JSON FORMAT (EXACT): {“Direction Matching Consistency”:{ “reason”: “string”, “score”: 0.0 }, “Physical Lighting Consistency”:{ “reason”: “string”, “score”: 0 } } 37 VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing Table 21.Evaluation Prompt for BII, CIC, an...

work page
[37]

Body Instance Integrity (BII)

work page
[38]

Character Identity Consistency (CIC)

work page
[39]

- Do NOT compare limb positions or body posture

Context Preservation (CP) IMPORTANT: - Do NOT evaluate pose correctness. - Do NOT compare limb positions or body posture. - Each metric must be judged independently. ———————————————————— Metric definitions: (1) Body Instance Integrity (BII) Checks whether IMAGE 2 depicts exactly ONE coherent human body instance. Score 0 if ANY of the following occur: - Ex...

work page
[40]

First provide your reasoning under these headers: - ## Analysis - ## Decisions

work page
[41]

Then output the final JSON under: - ## JSON

work page
[42]

reason”: one short factual sentence - “score

Each metric must include: - “reason”: one short factual sentence - “score”: 0 or 1

work page
[43]

Body Instance Integrity

The JSON must be the LAST part of your response. ———————————————————— FINAL JSON FORMAT (EXACT): { “Body Instance Integrity”:{ “reason”: “string (one-sentence summary)”, “score”: 0/1 }, “Character Identity Consistency”:{ “reason”: “string (one-sentence summary)”, “score”: 0/1 }, “Context Preservation”:{ “reason”: “string (one-sentence summary)”, “score”: ...

work page
[44]

Identity Consistency (IC)

work page
[45]

Identity Consistency

Visual Integrity (VI) These two metrics MUST be evaluated separately and independently. ———————————————————— ## PART 1: Identity Consistency (IC) Identity Consistency evaluates whether the edited subject in IMAGE 2 remains the same semantic object/entity as in IMAGE 1, ignoring changes that are directly caused by orientation editing. ### STEP 1: Identify ...

work page

[1] [1]

Edit this image following the instructions annotated on this picture

URL https://aclanthology.org/2024. emnlp-main.106/. Li, C., Wu, W., Zhang, H., Li, Q., Gao, Z., Xia, Y ., Hern´andez-Orallo, J., Vuli ´c, I., and Wei, F. 11plus- bench: Demystifying multimodal llm spatial reason- ing with cognitive-inspired analysis.arXiv preprint arXiv:2508.20068, 2025a. 9 VIBE: A Systematic Benchmark for Visual Instruction-Driven Image ...

work page arXiv 2024

[2] [2]

Visual Instruction Localization Correctness Did the main edit occur on the object or region explicitly indicated by the visual instruction on the Input Image (The second image)?

work page

[3] [3]

Visual Operator Type Compliance Was the type of edit consistent with the operation implied by the visual instruction?

work page

[4] [4]

Visual Instruction Localization Correctness

Textual Action Semantic Compliance Did the model execute the core action specified in the Text Prompt? Scoring rules: - Score 1 if the requirement is clearly satisfied. - Score 0 if the requirement is not satisfied or is ambiguous. - If unsure, assign 0. - Partial compliance must be scored as 0. You may reason freely to reach your decision. Then, for EACH...

work page

[5] [5]

- Ignore content missing only due to cropping

Cropping rule - If the output is cropped, only compare the overlapping visible region. - Ignore content missing only due to cropping

work page

[6] [6]

- Do NOT list differences caused by: • minor blur or softness, • small texture or color shifts, • pixel-level noise, • slight position or alignment offsets

Difference listing (what counts as a difference) - List ONLY meaningful differences at the level of objects or semantic entities. - Do NOT list differences caused by: • minor blur or softness, • small texture or color shifts, • pixel-level noise, • slight position or alignment offsets. - A difference should be listed ONLY if it: • adds or removes a comple...

work page

[7] [7]

Target rule - Identify the intended edit target based ONLY on: (a) the visual instruction marks, and (b) the text prompt

work page

[8] [8]

Classification rule - IN TARGET: - IN TARGET: • any change within the intended target, • OR any imperfect attempt to edit the target (including misplacement, offset, scale error, or incomplete coverage). 21 VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing Table 8.Evaluation Prompt for Contextual Preservation 2/2 Contextual Preserva...

work page

[9] [9]

Contextual Preservation

Scoring - Score = 1 if NO OUT OF TARGET differences exist. - Score = 0 if ANY OUT OF TARG Output format: First provide a brief analysis with these sections: - ## Differences - ## Target - ## Classification - ## Decision Then output the final JSON as the last part of your response: {“Contextual Preservation”:{“reason”: “string”, “score”: 0} }

work page

[10] [10]

Scoring - Score = 1 if NO OUT OF TARGET differences exist. - Score = 0 if ANY OUT OF TARG 22 VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing Table 9.Evaluation Prompt for Visual Coherence 1/2 Visual Coherence 1/2 You are given THREE images and ONE text prompt. The first image: - This is the original image. The second image: - Visu...

work page

[11] [11]

- Score 0 if the edited region introduces a different artistic or rendering domain

Style Consistency Did the edited region adopt the same artistic or rendering domain as the Input Image (e.g., line-art, watercolor, oil painting, 3D render, photographic style, pixel art, animation)? - Score 1 if the edited region clearly belongs to the same visual domain as the source image. - Score 0 if the edited region introduces a different artistic ...

work page

[12] [12]

- Score 1 if the edited region integrates seamlessly with its surroundings

Visual Seamlessness Is the edited region visually continuous with its surrounding area, without obvious signs of compositing? Focus on whether there are clear visual discontinuities such as: - unnatural seams or hard boundaries, - abrupt changes in texture, color, or resolution, - visible cut-and-paste artifacts. - Score 1 if the edited region integrates ...

work page

[13] [13]

score”: an integer value of 0 or 1. - “reason

Artifact-Free Generation Does the Output Image avoid obvious, domain-independent generative artifacts? 23 VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing Table 10.Evaluation Prompt for Visual Coherence 2/2 Visual Coherence 2/2 Consider artifacts such as: - unintended blurring or pixelation, - geometric distortion or deformation, -...

work page

[14] [14]

IMAGE 1 (Golden Label / Ground Truth): Depicts the correct initial state, the correct future trajectory (green dashed line), and the correct final target (red bounding box)

work page

[15] [15]

YOUR TASK: Evaluate the Generated Image against the Golden Label based on three independent metrics

IMAGE 2 (Model Generation): Depicts a predicted scene, trajectory, and target. YOUR TASK: Evaluate the Generated Image against the Golden Label based on three independent metrics. You must analyze the images step-by-step and output a final JSON score

work page

[16] [16]

Context Preservation (CP) Goal: Verify that the static environment remains unchanged. Check: - Are all billiard balls (balls with numbers and the white ball) present? - Are the ball numbers consistent? - Is the spatial layout (positions of non-moving balls) consistent with the ground truth? - Is the black arrow on the white ball preserved? Ignore: Slight ...

work page

[17] [17]

Path Correctness (PC) Goal: Verify the topology and direction of the green dashed trajectory. Check: - Does the predicted path move in the same cardinal direction? - Does the trajectory bounce off the same specific walls or cushions in the same order? (e.g., if Truth hits Top-Wall then Left-Wall, Prediction must do the same). - Is the path free of halluci...

work page

[18] [18]

The Golden path bounces off the top cushion. The Generated path bounces off the bottom cushion. These are different

Collision Correctness (CC) Goal: Verify the final target identity. Check: - Does the red bounding box surround the same specific ball number as in the Golden Label? Note: This metric is strictly about the identity of the target, regardless of whether the path (PC) looks perfect. Scoring: - 0: If the red box highlights a different ball or an empty space. -...

work page

[19] [19]

Wind-Identity Preservation (W-IP)

work page

[20] [20]

Differences related to arrows MUST be ignored, unless they severely obstruct the subject or degrade visual quality

Wind-Pose/Placement Preservation (W-PP) IMPORTANT GLOBAL RULES: - Visual instruction markings (e.g., arrows) are NOT scene content. Differences related to arrows MUST be ignored, unless they severely obstruct the subject or degrade visual quality. - Wind effects are assumed to be intentional and allowed. - Do NOT penalize changes directly caused by wind. ...

work page

[21] [21]

First provide your reasoning under: - ## Identity Preservation Analysis - ## Pose / Placement Analysis

work page

[22] [23]

Wind-Identity Preservation

The JSON must be the LAST part of your response. ———————————————————— FINAL JSON FORMAT (EXACT): {“Wind-Identity Preservation”:{ “reason”: “string”, “score”: 0 }, “Wind-Other Preservation”:{ “reason”: “string”, “score”: 0 } } 31 VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing Table 15.Evaluation Prompt for Wind Direction Consisten...

work page

[23] [24]

First provide your reasoning under: - ## Causal Wind Evidence - ## Direction Description - ## Direction Comparison

work page

[24] [26]

Wind Direction Consistency

The JSON must be the LAST part of your response. ———————————————————— FINAL JSON FORMAT (EXACT): {“Wind Direction Consistency”:{ “reason”: “string (one-sentence summary)”, “score”: 0.0 } } 33 VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing Table 17.Evaluation Prompt for Contextual Preservation in Light Control 1/2 Contextual Prese...

work page

[25] [27]

List all observable differences between IMAGE 1 and IMAGE 2

work page

[26] [28]

———————————————————— PART 3: Contextual Preservation decision Scoring (binary): - Score = 1 ONLY if ALL observed differences fall under allowed lighting-related changes

For each difference, determine whether it is: - a lighting-related manifestation (allowed), OR - an unrelated content modification (not allowed). ———————————————————— PART 3: Contextual Preservation decision Scoring (binary): - Score = 1 ONLY if ALL observed differences fall under allowed lighting-related changes. - Score = 0 If ANY difference corresponds...

work page

[27] [29]

First provide your reasoning under: - ## Difference Analysis - ## CP Decision

work page

[28] [31]

Contextual Preservation

The JSON must be the LAST part of your response. ———————————————————— FINAL JSON FORMAT (EXACT): {“Contextual Preservation”:{ “reason”: “string (one-sentence summary)”, “score”: 0 } } 35 VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing Table 19.Evaluation Prompt for Lighting Direction Consistency 1/2 Lighting Direction Consistency ...

work page

[29] [32]

Direction Matching Consistency (DMC)

work page

[30] [33]

from upper-right toward lower-left

Physical Lighting Consistency (PLC) ———————————————————— PART 1: Direction description (no comparison) (1) Arrow direction in IMAGE 1: Describe the arrow direction as a continuous spatial direction. Use precise language (e.g., “from upper-right toward lower-left”, “slightly downward from right to left”). Do NOT reduce the direction to simple categories li...

work page

[31] [34]

First provide your reasoning under: - ## Direction Description - ## Direction Comparison - ## Physical Consistency

work page

[32] [36]

Direction Matching Consistency

The JSON must be the LAST part of your response. ———————————————————— FINAL JSON FORMAT (EXACT): {“Direction Matching Consistency”:{ “reason”: “string”, “score”: 0.0 }, “Physical Lighting Consistency”:{ “reason”: “string”, “score”: 0 } } 37 VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing Table 21.Evaluation Prompt for BII, CIC, an...

work page

[33] [37]

Body Instance Integrity (BII)

work page

[34] [38]

Character Identity Consistency (CIC)

work page

[35] [39]

- Do NOT compare limb positions or body posture

Context Preservation (CP) IMPORTANT: - Do NOT evaluate pose correctness. - Do NOT compare limb positions or body posture. - Each metric must be judged independently. ———————————————————— Metric definitions: (1) Body Instance Integrity (BII) Checks whether IMAGE 2 depicts exactly ONE coherent human body instance. Score 0 if ANY of the following occur: - Ex...

work page

[36] [40]

First provide your reasoning under these headers: - ## Analysis - ## Decisions

work page

[37] [41]

Then output the final JSON under: - ## JSON

work page

[38] [42]

reason”: one short factual sentence - “score

Each metric must include: - “reason”: one short factual sentence - “score”: 0 or 1

work page

[39] [43]

Body Instance Integrity

The JSON must be the LAST part of your response. ———————————————————— FINAL JSON FORMAT (EXACT): { “Body Instance Integrity”:{ “reason”: “string (one-sentence summary)”, “score”: 0/1 }, “Character Identity Consistency”:{ “reason”: “string (one-sentence summary)”, “score”: 0/1 }, “Context Preservation”:{ “reason”: “string (one-sentence summary)”, “score”: ...

work page

[40] [44]

Identity Consistency (IC)

work page

[41] [45]

Identity Consistency

Visual Integrity (VI) These two metrics MUST be evaluated separately and independently. ———————————————————— ## PART 1: Identity Consistency (IC) Identity Consistency evaluates whether the edited subject in IMAGE 2 remains the same semantic object/entity as in IMAGE 1, ignoring changes that are directly caused by orientation editing. ### STEP 1: Identify ...

work page