How Well Do Models Follow Visual Instructions? VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing
Pith reviewed 2026-05-22 12:06 UTC · model grok-4.3
The pith
Proprietary image editing models follow visual instructions better than open-source ones but degrade sharply on complex tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes VIBE as a systematic benchmark for visual instruction-driven image editing featuring a three-level interaction hierarchy progressing from deictic grounding through morphological manipulation to causal reasoning, together with an LMM-as-a-judge framework using task-specific metrics. Comprehensive evaluation of seventeen representative open-source and proprietary models shows that proprietary systems exhibit early-stage visual instruction-following capabilities and consistently outperform open-source models, while performance degrades markedly with increasing task difficulty even for the strongest systems.
What carries the argument
The VIBE benchmark's three-level interaction hierarchy that captures deictic grounding, morphological manipulation, and causal reasoning, paired with an LMM-as-a-judge evaluation framework that applies task-specific metrics.
If this is right
- Image editing systems will need targeted improvements in visual reasoning to maintain performance on harder instructions.
- Proprietary models currently hold an edge in visual instruction following that may narrow with further open-source development.
- Benchmarks for generative image models should incorporate visual instructions to better reflect real user intent.
- Scalable LMM-based judging can support fine-grained assessment of spatial and causal editing tasks.
Where Pith is reading between the lines
- Progress on visual instruction following may transfer to other multimodal tasks such as diagram-based design or interactive scene generation.
- Curating larger and more diverse sets of real-user visual instructions could accelerate training of open-source models to close the observed gap.
- Integrating explicit spatial grounding modules into editing pipelines might reduce the performance drop on higher-level causal tasks.
Load-bearing premise
The LMM-as-a-judge evaluation with task-specific metrics supplies reliable, unbiased, and fine-grained scores for how well models follow the visual instructions.
What would settle it
Human evaluators scoring the same model outputs on the VIBE test cases and showing low agreement with the LMM judge scores, or any single model maintaining high accuracy across all three difficulty levels without measurable decline.
Figures
read the original abstract
Recent generative models have achieved remarkable progress in image editing. However, existing systems and benchmarks remain largely text-guided. In contrast, human communication is inherently multimodal, where visual instructions such as sketches efficiently convey spatial and structural intent. To address this gap, we introduce VIBE, the Visual Instruction Benchmark for Image Editing with a three-level interaction hierarchy that captures deictic grounding, morphological manipulation, and causal reasoning. Across these levels, we curate high-quality and diverse test cases that reflect progressively increasing complexity in visual instruction following. We further propose a robust LMM-as-a-judge evaluation framework with task-specific metrics to enable scalable and fine-grained assessment. Through a comprehensive evaluation of 17 representative open-source and proprietary image editing models, we find that proprietary models exhibit early-stage visual instruction-following capabilities and consistently outperform open-source models. However, performance degrades markedly with increasing task difficulty even for the strongest systems, highlighting promising directions for future research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces VIBE, a benchmark for visual instruction-driven image editing organized around a three-level hierarchy (deictic grounding, morphological manipulation, and causal reasoning). It curates progressively complex test cases and proposes an LMM-as-a-judge framework equipped with task-specific metrics. Evaluation of 17 open-source and proprietary image-editing models leads to the claim that proprietary systems exhibit early-stage visual instruction-following ability and outperform open-source counterparts, yet all models degrade sharply as task difficulty increases.
Significance. If the evaluation framework holds, the benchmark would usefully document the current gap between text-guided and visually instructed editing and the persistent difficulty of complex visual instructions, supplying a concrete reference point for future multimodal generative work.
major comments (2)
- [Evaluation Framework] Abstract and Evaluation Framework section: the assertion of a 'robust' LMM-as-a-judge framework with task-specific metrics is not accompanied by reported inter-judge agreement, human correlation coefficients, or bias-control experiments. Because the central performance comparisons (proprietary vs. open-source, degradation across levels) rest entirely on the judge's outputs, this omission is load-bearing for the claims.
- [Results] Results section: the statement that 'performance degrades markedly with increasing task difficulty' is presented without per-level quantitative breakdowns or tables that would allow readers to verify the magnitude of the drop for the strongest models on causal-reasoning cases.
minor comments (1)
- [Abstract] The abstract states that 'high-quality and diverse test cases' were curated but does not report the total count or the distribution across the three hierarchy levels.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, indicating whether revisions have been made.
read point-by-point responses
-
Referee: [Evaluation Framework] Abstract and Evaluation Framework section: the assertion of a 'robust' LMM-as-a-judge framework with task-specific metrics is not accompanied by reported inter-judge agreement, human correlation coefficients, or bias-control experiments. Because the central performance comparisons (proprietary vs. open-source, degradation across levels) rest entirely on the judge's outputs, this omission is load-bearing for the claims.
Authors: We agree that additional quantitative validation of the LMM-as-a-judge would strengthen the evaluation framework. While the task-specific metrics were chosen to align with the hierarchical structure of the benchmark, we acknowledge the value of reporting inter-judge agreement, human correlation coefficients, and bias-control results. In the revised manuscript, we have added these analyses in the Evaluation Framework section, including agreement statistics across multiple LMM judges and correlation with human raters on a subset of cases. revision: yes
-
Referee: [Results] Results section: the statement that 'performance degrades markedly with increasing task difficulty' is presented without per-level quantitative breakdowns or tables that would allow readers to verify the magnitude of the drop for the strongest models on causal-reasoning cases.
Authors: We concur that explicit per-level breakdowns would improve verifiability of the degradation claim. The original manuscript summarized trends across levels but did not include a dedicated table with exact scores. We have added a new table in the Results section that reports performance for each of the three levels separately, with particular emphasis on the strongest proprietary models on the causal-reasoning subset. revision: yes
Circularity Check
No circularity: benchmark construction and model evaluations are independent of inputs
full rationale
The paper introduces VIBE as a new benchmark with a three-level hierarchy and curated test cases, then applies an LMM-as-a-judge framework with task-specific metrics to evaluate 17 models. The performance claims (proprietary models outperforming open-source ones, with degradation on harder tasks) follow directly from these external evaluations rather than any self-definitional loop, fitted parameter renamed as prediction, or load-bearing self-citation. No equations, ansatzes, or uniqueness theorems are invoked that reduce the results to the paper's own inputs by construction. The framework is presented as a proposed tool for assessment, not derived from prior results by the same authors in a way that creates tautology. This is a standard benchmark paper whose claims rest on observable model outputs against independently curated cases.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Visual instructions can be effectively categorized into a three-level hierarchy of deictic grounding, morphological manipulation, and causal reasoning.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce VIBE, the Visual Instruction Benchmark for Image Editing with a three-level interaction hierarchy that captures deictic grounding, morphological manipulation, and causal reasoning.
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat recovery unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We further propose a robust LMM-as-a-judge evaluation framework with task-specific metrics
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling
Edit-Compass and EditReward-Compass are new unified benchmarks for fine-grained image editing evaluation and realistic reward modeling in reinforcement learning optimization.
Reference graph
Works this paper leans on
-
[1]
Edit this image following the instructions annotated on this picture
URL https://aclanthology.org/2024. emnlp-main.106/. Li, C., Wu, W., Zhang, H., Li, Q., Gao, Z., Xia, Y ., Hern´andez-Orallo, J., Vuli ´c, I., and Wei, F. 11plus- bench: Demystifying multimodal llm spatial reason- ing with cognitive-inspired analysis.arXiv preprint arXiv:2508.20068, 2025a. 9 VIBE: A Systematic Benchmark for Visual Instruction-Driven Image ...
-
[2]
Visual Instruction Localization Correctness Did the main edit occur on the object or region explicitly indicated by the visual instruction on the Input Image (The second image)?
-
[3]
Visual Operator Type Compliance Was the type of edit consistent with the operation implied by the visual instruction?
-
[4]
Visual Instruction Localization Correctness
Textual Action Semantic Compliance Did the model execute the core action specified in the Text Prompt? Scoring rules: - Score 1 if the requirement is clearly satisfied. - Score 0 if the requirement is not satisfied or is ambiguous. - If unsure, assign 0. - Partial compliance must be scored as 0. You may reason freely to reach your decision. Then, for EACH...
-
[5]
- Ignore content missing only due to cropping
Cropping rule - If the output is cropped, only compare the overlapping visible region. - Ignore content missing only due to cropping
-
[6]
Difference listing (what counts as a difference) - List ONLY meaningful differences at the level of objects or semantic entities. - Do NOT list differences caused by: • minor blur or softness, • small texture or color shifts, • pixel-level noise, • slight position or alignment offsets. - A difference should be listed ONLY if it: • adds or removes a comple...
-
[7]
Target rule - Identify the intended edit target based ONLY on: (a) the visual instruction marks, and (b) the text prompt
-
[8]
Classification rule - IN TARGET: - IN TARGET: • any change within the intended target, • OR any imperfect attempt to edit the target (including misplacement, offset, scale error, or incomplete coverage). 21 VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing Table 8.Evaluation Prompt for Contextual Preservation 2/2 Contextual Preserva...
-
[9]
Scoring - Score = 1 if NO OUT OF TARGET differences exist. - Score = 0 if ANY OUT OF TARG Output format: First provide a brief analysis with these sections: - ## Differences - ## Target - ## Classification - ## Decision Then output the final JSON as the last part of your response: {“Contextual Preservation”:{“reason”: “string”, “score”: 0} }
-
[10]
Scoring - Score = 1 if NO OUT OF TARGET differences exist. - Score = 0 if ANY OUT OF TARG 22 VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing Table 9.Evaluation Prompt for Visual Coherence 1/2 Visual Coherence 1/2 You are given THREE images and ONE text prompt. The first image: - This is the original image. The second image: - Visu...
-
[11]
- Score 0 if the edited region introduces a different artistic or rendering domain
Style Consistency Did the edited region adopt the same artistic or rendering domain as the Input Image (e.g., line-art, watercolor, oil painting, 3D render, photographic style, pixel art, animation)? - Score 1 if the edited region clearly belongs to the same visual domain as the source image. - Score 0 if the edited region introduces a different artistic ...
-
[12]
- Score 1 if the edited region integrates seamlessly with its surroundings
Visual Seamlessness Is the edited region visually continuous with its surrounding area, without obvious signs of compositing? Focus on whether there are clear visual discontinuities such as: - unnatural seams or hard boundaries, - abrupt changes in texture, color, or resolution, - visible cut-and-paste artifacts. - Score 1 if the edited region integrates ...
-
[13]
score”: an integer value of 0 or 1. - “reason
Artifact-Free Generation Does the Output Image avoid obvious, domain-independent generative artifacts? 23 VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing Table 10.Evaluation Prompt for Visual Coherence 2/2 Visual Coherence 2/2 Consider artifacts such as: - unintended blurring or pixelation, - geometric distortion or deformation, -...
-
[14]
IMAGE 1 (Golden Label / Ground Truth): Depicts the correct initial state, the correct future trajectory (green dashed line), and the correct final target (red bounding box)
-
[15]
YOUR TASK: Evaluate the Generated Image against the Golden Label based on three independent metrics
IMAGE 2 (Model Generation): Depicts a predicted scene, trajectory, and target. YOUR TASK: Evaluate the Generated Image against the Golden Label based on three independent metrics. You must analyze the images step-by-step and output a final JSON score
-
[16]
Context Preservation (CP) Goal: Verify that the static environment remains unchanged. Check: - Are all billiard balls (balls with numbers and the white ball) present? - Are the ball numbers consistent? - Is the spatial layout (positions of non-moving balls) consistent with the ground truth? - Is the black arrow on the white ball preserved? Ignore: Slight ...
-
[17]
Path Correctness (PC) Goal: Verify the topology and direction of the green dashed trajectory. Check: - Does the predicted path move in the same cardinal direction? - Does the trajectory bounce off the same specific walls or cushions in the same order? (e.g., if Truth hits Top-Wall then Left-Wall, Prediction must do the same). - Is the path free of halluci...
-
[18]
Collision Correctness (CC) Goal: Verify the final target identity. Check: - Does the red bounding box surround the same specific ball number as in the Golden Label? Note: This metric is strictly about the identity of the target, regardless of whether the path (PC) looks perfect. Scoring: - 0: If the red box highlights a different ball or an empty space. -...
-
[19]
Wind-Identity Preservation (W-IP)
-
[20]
Wind-Pose/Placement Preservation (W-PP) IMPORTANT GLOBAL RULES: - Visual instruction markings (e.g., arrows) are NOT scene content. Differences related to arrows MUST be ignored, unless they severely obstruct the subject or degrade visual quality. - Wind effects are assumed to be intentional and allowed. - Do NOT penalize changes directly caused by wind. ...
-
[21]
First provide your reasoning under: - ## Identity Preservation Analysis - ## Pose / Placement Analysis
-
[23]
The JSON must be the LAST part of your response. ———————————————————— FINAL JSON FORMAT (EXACT): {“Wind-Identity Preservation”:{ “reason”: “string”, “score”: 0 }, “Wind-Other Preservation”:{ “reason”: “string”, “score”: 0 } } 31 VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing Table 15.Evaluation Prompt for Wind Direction Consisten...
-
[24]
First provide your reasoning under: - ## Causal Wind Evidence - ## Direction Description - ## Direction Comparison
-
[26]
The JSON must be the LAST part of your response. ———————————————————— FINAL JSON FORMAT (EXACT): {“Wind Direction Consistency”:{ “reason”: “string (one-sentence summary)”, “score”: 0.0 } } 33 VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing Table 17.Evaluation Prompt for Contextual Preservation in Light Control 1/2 Contextual Prese...
-
[27]
List all observable differences between IMAGE 1 and IMAGE 2
-
[28]
For each difference, determine whether it is: - a lighting-related manifestation (allowed), OR - an unrelated content modification (not allowed). ———————————————————— PART 3: Contextual Preservation decision Scoring (binary): - Score = 1 ONLY if ALL observed differences fall under allowed lighting-related changes. - Score = 0 If ANY difference corresponds...
-
[29]
First provide your reasoning under: - ## Difference Analysis - ## CP Decision
-
[31]
The JSON must be the LAST part of your response. ———————————————————— FINAL JSON FORMAT (EXACT): {“Contextual Preservation”:{ “reason”: “string (one-sentence summary)”, “score”: 0 } } 35 VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing Table 19.Evaluation Prompt for Lighting Direction Consistency 1/2 Lighting Direction Consistency ...
-
[32]
Direction Matching Consistency (DMC)
-
[33]
from upper-right toward lower-left
Physical Lighting Consistency (PLC) ———————————————————— PART 1: Direction description (no comparison) (1) Arrow direction in IMAGE 1: Describe the arrow direction as a continuous spatial direction. Use precise language (e.g., “from upper-right toward lower-left”, “slightly downward from right to left”). Do NOT reduce the direction to simple categories li...
-
[34]
First provide your reasoning under: - ## Direction Description - ## Direction Comparison - ## Physical Consistency
-
[36]
Direction Matching Consistency
The JSON must be the LAST part of your response. ———————————————————— FINAL JSON FORMAT (EXACT): {“Direction Matching Consistency”:{ “reason”: “string”, “score”: 0.0 }, “Physical Lighting Consistency”:{ “reason”: “string”, “score”: 0 } } 37 VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing Table 21.Evaluation Prompt for BII, CIC, an...
-
[37]
Body Instance Integrity (BII)
-
[38]
Character Identity Consistency (CIC)
-
[39]
- Do NOT compare limb positions or body posture
Context Preservation (CP) IMPORTANT: - Do NOT evaluate pose correctness. - Do NOT compare limb positions or body posture. - Each metric must be judged independently. ———————————————————— Metric definitions: (1) Body Instance Integrity (BII) Checks whether IMAGE 2 depicts exactly ONE coherent human body instance. Score 0 if ANY of the following occur: - Ex...
-
[40]
First provide your reasoning under these headers: - ## Analysis - ## Decisions
-
[41]
Then output the final JSON under: - ## JSON
-
[42]
reason”: one short factual sentence - “score
Each metric must include: - “reason”: one short factual sentence - “score”: 0 or 1
-
[43]
The JSON must be the LAST part of your response. ———————————————————— FINAL JSON FORMAT (EXACT): { “Body Instance Integrity”:{ “reason”: “string (one-sentence summary)”, “score”: 0/1 }, “Character Identity Consistency”:{ “reason”: “string (one-sentence summary)”, “score”: 0/1 }, “Context Preservation”:{ “reason”: “string (one-sentence summary)”, “score”: ...
-
[44]
Identity Consistency (IC)
-
[45]
Visual Integrity (VI) These two metrics MUST be evaluated separately and independently. ———————————————————— ## PART 1: Identity Consistency (IC) Identity Consistency evaluates whether the edited subject in IMAGE 2 remains the same semantic object/entity as in IMAGE 1, ignoring changes that are directly caused by orientation editing. ### STEP 1: Identify ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.