pith. sign in

arxiv: 2602.01851 · v2 · pith:OXFILZ3Xnew · submitted 2026-02-02 · 💻 cs.CV

How Well Do Models Follow Visual Instructions? VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing

Pith reviewed 2026-05-22 12:06 UTC · model grok-4.3

classification 💻 cs.CV
keywords visual instructionsimage editingbenchmarkgenerative modelsmultimodal evaluationinstruction followingsketch-based editingLMM judge
0
0 comments X

The pith

Proprietary image editing models follow visual instructions better than open-source ones but degrade sharply on complex tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces VIBE, a benchmark that tests image editing models using visual instructions such as sketches rather than text prompts alone. It structures evaluation around three levels of increasing complexity that move from basic spatial pointing to shape changes and then to cause-and-effect reasoning. Seventeen models are assessed with a scalable judge based on large multimodal models and task-specific metrics. The results indicate that closed-source systems show initial ability to interpret these instructions and lead overall, yet every model loses capability as the visual demands rise. Readers would care because visual instructions align with natural human communication, so stronger performance here could make generative editing tools more practical for everyday creative work.

Core claim

The paper establishes VIBE as a systematic benchmark for visual instruction-driven image editing featuring a three-level interaction hierarchy progressing from deictic grounding through morphological manipulation to causal reasoning, together with an LMM-as-a-judge framework using task-specific metrics. Comprehensive evaluation of seventeen representative open-source and proprietary models shows that proprietary systems exhibit early-stage visual instruction-following capabilities and consistently outperform open-source models, while performance degrades markedly with increasing task difficulty even for the strongest systems.

What carries the argument

The VIBE benchmark's three-level interaction hierarchy that captures deictic grounding, morphological manipulation, and causal reasoning, paired with an LMM-as-a-judge evaluation framework that applies task-specific metrics.

If this is right

  • Image editing systems will need targeted improvements in visual reasoning to maintain performance on harder instructions.
  • Proprietary models currently hold an edge in visual instruction following that may narrow with further open-source development.
  • Benchmarks for generative image models should incorporate visual instructions to better reflect real user intent.
  • Scalable LMM-based judging can support fine-grained assessment of spatial and causal editing tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Progress on visual instruction following may transfer to other multimodal tasks such as diagram-based design or interactive scene generation.
  • Curating larger and more diverse sets of real-user visual instructions could accelerate training of open-source models to close the observed gap.
  • Integrating explicit spatial grounding modules into editing pipelines might reduce the performance drop on higher-level causal tasks.

Load-bearing premise

The LMM-as-a-judge evaluation with task-specific metrics supplies reliable, unbiased, and fine-grained scores for how well models follow the visual instructions.

What would settle it

Human evaluators scoring the same model outputs on the VIBE test cases and showing low agreement with the LMM judge scores, or any single model maintaining high accuracy across all three difficulty levels without measurable decline.

Figures

Figures reproduced from arXiv: 2602.01851 by Anna Korhonen, Chengzu Li, Chen Liang, Haochen Tian, Haodong Li, Huanyu Zhang, Liang Wang, Ruichuan An, Tieniu Tan, Xuehai Bai, Yifan Zhang, Zhang Zhang.

Figure 1
Figure 1. Figure 1: Motivation and scope of the VIBE benchmark. Tradi￾tional image editing is largely text-guided, where conveying spatial intent relies on verbose descriptions and incurs high cognitive load. In contrast, visual instructions enable precise and explicit ground￾ing, providing a more human-aligned interaction paradigm. VIBE is designed to fill the evaluation gap by systematically benchmark￾ing this visual intruc… view at source ↗
Figure 2
Figure 2. Figure 2: Composition of VIBE. VIBE comprises 1,034 sam￾ples across 10 tasks, organized into a three-level hierarchy that reflects increasing interaction and reasoning complexity, from deic￾tic grounding and morphological manipulation to causal reasoning. lenges of visual instruction-guided image editing. 2. VIBE To bridge the gap between linguistic instructions and precise image manipulation, we introduce the VIBE … view at source ↗
Figure 3
Figure 3. Figure 3: Overview of VIBE. VIBE organizes visual instruction-guided image editing into a three-level interaction hierarchy with increasing task complexity. The Deictic Level treats visual instructions as selectors that specify localized regions or objects for basic spatial operations. The Morphological Level interprets visual instructions as blueprints that define abstract structural constraints. The Causal Level v… view at source ↗
Figure 4
Figure 4. Figure 4: Performance across image styles on the Deictic Level. Left: Average Deictic Level scores across real-world, animation, and sketch images for four proprietary models. Right: Metric-level heatmaps for Seedream 4.5 and GPT-Image-1, illustrating style-dependent variations in Instruction Adherence, Contextual Preservation, and Visual Coherence. Nano Banana ProNano BananaSeedream 4.5 Wan 2.6 Wan 2.5 Step1X-Edit-… view at source ↗
Figure 5
Figure 5. Figure 5 [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6 [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Pearson correlation between human expert scores and LMM-based evaluation scores for Nano Banana Pro and GPT￾Image-1, demonstrating a strong alignment between human judg￾ments and the LMM-as-a-Judge evaluator. moving from single-task to multi-task settings [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Two representative cases illustrating how textual and visual instructions interact. The first case shows that visual instructions can resolve target ambiguity that detailed text alone fails to address. The second case demonstrates that complex semantic constraints require the joint use of detailed textual and visual instructions. 4.3. Validity of LMM-as-a-Judge To validate the reliability of using LMM as e… view at source ↗
Figure 9
Figure 9. Figure 9: Examples of editing results from Seedream 4.5 across different image styles 14 [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative incorrect examples on the Deictic and Morphological Level 15 [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative incorrect examples on the Causal Level 27 [PITH_FULL_IMAGE:figures/full_fig_p027_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Qualitative examples with visually embedded instructions. All examples use the same minimal textual prompt, “Edit this image following the instructions annotated on this picture.” Task specifications are conveyed through text and symbols embedded directly in the input image. Nano Banana Pro correctly executes single-task, multi-task, and causal editing operations based on these visually embedded instructi… view at source ↗
Figure 13
Figure 13. Figure 13: Screenshot of the developed data annotation system used in section 4.3. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_13.png] view at source ↗
read the original abstract

Recent generative models have achieved remarkable progress in image editing. However, existing systems and benchmarks remain largely text-guided. In contrast, human communication is inherently multimodal, where visual instructions such as sketches efficiently convey spatial and structural intent. To address this gap, we introduce VIBE, the Visual Instruction Benchmark for Image Editing with a three-level interaction hierarchy that captures deictic grounding, morphological manipulation, and causal reasoning. Across these levels, we curate high-quality and diverse test cases that reflect progressively increasing complexity in visual instruction following. We further propose a robust LMM-as-a-judge evaluation framework with task-specific metrics to enable scalable and fine-grained assessment. Through a comprehensive evaluation of 17 representative open-source and proprietary image editing models, we find that proprietary models exhibit early-stage visual instruction-following capabilities and consistently outperform open-source models. However, performance degrades markedly with increasing task difficulty even for the strongest systems, highlighting promising directions for future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces VIBE, a benchmark for visual instruction-driven image editing organized around a three-level hierarchy (deictic grounding, morphological manipulation, and causal reasoning). It curates progressively complex test cases and proposes an LMM-as-a-judge framework equipped with task-specific metrics. Evaluation of 17 open-source and proprietary image-editing models leads to the claim that proprietary systems exhibit early-stage visual instruction-following ability and outperform open-source counterparts, yet all models degrade sharply as task difficulty increases.

Significance. If the evaluation framework holds, the benchmark would usefully document the current gap between text-guided and visually instructed editing and the persistent difficulty of complex visual instructions, supplying a concrete reference point for future multimodal generative work.

major comments (2)
  1. [Evaluation Framework] Abstract and Evaluation Framework section: the assertion of a 'robust' LMM-as-a-judge framework with task-specific metrics is not accompanied by reported inter-judge agreement, human correlation coefficients, or bias-control experiments. Because the central performance comparisons (proprietary vs. open-source, degradation across levels) rest entirely on the judge's outputs, this omission is load-bearing for the claims.
  2. [Results] Results section: the statement that 'performance degrades markedly with increasing task difficulty' is presented without per-level quantitative breakdowns or tables that would allow readers to verify the magnitude of the drop for the strongest models on causal-reasoning cases.
minor comments (1)
  1. [Abstract] The abstract states that 'high-quality and diverse test cases' were curated but does not report the total count or the distribution across the three hierarchy levels.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, indicating whether revisions have been made.

read point-by-point responses
  1. Referee: [Evaluation Framework] Abstract and Evaluation Framework section: the assertion of a 'robust' LMM-as-a-judge framework with task-specific metrics is not accompanied by reported inter-judge agreement, human correlation coefficients, or bias-control experiments. Because the central performance comparisons (proprietary vs. open-source, degradation across levels) rest entirely on the judge's outputs, this omission is load-bearing for the claims.

    Authors: We agree that additional quantitative validation of the LMM-as-a-judge would strengthen the evaluation framework. While the task-specific metrics were chosen to align with the hierarchical structure of the benchmark, we acknowledge the value of reporting inter-judge agreement, human correlation coefficients, and bias-control results. In the revised manuscript, we have added these analyses in the Evaluation Framework section, including agreement statistics across multiple LMM judges and correlation with human raters on a subset of cases. revision: yes

  2. Referee: [Results] Results section: the statement that 'performance degrades markedly with increasing task difficulty' is presented without per-level quantitative breakdowns or tables that would allow readers to verify the magnitude of the drop for the strongest models on causal-reasoning cases.

    Authors: We concur that explicit per-level breakdowns would improve verifiability of the degradation claim. The original manuscript summarized trends across levels but did not include a dedicated table with exact scores. We have added a new table in the Results section that reports performance for each of the three levels separately, with particular emphasis on the strongest proprietary models on the causal-reasoning subset. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark construction and model evaluations are independent of inputs

full rationale

The paper introduces VIBE as a new benchmark with a three-level hierarchy and curated test cases, then applies an LMM-as-a-judge framework with task-specific metrics to evaluate 17 models. The performance claims (proprietary models outperforming open-source ones, with degradation on harder tasks) follow directly from these external evaluations rather than any self-definitional loop, fitted parameter renamed as prediction, or load-bearing self-citation. No equations, ansatzes, or uniqueness theorems are invoked that reduce the results to the paper's own inputs by construction. The framework is presented as a proposed tool for assessment, not derived from prior results by the same authors in a way that creates tautology. This is a standard benchmark paper whose claims rest on observable model outputs against independently curated cases.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the domain assumption that visual instructions can be meaningfully stratified into deictic grounding, morphological manipulation, and causal reasoning, plus the choice of LMM judge as a scalable proxy for human evaluation.

axioms (1)
  • domain assumption Visual instructions can be effectively categorized into a three-level hierarchy of deictic grounding, morphological manipulation, and causal reasoning.
    This hierarchy structures the benchmark and test case curation as described in the abstract.

pith-pipeline@v0.9.0 · 5733 in / 1171 out tokens · 51866 ms · 2026-05-22T12:06:08.619806+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling

    cs.CV 2026-05 unverdicted novelty 7.0

    Edit-Compass and EditReward-Compass are new unified benchmarks for fine-grained image editing evaluation and realistic reward modeling in reinforcement learning optimization.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · cited by 1 Pith paper

  1. [1]

    Edit this image following the instructions annotated on this picture

    URL https://aclanthology.org/2024. emnlp-main.106/. Li, C., Wu, W., Zhang, H., Li, Q., Gao, Z., Xia, Y ., Hern´andez-Orallo, J., Vuli ´c, I., and Wei, F. 11plus- bench: Demystifying multimodal llm spatial reason- ing with cognitive-inspired analysis.arXiv preprint arXiv:2508.20068, 2025a. 9 VIBE: A Systematic Benchmark for Visual Instruction-Driven Image ...

  2. [2]

    Visual Instruction Localization Correctness Did the main edit occur on the object or region explicitly indicated by the visual instruction on the Input Image (The second image)?

  3. [3]

    Visual Operator Type Compliance Was the type of edit consistent with the operation implied by the visual instruction?

  4. [4]

    Visual Instruction Localization Correctness

    Textual Action Semantic Compliance Did the model execute the core action specified in the Text Prompt? Scoring rules: - Score 1 if the requirement is clearly satisfied. - Score 0 if the requirement is not satisfied or is ambiguous. - If unsure, assign 0. - Partial compliance must be scored as 0. You may reason freely to reach your decision. Then, for EACH...

  5. [5]

    - Ignore content missing only due to cropping

    Cropping rule - If the output is cropped, only compare the overlapping visible region. - Ignore content missing only due to cropping

  6. [6]

    - Do NOT list differences caused by: • minor blur or softness, • small texture or color shifts, • pixel-level noise, • slight position or alignment offsets

    Difference listing (what counts as a difference) - List ONLY meaningful differences at the level of objects or semantic entities. - Do NOT list differences caused by: • minor blur or softness, • small texture or color shifts, • pixel-level noise, • slight position or alignment offsets. - A difference should be listed ONLY if it: • adds or removes a comple...

  7. [7]

    Target rule - Identify the intended edit target based ONLY on: (a) the visual instruction marks, and (b) the text prompt

  8. [8]

    Classification rule - IN TARGET: - IN TARGET: • any change within the intended target, • OR any imperfect attempt to edit the target (including misplacement, offset, scale error, or incomplete coverage). 21 VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing Table 8.Evaluation Prompt for Contextual Preservation 2/2 Contextual Preserva...

  9. [9]

    Contextual Preservation

    Scoring - Score = 1 if NO OUT OF TARGET differences exist. - Score = 0 if ANY OUT OF TARG Output format: First provide a brief analysis with these sections: - ## Differences - ## Target - ## Classification - ## Decision Then output the final JSON as the last part of your response: {“Contextual Preservation”:{“reason”: “string”, “score”: 0} }

  10. [10]

    Scoring - Score = 1 if NO OUT OF TARGET differences exist. - Score = 0 if ANY OUT OF TARG 22 VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing Table 9.Evaluation Prompt for Visual Coherence 1/2 Visual Coherence 1/2 You are given THREE images and ONE text prompt. The first image: - This is the original image. The second image: - Visu...

  11. [11]

    - Score 0 if the edited region introduces a different artistic or rendering domain

    Style Consistency Did the edited region adopt the same artistic or rendering domain as the Input Image (e.g., line-art, watercolor, oil painting, 3D render, photographic style, pixel art, animation)? - Score 1 if the edited region clearly belongs to the same visual domain as the source image. - Score 0 if the edited region introduces a different artistic ...

  12. [12]

    - Score 1 if the edited region integrates seamlessly with its surroundings

    Visual Seamlessness Is the edited region visually continuous with its surrounding area, without obvious signs of compositing? Focus on whether there are clear visual discontinuities such as: - unnatural seams or hard boundaries, - abrupt changes in texture, color, or resolution, - visible cut-and-paste artifacts. - Score 1 if the edited region integrates ...

  13. [13]

    score”: an integer value of 0 or 1. - “reason

    Artifact-Free Generation Does the Output Image avoid obvious, domain-independent generative artifacts? 23 VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing Table 10.Evaluation Prompt for Visual Coherence 2/2 Visual Coherence 2/2 Consider artifacts such as: - unintended blurring or pixelation, - geometric distortion or deformation, -...

  14. [14]

    IMAGE 1 (Golden Label / Ground Truth): Depicts the correct initial state, the correct future trajectory (green dashed line), and the correct final target (red bounding box)

  15. [15]

    YOUR TASK: Evaluate the Generated Image against the Golden Label based on three independent metrics

    IMAGE 2 (Model Generation): Depicts a predicted scene, trajectory, and target. YOUR TASK: Evaluate the Generated Image against the Golden Label based on three independent metrics. You must analyze the images step-by-step and output a final JSON score

  16. [16]

    Context Preservation (CP) Goal: Verify that the static environment remains unchanged. Check: - Are all billiard balls (balls with numbers and the white ball) present? - Are the ball numbers consistent? - Is the spatial layout (positions of non-moving balls) consistent with the ground truth? - Is the black arrow on the white ball preserved? Ignore: Slight ...

  17. [17]

    Path Correctness (PC) Goal: Verify the topology and direction of the green dashed trajectory. Check: - Does the predicted path move in the same cardinal direction? - Does the trajectory bounce off the same specific walls or cushions in the same order? (e.g., if Truth hits Top-Wall then Left-Wall, Prediction must do the same). - Is the path free of halluci...

  18. [18]

    The Golden path bounces off the top cushion. The Generated path bounces off the bottom cushion. These are different

    Collision Correctness (CC) Goal: Verify the final target identity. Check: - Does the red bounding box surround the same specific ball number as in the Golden Label? Note: This metric is strictly about the identity of the target, regardless of whether the path (PC) looks perfect. Scoring: - 0: If the red box highlights a different ball or an empty space. -...

  19. [19]

    Wind-Identity Preservation (W-IP)

  20. [20]

    Differences related to arrows MUST be ignored, unless they severely obstruct the subject or degrade visual quality

    Wind-Pose/Placement Preservation (W-PP) IMPORTANT GLOBAL RULES: - Visual instruction markings (e.g., arrows) are NOT scene content. Differences related to arrows MUST be ignored, unless they severely obstruct the subject or degrade visual quality. - Wind effects are assumed to be intentional and allowed. - Do NOT penalize changes directly caused by wind. ...

  21. [21]

    First provide your reasoning under: - ## Identity Preservation Analysis - ## Pose / Placement Analysis

  22. [23]

    Wind-Identity Preservation

    The JSON must be the LAST part of your response. ———————————————————— FINAL JSON FORMAT (EXACT): {“Wind-Identity Preservation”:{ “reason”: “string”, “score”: 0 }, “Wind-Other Preservation”:{ “reason”: “string”, “score”: 0 } } 31 VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing Table 15.Evaluation Prompt for Wind Direction Consisten...

  23. [24]

    First provide your reasoning under: - ## Causal Wind Evidence - ## Direction Description - ## Direction Comparison

  24. [26]

    Wind Direction Consistency

    The JSON must be the LAST part of your response. ———————————————————— FINAL JSON FORMAT (EXACT): {“Wind Direction Consistency”:{ “reason”: “string (one-sentence summary)”, “score”: 0.0 } } 33 VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing Table 17.Evaluation Prompt for Contextual Preservation in Light Control 1/2 Contextual Prese...

  25. [27]

    List all observable differences between IMAGE 1 and IMAGE 2

  26. [28]

    ———————————————————— PART 3: Contextual Preservation decision Scoring (binary): - Score = 1 ONLY if ALL observed differences fall under allowed lighting-related changes

    For each difference, determine whether it is: - a lighting-related manifestation (allowed), OR - an unrelated content modification (not allowed). ———————————————————— PART 3: Contextual Preservation decision Scoring (binary): - Score = 1 ONLY if ALL observed differences fall under allowed lighting-related changes. - Score = 0 If ANY difference corresponds...

  27. [29]

    First provide your reasoning under: - ## Difference Analysis - ## CP Decision

  28. [31]

    Contextual Preservation

    The JSON must be the LAST part of your response. ———————————————————— FINAL JSON FORMAT (EXACT): {“Contextual Preservation”:{ “reason”: “string (one-sentence summary)”, “score”: 0 } } 35 VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing Table 19.Evaluation Prompt for Lighting Direction Consistency 1/2 Lighting Direction Consistency ...

  29. [32]

    Direction Matching Consistency (DMC)

  30. [33]

    from upper-right toward lower-left

    Physical Lighting Consistency (PLC) ———————————————————— PART 1: Direction description (no comparison) (1) Arrow direction in IMAGE 1: Describe the arrow direction as a continuous spatial direction. Use precise language (e.g., “from upper-right toward lower-left”, “slightly downward from right to left”). Do NOT reduce the direction to simple categories li...

  31. [34]

    First provide your reasoning under: - ## Direction Description - ## Direction Comparison - ## Physical Consistency

  32. [36]

    Direction Matching Consistency

    The JSON must be the LAST part of your response. ———————————————————— FINAL JSON FORMAT (EXACT): {“Direction Matching Consistency”:{ “reason”: “string”, “score”: 0.0 }, “Physical Lighting Consistency”:{ “reason”: “string”, “score”: 0 } } 37 VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing Table 21.Evaluation Prompt for BII, CIC, an...

  33. [37]

    Body Instance Integrity (BII)

  34. [38]

    Character Identity Consistency (CIC)

  35. [39]

    - Do NOT compare limb positions or body posture

    Context Preservation (CP) IMPORTANT: - Do NOT evaluate pose correctness. - Do NOT compare limb positions or body posture. - Each metric must be judged independently. ———————————————————— Metric definitions: (1) Body Instance Integrity (BII) Checks whether IMAGE 2 depicts exactly ONE coherent human body instance. Score 0 if ANY of the following occur: - Ex...

  36. [40]

    First provide your reasoning under these headers: - ## Analysis - ## Decisions

  37. [41]

    Then output the final JSON under: - ## JSON

  38. [42]

    reason”: one short factual sentence - “score

    Each metric must include: - “reason”: one short factual sentence - “score”: 0 or 1

  39. [43]

    Body Instance Integrity

    The JSON must be the LAST part of your response. ———————————————————— FINAL JSON FORMAT (EXACT): { “Body Instance Integrity”:{ “reason”: “string (one-sentence summary)”, “score”: 0/1 }, “Character Identity Consistency”:{ “reason”: “string (one-sentence summary)”, “score”: 0/1 }, “Context Preservation”:{ “reason”: “string (one-sentence summary)”, “score”: ...

  40. [44]

    Identity Consistency (IC)

  41. [45]

    Identity Consistency

    Visual Integrity (VI) These two metrics MUST be evaluated separately and independently. ———————————————————— ## PART 1: Identity Consistency (IC) Identity Consistency evaluates whether the edited subject in IMAGE 2 remains the same semantic object/entity as in IMAGE 1, ignoring changes that are directly caused by orientation editing. ### STEP 1: Identify ...