pith. sign in

arxiv: 2606.01213 · v1 · pith:3YZSN3Y5new · submitted 2026-05-31 · 💻 cs.CV · cs.AI· cs.CL

TECCI: Tricky Edits of Collected and Curated Images

Pith reviewed 2026-06-28 17:15 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL
keywords image editingtext-guided editingbenchmarkgenerative modelshuman evaluationinstruction followingvisual qualityminimality
0
0 comments X

The pith

Current text-guided image editing models achieve at most 22 percent overall success on a new benchmark of challenging edits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TECCI as a benchmark of curated images and edit instructions designed to expose limitations in text-guided image editing methods, particularly for edits involving position, motion, viewpoint, scale, and creativity. It releases 7550 image-instruction pairs spanning seven image categories, with instructions generated automatically or written manually, and evaluates five leading models through human ratings on instruction following, edit minimality, and visual quality. The evaluations establish that models perform better on following instructions than on making minimal changes or achieving high visual quality, with the lowest performance on architecture and nature images as well as on reasoning and creative edit types.

Core claim

TECCI consists of a new set of images across seven categories and 7550 image-edit instruction pairs, where human evaluations show that no tested model exceeds a 22 percent overall success rate, models handle color and appearance edits more readily than reasoning or creative ones, and they struggle most with edits requiring strong spatial layout understanding and intricate visual details.

What carries the argument

The TECCI benchmark of collected and curated images paired with five edit types per image, using both auto-generated and manually written instructions to target specific weaknesses in editing methods.

If this is right

  • Models require stronger spatial reasoning capabilities to handle edits on architecture and nature images effectively.
  • Improvements are needed in balancing instruction adherence with minimal changes to the source image.
  • Reasoning and creative edit types remain harder than color or appearance changes across current approaches.
  • An automated rater can approximate human judgments on the three dimensions at 74.7 percent accuracy to enable larger-scale testing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The benchmark could serve as a standard test set for measuring progress in developing editors that preserve more of the original image while following complex instructions.
  • Extending similar curation principles to video or 3D editing tasks might reveal analogous gaps in those domains.
  • Models that excel on this benchmark would likely demonstrate better internal representations of object relationships and scene geometry.

Load-bearing premise

The curated images and instructions accurately target and expose the specific weaknesses of existing editing methods, and the three human-rated dimensions together constitute a valid measure of editing success.

What would settle it

A model achieving substantially higher than 22 percent success rate on the same set of 7550 pairs when judged by the same three human-rated dimensions would indicate that the benchmark does not expose the claimed limitations.

Figures

Figures reproduced from arXiv: 2606.01213 by Aishwarya Agrawal, Jason Baldridge, Roy Hirsch, Sherry Ben, Yasumasa Onoe.

Figure 1
Figure 1. Figure 1: Overview of TECCI dataset. Representative samples from the different subsets and edit categories. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Image type distributions. TECCI-IRCS prioritizes specific challenging edits (resulting in an unbalanced [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of edit categories across TECCI-IRCS (left) and TECCI-GGIS (right). [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Co-occurrence heatmaps of edit categories and image types in (a) TECCI-GGIS and (b) TECCI-IRCS. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparative analysis of model performance across image categories and edit types. Reporting [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Sample images from TECCI-IRCS (top) and TECCI-GGIS (bottom). Each row includes the source [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Distribution of human evaluation scores across the three evaluation criteria. [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Distribusion of Overall scores for different thresholds [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Annotation interface for the single-side evaluation task. [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Annotation interface for the side-by-side evaluation task. Each task presents the source images, edit [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Score distributions for Autorater (AR) and Human Evaluation (HE) before (top) and after (bottom) [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: TECCI image editing autorater prompt. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: TECCI-IRCS examples with mean human-based scores. [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: TECCI-GGIS examples with mean human-based scores. [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗
read the original abstract

Despite tremendous recent progress, current text-guided image editing methods still struggle with many aspects of editing involving instruction following, minimally editing the source image, and ensuring high visual quality. These problems are especially apparent when the requested edit is challenging, such as those that involve position, motion, viewpoint, scale and creative edits. To systematically test generative image editors, we propose a novel image editing benchmark -- TECCI: Tricky Edits of Collected and Curated Images. TECCI consists of a completely new set of images we are releasing. The images in TECCI span 7 image categories. The images and these categories were curated intentionally to target weaknesses of existing methods. The edit instructions in TECCI are automatically generated by Gemini, covering 5 edit types per source image. We also curated a set of 530 images for which we created challenging manually written edit instructions. Overall, TECCI contains 7550 pairs of images and edit instructions. We conduct human evaluations of five leading image editing models on TECCI. Humans judge outputs along three dimensions: 1) instruction following, 2) minimality of the edits, and 3) visual quality. To scale-up the evaluation, we also build an auto-rater using Gemini that achieves 74.7% accuracy in matching human evaluations. Our evaluations reveal that: 1) none of the models exceed a 22% overall success rate, demonstrating the challenging nature of TECCI, 2) Nano Banana Pro is the best performing model overall, 3) models perform significantly better at instruction following compared to minimal edits and visual quality, 4) models struggle with editing architecture and nature images which require strong understanding of spatial layout and intricate visual details. 5) reasoning and creative edits are the most difficult, whereas color and appearance edits are the easiest.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces TECCI, a benchmark of 7550 image-edit instruction pairs (Gemini-generated plus 530 manually written) spanning 7 curated image categories chosen to target weaknesses in text-guided editing. Five leading models are evaluated via human ratings on instruction following, edit minimality, and visual quality; an auxiliary Gemini auto-rater is reported at 74.7% agreement with humans. The central empirical claim is that no model exceeds 22% overall success, with relative strengths on color/appearance edits and weaknesses on architecture/nature images plus reasoning/creative edits.

Significance. If the evaluation protocol is shown to be reliable, TECCI would supply a large-scale, publicly released test set that quantifies persistent gaps in current editors for spatially and semantically demanding edits. The scale (7550 pairs) and the three-dimensional human rating scheme are concrete strengths. However, the significance is limited by the absence of a human-editor baseline and missing reliability statistics, which are required to establish that the ≤22% ceiling reflects model limitations rather than instruction ambiguity or inherent criterion trade-offs.

major comments (3)
  1. [Abstract and human-evaluation section] The claim that ≤22% success demonstrates the benchmark's ability to expose model weaknesses (abstract and evaluation section) rests on the untested premise that the instructions are simultaneously clear, minimal, and visually realizable. No human-editor performance numbers are supplied on the same 7550 pairs, leaving open the possibility that low scores arise from under-specified prompts (e.g., “change the viewpoint dramatically”) or unavoidable trade-offs among the three rated dimensions.
  2. [Human evaluation and auto-rater paragraphs] The human-evaluation protocol reports an overall success rate but supplies neither inter-rater reliability statistics (Cohen’s κ or equivalent) nor the precise aggregation rule that converts the three per-dimension scores into a binary success label. Without these, the 22% figure cannot be interpreted as a stable measure of model capability.
  3. [Auto-rater description] The auto-rater is stated to achieve 74.7% accuracy matching human judgments, yet the manuscript gives no information on the size or selection of the validation subset, the exact matching criterion, or whether agreement varies across edit types or image categories. This detail is load-bearing for any claim that the auto-rater can reliably scale the evaluation.
minor comments (2)
  1. [Dataset construction] Clarify the exact distribution of the 7550 pairs across the 7 image categories and the 5 edit types; a table would help readers assess whether certain categories dominate the aggregate statistics.
  2. [Evaluation metrics] The term “overall success rate” is used without an explicit definition; state whether it requires all three dimensions to be rated positively or employs a weighted combination.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments highlighting the need for stronger validation of our evaluation protocol. We agree that additional details and a human baseline will improve the manuscript and plan to incorporate them. We respond point-by-point to the major comments below.

read point-by-point responses
  1. Referee: [Abstract and human-evaluation section] The claim that ≤22% success demonstrates the benchmark's ability to expose model weaknesses (abstract and evaluation section) rests on the untested premise that the instructions are simultaneously clear, minimal, and visually realizable. No human-editor performance numbers are supplied on the same 7550 pairs, leaving open the possibility that low scores arise from under-specified prompts (e.g., “change the viewpoint dramatically”) or unavoidable trade-offs among the three rated dimensions.

    Authors: We acknowledge that a human-editor baseline on the same pairs would provide important context and help confirm that low model scores reflect capability gaps rather than prompt ambiguity or rating trade-offs. Our instructions were designed to be specific and realizable, but we agree this is untested without the baseline. We will add a human-editor study on a balanced subset of 500 pairs in the revision, with outputs rated under the identical three-dimension protocol for direct comparison. revision: yes

  2. Referee: [Human evaluation and auto-rater paragraphs] The human-evaluation protocol reports an overall success rate but supplies neither inter-rater reliability statistics (Cohen’s κ or equivalent) nor the precise aggregation rule that converts the three per-dimension scores into a binary success label. Without these, the 22% figure cannot be interpreted as a stable measure of model capability.

    Authors: We agree these details are necessary for interpreting the 22% figure. The evaluations used three raters per output; we will report Cohen’s κ (substantial agreement) and explicitly define the aggregation rule (binary success requires scores ≥4/5 on all three dimensions) in the revised human-evaluation section. revision: yes

  3. Referee: [Auto-rater description] The auto-rater is stated to achieve 74.7% accuracy matching human judgments, yet the manuscript gives no information on the size or selection of the validation subset, the exact matching criterion, or whether agreement varies across edit types or image categories. This detail is load-bearing for any claim that the auto-rater can reliably scale the evaluation.

    Authors: The validation subset was 300 randomly selected pairs balanced across categories and edit types. Matching was on the binary success label, with agreement rates of 70-78% that were consistent but slightly lower on creative edits. We will expand the auto-rater paragraph with these details in the revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity: purely empirical benchmark evaluation

full rationale

The paper introduces TECCI as a new benchmark consisting of curated images and edit instructions, then reports direct human evaluations (and an auto-rater) of five models on three dimensions. No derivations, equations, fitted parameters, or self-citation chains are present that reduce any claim back to its own inputs by construction. All reported results (e.g., ≤22% success rates, model rankings) are measurements against the external benchmark data, which is released and falsifiable outside the paper. This is the standard case of an empirical benchmark paper with no circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No mathematical model or fitted parameters; the work is an empirical benchmark paper. The main unstated premises are that human ratings on the three dimensions are reliable and that the chosen images and instructions genuinely probe the claimed weaknesses.

axioms (1)
  • domain assumption Human judgments on instruction following, edit minimality, and visual quality are consistent and sufficient to measure editing success.
    Central to all reported success rates and model comparisons.

pith-pipeline@v0.9.1-grok · 5879 in / 1277 out tokens · 25047 ms · 2026-06-28T17:15:29.868337+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 2 canonical work pages · 1 internal anchor

  1. [1]

    Magicbrush: A manually annotated dataset for instruction-guided image editing.Advances in Neural Information Processing Systems, 36, 2024

    Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. Magicbrush: A manually annotated dataset for instruction-guided image editing.Advances in Neural Information Processing Systems, 36, 2024. 1, 2

  2. [2]

    Learning action and reasoning-centric image editing from videos and simulation

    Benno Krojer, Dheeraj Vattikonda, Luis Lara, Varun Jampani, Eva Portelance, Christopher Pal, and Siva Reddy. Learning action and reasoning-centric image editing from videos and simulation. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 38035–...

  3. [3]

    URL https://proceedings.neurips.cc/paper_files/paper/2024/file/ 434d512d6d79a506fd32f8b39abb7c19-Paper-Datasets_and_Benchmarks_Track. pdf

  4. [4]

    The promise of rl for autoregressive image editing

    Saba Ahmadi, Rabiul Awal, Ankur Sikarwar, Amirhossein Kazemnejad, Ge Ya Luo, Juan A Rodriguez, Sai Rajeswar, Siva Reddy, Christopher Pal, Benno Krojer, et al. The promise of rl for autoregressive image editing. InNeurIPS, 2025

  5. [5]

    Emu edit: Precise image editing via recognition and generation tasks

    Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain, Amit Zohar, Oron Ashual, Devi Parikh, and Yaniv Taigman. Emu edit: Precise image editing via recognition and generation tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8871–8879, 2024

  6. [6]

    Omniedit: Building image editing generalist models through specialist supervision

    Cong Wei, Zheyang Xiong, Weiming Ren, Xeron Du, Ge Zhang, and Wenhu Chen. Omniedit: Building image editing generalist models through specialist supervision. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=Hlm0cga0sv

  7. [7]

    Imagen editor and editbench: Advancing and evaluating text-guided image inpainting

    Su Wang, Chitwan Saharia, Ceslee Montgomery, Jordi Pont-Tuset, Shai Noy, Stefano Pellegrini, Yasumasa Onoe, Sarah Laszlo, David J Fleet, Radu Soricut, et al. Imagen editor and editbench: Advancing and evaluating text-guided image inpainting. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18359–18369, 2023

  8. [8]

    ImagenWorld: Stress-Testing Image Generation Models with Explainable Human Evaluation on Open-ended Real-World Tasks

    Samin Mahdizadeh Sani, Max Ku, Nima Jamali, Matina Mahdizadeh Sani, Paria Khoshtab, Wei-Chieh Sun, Parnian Fazel, Zhi Rui Tam, Thomas Chong, Edisy Kin Wai Chan, et al. ImagenWorld: Stress-Testing Image Generation Models with Explainable Human Evaluation on Open-ended Real-World Tasks. InThe Fourteenth International Conference on Learning Representations, ...

  9. [9]

    Lawrence Zitnick

    Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO Captions: Data Collection and Evaluation Server.ArXiv, 2015. 3

  10. [10]

    The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale

    Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, Tom Duerig, and Vittorio Ferrari. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. IJCV, 2020. 3

  11. [11]

    Visual genome: Connecting language and vision using crowdsourced dense image annotations.IJCV, 2017

    Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations.IJCV, 2017. 3

  12. [12]

    Introducing nano banana pro, 2025

    Google. Introducing nano banana pro, 2025. URL https://blog.google/ innovation-and-ai/products/nano-banana-pro/. 4, 6

  13. [13]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. URL https://arxiv. org/abs/2507.06261. 4

  14. [14]

    Nano banana 2: Combining pro capabilities with lightning-fast speed, 2026

    Google. Nano banana 2: Combining pro capabilities with lightning-fast speed, 2026. URL https: //blog.google/innovation-and-ai/technology/ai/nano-banana-2/. 6

  15. [15]

    Grok imagine api, 2025

    X.ai. Grok imagine api, 2025. URLhttps://x.ai/news/grok-imagine-api. 6

  16. [16]

    Seedream 5.0 lite ai image generator online, 2025

    ByteDance. Seedream 5.0 lite ai image generator online, 2025. URL https://www.byteplus.com/ en/product/Seedream. 6

  17. [17]

    Gpt-image-1.5 model documentation, 2025

    OpenAI. Gpt-image-1.5 model documentation, 2025. URL https://developers.openai.com/ api/docs/models/gpt-image-1.5. 6 10

  18. [18]

    Viescore: Towards explainable metrics for conditional image synthesis evaluation, 2023

    Max Ku, Dongfu Jiang, Cong Wei, Xiang Yue, and Wenhu Chen. Viescore: Towards explainable metrics for conditional image synthesis evaluation, 2023. 6

  19. [19]

    Imagenworld: Stress-testing image generation models with explainable human evaluation on open-ended real-world tasks

    Samin Mahdizadeh Sani, Max Ku, Nima Jamali, Matina Mahdizadeh Sani, Paria Khoshtab, Wei-Chieh Sun, Parnian Fazel, Zhi Rui Tam, Thomas Chong, Edisy Kin Wai Chan, Donald Wai Tong Tsang, Chiao-Wei Hsu, Ting Wai Lam, Ho Yin Sam Ng, Chiafeng Chu, Chak-Wing Mak, Keming Wu, Hiu Tung Wong, Yik Chun Ho, Chi Ruan, Zhuofeng Li, I-Sheng Fang, Shih-Ying Yeh, Ho Kei Ch...

  20. [20]

    URLhttps://openreview.net/forum?id=bld9g6jFh9. 6

  21. [21]

    Gonzalez, and Ion Stoica

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a- judge with mt-bench and chatbot arena. InThirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openr...

  22. [22]

    Gemini 3 flash: frontier intelligence built for speed, 2025

    Google. Gemini 3 flash: frontier intelligence built for speed, 2025. URL https://blog.google/ products-and-platforms/products/gemini/gemini-3-flash/. 8

  23. [23]

    arXiv preprint arXiv:2601.03444 (2026) 8

    Weiyue Li, Minda Zhao, Weixuan Dong, Jiahui Cai, Yuze Wei, Michael Pocress, Yi Li, Wanyan Yuan, Xiaoyue Wang, Ruoyu Hou, et al. Grading scale impact on llm-as-a-judge: Human-llm alignment is highest on 0-5 grading scale.arXiv preprint arXiv:2601.03444, 2026. 19

  24. [24]

    overall quality

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022. 19 11 Appendix The appendix consists of the following sections: •Limitations and Future Work (Sec 5.1) •Broader...

  25. [25]

    Delivered changes:For each requested modification, state whether it was correctly applied, partially applied, or not applied

  26. [26]

    Consider changes in background, lighting, color palette, object positions, facial features, textures, and any other visual element

    Unintended changes:List any changes you observe in the edited image that were NOT requested by the instruction. Consider changes in background, lighting, color palette, object positions, facial features, textures, and any other visual element. 4.Overall impression:A brief summary of the edit quality. Finally, based on your analysis, provide a rating and a...

  27. [27]

    Each rating must be an integer between 1 and 5 (1 is not at all, 5 is completely)

  28. [28]

    Each justification must be a short sentence

  29. [29]

    Judge each criterion separately, and do not let the rating of one criterion influence the another

  30. [30]

    5.Be strict in your ratings.Only give a rating of 5 if the criterion is perfectly met

    As image editing quality is subjective, answer honestly, according to your best judgement and as an average human visual inspector would. 5.Be strict in your ratings.Only give a rating of 5 if the criterion is perfectly met. Evaluation Criteria

  31. [31]

    Delivered changes

    Instruction Following:Does the edited image precisely reflect all modifications described? Note: Your rating must be consistent with the “Delivered changes” checklist above. If any requested change is missing or only partially applied, the rating cannot be 5. Completely (5):Accurately and completely adheres to the modifications. Mostly (4):Reflects most m...

  32. [32]

    Unintended changes

    Image Consistency:Beyond the instructions, does the image remain consistent with the original? Note: Your rating must be consistent with the “Unintended changes” list above. If there are significant unintended changes, the rating cannot be 5. Completely (5):Fully consistent; preserves all features not intended to be changed with high accuracy. Mostly (4):...

  33. [33]

    evaluation_analysis

    Visual Quality:Is the integration seamless, natural, and free from artifacts? Completely (5):Blends perfectly and feels completely natural. Mostly (4):Blends well; minor alterations visible when paying close attention. Moderately (3):Somewhat consistent, but it is noticeable that the image has been edited. Somewhat (2):Edit doesn’t blend well; differences...