TECCI: Tricky Edits of Collected and Curated Images
Pith reviewed 2026-06-28 17:15 UTC · model grok-4.3
The pith
Current text-guided image editing models achieve at most 22 percent overall success on a new benchmark of challenging edits.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TECCI consists of a new set of images across seven categories and 7550 image-edit instruction pairs, where human evaluations show that no tested model exceeds a 22 percent overall success rate, models handle color and appearance edits more readily than reasoning or creative ones, and they struggle most with edits requiring strong spatial layout understanding and intricate visual details.
What carries the argument
The TECCI benchmark of collected and curated images paired with five edit types per image, using both auto-generated and manually written instructions to target specific weaknesses in editing methods.
If this is right
- Models require stronger spatial reasoning capabilities to handle edits on architecture and nature images effectively.
- Improvements are needed in balancing instruction adherence with minimal changes to the source image.
- Reasoning and creative edit types remain harder than color or appearance changes across current approaches.
- An automated rater can approximate human judgments on the three dimensions at 74.7 percent accuracy to enable larger-scale testing.
Where Pith is reading between the lines
- The benchmark could serve as a standard test set for measuring progress in developing editors that preserve more of the original image while following complex instructions.
- Extending similar curation principles to video or 3D editing tasks might reveal analogous gaps in those domains.
- Models that excel on this benchmark would likely demonstrate better internal representations of object relationships and scene geometry.
Load-bearing premise
The curated images and instructions accurately target and expose the specific weaknesses of existing editing methods, and the three human-rated dimensions together constitute a valid measure of editing success.
What would settle it
A model achieving substantially higher than 22 percent success rate on the same set of 7550 pairs when judged by the same three human-rated dimensions would indicate that the benchmark does not expose the claimed limitations.
Figures
read the original abstract
Despite tremendous recent progress, current text-guided image editing methods still struggle with many aspects of editing involving instruction following, minimally editing the source image, and ensuring high visual quality. These problems are especially apparent when the requested edit is challenging, such as those that involve position, motion, viewpoint, scale and creative edits. To systematically test generative image editors, we propose a novel image editing benchmark -- TECCI: Tricky Edits of Collected and Curated Images. TECCI consists of a completely new set of images we are releasing. The images in TECCI span 7 image categories. The images and these categories were curated intentionally to target weaknesses of existing methods. The edit instructions in TECCI are automatically generated by Gemini, covering 5 edit types per source image. We also curated a set of 530 images for which we created challenging manually written edit instructions. Overall, TECCI contains 7550 pairs of images and edit instructions. We conduct human evaluations of five leading image editing models on TECCI. Humans judge outputs along three dimensions: 1) instruction following, 2) minimality of the edits, and 3) visual quality. To scale-up the evaluation, we also build an auto-rater using Gemini that achieves 74.7% accuracy in matching human evaluations. Our evaluations reveal that: 1) none of the models exceed a 22% overall success rate, demonstrating the challenging nature of TECCI, 2) Nano Banana Pro is the best performing model overall, 3) models perform significantly better at instruction following compared to minimal edits and visual quality, 4) models struggle with editing architecture and nature images which require strong understanding of spatial layout and intricate visual details. 5) reasoning and creative edits are the most difficult, whereas color and appearance edits are the easiest.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces TECCI, a benchmark of 7550 image-edit instruction pairs (Gemini-generated plus 530 manually written) spanning 7 curated image categories chosen to target weaknesses in text-guided editing. Five leading models are evaluated via human ratings on instruction following, edit minimality, and visual quality; an auxiliary Gemini auto-rater is reported at 74.7% agreement with humans. The central empirical claim is that no model exceeds 22% overall success, with relative strengths on color/appearance edits and weaknesses on architecture/nature images plus reasoning/creative edits.
Significance. If the evaluation protocol is shown to be reliable, TECCI would supply a large-scale, publicly released test set that quantifies persistent gaps in current editors for spatially and semantically demanding edits. The scale (7550 pairs) and the three-dimensional human rating scheme are concrete strengths. However, the significance is limited by the absence of a human-editor baseline and missing reliability statistics, which are required to establish that the ≤22% ceiling reflects model limitations rather than instruction ambiguity or inherent criterion trade-offs.
major comments (3)
- [Abstract and human-evaluation section] The claim that ≤22% success demonstrates the benchmark's ability to expose model weaknesses (abstract and evaluation section) rests on the untested premise that the instructions are simultaneously clear, minimal, and visually realizable. No human-editor performance numbers are supplied on the same 7550 pairs, leaving open the possibility that low scores arise from under-specified prompts (e.g., “change the viewpoint dramatically”) or unavoidable trade-offs among the three rated dimensions.
- [Human evaluation and auto-rater paragraphs] The human-evaluation protocol reports an overall success rate but supplies neither inter-rater reliability statistics (Cohen’s κ or equivalent) nor the precise aggregation rule that converts the three per-dimension scores into a binary success label. Without these, the 22% figure cannot be interpreted as a stable measure of model capability.
- [Auto-rater description] The auto-rater is stated to achieve 74.7% accuracy matching human judgments, yet the manuscript gives no information on the size or selection of the validation subset, the exact matching criterion, or whether agreement varies across edit types or image categories. This detail is load-bearing for any claim that the auto-rater can reliably scale the evaluation.
minor comments (2)
- [Dataset construction] Clarify the exact distribution of the 7550 pairs across the 7 image categories and the 5 edit types; a table would help readers assess whether certain categories dominate the aggregate statistics.
- [Evaluation metrics] The term “overall success rate” is used without an explicit definition; state whether it requires all three dimensions to be rated positively or employs a weighted combination.
Simulated Author's Rebuttal
We thank the referee for the constructive comments highlighting the need for stronger validation of our evaluation protocol. We agree that additional details and a human baseline will improve the manuscript and plan to incorporate them. We respond point-by-point to the major comments below.
read point-by-point responses
-
Referee: [Abstract and human-evaluation section] The claim that ≤22% success demonstrates the benchmark's ability to expose model weaknesses (abstract and evaluation section) rests on the untested premise that the instructions are simultaneously clear, minimal, and visually realizable. No human-editor performance numbers are supplied on the same 7550 pairs, leaving open the possibility that low scores arise from under-specified prompts (e.g., “change the viewpoint dramatically”) or unavoidable trade-offs among the three rated dimensions.
Authors: We acknowledge that a human-editor baseline on the same pairs would provide important context and help confirm that low model scores reflect capability gaps rather than prompt ambiguity or rating trade-offs. Our instructions were designed to be specific and realizable, but we agree this is untested without the baseline. We will add a human-editor study on a balanced subset of 500 pairs in the revision, with outputs rated under the identical three-dimension protocol for direct comparison. revision: yes
-
Referee: [Human evaluation and auto-rater paragraphs] The human-evaluation protocol reports an overall success rate but supplies neither inter-rater reliability statistics (Cohen’s κ or equivalent) nor the precise aggregation rule that converts the three per-dimension scores into a binary success label. Without these, the 22% figure cannot be interpreted as a stable measure of model capability.
Authors: We agree these details are necessary for interpreting the 22% figure. The evaluations used three raters per output; we will report Cohen’s κ (substantial agreement) and explicitly define the aggregation rule (binary success requires scores ≥4/5 on all three dimensions) in the revised human-evaluation section. revision: yes
-
Referee: [Auto-rater description] The auto-rater is stated to achieve 74.7% accuracy matching human judgments, yet the manuscript gives no information on the size or selection of the validation subset, the exact matching criterion, or whether agreement varies across edit types or image categories. This detail is load-bearing for any claim that the auto-rater can reliably scale the evaluation.
Authors: The validation subset was 300 randomly selected pairs balanced across categories and edit types. Matching was on the binary success label, with agreement rates of 70-78% that were consistent but slightly lower on creative edits. We will expand the auto-rater paragraph with these details in the revision. revision: yes
Circularity Check
No significant circularity: purely empirical benchmark evaluation
full rationale
The paper introduces TECCI as a new benchmark consisting of curated images and edit instructions, then reports direct human evaluations (and an auto-rater) of five models on three dimensions. No derivations, equations, fitted parameters, or self-citation chains are present that reduce any claim back to its own inputs by construction. All reported results (e.g., ≤22% success rates, model rankings) are measurements against the external benchmark data, which is released and falsifiable outside the paper. This is the standard case of an empirical benchmark paper with no circularity.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human judgments on instruction following, edit minimality, and visual quality are consistent and sufficient to measure editing success.
Reference graph
Works this paper leans on
-
[1]
Magicbrush: A manually annotated dataset for instruction-guided image editing.Advances in Neural Information Processing Systems, 36, 2024
Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. Magicbrush: A manually annotated dataset for instruction-guided image editing.Advances in Neural Information Processing Systems, 36, 2024. 1, 2
2024
-
[2]
Learning action and reasoning-centric image editing from videos and simulation
Benno Krojer, Dheeraj Vattikonda, Luis Lara, Varun Jampani, Eva Portelance, Christopher Pal, and Siva Reddy. Learning action and reasoning-centric image editing from videos and simulation. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 38035–...
-
[3]
URL https://proceedings.neurips.cc/paper_files/paper/2024/file/ 434d512d6d79a506fd32f8b39abb7c19-Paper-Datasets_and_Benchmarks_Track. pdf
2024
-
[4]
The promise of rl for autoregressive image editing
Saba Ahmadi, Rabiul Awal, Ankur Sikarwar, Amirhossein Kazemnejad, Ge Ya Luo, Juan A Rodriguez, Sai Rajeswar, Siva Reddy, Christopher Pal, Benno Krojer, et al. The promise of rl for autoregressive image editing. InNeurIPS, 2025
2025
-
[5]
Emu edit: Precise image editing via recognition and generation tasks
Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain, Amit Zohar, Oron Ashual, Devi Parikh, and Yaniv Taigman. Emu edit: Precise image editing via recognition and generation tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8871–8879, 2024
2024
-
[6]
Omniedit: Building image editing generalist models through specialist supervision
Cong Wei, Zheyang Xiong, Weiming Ren, Xeron Du, Ge Zhang, and Wenhu Chen. Omniedit: Building image editing generalist models through specialist supervision. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=Hlm0cga0sv
2025
-
[7]
Imagen editor and editbench: Advancing and evaluating text-guided image inpainting
Su Wang, Chitwan Saharia, Ceslee Montgomery, Jordi Pont-Tuset, Shai Noy, Stefano Pellegrini, Yasumasa Onoe, Sarah Laszlo, David J Fleet, Radu Soricut, et al. Imagen editor and editbench: Advancing and evaluating text-guided image inpainting. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18359–18369, 2023
2023
-
[8]
ImagenWorld: Stress-Testing Image Generation Models with Explainable Human Evaluation on Open-ended Real-World Tasks
Samin Mahdizadeh Sani, Max Ku, Nima Jamali, Matina Mahdizadeh Sani, Paria Khoshtab, Wei-Chieh Sun, Parnian Fazel, Zhi Rui Tam, Thomas Chong, Edisy Kin Wai Chan, et al. ImagenWorld: Stress-Testing Image Generation Models with Explainable Human Evaluation on Open-ended Real-World Tasks. InThe Fourteenth International Conference on Learning Representations, ...
2026
-
[9]
Lawrence Zitnick
Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO Captions: Data Collection and Evaluation Server.ArXiv, 2015. 3
2015
-
[10]
The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale
Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, Tom Duerig, and Vittorio Ferrari. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. IJCV, 2020. 3
2020
-
[11]
Visual genome: Connecting language and vision using crowdsourced dense image annotations.IJCV, 2017
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations.IJCV, 2017. 3
2017
-
[12]
Introducing nano banana pro, 2025
Google. Introducing nano banana pro, 2025. URL https://blog.google/ innovation-and-ai/products/nano-banana-pro/. 4, 6
2025
-
[13]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. URL https://arxiv. org/abs/2507.06261. 4
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Nano banana 2: Combining pro capabilities with lightning-fast speed, 2026
Google. Nano banana 2: Combining pro capabilities with lightning-fast speed, 2026. URL https: //blog.google/innovation-and-ai/technology/ai/nano-banana-2/. 6
2026
-
[15]
Grok imagine api, 2025
X.ai. Grok imagine api, 2025. URLhttps://x.ai/news/grok-imagine-api. 6
2025
-
[16]
Seedream 5.0 lite ai image generator online, 2025
ByteDance. Seedream 5.0 lite ai image generator online, 2025. URL https://www.byteplus.com/ en/product/Seedream. 6
2025
-
[17]
Gpt-image-1.5 model documentation, 2025
OpenAI. Gpt-image-1.5 model documentation, 2025. URL https://developers.openai.com/ api/docs/models/gpt-image-1.5. 6 10
2025
-
[18]
Viescore: Towards explainable metrics for conditional image synthesis evaluation, 2023
Max Ku, Dongfu Jiang, Cong Wei, Xiang Yue, and Wenhu Chen. Viescore: Towards explainable metrics for conditional image synthesis evaluation, 2023. 6
2023
-
[19]
Imagenworld: Stress-testing image generation models with explainable human evaluation on open-ended real-world tasks
Samin Mahdizadeh Sani, Max Ku, Nima Jamali, Matina Mahdizadeh Sani, Paria Khoshtab, Wei-Chieh Sun, Parnian Fazel, Zhi Rui Tam, Thomas Chong, Edisy Kin Wai Chan, Donald Wai Tong Tsang, Chiao-Wei Hsu, Ting Wai Lam, Ho Yin Sam Ng, Chiafeng Chu, Chak-Wing Mak, Keming Wu, Hiu Tung Wong, Yik Chun Ho, Chi Ruan, Zhuofeng Li, I-Sheng Fang, Shih-Ying Yeh, Ho Kei Ch...
-
[20]
URLhttps://openreview.net/forum?id=bld9g6jFh9. 6
-
[21]
Gonzalez, and Ion Stoica
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a- judge with mt-bench and chatbot arena. InThirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openr...
2023
-
[22]
Gemini 3 flash: frontier intelligence built for speed, 2025
Google. Gemini 3 flash: frontier intelligence built for speed, 2025. URL https://blog.google/ products-and-platforms/products/gemini/gemini-3-flash/. 8
2025
-
[23]
arXiv preprint arXiv:2601.03444 (2026) 8
Weiyue Li, Minda Zhao, Weixuan Dong, Jiahui Cai, Yuze Wei, Michael Pocress, Yi Li, Wanyan Yuan, Xiaoyue Wang, Ruoyu Hou, et al. Grading scale impact on llm-as-a-judge: Human-llm alignment is highest on 0-5 grading scale.arXiv preprint arXiv:2601.03444, 2026. 19
-
[24]
overall quality
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022. 19 11 Appendix The appendix consists of the following sections: •Limitations and Future Work (Sec 5.1) •Broader...
2022
-
[25]
Delivered changes:For each requested modification, state whether it was correctly applied, partially applied, or not applied
-
[26]
Consider changes in background, lighting, color palette, object positions, facial features, textures, and any other visual element
Unintended changes:List any changes you observe in the edited image that were NOT requested by the instruction. Consider changes in background, lighting, color palette, object positions, facial features, textures, and any other visual element. 4.Overall impression:A brief summary of the edit quality. Finally, based on your analysis, provide a rating and a...
-
[27]
Each rating must be an integer between 1 and 5 (1 is not at all, 5 is completely)
-
[28]
Each justification must be a short sentence
-
[29]
Judge each criterion separately, and do not let the rating of one criterion influence the another
-
[30]
5.Be strict in your ratings.Only give a rating of 5 if the criterion is perfectly met
As image editing quality is subjective, answer honestly, according to your best judgement and as an average human visual inspector would. 5.Be strict in your ratings.Only give a rating of 5 if the criterion is perfectly met. Evaluation Criteria
-
[31]
Delivered changes
Instruction Following:Does the edited image precisely reflect all modifications described? Note: Your rating must be consistent with the “Delivered changes” checklist above. If any requested change is missing or only partially applied, the rating cannot be 5. Completely (5):Accurately and completely adheres to the modifications. Mostly (4):Reflects most m...
-
[32]
Unintended changes
Image Consistency:Beyond the instructions, does the image remain consistent with the original? Note: Your rating must be consistent with the “Unintended changes” list above. If there are significant unintended changes, the rating cannot be 5. Completely (5):Fully consistent; preserves all features not intended to be changed with high accuracy. Mostly (4):...
-
[33]
evaluation_analysis
Visual Quality:Is the integration seamless, natural, and free from artifacts? Completely (5):Blends perfectly and feels completely natural. Mostly (4):Blends well; minor alterations visible when paying close attention. Moderately (3):Somewhat consistent, but it is noticeable that the image has been edited. Somewhat (2):Edit doesn’t blend well; differences...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.