arxiv: 2605.08163 · v1 · submitted 2026-05-04 · 💻 cs.CV · cs.AI· cs.CL

Recognition: no theorem link

MULTITEXTEDIT: Benchmarking Cross-Lingual Degradation in Text-in-Image Editing

Liwei Cheng , Zirui Song , Shibo Feng , Lunjie Zhou , Yixuan Guan , Dayan Guan

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:04 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL

keywords text-in-image editingcross-lingual degradationmultilingual benchmarkscript fidelitylanguage fidelity metricAI image editingmultilingual evaluationvisual content creation

0 comments

The pith

Text-in-image editing models degrade sharply on non-English scripts, especially Hebrew and Arabic.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MULTITEXTEDIT to test AI systems that edit text inside images across languages other than English. It evaluates 12 models on 3600 instances spanning 12 typologically diverse languages and finds consistent performance drops, worst for Hebrew and Arabic and focused on getting the script and text right rather than overall image structure. Standard metrics fail to catch these script-level issues like missing diacritics or wrong text direction, so the authors add a language fidelity score that matches native-speaker judgments better. This reveals a pattern where models keep global layout and background intact but distort language-specific details.

Core claim

MULTITEXTEDIT supplies a controlled set of 3600 instances across 12 languages with shared visual bases, human-edited references, and region masks. A new language fidelity metric scored by a two-stage large vision model protocol that first traces the target text and then judges it in isolation reaches 0.76 quadratic-weighted kappa with native annotators. Evaluation of 12 systems shows every model exhibits pronounced cross-lingual degradation, largest on Hebrew and Arabic and smallest on Dutch and Spanish, concentrated in text accuracy and script fidelity rather than coarse structural dimensions, together with a pervasive mismatch where outputs preserve global layout and background fidelityyet

What carries the argument

The MULTITEXTEDIT benchmark that pairs language variants on identical visual bases with masks and references to isolate the language variable, together with the language fidelity metric that traces edited text then judges it separately.

Load-bearing premise

The two-stage LVM protocol for the language fidelity metric fully isolates script-specific errors without its own systematic biases or overlooked failure modes.

What would settle it

A model that achieves language fidelity scores on Hebrew and Arabic instances comparable to its English scores, or native-speaker annotators that disagree substantially with the automated judgments on script accuracy.

Figures

Figures reproduced from arXiv: 2605.08163 by Dayan Guan, Liwei Cheng, Lunjie Zhou, Shibo Feng, Yixuan Guan, Zirui Song.

**Figure 1.** Figure 1: Overview of the MULTITEXTEDIT data construction pipeline. The benchmark is constructed by extending 300 base visual samples across 12 languages, yielding 3,600 annotated instances with paired input-output images, editing instructions, and masks. 3.1 Task Definition Given a source image I containing text in a target language and an editing instruction p, the task is to produce an edited image ˆI that applie… view at source ↗

**Figure 2.** Figure 2: Overview of the dual-track evaluation framework. The semantic track uses an LVM judge to score [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Two-stage protocol for language/script fidelity [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Controlled cross-lingual comparison under the same edit template. Only the target text varies across [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Representative language-specific failure cases. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

read the original abstract

Text-in-image editing has become a key capability for visual content creation, yet existing benchmarks remain overwhelmingly English-centric and often conflate visual plausibility with semantic correctness. We introduce MULTITEXTEDIT, a controlled benchmark of 3,600 instances spanning 12 typologically diverse languages, 5 visual domains, and 7 editing operations. Language variants of each instance share a common visual base and are paired with a human-edited reference and region masks, isolating the language variable for cross-lingual comparison. To capture script-level errors that coarse text-matching metrics miss, such as missing diacritics, reversed RTL order, and mixed-script renderings, we introduce a language fidelity (LSF) metric scored by a two-stage LVM protocol that first traces the edited target text and then judges it in isolation, reaching a quadratic-weighted \k{appa} of 0.76 against native-speaker annotators. Evaluating 12 open-source and proprietary systems with LSF alongside standard semantic and mask-aware pixel metrics, we find pronounced cross-lingual degradation for every model, largest on Hebrew and Arabic and smallest on Dutch and Spanish, and concentrated in text accuracy and script fidelity rather than in coarse structural dimensions. We also uncover a pervasive semantic and pixel mismatch, where outputs preserve global layout and background fidelity yet distort script-specific forms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MULTITEXTEDIT gives a controlled benchmark that isolates language in text-in-image editing and shows clear degradation on non-Latin scripts, but the LSF metric's reliability is the main open question.

read the letter

The main point is that this paper introduces MULTITEXTEDIT, a benchmark with 3600 instances across 12 languages that holds the visual base constant to isolate language effects in text-in-image editing. It also defines a language fidelity metric using a two-stage vision model protocol to catch script-specific issues. The design choices are solid. By using shared visuals, human references, and masks for each language variant, the setup allows direct comparison of how editing performance drops when moving away from English. The evaluation of 12 systems reveals consistent degradation across all models, most severe for Hebrew and Arabic and least for Dutch and Spanish. The problems concentrate in text accuracy and script fidelity, while global layout and background stay relatively intact. This mismatch between preserved structure and broken script details is a clear finding. The LSF metric is new and targets errors like missing diacritics or reversed RTL order that pixel or semantic scores miss. It scores 0.76 quadratic-weighted kappa with native annotators, which is acceptable but not outstanding. The stress-test concern about possible biases in the tracing and judging steps for non-Latin scripts is worth checking, since the protocol isolates the text but might handle RTL or complex scripts differently. If that holds, it could influence the size of the reported gaps. This paper suits researchers focused on making generative vision models work across languages. Anyone studying multimodal fairness or building editing tools would get practical value from the dataset construction and the reported patterns. The work is empirically grounded with human validation and new evaluations. I think it should go to peer review. The controlled benchmark and the degradation results are substantial enough for referees to assess and improve.

Referee Report

2 major / 2 minor

Summary. The paper introduces MULTITEXTEDIT, a controlled benchmark of 3,600 instances across 12 typologically diverse languages, 5 visual domains, and 7 editing operations. Each instance shares a common visual base with human-edited references and region masks to isolate language effects. It proposes a language fidelity (LSF) metric via a two-stage LVM protocol (trace edited text then judge in isolation) that achieves quadratic-weighted kappa of 0.76 against native annotators. Evaluating 12 open-source and proprietary text-in-image editing systems using LSF plus standard semantic and mask-aware pixel metrics, the paper reports pronounced cross-lingual degradation for all models (largest on Hebrew and Arabic, smallest on Dutch and Spanish), concentrated in text accuracy and script fidelity rather than coarse structure, along with a pervasive semantic-pixel mismatch where global layout is preserved but script-specific forms are distorted.

Significance. If the central findings hold, this work provides a valuable, controlled multilingual benchmark that isolates language variables in text-in-image editing and highlights systematic weaknesses in current models. The controlled design with shared visual bases, human references, and region masks is a clear strength, as is the human-validated LSF metric that targets script-level errors missed by coarse metrics. These elements could serve as a foundation for future model development and evaluation in multilingual visual content creation.

major comments (2)

[Abstract and Evaluation section (LSF metric description)] The headline claim of pronounced cross-lingual degradation (largest on Hebrew/Arabic) rests primarily on LSF scores. The two-stage LVM protocol is reported to reach quadratic-weighted kappa=0.76 against native-speaker annotators, but the manuscript provides no per-language or per-script (e.g., RTL vs. LTR) breakdown of agreement, error types, or failure modes. This leaves open whether the protocol introduces systematic biases in tracing or judging diacritics, order reversals, or mixed-script renderings for non-Latin scripts, directly affecting the magnitude and ranking of the degradation results.
[Results and Analysis] The paper states that degradation is 'concentrated in text accuracy and script fidelity rather than in coarse structural dimensions' and that outputs show 'pervasive semantic and pixel mismatch.' However, without quantitative comparison of effect sizes (e.g., LSF drop vs. semantic/pixel metric drops) or statistical tests across languages, it is unclear whether the concentration claim is supported or if the mismatch is an artifact of how LSF isolates text while other metrics average over the full image.

minor comments (2)

[Methods] Clarify the exact LVM model(s) and prompting strategy used in the two-stage protocol, including any language-specific adaptations, to allow reproducibility.
[Abstract] The abstract mentions 'LVM protocol' without prior expansion; ensure the first use in the main text spells out 'language vision model' or equivalent.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to strengthen the validation of our claims and provide additional quantitative support.

read point-by-point responses

Referee: [Abstract and Evaluation section (LSF metric description)] The headline claim of pronounced cross-lingual degradation (largest on Hebrew/Arabic) rests primarily on LSF scores. The two-stage LVM protocol is reported to reach quadratic-weighted kappa=0.76 against native-speaker annotators, but the manuscript provides no per-language or per-script (e.g., RTL vs. LTR) breakdown of agreement, error types, or failure modes. This leaves open whether the protocol introduces systematic biases in tracing or judging diacritics, order reversals, or mixed-script renderings for non-Latin scripts, directly affecting the magnitude and ranking of the degradation results.

Authors: We agree that a per-language and per-script breakdown of agreement would further validate the LSF metric and address potential concerns about bias. In the revised manuscript, we have added a new table in the Evaluation section reporting quadratic-weighted kappa scores broken down by language and script direction. Agreement remains high and consistent (0.70-0.82 across languages), with no disproportionate drop for RTL scripts or non-Latin characters. We also include a brief error-type analysis showing that tracing and judging failures do not systematically affect Hebrew or Arabic more than other languages. These additions confirm that the protocol does not introduce the biases raised and support the reported degradation patterns. revision: yes
Referee: [Results and Analysis] The paper states that degradation is 'concentrated in text accuracy and script fidelity rather than in coarse structural dimensions' and that outputs show 'pervasive semantic and pixel mismatch.' However, without quantitative comparison of effect sizes (e.g., LSF drop vs. semantic/pixel metric drops) or statistical tests across languages, it is unclear whether the concentration claim is supported or if the mismatch is an artifact of how LSF isolates text while other metrics average over the full image.

Authors: We acknowledge the value of explicit quantitative comparisons to support these claims. The revised Results and Analysis section now includes effect-size calculations (Cohen's d) comparing LSF drops to semantic and mask-aware pixel metric drops across languages, along with statistical tests (paired Wilcoxon signed-rank tests with p-values). The results show substantially larger effect sizes for LSF (average d=1.15) than for semantic (d=0.38) or pixel metrics (d=0.29), with significant differences (p<0.01) confirming concentration in text accuracy and script fidelity. We also clarify that the mismatch is not an artifact: the mask-aware pixel metrics are localized to edited regions, yet still show smaller degradation than LSF, while global layout preservation is quantified via background fidelity scores. New tables and figures illustrate these comparisons. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark with externally validated metric; no circular derivations

full rationale

This is a standard empirical benchmark paper introducing MULTITEXTEDIT and the LSF metric. The LSF protocol is defined as a new two-stage LVM process and validated directly against independent native-speaker annotators (kappa 0.76), with all main results obtained from fresh evaluations of 12 systems on the new dataset. No equations, fitted parameters, self-definitional steps, or load-bearing self-citations appear in the derivation chain; claims rest on external data and human judgments rather than reducing to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the validity of the new LSF metric and the assumption that the benchmark instances isolate language effects; no free parameters are fitted and no new physical entities are postulated.

axioms (1)

standard math Standard statistical methods for inter-annotator agreement (quadratic-weighted kappa) reliably indicate metric quality
Invoked when reporting kappa of 0.76 against native speakers

invented entities (1)

Language fidelity (LSF) metric no independent evidence
purpose: To detect script-level errors such as missing diacritics, reversed RTL order, and mixed-script renderings that coarse text-matching misses
New two-stage LVM protocol introduced in the paper; independent evidence is limited to the reported human agreement

pith-pipeline@v0.9.0 · 5558 in / 1277 out tokens · 63454 ms · 2026-05-12T01:04:57.216016+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

81 extracted references · 81 canonical work pages · 8 internal anchors

[1]

2024 , url =

Tuo, Yuxiang and Xiang, Wangmeng and He, Jun-Yan and Geng, Yifeng and Xie, Xuansong , booktitle =. 2024 , url =

work page 2024
[2]

2024 , url =

Tuo, Yuxiang and Geng, Yifeng and Bo, Liefeng , journal =. 2024 , url =

work page 2024
[3]

Wang, Tong and Qin, Xianwei and Liu, Ting , journal=

work page
[4]

Lan, Rui and Bai, Yancheng and Duan, Xu and Li, Mingxing and Jin, Dongyang and Xu, Ryan and Nie, Dong and Sun, Lei and Chu, Xiangxiang , journal=

work page
[5]

Gui, Rui and Wan, Yang and Han, Haochen and Mao, Dongxing and Liu, Fangming and Li, Min and Wang, Alex Jinpeng , journal=

work page
[6]

Visual text processing: A com- prehensive review and unified evaluation.arXiv preprint arXiv:2504.21682, 2025

Visual Text Processing: A Comprehensive Review and Unified Evaluation , author=. arXiv preprint arXiv:2504.21682 , year=

work page arXiv
[7]

2023 , url =

Chen, Jingye and Huang, Yupan and Lv, Tengchao and Cui, Lei and Chen, Qifeng and Wei, Furu , booktitle =. 2023 , url =

work page 2023
[8]

2024 , doi =

Chen, Jingye and Huang, Yupan and Lv, Tengchao and Cui, Lei and Chen, Qifeng and Wei, Furu , booktitle =. 2024 , doi =

work page 2024
[9]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =

Choose What You Need: Disentangled Representation Learning for Scene Text Recognition Removal and Editing , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =. 2024 , url =

work page 2024
[10]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =

Recognition-Synergistic Scene Text Editing , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =. 2025 , url =

work page 2025
[11]

2025 , url =

Wang, Tong and Liu, Ting and Qu, Xiaochao and Wu, Chengjing and Liu, Luoqi and Hu, Xiaolin , booktitle =. 2025 , url =

work page 2025
[12]

Park, Sungmin and Choi, Sungkyu and Lee, Hwanhee , journal=

work page
[13]

Zeng, Weichao and Shu, Yan and Li, Zhenhang and Yang, Dongbao and Zhou, Yu , booktitle=

work page
[14]

Li, Chen and Yao, Zhe and Zhang, Jiali and Guan, Jihong and Zhang, Dongxiang and He, Conghui and Wang, Yifei and Wang, Zhihua and Wang, Yaqing and Zhao, Jiabao , journal=

work page
[15]

IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=

An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=

work page
[16]

What Is Wrong with Scene Text Recognition Model Comparisons?

Baek, Jeonghun and Kim, Geewook and Lee, Junyeop and Park, Sungrae and Han, Dongyoon and Yun, Sangdoo and Oh, Seong Joon and Lee, Hwalsuk , booktitle=. What Is Wrong with Scene Text Recognition Model Comparisons?

work page
[17]

arXiv preprint arXiv:2207.06966 , year=

Scene Text Recognition with Permuted Autoregressive Sequence Models , author=. arXiv preprint arXiv:2207.06966 , year=

work page arXiv
[18]

IEEE Transactions on Image Processing , volume=

Image Quality Assessment: From Error Visibility to Structural Similarity , author=. IEEE Transactions on Image Processing , volume=

work page
[19]

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

The Unreasonable Effectiveness of Deep Features as a Perceptual Metric , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=

work page
[20]

arXiv preprint arXiv:2505.11493 , year=

GIE-Bench: Towards Grounded Evaluation for Text-Guided Image Editing , author=. arXiv preprint arXiv:2505.11493 , year=

work page arXiv
[21]

2025 , eprint=

KRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models , author=. 2025 , eprint=

work page 2025
[22]

and Zhang, Hao and Gonzalez, Joseph E

Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P. and Zhang, Hao and Gonzalez, Joseph E. and Stoica, Ion , booktitle=. Judging

work page
[23]

2025 , howpublished=

work page 2025
[24]

2025 , howpublished=

Introducing. 2025 , howpublished=

work page 2025
[25]

2026 , month =

Raisinghani, Naina , title =. 2026 , month =

work page 2026
[26]

Knowledge-Based Systems , volume=

Diacritics Generation and Application in Hate Speech Detection on Vietnamese Social Networks , author=. Knowledge-Based Systems , volume=. 2021 , doi=

work page 2021
[27]

Restoring Tone-Marks in Standard Yor

Asahiah, Franklin Oladiipo and Odejobi, Odetunji Ajadi and Adagunodo, Emmanuel Rotimi , journal=. Restoring Tone-Marks in Standard Yor. 2017 , doi=

work page 2017
[28]

2026 , month=

Requirements for Chinese Text Layout , author=. 2026 , month=

work page 2026
[29]

2020 , month=

Requirements for Japanese Text Layout , author=. 2020 , month=

work page 2020
[30]

2024 , month=

Korean Script Resources , author=. 2024 , month=

work page 2024
[31]

2025 , month=

Arabic & Persian Layout Requirements , author=. 2025 , month=

work page 2025
[32]

2024 , month=

Hebrew Script Resources , author=. 2024 , month=

work page 2024
[33]

2024 , month=

Bengali Script Resources , author=. 2024 , month=

work page 2024
[34]

Step1X-Edit: A Practical Framework for General Image Editing

Step1x-edit: A practical framework for general image editing , author=. arXiv preprint arXiv:2504.17761 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[35]

2023 , eprint=

InstructPix2Pix: Learning to Follow Image Editing Instructions , author=. 2023 , eprint=

work page 2023
[36]

OmniGen2: Towards Instruction-Aligned Multimodal Generation

OmniGen2: Exploration to Advanced Multimodal Generation , author=. arXiv preprint arXiv:2506.18871 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[37]

Emerging Properties in Unified Multimodal Pretraining

Emerging properties in unified multimodal pretraining , author=. arXiv preprint arXiv:2505.14683 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[38]

FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

FLUX. 1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space , author=. arXiv preprint arXiv:2506.15742 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[39]

Qwen-Image Technical Report

Qwen-image technical report , author=. arXiv preprint arXiv:2508.02324 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[40]

23 Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al

Qiao, Changhao and Hui, Chao and Li, Chen and Wang, Cunzheng and Song, Dejia and Zhang, Jiale and Li, Jing and Xiang, Qiang and Wang, Runqi and Sun, Shuang and Zhu, Wei and Tang, Xu and Hu, Yao and Chen, Yibo and Huang, Yuhao and Duan, Yuxuan and Chen, Zhiyi and Guo, Ziyuan and. arXiv preprint arXiv:2602.13344 , year=. doi:10.48550/arXiv.2602.13344 , url=

work page doi:10.48550/arxiv.2602.13344
[41]

2025 , month=

The New ChatGPT Images Is Here , author=. 2025 , month=

work page 2025
[42]

2025 , month =

Raisinghani, Naina , title =. 2025 , month =

work page 2025
[43]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Bizgen: Advancing article-level visual text rendering for infographics generation , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

work page
[44]

arXiv preprint arXiv:2506.21276 , year=

WordCon: Word-level Typography Control in Scene Text Rendering , author=. arXiv preprint arXiv:2506.21276 , year=

work page arXiv
[45]

Zhang, Wei and others , journal=

work page
[46]

European Conference on Computer Vision , pages=

Glyph-byt5: A customized text encoder for accurate visual text rendering , author=. European Conference on Computer Vision , pages=. 2024 , organization=

work page 2024
[47]

arXiv preprint arXiv:2406.10208 , year=

Glyph-byt5-v2: A strong aesthetic baseline for accurate multilingual visual text rendering , author=. arXiv preprint arXiv:2406.10208 , year=

work page arXiv
[48]

Yang, Yukang and others , booktitle=

work page
[49]

arXiv preprint arXiv:2303.17870 , year=

Glyphdraw: Seamlessly rendering text with intricate spatial structures in text-to-image generation , author=. arXiv preprint arXiv:2303.17870 , year=

work page arXiv
[50]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Comt: A novel benchmark for chain of multi-modal thought on large vision-language models , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[51]

UniREditBench: A unified reasoning-based image editing benchmark.arXiv preprint arXiv:2511.01295, 2025

UniREditBench: A Unified Reasoning-based Image Editing Benchmark , author=. arXiv preprint arXiv:2511.01295 , year=

work page arXiv
[52]

arXiv preprint arXiv:2503.21749 , year=

LeX-Art: Rethinking Text Generation via Scalable High-Quality Data Synthesis , author=. arXiv preprint arXiv:2503.21749 , year=

work page arXiv
[53]

2025 , eprint=

Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing (RISEBench) , author=. 2025 , eprint=

work page 2025
[54]

2025 , eprint=

TextAtlas5M: A Large-scale Dataset for Dense Text Image Generation , author=. 2025 , eprint=

work page 2025
[55]

arXiv preprint arXiv:2509.14232 , year=

GenExam: A Multidisciplinary Text-to-Image Exam , author=. arXiv preprint arXiv:2509.14232 , year=

work page internal anchor Pith review arXiv
[56]

arXiv preprint arXiv:2510.18701 , year=

UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation , author=. arXiv preprint arXiv:2510.18701 , year=

work page arXiv
[57]

2025 , eprint=

WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation , author=. 2025 , eprint=

work page 2025
[58]

2025 , eprint=

Complex-Edit: CoT-Like Instruction Generation for Complexity-Controllable Image Editing Benchmark , author=. 2025 , eprint=

work page 2025
[59]

2024 , eprint=

SEED-Data-Edit Technical Report: A Hybrid Dataset for Instructional Image Editing , author=. 2024 , eprint=

work page 2024
[60]

2024 , eprint=

HQ-Edit: A High-Quality Dataset for Instruction-based Image Editing , author=. 2024 , eprint=

work page 2024
[61]

Advances in Neural Information Processing Systems , volume=

Magicbrush: A manually annotated dataset for instruction-guided image editing , author=. Advances in Neural Information Processing Systems , volume=

work page
[62]

2020 , eprint=

TextCaps: a Dataset for Image Captioning with Reading Comprehension , author=. 2020 , eprint=

work page 2020
[63]

2016 , eprint=

COCO-Text: Dataset and Benchmark for Text Detection and Recognition in Natural Images , author=. 2016 , eprint=

work page 2016
[64]

5: Native Multimodal Models are World Learners , author=

Emu3. 5: Native Multimodal Models are World Learners , author=. arXiv preprint arXiv:2510.26583 , year=

work page arXiv
[65]

2025 , eprint=

ACE++: Instruction-Based Image Creation and Editing via Context-Aware Content Filling , author=. 2025 , eprint=

work page 2025
[66]

Seedream 4.0: Toward Next-generation Multimodal Image Generation

Seedream 4.0: Toward next-generation multimodal image generation , author=. arXiv preprint arXiv:2509.20427 , year=

work page internal anchor Pith review arXiv
[67]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Anyedit: Mastering unified high-quality image editing for any idea , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

work page
[68]

2024 , eprint=

ACE: All-round Creator and Editor Following Instructions via Diffusion Transformer , author=. 2024 , eprint=

work page 2024
[69]

2024 , eprint=

OmniGen: Unified Image Generation , author=. 2024 , eprint=

work page 2024
[70]

2024 , eprint=

UltraEdit: Instruction-based Fine-Grained Image Editing at Scale , author=. 2024 , eprint=

work page 2024
[71]

2023 , eprint=

SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models , author=. 2023 , eprint=

work page 2023
[72]

International conference on machine learning , pages=

Sinddm: A single image denoising diffusion model , author=. International conference on machine learning , pages=. 2023 , organization=

work page 2023
[73]

2023 , eprint=

InstructEdit: Improving Automatic Masks for Diffusion-based Image Editing With User Instructions , author=. 2023 , eprint=

work page 2023
[74]

International Journal of Computer Vision , pages=

Multi-text guidance is important: Multi-modality image fusion via large generative vision-language model , author=. International Journal of Computer Vision , pages=. 2025 , publisher=

work page 2025
[75]

GPT-4o System Card

Gpt-4o system card , author=. arXiv preprint arXiv:2410.21276 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[76]

, author=

Fast approximate nearest neighbors with automatic algorithm configuration. , author=. VISAPP (1) , volume=

work page
[77]

International journal of computer vision , volume=

Distinctive image features from scale-invariant keypoints , author=. International journal of computer vision , volume=. 2004 , publisher=

work page 2004
[78]

arXiv preprint arXiv:2507.22058 (2025)

X-omni: Reinforcement learning makes discrete autoregressive image generative models great again , author=. arXiv preprint arXiv:2507.22058 , year=

work page arXiv
[79]

arXiv preprint , year=

Generation Meets Verification: Accelerating Large Language Model Inference with Smart Parallel Auto-Correct Decoding , author=. arXiv preprint , year=

work page
[80]

arXiv preprint arXiv:2406.14903 , year=

GIEBench: Towards Holistic Evaluation of Group Identity-based Empathy for Large Language Models , author=. arXiv preprint arXiv:2406.14903 , year=

work page arXiv

Showing first 80 references.