VDE Bench: Evaluating The Capability of Image Editing Models to Modify Visual Documents

Chenxi Bao; Haijin Liang; Haopeng Jin; Hongzhu Yi; Jiahuan Chen; Jin Ma; Jinwen Luo; Jungang Xu; Ruilin Gao; Ruiwen Tao

arxiv: 2602.00122 · v2 · pith:6EQ7FFAGnew · submitted 2026-01-27 · 💻 cs.CV · cs.AI· cs.MM

VDE Bench: Evaluating The Capability of Image Editing Models to Modify Visual Documents

Hongzhu Yi , Yujia Yang , Yuanxiang Wang , Tong Li , Zhenyu Guan , Tianyu Zong , Jiahuan Chen , Chenxi Bao

show 13 more authors

Tiankun Yang Haopeng Jin Yixuan Yuan Xinming Wang Tao Yu Ruilin Gao Ruiwen Tao Haijin Liang Jin Ma Jinwen Luo Yeshani Xinyu Zuo Jungang Xu

This is my paper

Pith reviewed 2026-05-22 11:16 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.MM

keywords image editingvisual document editingbenchmarkbilingual documentsOCR evaluationChinese-English textdense text images

0 comments

The pith

VDE Bench is the first benchmark for testing image editing models on dense bilingual Chinese-English visual documents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates VDE Bench to evaluate how image editing models perform when asked to change text inside complex visual documents that mix dense Chinese and English. The benchmark draws seed images from academic papers, posters, slides, exams, and newspapers, then supplies 942 human-annotated editing instructions for each. It adds an OCR-parsing evaluation that scores text accuracy at a fine-grained level while checking that style and background stay intact. Tests on existing models produce results that align closely with separate human judgments.

Core claim

VDE Bench is a rigorously human-annotated benchmark of 942 instruction-based image editing samples for bilingual Chinese-English dense visual documents, together with a novel evaluation framework that quantifies editing performance at the OCR parsing level and enables fine-grained assessment of text modification accuracy.

What carries the argument

VDE Bench, a dataset of 942 human-annotated editing samples drawn from dense bilingual document images and scored by OCR parsing metrics that measure text change accuracy.

If this is right

Existing image editing models can be ranked by their success at preserving text style while changing content in dense bilingual documents.
The OCR-level metrics supply detailed error breakdowns that reveal specific failure modes in text editing.
Human-verified consistency supports using the automated scores for large-scale model comparisons.
The benchmark highlights the need for better handling of non-Latin scripts and complex layouts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Tool builders could use the benchmark to prioritize fixes for layout preservation in document editors.
Adding more document categories or languages to the same evaluation structure would test broader generalization.
Training loops that include VDE Bench examples might improve model robustness on real-world dense text edits.

Load-bearing premise

The 942 annotated samples together with the OCR parsing metrics accurately reflect the main difficulties of editing dense bilingual documents.

What would settle it

A larger human study that finds low correlation between the OCR metrics and human judgments on editing quality would show the automated scores do not reliably capture performance.

Figures

Figures reproduced from arXiv: 2602.00122 by Chenxi Bao, Haijin Liang, Haopeng Jin, Hongzhu Yi, Jiahuan Chen, Jin Ma, Jinwen Luo, Jungang Xu, Ruilin Gao, Ruiwen Tao, Tao Yu, Tiankun Yang, Tianyu Zong, Tong Li, Xinming Wang, Xinyu Zuo, Yeshani, Yixuan Yuan, Yuanxiang Wang, Yujia Yang, Zhenyu Guan.

**Figure 2.** Figure 2: An example data sample from VDE Bench [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: With respect to document type, the samples are categorized into nine distinct classes, which are approximately evenly distributed across the dataset. In terms of language, the samples encompass three categories: Chinese, English, and mixed Chinese-English. The distribution of these language categories is also largely balanced, indicating that the dataset provides good representativeness in both language… view at source ↗

**Figure 4.** Figure 4: Overview of the evaluation pipeline. The [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Correlation between human rankings and automated rankings.The horizontal axis represents the human ranking results, and the vertical axis represents the automated ranking results. Specifically, we randomly sampled 20 instances from VDE Bench and collected the corresponding outputs generated by each image editing model. Human annotators then ranked the models according to two criteria: detection box alignm… view at source ↗

read the original abstract

In recent years, image editing models have made significant progress, enabling users to manipulate visual content in a flexible and interactive manner through natural language instructions. However, an important yet underexplored research direction remains dense visual document image editing, which involves modifying textual content within images while faithfully preserving the original text style and background context. Existing methods primarily focus on English scenarios and images with relatively sparse text, and thus cannot adequately address dense, structurally complex documents or non-Latin scripts such as Chinese. To bridge this gap, we propose VDE Bench (Visual Doc Edit Bench), a rigorously human annotated and evaluated benchmark specifically designed to assess the performance of image editing models on bilingual Chinese-English and complex visual document editing tasks. The benchmark comprises a high quality dataset of 942 instruction based image editing samples, whose seed images encompass dense Chinese and English text documents including academic papers, posters, presentation slides, examination materials, and newspapers. Furthermore, we introduce a novel evaluation framework that systematically quantifies editing performance at the OCR parsing level, thereby enabling fine grained assessment of text modification accuracy. Based on this benchmark, we conduct a comprehensive evaluation of representative image editing models. Human verification demonstrates a high degree of consistency between human judgments and automated evaluation metrics. VDE Bench constitutes the first systematic benchmark for evaluating the performance of image editing models on bilingual dense text visual documents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims to introduce VDE Bench as the first systematic benchmark for evaluating image editing models on bilingual dense-text visual documents. It presents a dataset of 942 human-annotated instruction-based editing samples covering academic papers, posters, presentation slides, examination materials, and newspapers with Chinese and English text. The work also proposes an OCR parsing level evaluation framework for fine-grained assessment of text modification accuracy and reports comprehensive evaluations of representative models, with human verification confirming consistency with the automated metrics.

Significance. If validated, VDE Bench would represent a valuable contribution to the field of image editing and document understanding by addressing the underexplored area of dense, bilingual document editing. Existing benchmarks focus on English and sparse text, so this resource could drive progress in models that handle complex layouts and non-Latin scripts while preserving style and context. The human annotation and OCR-based metrics offer a reproducible way to measure performance, potentially leading to better tools for real-world applications like editing scanned documents or multilingual posters.

major comments (2)

[§3 (Dataset Construction)] §3 (Dataset Construction): The manuscript describes the seed image categories and the total number of samples but does not detail the human annotation protocol, including guidelines provided to annotators, quality control measures, or how instructions were generated. This omission makes it difficult to assess the rigor of the 'rigorously human annotated' claim and whether the samples truly represent the challenges of dense bilingual editing.
[§4 (Evaluation Framework)] §4 (Evaluation Framework): The assertion that human verification shows high consistency with automated OCR metrics is central to the benchmark's utility, yet no specific quantitative results (e.g., agreement percentages or correlation values) are provided in the main text or tables. This weakens the support for using these metrics as reliable proxies for editing quality.

minor comments (1)

[Abstract] Abstract: The abstract mentions 'fine grained assessment of text modification accuracy' but does not name the specific metrics (e.g., CER, WER) used in the OCR parsing level evaluation; adding this would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We appreciate the opportunity to clarify aspects of the dataset construction and evaluation framework. Below we respond point-by-point to the major comments and indicate planned revisions to strengthen the paper.

read point-by-point responses

Referee: [§3 (Dataset Construction)] §3 (Dataset Construction): The manuscript describes the seed image categories and the total number of samples but does not detail the human annotation protocol, including guidelines provided to annotators, quality control measures, or how instructions were generated. This omission makes it difficult to assess the rigor of the 'rigorously human annotated' claim and whether the samples truly represent the challenges of dense bilingual editing.

Authors: We acknowledge that the current manuscript provides insufficient detail on the annotation process. In the revised version, we will expand Section 3 with a dedicated subsection describing the full human annotation protocol. This will include the specific guidelines given to annotators (e.g., requirements for preserving layout, style, and bilingual text fidelity), the multi-stage quality control process (initial annotation followed by independent review by two additional annotators with disagreement resolution), and the procedure for generating editing instructions (seed instructions derived from common real-world editing scenarios and refined through pilot testing). These additions will provide transparent evidence supporting the rigor of the human-annotated dataset. revision: yes
Referee: [§4 (Evaluation Framework)] §4 (Evaluation Framework): The assertion that human verification shows high consistency with automated OCR metrics is central to the benchmark's utility, yet no specific quantitative results (e.g., agreement percentages or correlation values) are provided in the main text or tables. This weakens the support for using these metrics as reliable proxies for editing quality.

Authors: We agree that quantitative validation is essential for establishing the reliability of the OCR-based metrics. Although the manuscript states that human verification demonstrates high consistency, we did not include the supporting statistics in the main text. In the revision, we will add a new paragraph and accompanying table in Section 4 reporting the specific agreement metrics, including inter-annotator agreement rates (e.g., percentage agreement on text modification accuracy) and correlation coefficients between human judgments and automated OCR scores across the evaluated models. This will directly substantiate the claim and strengthen the justification for the evaluation framework. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces VDE Bench as a new benchmark resource for image editing models on bilingual dense-text visual documents. It directly describes a 942-sample human-annotated dataset spanning papers, posters, slides, exams and newspapers, plus an OCR-parsing-level evaluation framework whose reliability is asserted via human verification. No equations, fitted parameters, predictions, or derivation steps appear in the provided text. The central claim of constituting the 'first systematic benchmark' is framed as a gap identification in prior English/sparse-text work rather than a result derived from self-citations, uniqueness theorems, or ansatzes. The argument is therefore self-contained as a dataset and protocol proposal with no reduction to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper contributes a new benchmark and evaluation protocol rather than relying on mathematical derivations; its main premises are domain assumptions about annotation quality and metric validity.

axioms (1)

domain assumption Human annotations and OCR-based metrics provide a reliable and consistent measure of text editing accuracy in visual documents.
The evaluation framework and human verification claims rest on this premise.

pith-pipeline@v0.9.0 · 5853 in / 1230 out tokens · 58553 ms · 2026-05-22T11:16:40.879778+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

11 extracted references · 11 canonical work pages · 2 internal anchors

[1]

Tim Brooks, Aleksander Holynski, and Alexei A Efros

Editval: Benchmarking diffusion based text- guided image editing methods.arXiv preprint arXiv:2310.02426. Tim Brooks, Aleksander Holynski, and Alexei A Efros

work page arXiv
[2]

HunyuanImage 3.0 Technical Report

Instructpix2pix: Learning to follow im- age editing instructions. InProceedings of the IEEE/CVF conference on computer vision and pat- tern recognition, pages 18392–18402. Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yu- tao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, Tiankai Hang, Duojun Huang, Jie Jiang, Zhengkai Jiang, Weijie K...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Seedream 3.0 Technical Report

Seedream 3.0 technical report.Preprint, arXiv:2504.11346. Rui Gui, Yang Wan, Haochen Han, Dongxing Mao, Fangming Liu, Min Li, and Alex Jinpeng Wang. 2025. Texteditbench: Evaluating reasoning- aware text editing beyond rendering.Preprint, arXiv:2512.16270. Bohan Jia, Wenxuan Huang, Yuntian Tang, Junbo Qiao, Jincheng Liao, Shaosheng Cao, Fei Zhao, Zhaopeng ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

The instruction you provide must clearly specify which text in the image is to be added

work page
[8]

Text Deletion Instruction Generate a text deletion instruction for the input image

Modify only one location. Text Deletion Instruction Generate a text deletion instruction for the input image. You are not allowed to modify any text inside tables or within images; only titles and body text may be modified

work page
[13]

Text Replacement Instruction Generate a text modify instruction for the input image

Modify only one location. Text Replacement Instruction Generate a text modify instruction for the input image. You are not allowed to modify any text inside tables or within images; only titles and body text may be modified

work page
[14]

Your response must contain only the editing instruction itself, with no additional content

work page
[15]

Your response must be plain text, without any Markdown formatting

work page
[16]

The instruction you provide must clearly specify which text in the image is to be deleted

work page
[17]

For example, if the main language in the image is Chinese, respond in Chinese; if it is English, respond in English

The language of your instruction must match the primary language used in the image. For example, if the main language in the image is Chinese, respond in Chinese; if it is English, respond in English

work page
[18]

国资背景基金情况

Modify only one location. B Model Edit Example This section provides a series of qualitative exam- ples of image editing results generated by various models, illustrating their respective capabilities and differences in handling complex editing tasks. By examining these examples, we aim to offer a deeper understanding of each model’s strengths and limi- t...

work page

[1] [1]

Tim Brooks, Aleksander Holynski, and Alexei A Efros

Editval: Benchmarking diffusion based text- guided image editing methods.arXiv preprint arXiv:2310.02426. Tim Brooks, Aleksander Holynski, and Alexei A Efros

work page arXiv

[2] [2]

HunyuanImage 3.0 Technical Report

Instructpix2pix: Learning to follow im- age editing instructions. InProceedings of the IEEE/CVF conference on computer vision and pat- tern recognition, pages 18392–18402. Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yu- tao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, Tiankai Hang, Duojun Huang, Jie Jiang, Zhengkai Jiang, Weijie K...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Seedream 3.0 Technical Report

Seedream 3.0 technical report.Preprint, arXiv:2504.11346. Rui Gui, Yang Wan, Haochen Han, Dongxing Mao, Fangming Liu, Min Li, and Alex Jinpeng Wang. 2025. Texteditbench: Evaluating reasoning- aware text editing beyond rendering.Preprint, arXiv:2512.16270. Bohan Jia, Wenxuan Huang, Yuntian Tang, Junbo Qiao, Jincheng Liao, Shaosheng Cao, Fei Zhao, Zhaopeng ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [6]

The instruction you provide must clearly specify which text in the image is to be added

work page

[5] [8]

Text Deletion Instruction Generate a text deletion instruction for the input image

Modify only one location. Text Deletion Instruction Generate a text deletion instruction for the input image. You are not allowed to modify any text inside tables or within images; only titles and body text may be modified

work page

[6] [13]

Text Replacement Instruction Generate a text modify instruction for the input image

Modify only one location. Text Replacement Instruction Generate a text modify instruction for the input image. You are not allowed to modify any text inside tables or within images; only titles and body text may be modified

work page

[7] [14]

Your response must contain only the editing instruction itself, with no additional content

work page

[8] [15]

Your response must be plain text, without any Markdown formatting

work page

[9] [16]

The instruction you provide must clearly specify which text in the image is to be deleted

work page

[10] [17]

For example, if the main language in the image is Chinese, respond in Chinese; if it is English, respond in English

The language of your instruction must match the primary language used in the image. For example, if the main language in the image is Chinese, respond in Chinese; if it is English, respond in English

work page

[11] [18]

国资背景基金情况

Modify only one location. B Model Edit Example This section provides a series of qualitative exam- ples of image editing results generated by various models, illustrating their respective capabilities and differences in handling complex editing tasks. By examining these examples, we aim to offer a deeper understanding of each model’s strengths and limi- t...

work page