VDE Bench: Evaluating The Capability of Image Editing Models to Modify Visual Documents
Pith reviewed 2026-05-22 11:16 UTC · model grok-4.3
The pith
VDE Bench is the first benchmark for testing image editing models on dense bilingual Chinese-English visual documents.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VDE Bench is a rigorously human-annotated benchmark of 942 instruction-based image editing samples for bilingual Chinese-English dense visual documents, together with a novel evaluation framework that quantifies editing performance at the OCR parsing level and enables fine-grained assessment of text modification accuracy.
What carries the argument
VDE Bench, a dataset of 942 human-annotated editing samples drawn from dense bilingual document images and scored by OCR parsing metrics that measure text change accuracy.
If this is right
- Existing image editing models can be ranked by their success at preserving text style while changing content in dense bilingual documents.
- The OCR-level metrics supply detailed error breakdowns that reveal specific failure modes in text editing.
- Human-verified consistency supports using the automated scores for large-scale model comparisons.
- The benchmark highlights the need for better handling of non-Latin scripts and complex layouts.
Where Pith is reading between the lines
- Tool builders could use the benchmark to prioritize fixes for layout preservation in document editors.
- Adding more document categories or languages to the same evaluation structure would test broader generalization.
- Training loops that include VDE Bench examples might improve model robustness on real-world dense text edits.
Load-bearing premise
The 942 annotated samples together with the OCR parsing metrics accurately reflect the main difficulties of editing dense bilingual documents.
What would settle it
A larger human study that finds low correlation between the OCR metrics and human judgments on editing quality would show the automated scores do not reliably capture performance.
Figures
read the original abstract
In recent years, image editing models have made significant progress, enabling users to manipulate visual content in a flexible and interactive manner through natural language instructions. However, an important yet underexplored research direction remains dense visual document image editing, which involves modifying textual content within images while faithfully preserving the original text style and background context. Existing methods primarily focus on English scenarios and images with relatively sparse text, and thus cannot adequately address dense, structurally complex documents or non-Latin scripts such as Chinese. To bridge this gap, we propose VDE Bench (Visual Doc Edit Bench), a rigorously human annotated and evaluated benchmark specifically designed to assess the performance of image editing models on bilingual Chinese-English and complex visual document editing tasks. The benchmark comprises a high quality dataset of 942 instruction based image editing samples, whose seed images encompass dense Chinese and English text documents including academic papers, posters, presentation slides, examination materials, and newspapers. Furthermore, we introduce a novel evaluation framework that systematically quantifies editing performance at the OCR parsing level, thereby enabling fine grained assessment of text modification accuracy. Based on this benchmark, we conduct a comprehensive evaluation of representative image editing models. Human verification demonstrates a high degree of consistency between human judgments and automated evaluation metrics. VDE Bench constitutes the first systematic benchmark for evaluating the performance of image editing models on bilingual dense text visual documents.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to introduce VDE Bench as the first systematic benchmark for evaluating image editing models on bilingual dense-text visual documents. It presents a dataset of 942 human-annotated instruction-based editing samples covering academic papers, posters, presentation slides, examination materials, and newspapers with Chinese and English text. The work also proposes an OCR parsing level evaluation framework for fine-grained assessment of text modification accuracy and reports comprehensive evaluations of representative models, with human verification confirming consistency with the automated metrics.
Significance. If validated, VDE Bench would represent a valuable contribution to the field of image editing and document understanding by addressing the underexplored area of dense, bilingual document editing. Existing benchmarks focus on English and sparse text, so this resource could drive progress in models that handle complex layouts and non-Latin scripts while preserving style and context. The human annotation and OCR-based metrics offer a reproducible way to measure performance, potentially leading to better tools for real-world applications like editing scanned documents or multilingual posters.
major comments (2)
- [§3 (Dataset Construction)] §3 (Dataset Construction): The manuscript describes the seed image categories and the total number of samples but does not detail the human annotation protocol, including guidelines provided to annotators, quality control measures, or how instructions were generated. This omission makes it difficult to assess the rigor of the 'rigorously human annotated' claim and whether the samples truly represent the challenges of dense bilingual editing.
- [§4 (Evaluation Framework)] §4 (Evaluation Framework): The assertion that human verification shows high consistency with automated OCR metrics is central to the benchmark's utility, yet no specific quantitative results (e.g., agreement percentages or correlation values) are provided in the main text or tables. This weakens the support for using these metrics as reliable proxies for editing quality.
minor comments (1)
- [Abstract] Abstract: The abstract mentions 'fine grained assessment of text modification accuracy' but does not name the specific metrics (e.g., CER, WER) used in the OCR parsing level evaluation; adding this would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We appreciate the opportunity to clarify aspects of the dataset construction and evaluation framework. Below we respond point-by-point to the major comments and indicate planned revisions to strengthen the paper.
read point-by-point responses
-
Referee: [§3 (Dataset Construction)] §3 (Dataset Construction): The manuscript describes the seed image categories and the total number of samples but does not detail the human annotation protocol, including guidelines provided to annotators, quality control measures, or how instructions were generated. This omission makes it difficult to assess the rigor of the 'rigorously human annotated' claim and whether the samples truly represent the challenges of dense bilingual editing.
Authors: We acknowledge that the current manuscript provides insufficient detail on the annotation process. In the revised version, we will expand Section 3 with a dedicated subsection describing the full human annotation protocol. This will include the specific guidelines given to annotators (e.g., requirements for preserving layout, style, and bilingual text fidelity), the multi-stage quality control process (initial annotation followed by independent review by two additional annotators with disagreement resolution), and the procedure for generating editing instructions (seed instructions derived from common real-world editing scenarios and refined through pilot testing). These additions will provide transparent evidence supporting the rigor of the human-annotated dataset. revision: yes
-
Referee: [§4 (Evaluation Framework)] §4 (Evaluation Framework): The assertion that human verification shows high consistency with automated OCR metrics is central to the benchmark's utility, yet no specific quantitative results (e.g., agreement percentages or correlation values) are provided in the main text or tables. This weakens the support for using these metrics as reliable proxies for editing quality.
Authors: We agree that quantitative validation is essential for establishing the reliability of the OCR-based metrics. Although the manuscript states that human verification demonstrates high consistency, we did not include the supporting statistics in the main text. In the revision, we will add a new paragraph and accompanying table in Section 4 reporting the specific agreement metrics, including inter-annotator agreement rates (e.g., percentage agreement on text modification accuracy) and correlation coefficients between human judgments and automated OCR scores across the evaluated models. This will directly substantiate the claim and strengthen the justification for the evaluation framework. revision: yes
Circularity Check
No significant circularity
full rationale
The paper introduces VDE Bench as a new benchmark resource for image editing models on bilingual dense-text visual documents. It directly describes a 942-sample human-annotated dataset spanning papers, posters, slides, exams and newspapers, plus an OCR-parsing-level evaluation framework whose reliability is asserted via human verification. No equations, fitted parameters, predictions, or derivation steps appear in the provided text. The central claim of constituting the 'first systematic benchmark' is framed as a gap identification in prior English/sparse-text work rather than a result derived from self-citations, uniqueness theorems, or ansatzes. The argument is therefore self-contained as a dataset and protocol proposal with no reduction to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human annotations and OCR-based metrics provide a reliable and consistent measure of text editing accuracy in visual documents.
Reference graph
Works this paper leans on
-
[1]
Tim Brooks, Aleksander Holynski, and Alexei A Efros
Editval: Benchmarking diffusion based text- guided image editing methods.arXiv preprint arXiv:2310.02426. Tim Brooks, Aleksander Holynski, and Alexei A Efros
-
[2]
HunyuanImage 3.0 Technical Report
Instructpix2pix: Learning to follow im- age editing instructions. InProceedings of the IEEE/CVF conference on computer vision and pat- tern recognition, pages 18392–18402. Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yu- tao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, Tiankai Hang, Duojun Huang, Jie Jiang, Zhengkai Jiang, Weijie K...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Seedream 3.0 technical report.Preprint, arXiv:2504.11346. Rui Gui, Yang Wan, Haochen Han, Dongxing Mao, Fangming Liu, Min Li, and Alex Jinpeng Wang. 2025. Texteditbench: Evaluating reasoning- aware text editing beyond rendering.Preprint, arXiv:2512.16270. Bohan Jia, Wenxuan Huang, Yuntian Tang, Junbo Qiao, Jincheng Liao, Shaosheng Cao, Fei Zhao, Zhaopeng ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
The instruction you provide must clearly specify which text in the image is to be added
-
[8]
Text Deletion Instruction Generate a text deletion instruction for the input image
Modify only one location. Text Deletion Instruction Generate a text deletion instruction for the input image. You are not allowed to modify any text inside tables or within images; only titles and body text may be modified
-
[13]
Text Replacement Instruction Generate a text modify instruction for the input image
Modify only one location. Text Replacement Instruction Generate a text modify instruction for the input image. You are not allowed to modify any text inside tables or within images; only titles and body text may be modified
-
[14]
Your response must contain only the editing instruction itself, with no additional content
-
[15]
Your response must be plain text, without any Markdown formatting
-
[16]
The instruction you provide must clearly specify which text in the image is to be deleted
-
[17]
The language of your instruction must match the primary language used in the image. For example, if the main language in the image is Chinese, respond in Chinese; if it is English, respond in English
-
[18]
Modify only one location. B Model Edit Example This section provides a series of qualitative exam- ples of image editing results generated by various models, illustrating their respective capabilities and differences in handling complex editing tasks. By examining these examples, we aim to offer a deeper understanding of each model’s strengths and limi- t...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.