VCG-Bench: Towards A Unified Visual-Centric Benchmark for Structured Generation and Editing

Gai Yuhang; Kaitao Lin; Liang Chen; Peijie Dong; Qiang Wang; Song Tang; Xiaowen Chu; Xiaoyan Su; Yuyao Zhai; Yuyu Luo

arxiv: 2605.15677 · v1 · pith:3YVZB2R4new · submitted 2026-05-15 · 💻 cs.CL · cs.CV

VCG-Bench: Towards A Unified Visual-Centric Benchmark for Structured Generation and Editing

Xiaoyan Su , Peijie Dong , Zhenheng Tang , Song Tang , Yuyao Zhai , Kaitao Lin , Liang Chen , Gai Yuhang

show 3 more authors

Yuyu Luo Qiang Wang Xiaowen Chu

This is my paper

Pith reviewed 2026-05-20 19:30 UTC · model grok-4.3

classification 💻 cs.CL cs.CV

keywords VCG-BenchDiagram-as-CodemxGraph XMLVision-Language Modelsdiagram generationstructured editingbenchmark evaluation

0 comments

The pith

VCG-Bench introduces a Diagram-as-Code paradigm using mxGraph XML to test vision-language models on precise diagram generation and editing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies limitations in current vision-language models for structured diagrammatic tasks required in professional settings. Pixel-based synthesis lacks editability and fidelity, leading the authors to advocate a symbolic Diagram-as-Code approach based on mxGraph XML. VCG-Bench supplies a dataset of 1,449 diagrams across six domains, defines paired tasks for vision-to-code generation and code-to-code editing, and applies metrics including execution success rate and style consistency. Experiments demonstrate that state-of-the-art models still fall short on structured fidelity and instruction compliance. A sympathetic reader would see this as evidence that vision and reasoning capacities in VLMs remain insufficient for controllable, high-precision visual outputs.

Core claim

We propose a new Diagram-as-Code paradigm with symbolic logic that leverages mxGraph XML for precise diagram generation and editing instead of probabilistic pixel spaces. We present VCG-Bench, a unified benchmark comprising a taxonomized dataset of 1,449 diverse diagrams spanning 6 domains and 15 sub-domains, a paradigm definition integrating Generation (Vision-to-Code) and Editability (Code-to-Code), and a tailored evaluation protocol with multi-dimensional metrics such as mxGraph Execution Success Rate and Style Consistency Score. Experimental results highlight the challenges faced by current SOTA VLMs in structured fidelity and instruction compliance, reflecting their vision and reasoning

What carries the argument

The Diagram-as-Code paradigm, which substitutes symbolic mxGraph XML logic for pixel-based synthesis to enable exact, editable control over diagram structure and style.

If this is right

State-of-the-art VLMs exhibit measurable shortfalls in structured fidelity and instruction compliance on diagrammatic tasks.
The benchmark supplies a reproducible way to track progress on vision-to-code and code-to-code diagram operations.
Code-based representations can improve editability and structural accuracy over pixel synthesis for professional workflows.
Performance gaps on the benchmark point to broader limitations in current models' visual reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training regimes that explicitly reward symbolic output formats might close some of the observed gaps.
The same evaluation structure could be adapted to other structured visuals such as circuit schematics or UI wireframes.
Direct comparisons between mxGraph and alternative diagram languages would test whether results generalize beyond the chosen format.

Load-bearing premise

That the mxGraph XML format and the chosen metrics such as Execution Success Rate and Style Consistency Score adequately capture real-world professional requirements for diagram generation and editing.

What would settle it

A study in which domain experts rate VLM-generated diagrams for usability in actual tools like draw.io or Lucidchart and find that high benchmark scores do not predict practical success.

Figures

Figures reproduced from arXiv: 2605.15677 by Gai Yuhang, Kaitao Lin, Liang Chen, Peijie Dong, Qiang Wang, Song Tang, Xiaowen Chu, Xiaoyan Su, Yuyao Zhai, Yuyu Luo, Zhenheng Tang.

**Figure 1.** Figure 1: Comparison of diagrammatic tasks. VCG utilizes symbolic mxGraph XML for precise generation and editing, overcoming structural drift in pixel-based models [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of the VCG-Bench framework. Unlike fragmented approaches (top), VCG-Bench (bottom) unifies Vision-to-Code Generation (Task 1) and Instruction-to-Patch Editing (Task 2). Utilizing symbolic mxGraph XML enables precise, low-cost modifications for professional workflows. and editing mxGraph XML diagrams. VCG-Bench follows a “Data-Task-Evaluation” framework. (1) We construct a diverse dataset categori… view at source ↗

**Figure 3.** Figure 3: Overview of the VCG-Bench data generation framework. Subfigure (a) illustrates the end-to-end pipeline for Task 1, from raw web scraping to structured XML-based rendering. Subfigure (b) details the Task 2 pipeline, which focuses on generating editing-based reasoning tasks derived from Task 1. 2. Related Work Our work advances the domain of multimodal code generation by focusing on the mxGraph format. We s… view at source ↗

**Figure 4.** Figure 4: Left: Distribution across 15 sub-domains. Stratified by difficulty, reflecting structural complexity and element density. Right: Dataset composition. 4. Task Definition We define two core tasks under a unified framework formalized by executability constraints: both require model outputs to be valid mxGraph XML that can be parsed and rendered. The following subsections formalize Generation (Vision-to-Code)… view at source ↗

**Figure 5.** Figure 5: Performance scaling and robustness across diagrammatic complexity tiers. Each panel illustrates the CodeXQA accuracy of a specific model family as task difficulty increases from Easy to Hard [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Task 1 qualitative examples. This figure showcases generated diagrams across multiple domains with their corresponding Style Consistency Scores (SCS), SCS rankings, and expert rankings. The examples demonstrate that the SCS metric rankings and human expert rankings are highly consistent, proving that the SCS metric accurately reflects human aesthetic standards. 6. Experiments 6.1. Experimental Setup We eva… view at source ↗

**Figure 7.** Figure 7: Task 2 instruction editing precision demonstration. Rows show distinct editing instructions; columns show the input diagram, instruction, and outputs from five representative models [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: visualizes the performance trade-offs between visual consistency (measured by SCS on Task 1) and editing fidelity (measured by XDRFR on Task 2) across frontier model families (GPT, Claude, Gemini). The scatter plot reveals several key insights: (1) Performance correlation: The positive correlation between SCS and XDRFR suggests that the underlying capabilities for structured understanding and code manipula… view at source ↗

**Figure 9.** Figure 9: Model robustness across task difficulty levels. The heatmap shows CodeXQA accuracy scores of evaluated models on Easy, Medium, and Hard samples. Closed-source models demonstrate superior stability, while open-source and smaller-scale models show sharp performance degradation. A.1.4. XDRFR HUMAN CORRECTION AUDIT To ensure the accuracy and reliability of the XDRFR evaluation metric, we conducted a manual aud… view at source ↗

**Figure 10.** Figure 10: VCG-Bench Pipeline as a crucial Data Flywheel, delivering high-quality training data for VLM, LLM, and Diffusion models with enhanced controllability and explainability. D. Technical Specifications D.1. mxGraph XML Schema Overview Ground-truth mxGraph XML follows the mxGraph structure and loads without repair in standard editors. A minimal illustrative snippet: 1 <mxGraph> 2 <root> 3 <mxCell id="0"/> 4 <m… view at source ↗

read the original abstract

Despite the rapid advancements in Vision-Language Models (VLMs), a critical gap remains in their ability to handle structured, controllable diagrammatic tasks essential for professional workflows. Existing methods predominantly rely on pixel-based synthesis, which operates in probabilistic pixel spaces and is inherently limited in editability and fidelity. Instead, we propose a new Diagram-as-Code paradigm with symbolic logic that leverages mxGraph Extensible Markup Language (XML) for precise diagram generation and editing. We present VCG-Bench, a unified benchmark for visual-centric \texttt{mxGraph} tasks. VCG-Bench comprises: (1) a taxonomized dataset of 1,449 diverse diagrams spanning 6 domains and 15 sub-domains, (2) a paradigm definition that integrates Generation (Vision-to-Code) and Editability (Code-to-Code), (3) a Tailored Evaluation Protocol employing multi-dimensional metrics such as \texttt{mxGraph} Execution Success Rate, Style Consistency Score (SCS), etc. Experimental results highlight the challenges faced by current State-of-the-Art (SOTA) VLMs in structured fidelity and instruction compliance, reflecting their vision and reasoning capabilities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces VCG-Bench, a unified benchmark for visual-centric structured generation and editing of diagrams under a Diagram-as-Code paradigm that uses mxGraph XML for symbolic, editable representations. It contributes a taxonomized dataset of 1,449 diagrams spanning 6 domains and 15 sub-domains, defines Generation (Vision-to-Code) and Editability (Code-to-Code) tasks, and proposes a tailored evaluation protocol with multi-dimensional metrics including mxGraph Execution Success Rate and Style Consistency Score. Experiments on current SOTA VLMs are reported to demonstrate limitations in structured fidelity and instruction compliance.

Significance. If the central claims hold after addressing format-specific confounds, the benchmark could usefully shift evaluation of VLMs away from pixel-based synthesis toward controllable, professional-grade diagrammatic workflows. The dataset taxonomy and dual-task paradigm definition are concrete contributions that could support reproducible progress in structured output generation.

major comments (2)

[Abstract] Abstract: The claim that results 'reflect their vision and reasoning capabilities' is load-bearing for the paper's interpretation of VLM limitations, yet the evaluation protocol does not include cross-format controls or ablations on representation choice. Low Execution Success Rate and SCS scores could therefore arise from limited pretraining exposure to mxGraph XML syntax rather than deficits in diagram understanding or instruction following.
[Evaluation Protocol] Evaluation Protocol (implied in abstract): The weakest assumption—that mxGraph XML and the chosen metrics adequately capture real-world professional requirements—is not tested. Without evidence that the format is representative or that results generalize beyond this schema, the benchmark's ability to diagnose general structured-generation failures remains open.

minor comments (2)

[Abstract] Abstract: Dataset construction details, exact metric definitions, and baseline comparisons are referenced but not summarized; adding one sentence on each would improve immediate verifiability.
Notation: The term 'Diagram-as-Code paradigm' is introduced without a concise formal definition or contrast to prior code-based diagram work; a short paragraph or table would clarify the novelty.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive feedback on our manuscript. The comments highlight important considerations regarding the interpretation of our results and the scope of the benchmark. We address each major comment point-by-point below, providing clarifications and making targeted revisions to the manuscript where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that results 'reflect their vision and reasoning capabilities' is load-bearing for the paper's interpretation of VLM limitations, yet the evaluation protocol does not include cross-format controls or ablations on representation choice. Low Execution Success Rate and SCS scores could therefore arise from limited pretraining exposure to mxGraph XML syntax rather than deficits in diagram understanding or instruction following.

Authors: We acknowledge the validity of this concern regarding potential format-specific confounds. Our benchmark is intentionally scoped to the Diagram-as-Code paradigm using mxGraph XML, which enables precise symbolic representations and editability not afforded by pixel-based methods. The observed low Execution Success Rates and Style Consistency Scores primarily indicate challenges in producing syntactically valid and instruction-compliant structured output. To strengthen the manuscript, we have revised the abstract and added a dedicated paragraph in the Discussion section that explicitly discusses the possibility of pretraining exposure effects, justifies the choice of mxGraph based on its widespread use in professional tools (e.g., diagrams.net), and notes that cross-format ablations are a valuable direction for future work. This revision clarifies the interpretive scope without overstating generality. revision: partial
Referee: [Evaluation Protocol] Evaluation Protocol (implied in abstract): The weakest assumption—that mxGraph XML and the chosen metrics adequately capture real-world professional requirements—is not tested. Without evidence that the format is representative or that results generalize beyond this schema, the benchmark's ability to diagnose general structured-generation failures remains open.

Authors: We agree that empirical validation of representativeness would further bolster the benchmark's utility. mxGraph was selected for its symbolic editability and adoption in real diagramming workflows, as supported by references in the Related Work section. In the revised manuscript, we have expanded the Introduction to include additional context on mxGraph's professional relevance, added a Limitations subsection that transparently discusses schema-specific aspects and the absence of direct user studies or cross-schema comparisons, and clarified that the multi-metric protocol (Execution Success Rate, Style Consistency Score, etc.) is tailored to evaluate structured fidelity within this paradigm. These changes address the concern by improving transparency while preserving the benchmark's focus on controllable, editable diagram generation. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark proposal with no derivations or self-referential predictions

full rationale

The paper proposes VCG-Bench as a new benchmark and Diagram-as-Code paradigm using mxGraph XML for diagram generation and editing tasks. It defines a dataset, evaluation metrics (Execution Success Rate, Style Consistency Score), and reports experimental results on SOTA VLMs. No equations, fitted parameters, predictions derived from prior fits, or load-bearing self-citations appear in the provided text. The central claims about VLM limitations rest on the benchmark's external evaluation protocol rather than reducing to quantities defined by the authors' own prior results or by construction. This is a standard benchmark contribution whose statements are self-contained against the introduced tasks and metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the premise that symbolic mxGraph representation overcomes pixel-based limitations and that the collected diagrams form a representative test of professional diagrammatic tasks.

axioms (1)

domain assumption mxGraph XML supplies precise, editable symbolic logic for diagrams
Invoked in the abstract as the foundation for the Generation and Editability paradigm.

invented entities (1)

Diagram-as-Code paradigm no independent evidence
purpose: To replace probabilistic pixel synthesis with symbolic code for higher fidelity and editability in diagrammatic tasks
Introduced as the core alternative to existing pixel-based methods.

pith-pipeline@v0.9.0 · 5772 in / 1142 out tokens · 48999 ms · 2026-05-20T19:30:14.347159+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages

[1]

Open-Source Template Repositories (40%): Sourced from public GitHub repositories hosting ‘.drawio‘ or ‘.xml‘ templates (e.g., ‘jgraph/drawio-diagrams‘, architecture-templates)

work page
[2]

Open Access Academic Papers (30%): Extracted from arXiv sources under CC-BY licenses, focusing on CS/AI system architectures

work page
[3]

Feature A

Anonymized Corporate Diagrams (20%): Internal datasets from verified industry partners. All text entities were anonymized (e.g., "Feature A" instead of specific product names) and PII was scrubbed using regex and NER pipelines

work page
[4]

Permissively Licensed Web Diagrams (10%): Crawled from technical blogs and documentation sites explicitly marked as CC-BY or public domain. B.3. Data Synthesis Pipeline The VCG-Bench dataset is constructed using a two-stage synthesis pipeline. Gemini-3-Pro is used to generate intermediate structured captions, while candidate mxGraph XML files are synthesi...

work page
[5]

5.Verification: Rendering themxGraphXML and filtering based on visual similarity

Code Generation (Task 1): Converting the description and the original image intomxGraphModel mxGraph XML. 5.Verification: Rendering themxGraphXML and filtering based on visual similarity. Stage 2: Instruction Synthesis (Task 2) 1.Base Selection: High-qualitymxGraphXMLs from Stage 1 are selected as ground truth. 2.Instruction Generation: Creating instructi...

work page
[6]

0"/> 4<mxCell id=

Atomic Operations: Edits are composed of 14 atomic operations (e.g., add_node, change_color, reroute_edge). B.4. Dataset Statistics and Difficulty Task 1 Difficulty (mxGraphXML Token Count): •Easy(< 8,645 tokens): 33.0% (478 images) •Medium(8,645 - 14,000 tokens): 49.8% (721 images) •Hard(> 14,000 tokens): 17.3% (250 images) Statistics: Mean 11,024, Media...

work page 2025
[7]

Output ONLY XML content (no Markdown fences, no explanations)

work page
[8]

Start with ‘<mxfile>‘ and end with ‘</mxfile>‘

work page
[9]

1" gridSize=

‘<mxGraphModel>‘ must include ‘grid="1" gridSize="10" guides="1" page="1"‘

work page
[10]

geometry

All elements must be ‘<mxCell>‘ with: 12- Unique ‘id‘ (globally unique, add suffixes if needed) 13- ‘value‘, ‘style‘, ‘parent‘ attributes 14- ‘<mxGeometry ... as="geometry"/>‘ child

work page
[11]

1"‘) must NOT be self-closing; must contain ‘<mxGeometry relative=

Edge cells (‘edge="1"‘) must NOT be self-closing; must contain ‘<mxGeometry relative="1" as="geometry" />‘

work page
[12]

XML escaping: Use ‘<‘, ‘>‘, ‘&‘ in attribute values

work page
[13]

No external images (no ‘shape=image‘ with URLs) 18 19 **Generation Rules **:

work page
[14]

Use [Original Image] as ground truth; [JSON Description Draft] as reference

work page
[15]

Center the diagram on canvas (don’t stack in top-left corner)

Layout: All coordinates must be multiples of 10. Center the diagram on canvas (don’t stack in top-left corner)

work page
[16]

Use styles from JSON when available

Components: Create nodes from ‘components‘ and ‘component_styles‘. Use styles from JSON when available

work page
[17]

middle_right

Arrows: Create exactly one ‘<mxCell edge="1">‘ for each item in ‘arrows.details‘: 24- Use ‘style‘ field (solid/dashed) 25- Set ‘startArrow‘/‘endArrow‘ based on ‘heads‘ field (double arrow if ‘is_bidirectional: true‘) 26- Use ‘routing‘ field: add ‘edgeStyle=orthogonalEdgeStyle;‘ for orthogonal, use ‘<Array as="points">‘ for curved 25 VCG-Bench: Towards A U...

work page
[18]

Z-Order: Create background elements (with ‘is_background: true‘) first, then foreground elements

work page
[19]

describes_subgraph

Complex structures: For ‘relationships_extended‘ with ‘type: "describes_subgraph"‘ or ‘"describes_structure "‘, rebuild the structure using ‘<mxCell>‘ primitives based on the ‘note‘ description 30 31 **Validation**: Ensure all ‘id‘s are unique, all edges have proper source/target references, and XML is well- formed. Listing 3.Prompt for XML Generation E.2...

work page
[20]

**Modify Node Color **: Change the fill color or border color of a node

work page
[21]

**Modify Node Shape **: Change the shape type of a node (rectangle, circle, diamond, etc.)

work page
[22]

**Modify Node Size **: Change the width or height of a node

work page
[23]

**Modify Node Text **: Change the text content displayed inside a node 23 24### Category 2: Node Structure Operations (3 operations)

work page
[24]

**Delete Node **: Delete a node (without handling connections)

work page
[25]

**Add Node **: Add a new node at a specified location

work page
[26]

**Move Node **: Change the position of a node 28 29### Category 3: Connection Line Attribute Modification (3 operations)

work page
[27]

**Modify Connection Color **: Change the color of a connection line

work page
[28]

**Modify Connection Style **: Change the style of a connection line (solid, dashed, thickness, etc.)

work page
[29]

**Modify Connection Arrow **: Change the arrow style of a connection line 33 34### Category 4: Connection Line Structure Operations (4 operations)

work page
[30]

**Delete Connection **: Delete a connection line

work page
[31]

**Add Connection **: Add a connection line between two nodes

work page
[32]

**Redirect Connection **: Change the start or end point of a connection line

work page
[33]

60 61### [RECOMMENDED] Recommended Description Methods (Prioritized): 62 63 **Highest Priority **: Directly use the text content displayed on nodes 64

**Update Connection Path **: Update the connection line path when nodes move 39 40--- 41 42## Difficulty Requirements (Strictly Follow) 43 44You need to generate 3 instructions of different difficulty levels, where difficulty is **completely determined by the number of atomic operations **: 45 46### Easy Difficulty (1-2 atomic operations) 47- **Requiremen...

work page
[34]

Change the ’Start’ node to red

**Node Text Content Description (Most natural and accurate) **: 66- "Change the ’Start’ node to red" 67- "Delete the ’Data Processing’ node" 68

work page
[35]

Start node

**Semantic Description ** (when nodes don’t have clear text but semantics are clear): 70- "Start node", "End node" 71

work page
[36]

The topmost node

**Position Description ** (when nodes have no text or text is unclear): 73- "The topmost node", "The leftmost rectangle", "The middle circle" 74

work page
[37]

The largest rectangle

**Visual Feature Description ** (as supplementary positioning): 76- "The largest rectangle", "The red node" 77

work page
[38]

The arrow between ’Start’ and ’End’

**Relative Position Description **: 79- "The arrow between ’Start’ and ’End’" 80 81### [PROHIBITED] Strictly Prohibited Description Methods: 82

work page
[39]

Node with id ’node_1’

**Technical Identifiers ** (Prohibited): 84- [X] "Node with id ’node_1’" 85- [X] "edge_5" 86

work page
[40]

Node at coordinates (200, 300)

**Coordinate Description ** (Prohibited): 88- [X] "Node at coordinates (200, 300)" 89

work page
[41]

Node with shape=’rectangle’

**XML Attribute Reference ** (Prohibited): 91- [X] "Node with shape=’rectangle’" 92 93--- 94 95## Output Format (Strict JSON) 96 97Please output a JSON object containing 3 instructions in the following format: 98 99‘‘‘json 100{ 101"instructions": [ 102{ 103"difficulty": "Easy", 104"instruction": "Change the ’Start’ node to red", 105"atomic_operations": [ ...

work page
[42]

**Only output modified parts **: To save tokens, you only need to output the modified \texttt{mxGraph} XML fragments, not the complete \texttt{mxGraph} XML

work page
[43]

**Incremental format **: Output JSON format containing all places that need modification (because there may be multiple places to modify)

work page
[44]

**Format requirements **: 15- Each modification contains two fields: 28 VCG-Bench: Towards A Unified Visual-Centric Benchmark for Structured Generation and Editing 16- ‘original_fragment‘: Original \texttt{mxGraph} XML fragment (the part to be replaced) 17- ‘modified_fragment‘: Modified \texttt{mxGraph} XML fragment (the replacement content) 18- If there ...

work page
[45]

**Fragment requirements **: 20- ‘original_fragment‘ must be a complete fragment from the original \texttt{mxGraph} XML (can be a complete element, such as ‘<mxCell>...</mxCell>‘) 21- ‘modified_fragment‘ is the corresponding modified fragment 22- Fragments must be precise enough to uniquely match the position in the original \texttt{mxGraph} XML

work page
[46]

changes": [ 32{ 33

**Maintain correct \texttt{mxGraph} XML format **: Modified fragments must maintain correct and parseable \ texttt{mxGraph} XML format 24 25## Output Format (Strict JSON) 26 27Please output a JSON object in the following format: 28 29‘‘‘json 30{ 31"changes": [ 32{ 33"original_fragment": "<mxCell id=’node_1’ value=’Old Text’ ...>...</mxCell>", 34"modified_...

work page
[47]

Visual style consistency (color, line thickness, node style): ___

work page
[48]

Layout structure consistency (element positions, spacing, spatial relationships): ___ 29 VCG-Bench: Towards A Unified Visual-Centric Benchmark for Structured Generation and Editing

work page
[49]

analysis

Aesthetic quality (alignment, visual balance, overall beauty): ___ 30 31 **Step 4: Final Score Calculation ** 32- Calculate the average of the three dimension scores 33- Divide the average by 10 to normalize to 0-1 34- Example: Dimension scores [8.5, 7.0, 9.0], average = 8.17, final score = 0.817 35 36 **Output Format (JSON): ** 37{ 38"analysis": { 39"ori...

work page
[50]

**Style Consistency ** (color style, visual element style, overall style characteristics): ___ 36- Evaluate whether unmodified parts maintain the original style characteristics 37- Evaluate whether the overall style is coordinated and unified

work page
[51]

analysis

**Aesthetic Quality ** (visual balance, alignment, overall beauty, presence of obvious visual errors): ___ 39- Evaluate whether the modified diagram is still beautiful and professional 40- If there are obvious visual errors (element overlaps, text misalignment, layout chaos), deduct points 41 42 **Step 4: Final Score Calculation ** 43- Calculate the avera...

work page
[52]

Counting: Count the number of elements in the diagram

work page
[53]

Identification: Identify attributes or labels of specific elements

work page
[54]

how many nodes

Relationship: Identify relationships or connections between elements 6 7Image: [Image Placeholder] 8 9 **Requirements (Enhanced Version): ** 10 11 **1. Depth Requirements (Improve Discrimination) ** 12- **Counting questions **: Should not only ask "how many nodes" which is too simple. Should include counting of specific attributes, for example: 13- "How m...

work page
[57]

Yes " or

Verify completeness and correctness 34 35 **Verification Rules: ** 36- **Text changes **: Check ‘value‘ attribute in ‘modified_fragment‘ 37- **Color changes **: Check ‘fillColor‘/‘strokeColor‘ in ‘style‘ attribute (hex codes like #0000FF or color names) 38- **Position/Size**: Check ‘mxGeometry‘ coordinates/dimensions 39- **Add/Remove**: Check for new elem...

work page
[58]

Analyze each modification by comparing ‘original_fragment‘ and ‘modified_fragment‘

work page
[59]

Map instruction requirements to modifications

work page
[60]

How many blue circular nodes are in the diagram?

Verify completeness and correctness 38 39 **Verification Rules: ** 40- **Text changes **: Check ‘value‘ attribute in ‘modified_fragment‘ 41- **Color changes **: Check ‘fillColor‘/‘strokeColor‘ in ‘style‘ attribute (hex codes like #0000FF or color names) 42- **Position/Size**: Check ‘mxGeometry‘ coordinates/dimensions 43- **Add/Remove**: Check for new elem...

work page 2025
[61]

ESR = ( 1.0ifmxGraphXML_valid∧Render_success 0.0otherwise

Execution Success Rate (ESR)ESR measures the proportion of generated mxGraph XML codes that are syntactically valid and renderable. ESR = ( 1.0ifmxGraphXML_valid∧Render_success 0.0otherwise

work page
[62]

Primarily used for image difficulty classification, not as a model performance evaluation metric

mxGraph XML Token Count (XTC)XTC is a proxy for structural complexity, calculated using the ‘cl100k_base‘ tokenizer. Primarily used for image difficulty classification, not as a model performance evaluation metric. 35 VCG-Bench: Towards A Unified Visual-Centric Benchmark for Structured Generation and Editing XTC = Tokenize(mxGraphXML_code,cl100k_base)(4)

work page
[63]

SCS = 1 30 (Svisual +S layout +S aesthetic)(5) Where: •S visual: Color palette, line styles, node shapes

Style Consistency Score (SCS)SCS is a VLM-based ( gemini-3-pro-preview) perceptual metric evaluating three dimensions on a 0-10 scale. SCS = 1 30 (Svisual +S layout +S aesthetic)(5) Where: •S visual: Color palette, line styles, node shapes. •S layout: Spatial arrangement, alignment, spacing. •S aesthetic: Overall visual harmony and professional look

work page
[64]

CodeXQA = 1 N NX i=1 I(Match(Answerpred i ,Answer gt i )) Matching strategies include Exact Match, Inclusion, and Semantic Similarity

CodeXQA AccuracyCodeXQA is the average accuracy across the three question types (Counting, Identification, Relationship). CodeXQA = 1 N NX i=1 I(Match(Answerpred i ,Answer gt i )) Matching strategies include Exact Match, Inclusion, and Semantic Similarity

work page
[65]

SigLIP2 = CosineSimilarity(SigLIP2(Original),SigLIP2(Generated)) The model used isgoogle/siglip2-so400m-patch16-512, with cosine similarity in the range 0–1

SigLIP2 Score (SigLIP2 Semantic Similarity Score)This metric uses the SigLIP2 model to calculate semantic similarity between the original and generated images. SigLIP2 = CosineSimilarity(SigLIP2(Original),SigLIP2(Generated)) The model used isgoogle/siglip2-so400m-patch16-512, with cosine similarity in the range 0–1. F.2.2. TASK2: EDITINGMETRICS

work page
[66]

XDRFR = 1 M MX i=1 I(Answeri ="Yes") Where M is the number of decomposed questions for the instruction

XDRFR (XML Decomposed Requirements Following Ratio)XDRFR is the primary metric for instruction following; it calculates the pass rate of decomposed Yes/No questions. XDRFR = 1 M MX i=1 I(Answeri ="Yes") Where M is the number of decomposed questions for the instruction. The evaluation is performed purely on the XML text, avoiding visual rendering artifacts

work page
[67]

SCSTask2 = 1 20 (Sstyle +S aesthetic) Crucially, this metric does not penalize content changes (which are intended) but ensures thestyleremains consistent with the original diagram

SCS for Task 2This metric is adapted to focus on style preservation during edits. SCSTask2 = 1 20 (Sstyle +S aesthetic) Crucially, this metric does not penalize content changes (which are intended) but ensures thestyleremains consistent with the original diagram

work page
[68]

XED = LevenshteinDistance(OriginalmxGraphXML,ModifiedmxGraphXML) It uses the standard Levenshtein distance algorithm, based on character-level comparison ofmxGraphXML strings

mxGraph XML Edit Distance (XED)XED calculates the edit distance between the original mxGraph XML and the modifiedmxGraphXML, quantifying the magnitude of code-level modifications. XED = LevenshteinDistance(OriginalmxGraphXML,ModifiedmxGraphXML) It uses the standard Levenshtein distance algorithm, based on character-level comparison ofmxGraphXML strings. 3...

work page
[69]

Parse and validate schema(C); report well-formedness and loadability

work page
[70]

RenderCandRunder identical settings for spatial comparison

work page
[71]

Detect elements and align via category + proximity; perform bipartite matching

work page
[72]

Compute IoU, completeness, style consistency

work page
[73]

Run editability checks: move/resize nodes, re-route connectors

work page
[74]

Evaluate directive compliance fromD(presence/absence, counts, layout constraints) 8.Output: per-metric scores and weighted aggregate 37

work page

[1] [1]

Open-Source Template Repositories (40%): Sourced from public GitHub repositories hosting ‘.drawio‘ or ‘.xml‘ templates (e.g., ‘jgraph/drawio-diagrams‘, architecture-templates)

work page

[2] [2]

Open Access Academic Papers (30%): Extracted from arXiv sources under CC-BY licenses, focusing on CS/AI system architectures

work page

[3] [3]

Feature A

Anonymized Corporate Diagrams (20%): Internal datasets from verified industry partners. All text entities were anonymized (e.g., "Feature A" instead of specific product names) and PII was scrubbed using regex and NER pipelines

work page

[4] [4]

Permissively Licensed Web Diagrams (10%): Crawled from technical blogs and documentation sites explicitly marked as CC-BY or public domain. B.3. Data Synthesis Pipeline The VCG-Bench dataset is constructed using a two-stage synthesis pipeline. Gemini-3-Pro is used to generate intermediate structured captions, while candidate mxGraph XML files are synthesi...

work page

[5] [5]

5.Verification: Rendering themxGraphXML and filtering based on visual similarity

Code Generation (Task 1): Converting the description and the original image intomxGraphModel mxGraph XML. 5.Verification: Rendering themxGraphXML and filtering based on visual similarity. Stage 2: Instruction Synthesis (Task 2) 1.Base Selection: High-qualitymxGraphXMLs from Stage 1 are selected as ground truth. 2.Instruction Generation: Creating instructi...

work page

[6] [6]

0"/> 4<mxCell id=

Atomic Operations: Edits are composed of 14 atomic operations (e.g., add_node, change_color, reroute_edge). B.4. Dataset Statistics and Difficulty Task 1 Difficulty (mxGraphXML Token Count): •Easy(< 8,645 tokens): 33.0% (478 images) •Medium(8,645 - 14,000 tokens): 49.8% (721 images) •Hard(> 14,000 tokens): 17.3% (250 images) Statistics: Mean 11,024, Media...

work page 2025

[7] [7]

Output ONLY XML content (no Markdown fences, no explanations)

work page

[8] [8]

Start with ‘<mxfile>‘ and end with ‘</mxfile>‘

work page

[9] [9]

1" gridSize=

‘<mxGraphModel>‘ must include ‘grid="1" gridSize="10" guides="1" page="1"‘

work page

[10] [10]

geometry

All elements must be ‘<mxCell>‘ with: 12- Unique ‘id‘ (globally unique, add suffixes if needed) 13- ‘value‘, ‘style‘, ‘parent‘ attributes 14- ‘<mxGeometry ... as="geometry"/>‘ child

work page

[11] [11]

1"‘) must NOT be self-closing; must contain ‘<mxGeometry relative=

Edge cells (‘edge="1"‘) must NOT be self-closing; must contain ‘<mxGeometry relative="1" as="geometry" />‘

work page

[12] [12]

XML escaping: Use ‘<‘, ‘>‘, ‘&‘ in attribute values

work page

[13] [13]

No external images (no ‘shape=image‘ with URLs) 18 19 **Generation Rules **:

work page

[14] [14]

Use [Original Image] as ground truth; [JSON Description Draft] as reference

work page

[15] [15]

Center the diagram on canvas (don’t stack in top-left corner)

Layout: All coordinates must be multiples of 10. Center the diagram on canvas (don’t stack in top-left corner)

work page

[16] [16]

Use styles from JSON when available

Components: Create nodes from ‘components‘ and ‘component_styles‘. Use styles from JSON when available

work page

[17] [17]

middle_right

Arrows: Create exactly one ‘<mxCell edge="1">‘ for each item in ‘arrows.details‘: 24- Use ‘style‘ field (solid/dashed) 25- Set ‘startArrow‘/‘endArrow‘ based on ‘heads‘ field (double arrow if ‘is_bidirectional: true‘) 26- Use ‘routing‘ field: add ‘edgeStyle=orthogonalEdgeStyle;‘ for orthogonal, use ‘<Array as="points">‘ for curved 25 VCG-Bench: Towards A U...

work page

[18] [18]

Z-Order: Create background elements (with ‘is_background: true‘) first, then foreground elements

work page

[19] [19]

describes_subgraph

Complex structures: For ‘relationships_extended‘ with ‘type: "describes_subgraph"‘ or ‘"describes_structure "‘, rebuild the structure using ‘<mxCell>‘ primitives based on the ‘note‘ description 30 31 **Validation**: Ensure all ‘id‘s are unique, all edges have proper source/target references, and XML is well- formed. Listing 3.Prompt for XML Generation E.2...

work page

[20] [20]

**Modify Node Color **: Change the fill color or border color of a node

work page

[21] [21]

**Modify Node Shape **: Change the shape type of a node (rectangle, circle, diamond, etc.)

work page

[22] [22]

**Modify Node Size **: Change the width or height of a node

work page

[23] [23]

**Modify Node Text **: Change the text content displayed inside a node 23 24### Category 2: Node Structure Operations (3 operations)

work page

[24] [24]

**Delete Node **: Delete a node (without handling connections)

work page

[25] [25]

**Add Node **: Add a new node at a specified location

work page

[26] [26]

**Move Node **: Change the position of a node 28 29### Category 3: Connection Line Attribute Modification (3 operations)

work page

[27] [27]

**Modify Connection Color **: Change the color of a connection line

work page

[28] [28]

**Modify Connection Style **: Change the style of a connection line (solid, dashed, thickness, etc.)

work page

[29] [29]

**Modify Connection Arrow **: Change the arrow style of a connection line 33 34### Category 4: Connection Line Structure Operations (4 operations)

work page

[30] [30]

**Delete Connection **: Delete a connection line

work page

[31] [31]

**Add Connection **: Add a connection line between two nodes

work page

[32] [32]

**Redirect Connection **: Change the start or end point of a connection line

work page

[33] [33]

60 61### [RECOMMENDED] Recommended Description Methods (Prioritized): 62 63 **Highest Priority **: Directly use the text content displayed on nodes 64

**Update Connection Path **: Update the connection line path when nodes move 39 40--- 41 42## Difficulty Requirements (Strictly Follow) 43 44You need to generate 3 instructions of different difficulty levels, where difficulty is **completely determined by the number of atomic operations **: 45 46### Easy Difficulty (1-2 atomic operations) 47- **Requiremen...

work page

[34] [34]

Change the ’Start’ node to red

**Node Text Content Description (Most natural and accurate) **: 66- "Change the ’Start’ node to red" 67- "Delete the ’Data Processing’ node" 68

work page

[35] [35]

Start node

**Semantic Description ** (when nodes don’t have clear text but semantics are clear): 70- "Start node", "End node" 71

work page

[36] [36]

The topmost node

**Position Description ** (when nodes have no text or text is unclear): 73- "The topmost node", "The leftmost rectangle", "The middle circle" 74

work page

[37] [37]

The largest rectangle

**Visual Feature Description ** (as supplementary positioning): 76- "The largest rectangle", "The red node" 77

work page

[38] [38]

The arrow between ’Start’ and ’End’

**Relative Position Description **: 79- "The arrow between ’Start’ and ’End’" 80 81### [PROHIBITED] Strictly Prohibited Description Methods: 82

work page

[39] [39]

Node with id ’node_1’

**Technical Identifiers ** (Prohibited): 84- [X] "Node with id ’node_1’" 85- [X] "edge_5" 86

work page

[40] [40]

Node at coordinates (200, 300)

**Coordinate Description ** (Prohibited): 88- [X] "Node at coordinates (200, 300)" 89

work page

[41] [41]

Node with shape=’rectangle’

**XML Attribute Reference ** (Prohibited): 91- [X] "Node with shape=’rectangle’" 92 93--- 94 95## Output Format (Strict JSON) 96 97Please output a JSON object containing 3 instructions in the following format: 98 99‘‘‘json 100{ 101"instructions": [ 102{ 103"difficulty": "Easy", 104"instruction": "Change the ’Start’ node to red", 105"atomic_operations": [ ...

work page

[42] [42]

**Only output modified parts **: To save tokens, you only need to output the modified \texttt{mxGraph} XML fragments, not the complete \texttt{mxGraph} XML

work page

[43] [43]

**Incremental format **: Output JSON format containing all places that need modification (because there may be multiple places to modify)

work page

[44] [44]

**Format requirements **: 15- Each modification contains two fields: 28 VCG-Bench: Towards A Unified Visual-Centric Benchmark for Structured Generation and Editing 16- ‘original_fragment‘: Original \texttt{mxGraph} XML fragment (the part to be replaced) 17- ‘modified_fragment‘: Modified \texttt{mxGraph} XML fragment (the replacement content) 18- If there ...

work page

[45] [45]

**Fragment requirements **: 20- ‘original_fragment‘ must be a complete fragment from the original \texttt{mxGraph} XML (can be a complete element, such as ‘<mxCell>...</mxCell>‘) 21- ‘modified_fragment‘ is the corresponding modified fragment 22- Fragments must be precise enough to uniquely match the position in the original \texttt{mxGraph} XML

work page

[46] [46]

changes": [ 32{ 33

**Maintain correct \texttt{mxGraph} XML format **: Modified fragments must maintain correct and parseable \ texttt{mxGraph} XML format 24 25## Output Format (Strict JSON) 26 27Please output a JSON object in the following format: 28 29‘‘‘json 30{ 31"changes": [ 32{ 33"original_fragment": "<mxCell id=’node_1’ value=’Old Text’ ...>...</mxCell>", 34"modified_...

work page

[47] [47]

Visual style consistency (color, line thickness, node style): ___

work page

[48] [48]

Layout structure consistency (element positions, spacing, spatial relationships): ___ 29 VCG-Bench: Towards A Unified Visual-Centric Benchmark for Structured Generation and Editing

work page

[49] [49]

analysis

Aesthetic quality (alignment, visual balance, overall beauty): ___ 30 31 **Step 4: Final Score Calculation ** 32- Calculate the average of the three dimension scores 33- Divide the average by 10 to normalize to 0-1 34- Example: Dimension scores [8.5, 7.0, 9.0], average = 8.17, final score = 0.817 35 36 **Output Format (JSON): ** 37{ 38"analysis": { 39"ori...

work page

[50] [50]

**Style Consistency ** (color style, visual element style, overall style characteristics): ___ 36- Evaluate whether unmodified parts maintain the original style characteristics 37- Evaluate whether the overall style is coordinated and unified

work page

[51] [51]

analysis

**Aesthetic Quality ** (visual balance, alignment, overall beauty, presence of obvious visual errors): ___ 39- Evaluate whether the modified diagram is still beautiful and professional 40- If there are obvious visual errors (element overlaps, text misalignment, layout chaos), deduct points 41 42 **Step 4: Final Score Calculation ** 43- Calculate the avera...

work page

[52] [52]

Counting: Count the number of elements in the diagram

work page

[53] [53]

Identification: Identify attributes or labels of specific elements

work page

[54] [54]

how many nodes

Relationship: Identify relationships or connections between elements 6 7Image: [Image Placeholder] 8 9 **Requirements (Enhanced Version): ** 10 11 **1. Depth Requirements (Improve Discrimination) ** 12- **Counting questions **: Should not only ask "how many nodes" which is too simple. Should include counting of specific attributes, for example: 13- "How m...

work page

[55] [57]

Yes " or

Verify completeness and correctness 34 35 **Verification Rules: ** 36- **Text changes **: Check ‘value‘ attribute in ‘modified_fragment‘ 37- **Color changes **: Check ‘fillColor‘/‘strokeColor‘ in ‘style‘ attribute (hex codes like #0000FF or color names) 38- **Position/Size**: Check ‘mxGeometry‘ coordinates/dimensions 39- **Add/Remove**: Check for new elem...

work page

[56] [58]

Analyze each modification by comparing ‘original_fragment‘ and ‘modified_fragment‘

work page

[57] [59]

Map instruction requirements to modifications

work page

[58] [60]

How many blue circular nodes are in the diagram?

Verify completeness and correctness 38 39 **Verification Rules: ** 40- **Text changes **: Check ‘value‘ attribute in ‘modified_fragment‘ 41- **Color changes **: Check ‘fillColor‘/‘strokeColor‘ in ‘style‘ attribute (hex codes like #0000FF or color names) 42- **Position/Size**: Check ‘mxGeometry‘ coordinates/dimensions 43- **Add/Remove**: Check for new elem...

work page 2025

[59] [61]

ESR = ( 1.0ifmxGraphXML_valid∧Render_success 0.0otherwise

Execution Success Rate (ESR)ESR measures the proportion of generated mxGraph XML codes that are syntactically valid and renderable. ESR = ( 1.0ifmxGraphXML_valid∧Render_success 0.0otherwise

work page

[60] [62]

Primarily used for image difficulty classification, not as a model performance evaluation metric

mxGraph XML Token Count (XTC)XTC is a proxy for structural complexity, calculated using the ‘cl100k_base‘ tokenizer. Primarily used for image difficulty classification, not as a model performance evaluation metric. 35 VCG-Bench: Towards A Unified Visual-Centric Benchmark for Structured Generation and Editing XTC = Tokenize(mxGraphXML_code,cl100k_base)(4)

work page

[61] [63]

SCS = 1 30 (Svisual +S layout +S aesthetic)(5) Where: •S visual: Color palette, line styles, node shapes

Style Consistency Score (SCS)SCS is a VLM-based ( gemini-3-pro-preview) perceptual metric evaluating three dimensions on a 0-10 scale. SCS = 1 30 (Svisual +S layout +S aesthetic)(5) Where: •S visual: Color palette, line styles, node shapes. •S layout: Spatial arrangement, alignment, spacing. •S aesthetic: Overall visual harmony and professional look

work page

[62] [64]

CodeXQA = 1 N NX i=1 I(Match(Answerpred i ,Answer gt i )) Matching strategies include Exact Match, Inclusion, and Semantic Similarity

CodeXQA AccuracyCodeXQA is the average accuracy across the three question types (Counting, Identification, Relationship). CodeXQA = 1 N NX i=1 I(Match(Answerpred i ,Answer gt i )) Matching strategies include Exact Match, Inclusion, and Semantic Similarity

work page

[63] [65]

SigLIP2 = CosineSimilarity(SigLIP2(Original),SigLIP2(Generated)) The model used isgoogle/siglip2-so400m-patch16-512, with cosine similarity in the range 0–1

SigLIP2 Score (SigLIP2 Semantic Similarity Score)This metric uses the SigLIP2 model to calculate semantic similarity between the original and generated images. SigLIP2 = CosineSimilarity(SigLIP2(Original),SigLIP2(Generated)) The model used isgoogle/siglip2-so400m-patch16-512, with cosine similarity in the range 0–1. F.2.2. TASK2: EDITINGMETRICS

work page

[64] [66]

XDRFR = 1 M MX i=1 I(Answeri ="Yes") Where M is the number of decomposed questions for the instruction

XDRFR (XML Decomposed Requirements Following Ratio)XDRFR is the primary metric for instruction following; it calculates the pass rate of decomposed Yes/No questions. XDRFR = 1 M MX i=1 I(Answeri ="Yes") Where M is the number of decomposed questions for the instruction. The evaluation is performed purely on the XML text, avoiding visual rendering artifacts

work page

[65] [67]

SCSTask2 = 1 20 (Sstyle +S aesthetic) Crucially, this metric does not penalize content changes (which are intended) but ensures thestyleremains consistent with the original diagram

SCS for Task 2This metric is adapted to focus on style preservation during edits. SCSTask2 = 1 20 (Sstyle +S aesthetic) Crucially, this metric does not penalize content changes (which are intended) but ensures thestyleremains consistent with the original diagram

work page

[66] [68]

XED = LevenshteinDistance(OriginalmxGraphXML,ModifiedmxGraphXML) It uses the standard Levenshtein distance algorithm, based on character-level comparison ofmxGraphXML strings

mxGraph XML Edit Distance (XED)XED calculates the edit distance between the original mxGraph XML and the modifiedmxGraphXML, quantifying the magnitude of code-level modifications. XED = LevenshteinDistance(OriginalmxGraphXML,ModifiedmxGraphXML) It uses the standard Levenshtein distance algorithm, based on character-level comparison ofmxGraphXML strings. 3...

work page

[67] [69]

Parse and validate schema(C); report well-formedness and loadability

work page

[68] [70]

RenderCandRunder identical settings for spatial comparison

work page

[69] [71]

Detect elements and align via category + proximity; perform bipartite matching

work page

[70] [72]

Compute IoU, completeness, style consistency

work page

[71] [73]

Run editability checks: move/resize nodes, re-route connectors

work page

[72] [74]

Evaluate directive compliance fromD(presence/absence, counts, layout constraints) 8.Output: per-metric scores and weighted aggregate 37

work page