pith. sign in

arxiv: 2605.15677 · v1 · pith:3YVZB2R4new · submitted 2026-05-15 · 💻 cs.CL · cs.CV

VCG-Bench: Towards A Unified Visual-Centric Benchmark for Structured Generation and Editing

Pith reviewed 2026-05-20 19:30 UTC · model grok-4.3

classification 💻 cs.CL cs.CV
keywords VCG-BenchDiagram-as-CodemxGraph XMLVision-Language Modelsdiagram generationstructured editingbenchmark evaluation
0
0 comments X

The pith

VCG-Bench introduces a Diagram-as-Code paradigm using mxGraph XML to test vision-language models on precise diagram generation and editing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies limitations in current vision-language models for structured diagrammatic tasks required in professional settings. Pixel-based synthesis lacks editability and fidelity, leading the authors to advocate a symbolic Diagram-as-Code approach based on mxGraph XML. VCG-Bench supplies a dataset of 1,449 diagrams across six domains, defines paired tasks for vision-to-code generation and code-to-code editing, and applies metrics including execution success rate and style consistency. Experiments demonstrate that state-of-the-art models still fall short on structured fidelity and instruction compliance. A sympathetic reader would see this as evidence that vision and reasoning capacities in VLMs remain insufficient for controllable, high-precision visual outputs.

Core claim

We propose a new Diagram-as-Code paradigm with symbolic logic that leverages mxGraph XML for precise diagram generation and editing instead of probabilistic pixel spaces. We present VCG-Bench, a unified benchmark comprising a taxonomized dataset of 1,449 diverse diagrams spanning 6 domains and 15 sub-domains, a paradigm definition integrating Generation (Vision-to-Code) and Editability (Code-to-Code), and a tailored evaluation protocol with multi-dimensional metrics such as mxGraph Execution Success Rate and Style Consistency Score. Experimental results highlight the challenges faced by current SOTA VLMs in structured fidelity and instruction compliance, reflecting their vision and reasoning

What carries the argument

The Diagram-as-Code paradigm, which substitutes symbolic mxGraph XML logic for pixel-based synthesis to enable exact, editable control over diagram structure and style.

If this is right

  • State-of-the-art VLMs exhibit measurable shortfalls in structured fidelity and instruction compliance on diagrammatic tasks.
  • The benchmark supplies a reproducible way to track progress on vision-to-code and code-to-code diagram operations.
  • Code-based representations can improve editability and structural accuracy over pixel synthesis for professional workflows.
  • Performance gaps on the benchmark point to broader limitations in current models' visual reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training regimes that explicitly reward symbolic output formats might close some of the observed gaps.
  • The same evaluation structure could be adapted to other structured visuals such as circuit schematics or UI wireframes.
  • Direct comparisons between mxGraph and alternative diagram languages would test whether results generalize beyond the chosen format.

Load-bearing premise

That the mxGraph XML format and the chosen metrics such as Execution Success Rate and Style Consistency Score adequately capture real-world professional requirements for diagram generation and editing.

What would settle it

A study in which domain experts rate VLM-generated diagrams for usability in actual tools like draw.io or Lucidchart and find that high benchmark scores do not predict practical success.

Figures

Figures reproduced from arXiv: 2605.15677 by Gai Yuhang, Kaitao Lin, Liang Chen, Peijie Dong, Qiang Wang, Song Tang, Xiaowen Chu, Xiaoyan Su, Yuyao Zhai, Yuyu Luo, Zhenheng Tang.

Figure 1
Figure 1. Figure 1: Comparison of diagrammatic tasks. VCG utilizes symbolic mxGraph XML for precise generation and editing, over￾coming structural drift in pixel-based models [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the VCG-Bench framework. Unlike fragmented approaches (top), VCG-Bench (bottom) unifies Vision-to-Code Generation (Task 1) and Instruction-to-Patch Editing (Task 2). Utilizing symbolic mxGraph XML enables precise, low-cost modifications for professional workflows. and editing mxGraph XML diagrams. VCG-Bench follows a “Data-Task-Evaluation” framework. (1) We construct a di￾verse dataset categori… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the VCG-Bench data generation framework. Subfigure (a) illustrates the end-to-end pipeline for Task 1, from raw web scraping to structured XML-based rendering. Subfigure (b) details the Task 2 pipeline, which focuses on generating editing-based reasoning tasks derived from Task 1. 2. Related Work Our work advances the domain of multimodal code genera￾tion by focusing on the mxGraph format. We s… view at source ↗
Figure 4
Figure 4. Figure 4: Left: Distribution across 15 sub-domains. Stratified by difficulty, reflecting structural complexity and element density. Right: Dataset composition. 4. Task Definition We define two core tasks under a unified framework for￾malized by executability constraints: both require model outputs to be valid mxGraph XML that can be parsed and rendered. The following subsections formalize Generation (Vision-to-Code)… view at source ↗
Figure 5
Figure 5. Figure 5: Performance scaling and robustness across diagrammatic complexity tiers. Each panel illustrates the CodeXQA accuracy of a specific model family as task difficulty increases from Easy to Hard [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Task 1 qualitative examples. This figure showcases generated diagrams across multiple domains with their corresponding Style Consistency Scores (SCS), SCS rankings, and expert rankings. The examples demonstrate that the SCS metric rankings and human expert rankings are highly consistent, proving that the SCS metric accurately reflects human aesthetic standards. 6. Experiments 6.1. Experimental Setup We eva… view at source ↗
Figure 7
Figure 7. Figure 7: Task 2 instruction editing precision demonstration. Rows show distinct editing instructions; columns show the input diagram, instruction, and outputs from five representative models [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: visualizes the performance trade-offs between visual consistency (measured by SCS on Task 1) and editing fidelity (measured by XDRFR on Task 2) across frontier model families (GPT, Claude, Gemini). The scatter plot reveals several key insights: (1) Performance correlation: The positive correlation between SCS and XDRFR suggests that the underlying capabilities for structured understanding and code manipula… view at source ↗
Figure 9
Figure 9. Figure 9: Model robustness across task difficulty levels. The heatmap shows CodeXQA accuracy scores of evaluated models on Easy, Medium, and Hard samples. Closed-source models demonstrate superior stability, while open-source and smaller-scale models show sharp performance degradation. A.1.4. XDRFR HUMAN CORRECTION AUDIT To ensure the accuracy and reliability of the XDRFR evaluation metric, we conducted a manual aud… view at source ↗
Figure 10
Figure 10. Figure 10: VCG-Bench Pipeline as a crucial Data Flywheel, delivering high-quality training data for VLM, LLM, and Diffusion models with enhanced controllability and explainability. D. Technical Specifications D.1. mxGraph XML Schema Overview Ground-truth mxGraph XML follows the mxGraph structure and loads without repair in standard editors. A minimal illustrative snippet: 1 <mxGraph> 2 <root> 3 <mxCell id="0"/> 4 <m… view at source ↗
read the original abstract

Despite the rapid advancements in Vision-Language Models (VLMs), a critical gap remains in their ability to handle structured, controllable diagrammatic tasks essential for professional workflows. Existing methods predominantly rely on pixel-based synthesis, which operates in probabilistic pixel spaces and is inherently limited in editability and fidelity. Instead, we propose a new Diagram-as-Code paradigm with symbolic logic that leverages mxGraph Extensible Markup Language (XML) for precise diagram generation and editing. We present VCG-Bench, a unified benchmark for visual-centric \texttt{mxGraph} tasks. VCG-Bench comprises: (1) a taxonomized dataset of 1,449 diverse diagrams spanning 6 domains and 15 sub-domains, (2) a paradigm definition that integrates Generation (Vision-to-Code) and Editability (Code-to-Code), (3) a Tailored Evaluation Protocol employing multi-dimensional metrics such as \texttt{mxGraph} Execution Success Rate, Style Consistency Score (SCS), etc. Experimental results highlight the challenges faced by current State-of-the-Art (SOTA) VLMs in structured fidelity and instruction compliance, reflecting their vision and reasoning capabilities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces VCG-Bench, a unified benchmark for visual-centric structured generation and editing of diagrams under a Diagram-as-Code paradigm that uses mxGraph XML for symbolic, editable representations. It contributes a taxonomized dataset of 1,449 diagrams spanning 6 domains and 15 sub-domains, defines Generation (Vision-to-Code) and Editability (Code-to-Code) tasks, and proposes a tailored evaluation protocol with multi-dimensional metrics including mxGraph Execution Success Rate and Style Consistency Score. Experiments on current SOTA VLMs are reported to demonstrate limitations in structured fidelity and instruction compliance.

Significance. If the central claims hold after addressing format-specific confounds, the benchmark could usefully shift evaluation of VLMs away from pixel-based synthesis toward controllable, professional-grade diagrammatic workflows. The dataset taxonomy and dual-task paradigm definition are concrete contributions that could support reproducible progress in structured output generation.

major comments (2)
  1. [Abstract] Abstract: The claim that results 'reflect their vision and reasoning capabilities' is load-bearing for the paper's interpretation of VLM limitations, yet the evaluation protocol does not include cross-format controls or ablations on representation choice. Low Execution Success Rate and SCS scores could therefore arise from limited pretraining exposure to mxGraph XML syntax rather than deficits in diagram understanding or instruction following.
  2. [Evaluation Protocol] Evaluation Protocol (implied in abstract): The weakest assumption—that mxGraph XML and the chosen metrics adequately capture real-world professional requirements—is not tested. Without evidence that the format is representative or that results generalize beyond this schema, the benchmark's ability to diagnose general structured-generation failures remains open.
minor comments (2)
  1. [Abstract] Abstract: Dataset construction details, exact metric definitions, and baseline comparisons are referenced but not summarized; adding one sentence on each would improve immediate verifiability.
  2. Notation: The term 'Diagram-as-Code paradigm' is introduced without a concise formal definition or contrast to prior code-based diagram work; a short paragraph or table would clarify the novelty.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive feedback on our manuscript. The comments highlight important considerations regarding the interpretation of our results and the scope of the benchmark. We address each major comment point-by-point below, providing clarifications and making targeted revisions to the manuscript where appropriate.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that results 'reflect their vision and reasoning capabilities' is load-bearing for the paper's interpretation of VLM limitations, yet the evaluation protocol does not include cross-format controls or ablations on representation choice. Low Execution Success Rate and SCS scores could therefore arise from limited pretraining exposure to mxGraph XML syntax rather than deficits in diagram understanding or instruction following.

    Authors: We acknowledge the validity of this concern regarding potential format-specific confounds. Our benchmark is intentionally scoped to the Diagram-as-Code paradigm using mxGraph XML, which enables precise symbolic representations and editability not afforded by pixel-based methods. The observed low Execution Success Rates and Style Consistency Scores primarily indicate challenges in producing syntactically valid and instruction-compliant structured output. To strengthen the manuscript, we have revised the abstract and added a dedicated paragraph in the Discussion section that explicitly discusses the possibility of pretraining exposure effects, justifies the choice of mxGraph based on its widespread use in professional tools (e.g., diagrams.net), and notes that cross-format ablations are a valuable direction for future work. This revision clarifies the interpretive scope without overstating generality. revision: partial

  2. Referee: [Evaluation Protocol] Evaluation Protocol (implied in abstract): The weakest assumption—that mxGraph XML and the chosen metrics adequately capture real-world professional requirements—is not tested. Without evidence that the format is representative or that results generalize beyond this schema, the benchmark's ability to diagnose general structured-generation failures remains open.

    Authors: We agree that empirical validation of representativeness would further bolster the benchmark's utility. mxGraph was selected for its symbolic editability and adoption in real diagramming workflows, as supported by references in the Related Work section. In the revised manuscript, we have expanded the Introduction to include additional context on mxGraph's professional relevance, added a Limitations subsection that transparently discusses schema-specific aspects and the absence of direct user studies or cross-schema comparisons, and clarified that the multi-metric protocol (Execution Success Rate, Style Consistency Score, etc.) is tailored to evaluate structured fidelity within this paradigm. These changes address the concern by improving transparency while preserving the benchmark's focus on controllable, editable diagram generation. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark proposal with no derivations or self-referential predictions

full rationale

The paper proposes VCG-Bench as a new benchmark and Diagram-as-Code paradigm using mxGraph XML for diagram generation and editing tasks. It defines a dataset, evaluation metrics (Execution Success Rate, Style Consistency Score), and reports experimental results on SOTA VLMs. No equations, fitted parameters, predictions derived from prior fits, or load-bearing self-citations appear in the provided text. The central claims about VLM limitations rest on the benchmark's external evaluation protocol rather than reducing to quantities defined by the authors' own prior results or by construction. This is a standard benchmark contribution whose statements are self-contained against the introduced tasks and metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the premise that symbolic mxGraph representation overcomes pixel-based limitations and that the collected diagrams form a representative test of professional diagrammatic tasks.

axioms (1)
  • domain assumption mxGraph XML supplies precise, editable symbolic logic for diagrams
    Invoked in the abstract as the foundation for the Generation and Editability paradigm.
invented entities (1)
  • Diagram-as-Code paradigm no independent evidence
    purpose: To replace probabilistic pixel synthesis with symbolic code for higher fidelity and editability in diagrammatic tasks
    Introduced as the core alternative to existing pixel-based methods.

pith-pipeline@v0.9.0 · 5772 in / 1142 out tokens · 48999 ms · 2026-05-20T19:30:14.347159+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages

  1. [1]

    Open-Source Template Repositories (40%): Sourced from public GitHub repositories hosting ‘.drawio‘ or ‘.xml‘ templates (e.g., ‘jgraph/drawio-diagrams‘, architecture-templates)

  2. [2]

    Open Access Academic Papers (30%): Extracted from arXiv sources under CC-BY licenses, focusing on CS/AI system architectures

  3. [3]

    Feature A

    Anonymized Corporate Diagrams (20%): Internal datasets from verified industry partners. All text entities were anonymized (e.g., "Feature A" instead of specific product names) and PII was scrubbed using regex and NER pipelines

  4. [4]

    Permissively Licensed Web Diagrams (10%): Crawled from technical blogs and documentation sites explicitly marked as CC-BY or public domain. B.3. Data Synthesis Pipeline The VCG-Bench dataset is constructed using a two-stage synthesis pipeline. Gemini-3-Pro is used to generate intermediate structured captions, while candidate mxGraph XML files are synthesi...

  5. [5]

    5.Verification: Rendering themxGraphXML and filtering based on visual similarity

    Code Generation (Task 1): Converting the description and the original image intomxGraphModel mxGraph XML. 5.Verification: Rendering themxGraphXML and filtering based on visual similarity. Stage 2: Instruction Synthesis (Task 2) 1.Base Selection: High-qualitymxGraphXMLs from Stage 1 are selected as ground truth. 2.Instruction Generation: Creating instructi...

  6. [6]

    0"/> 4<mxCell id=

    Atomic Operations: Edits are composed of 14 atomic operations (e.g., add_node, change_color, reroute_edge). B.4. Dataset Statistics and Difficulty Task 1 Difficulty (mxGraphXML Token Count): •Easy(< 8,645 tokens): 33.0% (478 images) •Medium(8,645 - 14,000 tokens): 49.8% (721 images) •Hard(> 14,000 tokens): 17.3% (250 images) Statistics: Mean 11,024, Media...

  7. [7]

    Output ONLY XML content (no Markdown fences, no explanations)

  8. [8]

    Start with ‘<mxfile>‘ and end with ‘</mxfile>‘

  9. [9]

    1" gridSize=

    ‘<mxGraphModel>‘ must include ‘grid="1" gridSize="10" guides="1" page="1"‘

  10. [10]

    geometry

    All elements must be ‘<mxCell>‘ with: 12- Unique ‘id‘ (globally unique, add suffixes if needed) 13- ‘value‘, ‘style‘, ‘parent‘ attributes 14- ‘<mxGeometry ... as="geometry"/>‘ child

  11. [11]

    1"‘) must NOT be self-closing; must contain ‘<mxGeometry relative=

    Edge cells (‘edge="1"‘) must NOT be self-closing; must contain ‘<mxGeometry relative="1" as="geometry" />‘

  12. [12]

    XML escaping: Use ‘&lt;‘, ‘&gt;‘, ‘&amp;‘ in attribute values

  13. [13]

    No external images (no ‘shape=image‘ with URLs) 18 19 **Generation Rules **:

  14. [14]

    Use [Original Image] as ground truth; [JSON Description Draft] as reference

  15. [15]

    Center the diagram on canvas (don’t stack in top-left corner)

    Layout: All coordinates must be multiples of 10. Center the diagram on canvas (don’t stack in top-left corner)

  16. [16]

    Use styles from JSON when available

    Components: Create nodes from ‘components‘ and ‘component_styles‘. Use styles from JSON when available

  17. [17]

    middle_right

    Arrows: Create exactly one ‘<mxCell edge="1">‘ for each item in ‘arrows.details‘: 24- Use ‘style‘ field (solid/dashed) 25- Set ‘startArrow‘/‘endArrow‘ based on ‘heads‘ field (double arrow if ‘is_bidirectional: true‘) 26- Use ‘routing‘ field: add ‘edgeStyle=orthogonalEdgeStyle;‘ for orthogonal, use ‘<Array as="points">‘ for curved 25 VCG-Bench: Towards A U...

  18. [18]

    Z-Order: Create background elements (with ‘is_background: true‘) first, then foreground elements

  19. [19]

    describes_subgraph

    Complex structures: For ‘relationships_extended‘ with ‘type: "describes_subgraph"‘ or ‘"describes_structure "‘, rebuild the structure using ‘<mxCell>‘ primitives based on the ‘note‘ description 30 31 **Validation**: Ensure all ‘id‘s are unique, all edges have proper source/target references, and XML is well- formed. Listing 3.Prompt for XML Generation E.2...

  20. [20]

    **Modify Node Color **: Change the fill color or border color of a node

  21. [21]

    **Modify Node Shape **: Change the shape type of a node (rectangle, circle, diamond, etc.)

  22. [22]

    **Modify Node Size **: Change the width or height of a node

  23. [23]

    **Modify Node Text **: Change the text content displayed inside a node 23 24### Category 2: Node Structure Operations (3 operations)

  24. [24]

    **Delete Node **: Delete a node (without handling connections)

  25. [25]

    **Add Node **: Add a new node at a specified location

  26. [26]

    **Move Node **: Change the position of a node 28 29### Category 3: Connection Line Attribute Modification (3 operations)

  27. [27]

    **Modify Connection Color **: Change the color of a connection line

  28. [28]

    **Modify Connection Style **: Change the style of a connection line (solid, dashed, thickness, etc.)

  29. [29]

    **Modify Connection Arrow **: Change the arrow style of a connection line 33 34### Category 4: Connection Line Structure Operations (4 operations)

  30. [30]

    **Delete Connection **: Delete a connection line

  31. [31]

    **Add Connection **: Add a connection line between two nodes

  32. [32]

    **Redirect Connection **: Change the start or end point of a connection line

  33. [33]

    60 61### [RECOMMENDED] Recommended Description Methods (Prioritized): 62 63 **Highest Priority **: Directly use the text content displayed on nodes 64

    **Update Connection Path **: Update the connection line path when nodes move 39 40--- 41 42## Difficulty Requirements (Strictly Follow) 43 44You need to generate 3 instructions of different difficulty levels, where difficulty is **completely determined by the number of atomic operations **: 45 46### Easy Difficulty (1-2 atomic operations) 47- **Requiremen...

  34. [34]

    Change the ’Start’ node to red

    **Node Text Content Description (Most natural and accurate) **: 66- "Change the ’Start’ node to red" 67- "Delete the ’Data Processing’ node" 68

  35. [35]

    Start node

    **Semantic Description ** (when nodes don’t have clear text but semantics are clear): 70- "Start node", "End node" 71

  36. [36]

    The topmost node

    **Position Description ** (when nodes have no text or text is unclear): 73- "The topmost node", "The leftmost rectangle", "The middle circle" 74

  37. [37]

    The largest rectangle

    **Visual Feature Description ** (as supplementary positioning): 76- "The largest rectangle", "The red node" 77

  38. [38]

    The arrow between ’Start’ and ’End’

    **Relative Position Description **: 79- "The arrow between ’Start’ and ’End’" 80 81### [PROHIBITED] Strictly Prohibited Description Methods: 82

  39. [39]

    Node with id ’node_1’

    **Technical Identifiers ** (Prohibited): 84- [X] "Node with id ’node_1’" 85- [X] "edge_5" 86

  40. [40]

    Node at coordinates (200, 300)

    **Coordinate Description ** (Prohibited): 88- [X] "Node at coordinates (200, 300)" 89

  41. [41]

    Node with shape=’rectangle’

    **XML Attribute Reference ** (Prohibited): 91- [X] "Node with shape=’rectangle’" 92 93--- 94 95## Output Format (Strict JSON) 96 97Please output a JSON object containing 3 instructions in the following format: 98 99‘‘‘json 100{ 101"instructions": [ 102{ 103"difficulty": "Easy", 104"instruction": "Change the ’Start’ node to red", 105"atomic_operations": [ ...

  42. [42]

    **Only output modified parts **: To save tokens, you only need to output the modified \texttt{mxGraph} XML fragments, not the complete \texttt{mxGraph} XML

  43. [43]

    **Incremental format **: Output JSON format containing all places that need modification (because there may be multiple places to modify)

  44. [44]

    **Format requirements **: 15- Each modification contains two fields: 28 VCG-Bench: Towards A Unified Visual-Centric Benchmark for Structured Generation and Editing 16- ‘original_fragment‘: Original \texttt{mxGraph} XML fragment (the part to be replaced) 17- ‘modified_fragment‘: Modified \texttt{mxGraph} XML fragment (the replacement content) 18- If there ...

  45. [45]

    **Fragment requirements **: 20- ‘original_fragment‘ must be a complete fragment from the original \texttt{mxGraph} XML (can be a complete element, such as ‘<mxCell>...</mxCell>‘) 21- ‘modified_fragment‘ is the corresponding modified fragment 22- Fragments must be precise enough to uniquely match the position in the original \texttt{mxGraph} XML

  46. [46]

    changes": [ 32{ 33

    **Maintain correct \texttt{mxGraph} XML format **: Modified fragments must maintain correct and parseable \ texttt{mxGraph} XML format 24 25## Output Format (Strict JSON) 26 27Please output a JSON object in the following format: 28 29‘‘‘json 30{ 31"changes": [ 32{ 33"original_fragment": "<mxCell id=’node_1’ value=’Old Text’ ...>...</mxCell>", 34"modified_...

  47. [47]

    Visual style consistency (color, line thickness, node style): ___

  48. [48]

    Layout structure consistency (element positions, spacing, spatial relationships): ___ 29 VCG-Bench: Towards A Unified Visual-Centric Benchmark for Structured Generation and Editing

  49. [49]

    analysis

    Aesthetic quality (alignment, visual balance, overall beauty): ___ 30 31 **Step 4: Final Score Calculation ** 32- Calculate the average of the three dimension scores 33- Divide the average by 10 to normalize to 0-1 34- Example: Dimension scores [8.5, 7.0, 9.0], average = 8.17, final score = 0.817 35 36 **Output Format (JSON): ** 37{ 38"analysis": { 39"ori...

  50. [50]

    **Style Consistency ** (color style, visual element style, overall style characteristics): ___ 36- Evaluate whether unmodified parts maintain the original style characteristics 37- Evaluate whether the overall style is coordinated and unified

  51. [51]

    analysis

    **Aesthetic Quality ** (visual balance, alignment, overall beauty, presence of obvious visual errors): ___ 39- Evaluate whether the modified diagram is still beautiful and professional 40- If there are obvious visual errors (element overlaps, text misalignment, layout chaos), deduct points 41 42 **Step 4: Final Score Calculation ** 43- Calculate the avera...

  52. [52]

    Counting: Count the number of elements in the diagram

  53. [53]

    Identification: Identify attributes or labels of specific elements

  54. [54]

    how many nodes

    Relationship: Identify relationships or connections between elements 6 7Image: [Image Placeholder] 8 9 **Requirements (Enhanced Version): ** 10 11 **1. Depth Requirements (Improve Discrimination) ** 12- **Counting questions **: Should not only ask "how many nodes" which is too simple. Should include counting of specific attributes, for example: 13- "How m...

  55. [57]

    Yes " or

    Verify completeness and correctness 34 35 **Verification Rules: ** 36- **Text changes **: Check ‘value‘ attribute in ‘modified_fragment‘ 37- **Color changes **: Check ‘fillColor‘/‘strokeColor‘ in ‘style‘ attribute (hex codes like #0000FF or color names) 38- **Position/Size**: Check ‘mxGeometry‘ coordinates/dimensions 39- **Add/Remove**: Check for new elem...

  56. [58]

    Analyze each modification by comparing ‘original_fragment‘ and ‘modified_fragment‘

  57. [59]

    Map instruction requirements to modifications

  58. [60]

    How many blue circular nodes are in the diagram?

    Verify completeness and correctness 38 39 **Verification Rules: ** 40- **Text changes **: Check ‘value‘ attribute in ‘modified_fragment‘ 41- **Color changes **: Check ‘fillColor‘/‘strokeColor‘ in ‘style‘ attribute (hex codes like #0000FF or color names) 42- **Position/Size**: Check ‘mxGeometry‘ coordinates/dimensions 43- **Add/Remove**: Check for new elem...

  59. [61]

    ESR = ( 1.0ifmxGraphXML_valid∧Render_success 0.0otherwise

    Execution Success Rate (ESR)ESR measures the proportion of generated mxGraph XML codes that are syntactically valid and renderable. ESR = ( 1.0ifmxGraphXML_valid∧Render_success 0.0otherwise

  60. [62]

    Primarily used for image difficulty classification, not as a model performance evaluation metric

    mxGraph XML Token Count (XTC)XTC is a proxy for structural complexity, calculated using the ‘cl100k_base‘ tokenizer. Primarily used for image difficulty classification, not as a model performance evaluation metric. 35 VCG-Bench: Towards A Unified Visual-Centric Benchmark for Structured Generation and Editing XTC = Tokenize(mxGraphXML_code,cl100k_base)(4)

  61. [63]

    SCS = 1 30 (Svisual +S layout +S aesthetic)(5) Where: •S visual: Color palette, line styles, node shapes

    Style Consistency Score (SCS)SCS is a VLM-based ( gemini-3-pro-preview) perceptual metric evaluating three dimensions on a 0-10 scale. SCS = 1 30 (Svisual +S layout +S aesthetic)(5) Where: •S visual: Color palette, line styles, node shapes. •S layout: Spatial arrangement, alignment, spacing. •S aesthetic: Overall visual harmony and professional look

  62. [64]

    CodeXQA = 1 N NX i=1 I(Match(Answerpred i ,Answer gt i )) Matching strategies include Exact Match, Inclusion, and Semantic Similarity

    CodeXQA AccuracyCodeXQA is the average accuracy across the three question types (Counting, Identification, Relationship). CodeXQA = 1 N NX i=1 I(Match(Answerpred i ,Answer gt i )) Matching strategies include Exact Match, Inclusion, and Semantic Similarity

  63. [65]

    SigLIP2 = CosineSimilarity(SigLIP2(Original),SigLIP2(Generated)) The model used isgoogle/siglip2-so400m-patch16-512, with cosine similarity in the range 0–1

    SigLIP2 Score (SigLIP2 Semantic Similarity Score)This metric uses the SigLIP2 model to calculate semantic similarity between the original and generated images. SigLIP2 = CosineSimilarity(SigLIP2(Original),SigLIP2(Generated)) The model used isgoogle/siglip2-so400m-patch16-512, with cosine similarity in the range 0–1. F.2.2. TASK2: EDITINGMETRICS

  64. [66]

    XDRFR = 1 M MX i=1 I(Answeri ="Yes") Where M is the number of decomposed questions for the instruction

    XDRFR (XML Decomposed Requirements Following Ratio)XDRFR is the primary metric for instruction following; it calculates the pass rate of decomposed Yes/No questions. XDRFR = 1 M MX i=1 I(Answeri ="Yes") Where M is the number of decomposed questions for the instruction. The evaluation is performed purely on the XML text, avoiding visual rendering artifacts

  65. [67]

    SCSTask2 = 1 20 (Sstyle +S aesthetic) Crucially, this metric does not penalize content changes (which are intended) but ensures thestyleremains consistent with the original diagram

    SCS for Task 2This metric is adapted to focus on style preservation during edits. SCSTask2 = 1 20 (Sstyle +S aesthetic) Crucially, this metric does not penalize content changes (which are intended) but ensures thestyleremains consistent with the original diagram

  66. [68]

    XED = LevenshteinDistance(OriginalmxGraphXML,ModifiedmxGraphXML) It uses the standard Levenshtein distance algorithm, based on character-level comparison ofmxGraphXML strings

    mxGraph XML Edit Distance (XED)XED calculates the edit distance between the original mxGraph XML and the modifiedmxGraphXML, quantifying the magnitude of code-level modifications. XED = LevenshteinDistance(OriginalmxGraphXML,ModifiedmxGraphXML) It uses the standard Levenshtein distance algorithm, based on character-level comparison ofmxGraphXML strings. 3...

  67. [69]

    Parse and validate schema(C); report well-formedness and loadability

  68. [70]

    RenderCandRunder identical settings for spatial comparison

  69. [71]

    Detect elements and align via category + proximity; perform bipartite matching

  70. [72]

    Compute IoU, completeness, style consistency

  71. [73]

    Run editability checks: move/resize nodes, re-route connectors

  72. [74]

    Evaluate directive compliance fromD(presence/absence, counts, layout constraints) 8.Output: per-metric scores and weighted aggregate 37