VCG-Bench: Towards A Unified Visual-Centric Benchmark for Structured Generation and Editing
Pith reviewed 2026-05-20 19:30 UTC · model grok-4.3
The pith
VCG-Bench introduces a Diagram-as-Code paradigm using mxGraph XML to test vision-language models on precise diagram generation and editing.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We propose a new Diagram-as-Code paradigm with symbolic logic that leverages mxGraph XML for precise diagram generation and editing instead of probabilistic pixel spaces. We present VCG-Bench, a unified benchmark comprising a taxonomized dataset of 1,449 diverse diagrams spanning 6 domains and 15 sub-domains, a paradigm definition integrating Generation (Vision-to-Code) and Editability (Code-to-Code), and a tailored evaluation protocol with multi-dimensional metrics such as mxGraph Execution Success Rate and Style Consistency Score. Experimental results highlight the challenges faced by current SOTA VLMs in structured fidelity and instruction compliance, reflecting their vision and reasoning
What carries the argument
The Diagram-as-Code paradigm, which substitutes symbolic mxGraph XML logic for pixel-based synthesis to enable exact, editable control over diagram structure and style.
If this is right
- State-of-the-art VLMs exhibit measurable shortfalls in structured fidelity and instruction compliance on diagrammatic tasks.
- The benchmark supplies a reproducible way to track progress on vision-to-code and code-to-code diagram operations.
- Code-based representations can improve editability and structural accuracy over pixel synthesis for professional workflows.
- Performance gaps on the benchmark point to broader limitations in current models' visual reasoning.
Where Pith is reading between the lines
- Training regimes that explicitly reward symbolic output formats might close some of the observed gaps.
- The same evaluation structure could be adapted to other structured visuals such as circuit schematics or UI wireframes.
- Direct comparisons between mxGraph and alternative diagram languages would test whether results generalize beyond the chosen format.
Load-bearing premise
That the mxGraph XML format and the chosen metrics such as Execution Success Rate and Style Consistency Score adequately capture real-world professional requirements for diagram generation and editing.
What would settle it
A study in which domain experts rate VLM-generated diagrams for usability in actual tools like draw.io or Lucidchart and find that high benchmark scores do not predict practical success.
Figures
read the original abstract
Despite the rapid advancements in Vision-Language Models (VLMs), a critical gap remains in their ability to handle structured, controllable diagrammatic tasks essential for professional workflows. Existing methods predominantly rely on pixel-based synthesis, which operates in probabilistic pixel spaces and is inherently limited in editability and fidelity. Instead, we propose a new Diagram-as-Code paradigm with symbolic logic that leverages mxGraph Extensible Markup Language (XML) for precise diagram generation and editing. We present VCG-Bench, a unified benchmark for visual-centric \texttt{mxGraph} tasks. VCG-Bench comprises: (1) a taxonomized dataset of 1,449 diverse diagrams spanning 6 domains and 15 sub-domains, (2) a paradigm definition that integrates Generation (Vision-to-Code) and Editability (Code-to-Code), (3) a Tailored Evaluation Protocol employing multi-dimensional metrics such as \texttt{mxGraph} Execution Success Rate, Style Consistency Score (SCS), etc. Experimental results highlight the challenges faced by current State-of-the-Art (SOTA) VLMs in structured fidelity and instruction compliance, reflecting their vision and reasoning capabilities.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces VCG-Bench, a unified benchmark for visual-centric structured generation and editing of diagrams under a Diagram-as-Code paradigm that uses mxGraph XML for symbolic, editable representations. It contributes a taxonomized dataset of 1,449 diagrams spanning 6 domains and 15 sub-domains, defines Generation (Vision-to-Code) and Editability (Code-to-Code) tasks, and proposes a tailored evaluation protocol with multi-dimensional metrics including mxGraph Execution Success Rate and Style Consistency Score. Experiments on current SOTA VLMs are reported to demonstrate limitations in structured fidelity and instruction compliance.
Significance. If the central claims hold after addressing format-specific confounds, the benchmark could usefully shift evaluation of VLMs away from pixel-based synthesis toward controllable, professional-grade diagrammatic workflows. The dataset taxonomy and dual-task paradigm definition are concrete contributions that could support reproducible progress in structured output generation.
major comments (2)
- [Abstract] Abstract: The claim that results 'reflect their vision and reasoning capabilities' is load-bearing for the paper's interpretation of VLM limitations, yet the evaluation protocol does not include cross-format controls or ablations on representation choice. Low Execution Success Rate and SCS scores could therefore arise from limited pretraining exposure to mxGraph XML syntax rather than deficits in diagram understanding or instruction following.
- [Evaluation Protocol] Evaluation Protocol (implied in abstract): The weakest assumption—that mxGraph XML and the chosen metrics adequately capture real-world professional requirements—is not tested. Without evidence that the format is representative or that results generalize beyond this schema, the benchmark's ability to diagnose general structured-generation failures remains open.
minor comments (2)
- [Abstract] Abstract: Dataset construction details, exact metric definitions, and baseline comparisons are referenced but not summarized; adding one sentence on each would improve immediate verifiability.
- Notation: The term 'Diagram-as-Code paradigm' is introduced without a concise formal definition or contrast to prior code-based diagram work; a short paragraph or table would clarify the novelty.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive feedback on our manuscript. The comments highlight important considerations regarding the interpretation of our results and the scope of the benchmark. We address each major comment point-by-point below, providing clarifications and making targeted revisions to the manuscript where appropriate.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that results 'reflect their vision and reasoning capabilities' is load-bearing for the paper's interpretation of VLM limitations, yet the evaluation protocol does not include cross-format controls or ablations on representation choice. Low Execution Success Rate and SCS scores could therefore arise from limited pretraining exposure to mxGraph XML syntax rather than deficits in diagram understanding or instruction following.
Authors: We acknowledge the validity of this concern regarding potential format-specific confounds. Our benchmark is intentionally scoped to the Diagram-as-Code paradigm using mxGraph XML, which enables precise symbolic representations and editability not afforded by pixel-based methods. The observed low Execution Success Rates and Style Consistency Scores primarily indicate challenges in producing syntactically valid and instruction-compliant structured output. To strengthen the manuscript, we have revised the abstract and added a dedicated paragraph in the Discussion section that explicitly discusses the possibility of pretraining exposure effects, justifies the choice of mxGraph based on its widespread use in professional tools (e.g., diagrams.net), and notes that cross-format ablations are a valuable direction for future work. This revision clarifies the interpretive scope without overstating generality. revision: partial
-
Referee: [Evaluation Protocol] Evaluation Protocol (implied in abstract): The weakest assumption—that mxGraph XML and the chosen metrics adequately capture real-world professional requirements—is not tested. Without evidence that the format is representative or that results generalize beyond this schema, the benchmark's ability to diagnose general structured-generation failures remains open.
Authors: We agree that empirical validation of representativeness would further bolster the benchmark's utility. mxGraph was selected for its symbolic editability and adoption in real diagramming workflows, as supported by references in the Related Work section. In the revised manuscript, we have expanded the Introduction to include additional context on mxGraph's professional relevance, added a Limitations subsection that transparently discusses schema-specific aspects and the absence of direct user studies or cross-schema comparisons, and clarified that the multi-metric protocol (Execution Success Rate, Style Consistency Score, etc.) is tailored to evaluate structured fidelity within this paradigm. These changes address the concern by improving transparency while preserving the benchmark's focus on controllable, editable diagram generation. revision: yes
Circularity Check
No circularity: benchmark proposal with no derivations or self-referential predictions
full rationale
The paper proposes VCG-Bench as a new benchmark and Diagram-as-Code paradigm using mxGraph XML for diagram generation and editing tasks. It defines a dataset, evaluation metrics (Execution Success Rate, Style Consistency Score), and reports experimental results on SOTA VLMs. No equations, fitted parameters, predictions derived from prior fits, or load-bearing self-citations appear in the provided text. The central claims about VLM limitations rest on the benchmark's external evaluation protocol rather than reducing to quantities defined by the authors' own prior results or by construction. This is a standard benchmark contribution whose statements are self-contained against the introduced tasks and metrics.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption mxGraph XML supplies precise, editable symbolic logic for diagrams
invented entities (1)
-
Diagram-as-Code paradigm
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Open-Source Template Repositories (40%): Sourced from public GitHub repositories hosting ‘.drawio‘ or ‘.xml‘ templates (e.g., ‘jgraph/drawio-diagrams‘, architecture-templates)
-
[2]
Open Access Academic Papers (30%): Extracted from arXiv sources under CC-BY licenses, focusing on CS/AI system architectures
- [3]
-
[4]
Permissively Licensed Web Diagrams (10%): Crawled from technical blogs and documentation sites explicitly marked as CC-BY or public domain. B.3. Data Synthesis Pipeline The VCG-Bench dataset is constructed using a two-stage synthesis pipeline. Gemini-3-Pro is used to generate intermediate structured captions, while candidate mxGraph XML files are synthesi...
-
[5]
5.Verification: Rendering themxGraphXML and filtering based on visual similarity
Code Generation (Task 1): Converting the description and the original image intomxGraphModel mxGraph XML. 5.Verification: Rendering themxGraphXML and filtering based on visual similarity. Stage 2: Instruction Synthesis (Task 2) 1.Base Selection: High-qualitymxGraphXMLs from Stage 1 are selected as ground truth. 2.Instruction Generation: Creating instructi...
-
[6]
Atomic Operations: Edits are composed of 14 atomic operations (e.g., add_node, change_color, reroute_edge). B.4. Dataset Statistics and Difficulty Task 1 Difficulty (mxGraphXML Token Count): •Easy(< 8,645 tokens): 33.0% (478 images) •Medium(8,645 - 14,000 tokens): 49.8% (721 images) •Hard(> 14,000 tokens): 17.3% (250 images) Statistics: Mean 11,024, Media...
work page 2025
-
[7]
Output ONLY XML content (no Markdown fences, no explanations)
-
[8]
Start with ‘<mxfile>‘ and end with ‘</mxfile>‘
- [9]
- [10]
-
[11]
1"‘) must NOT be self-closing; must contain ‘<mxGeometry relative=
Edge cells (‘edge="1"‘) must NOT be self-closing; must contain ‘<mxGeometry relative="1" as="geometry" />‘
-
[12]
XML escaping: Use ‘<‘, ‘>‘, ‘&‘ in attribute values
-
[13]
No external images (no ‘shape=image‘ with URLs) 18 19 **Generation Rules **:
-
[14]
Use [Original Image] as ground truth; [JSON Description Draft] as reference
-
[15]
Center the diagram on canvas (don’t stack in top-left corner)
Layout: All coordinates must be multiples of 10. Center the diagram on canvas (don’t stack in top-left corner)
-
[16]
Use styles from JSON when available
Components: Create nodes from ‘components‘ and ‘component_styles‘. Use styles from JSON when available
-
[17]
Arrows: Create exactly one ‘<mxCell edge="1">‘ for each item in ‘arrows.details‘: 24- Use ‘style‘ field (solid/dashed) 25- Set ‘startArrow‘/‘endArrow‘ based on ‘heads‘ field (double arrow if ‘is_bidirectional: true‘) 26- Use ‘routing‘ field: add ‘edgeStyle=orthogonalEdgeStyle;‘ for orthogonal, use ‘<Array as="points">‘ for curved 25 VCG-Bench: Towards A U...
-
[18]
Z-Order: Create background elements (with ‘is_background: true‘) first, then foreground elements
-
[19]
Complex structures: For ‘relationships_extended‘ with ‘type: "describes_subgraph"‘ or ‘"describes_structure "‘, rebuild the structure using ‘<mxCell>‘ primitives based on the ‘note‘ description 30 31 **Validation**: Ensure all ‘id‘s are unique, all edges have proper source/target references, and XML is well- formed. Listing 3.Prompt for XML Generation E.2...
-
[20]
**Modify Node Color **: Change the fill color or border color of a node
-
[21]
**Modify Node Shape **: Change the shape type of a node (rectangle, circle, diamond, etc.)
-
[22]
**Modify Node Size **: Change the width or height of a node
-
[23]
**Modify Node Text **: Change the text content displayed inside a node 23 24### Category 2: Node Structure Operations (3 operations)
-
[24]
**Delete Node **: Delete a node (without handling connections)
-
[25]
**Add Node **: Add a new node at a specified location
-
[26]
**Move Node **: Change the position of a node 28 29### Category 3: Connection Line Attribute Modification (3 operations)
-
[27]
**Modify Connection Color **: Change the color of a connection line
-
[28]
**Modify Connection Style **: Change the style of a connection line (solid, dashed, thickness, etc.)
-
[29]
**Modify Connection Arrow **: Change the arrow style of a connection line 33 34### Category 4: Connection Line Structure Operations (4 operations)
-
[30]
**Delete Connection **: Delete a connection line
-
[31]
**Add Connection **: Add a connection line between two nodes
-
[32]
**Redirect Connection **: Change the start or end point of a connection line
-
[33]
**Update Connection Path **: Update the connection line path when nodes move 39 40--- 41 42## Difficulty Requirements (Strictly Follow) 43 44You need to generate 3 instructions of different difficulty levels, where difficulty is **completely determined by the number of atomic operations **: 45 46### Easy Difficulty (1-2 atomic operations) 47- **Requiremen...
-
[34]
Change the ’Start’ node to red
**Node Text Content Description (Most natural and accurate) **: 66- "Change the ’Start’ node to red" 67- "Delete the ’Data Processing’ node" 68
-
[35]
**Semantic Description ** (when nodes don’t have clear text but semantics are clear): 70- "Start node", "End node" 71
-
[36]
**Position Description ** (when nodes have no text or text is unclear): 73- "The topmost node", "The leftmost rectangle", "The middle circle" 74
-
[37]
**Visual Feature Description ** (as supplementary positioning): 76- "The largest rectangle", "The red node" 77
-
[38]
The arrow between ’Start’ and ’End’
**Relative Position Description **: 79- "The arrow between ’Start’ and ’End’" 80 81### [PROHIBITED] Strictly Prohibited Description Methods: 82
-
[39]
**Technical Identifiers ** (Prohibited): 84- [X] "Node with id ’node_1’" 85- [X] "edge_5" 86
-
[40]
Node at coordinates (200, 300)
**Coordinate Description ** (Prohibited): 88- [X] "Node at coordinates (200, 300)" 89
-
[41]
**XML Attribute Reference ** (Prohibited): 91- [X] "Node with shape=’rectangle’" 92 93--- 94 95## Output Format (Strict JSON) 96 97Please output a JSON object containing 3 instructions in the following format: 98 99‘‘‘json 100{ 101"instructions": [ 102{ 103"difficulty": "Easy", 104"instruction": "Change the ’Start’ node to red", 105"atomic_operations": [ ...
-
[42]
**Only output modified parts **: To save tokens, you only need to output the modified \texttt{mxGraph} XML fragments, not the complete \texttt{mxGraph} XML
-
[43]
**Incremental format **: Output JSON format containing all places that need modification (because there may be multiple places to modify)
-
[44]
**Format requirements **: 15- Each modification contains two fields: 28 VCG-Bench: Towards A Unified Visual-Centric Benchmark for Structured Generation and Editing 16- ‘original_fragment‘: Original \texttt{mxGraph} XML fragment (the part to be replaced) 17- ‘modified_fragment‘: Modified \texttt{mxGraph} XML fragment (the replacement content) 18- If there ...
-
[45]
**Fragment requirements **: 20- ‘original_fragment‘ must be a complete fragment from the original \texttt{mxGraph} XML (can be a complete element, such as ‘<mxCell>...</mxCell>‘) 21- ‘modified_fragment‘ is the corresponding modified fragment 22- Fragments must be precise enough to uniquely match the position in the original \texttt{mxGraph} XML
-
[46]
**Maintain correct \texttt{mxGraph} XML format **: Modified fragments must maintain correct and parseable \ texttt{mxGraph} XML format 24 25## Output Format (Strict JSON) 26 27Please output a JSON object in the following format: 28 29‘‘‘json 30{ 31"changes": [ 32{ 33"original_fragment": "<mxCell id=’node_1’ value=’Old Text’ ...>...</mxCell>", 34"modified_...
-
[47]
Visual style consistency (color, line thickness, node style): ___
-
[48]
Layout structure consistency (element positions, spacing, spatial relationships): ___ 29 VCG-Bench: Towards A Unified Visual-Centric Benchmark for Structured Generation and Editing
-
[49]
Aesthetic quality (alignment, visual balance, overall beauty): ___ 30 31 **Step 4: Final Score Calculation ** 32- Calculate the average of the three dimension scores 33- Divide the average by 10 to normalize to 0-1 34- Example: Dimension scores [8.5, 7.0, 9.0], average = 8.17, final score = 0.817 35 36 **Output Format (JSON): ** 37{ 38"analysis": { 39"ori...
-
[50]
**Style Consistency ** (color style, visual element style, overall style characteristics): ___ 36- Evaluate whether unmodified parts maintain the original style characteristics 37- Evaluate whether the overall style is coordinated and unified
-
[51]
**Aesthetic Quality ** (visual balance, alignment, overall beauty, presence of obvious visual errors): ___ 39- Evaluate whether the modified diagram is still beautiful and professional 40- If there are obvious visual errors (element overlaps, text misalignment, layout chaos), deduct points 41 42 **Step 4: Final Score Calculation ** 43- Calculate the avera...
-
[52]
Counting: Count the number of elements in the diagram
-
[53]
Identification: Identify attributes or labels of specific elements
-
[54]
Relationship: Identify relationships or connections between elements 6 7Image: [Image Placeholder] 8 9 **Requirements (Enhanced Version): ** 10 11 **1. Depth Requirements (Improve Discrimination) ** 12- **Counting questions **: Should not only ask "how many nodes" which is too simple. Should include counting of specific attributes, for example: 13- "How m...
-
[57]
Verify completeness and correctness 34 35 **Verification Rules: ** 36- **Text changes **: Check ‘value‘ attribute in ‘modified_fragment‘ 37- **Color changes **: Check ‘fillColor‘/‘strokeColor‘ in ‘style‘ attribute (hex codes like #0000FF or color names) 38- **Position/Size**: Check ‘mxGeometry‘ coordinates/dimensions 39- **Add/Remove**: Check for new elem...
-
[58]
Analyze each modification by comparing ‘original_fragment‘ and ‘modified_fragment‘
-
[59]
Map instruction requirements to modifications
-
[60]
How many blue circular nodes are in the diagram?
Verify completeness and correctness 38 39 **Verification Rules: ** 40- **Text changes **: Check ‘value‘ attribute in ‘modified_fragment‘ 41- **Color changes **: Check ‘fillColor‘/‘strokeColor‘ in ‘style‘ attribute (hex codes like #0000FF or color names) 42- **Position/Size**: Check ‘mxGeometry‘ coordinates/dimensions 43- **Add/Remove**: Check for new elem...
work page 2025
-
[61]
ESR = ( 1.0ifmxGraphXML_valid∧Render_success 0.0otherwise
Execution Success Rate (ESR)ESR measures the proportion of generated mxGraph XML codes that are syntactically valid and renderable. ESR = ( 1.0ifmxGraphXML_valid∧Render_success 0.0otherwise
-
[62]
Primarily used for image difficulty classification, not as a model performance evaluation metric
mxGraph XML Token Count (XTC)XTC is a proxy for structural complexity, calculated using the ‘cl100k_base‘ tokenizer. Primarily used for image difficulty classification, not as a model performance evaluation metric. 35 VCG-Bench: Towards A Unified Visual-Centric Benchmark for Structured Generation and Editing XTC = Tokenize(mxGraphXML_code,cl100k_base)(4)
-
[63]
Style Consistency Score (SCS)SCS is a VLM-based ( gemini-3-pro-preview) perceptual metric evaluating three dimensions on a 0-10 scale. SCS = 1 30 (Svisual +S layout +S aesthetic)(5) Where: •S visual: Color palette, line styles, node shapes. •S layout: Spatial arrangement, alignment, spacing. •S aesthetic: Overall visual harmony and professional look
-
[64]
CodeXQA AccuracyCodeXQA is the average accuracy across the three question types (Counting, Identification, Relationship). CodeXQA = 1 N NX i=1 I(Match(Answerpred i ,Answer gt i )) Matching strategies include Exact Match, Inclusion, and Semantic Similarity
-
[65]
SigLIP2 Score (SigLIP2 Semantic Similarity Score)This metric uses the SigLIP2 model to calculate semantic similarity between the original and generated images. SigLIP2 = CosineSimilarity(SigLIP2(Original),SigLIP2(Generated)) The model used isgoogle/siglip2-so400m-patch16-512, with cosine similarity in the range 0–1. F.2.2. TASK2: EDITINGMETRICS
-
[66]
XDRFR (XML Decomposed Requirements Following Ratio)XDRFR is the primary metric for instruction following; it calculates the pass rate of decomposed Yes/No questions. XDRFR = 1 M MX i=1 I(Answeri ="Yes") Where M is the number of decomposed questions for the instruction. The evaluation is performed purely on the XML text, avoiding visual rendering artifacts
-
[67]
SCS for Task 2This metric is adapted to focus on style preservation during edits. SCSTask2 = 1 20 (Sstyle +S aesthetic) Crucially, this metric does not penalize content changes (which are intended) but ensures thestyleremains consistent with the original diagram
-
[68]
mxGraph XML Edit Distance (XED)XED calculates the edit distance between the original mxGraph XML and the modifiedmxGraphXML, quantifying the magnitude of code-level modifications. XED = LevenshteinDistance(OriginalmxGraphXML,ModifiedmxGraphXML) It uses the standard Levenshtein distance algorithm, based on character-level comparison ofmxGraphXML strings. 3...
-
[69]
Parse and validate schema(C); report well-formedness and loadability
-
[70]
RenderCandRunder identical settings for spatial comparison
-
[71]
Detect elements and align via category + proximity; perform bipartite matching
-
[72]
Compute IoU, completeness, style consistency
-
[73]
Run editability checks: move/resize nodes, re-route connectors
-
[74]
Evaluate directive compliance fromD(presence/absence, counts, layout constraints) 8.Output: per-metric scores and weighted aggregate 37
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.