pith. sign in

arxiv: 2604.05514 · v1 · submitted 2026-04-07 · 💻 cs.AI

OmniDiagram: Advancing Unified Diagram Code Generation via Visual Interrogation Reward

Pith reviewed 2026-05-10 18:54 UTC · model grok-4.3

classification 💻 cs.AI
keywords diagram code generationvisual feedbackreinforcement learningunified frameworkdiagram datasetvisual interrogationcode generationstate-of-the-art results
0
0 comments X

The pith

OmniDiagram trains code generators for many diagram languages by using self-generated visual questions to score rendered outputs in reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents OmniDiagram as a single framework that supports multiple diagram code languages and task types instead of restricting to narrow cases. It introduces Viva, a process in which the model creates its own visual inquiries to examine how faithfully a rendered diagram matches the intended structure and then uses that feedback to guide reinforcement learning updates. This approach removes the need for manually written ground-truth code during training. The authors also release M3²Diagram, a dataset of more than 196,000 examples. When supervised fine-tuning is followed by Viva-based reinforcement learning, the system records new state-of-the-art results on standard diagram code generation benchmarks.

Core claim

OmniDiagram is a unified framework incorporating diverse diagram code languages and task definitions. To align code logic with visual fidelity in reinforcement learning, it employs Visual Interrogation Verifies All (Viva), a generative strategy that actively produces targeted visual inquiries to scrutinize diagram visual fidelity and supplies fine-grained feedback for optimization. This enables a self-evolving training process that does not require manually annotated ground-truth code. Paired with supervised fine-tuning on the newly constructed M3²Diagram dataset of over 196k high-quality instances, the combination reaches new state-of-the-art performance across diagram code generation tasks

What carries the argument

Visual Interrogation Verifies All (Viva), the mechanism that generates targeted visual inquiries about rendered diagrams to produce fine-grained rewards for reinforcement learning without ground-truth annotations.

If this is right

  • A single model can now handle a broader set of diagram languages and task formulations than earlier specialized systems.
  • Training proceeds without paired ground-truth code annotations for every example.
  • The self-evolving loop allows performance to improve iteratively from visual structure feedback alone.
  • The released M3²Diagram dataset supplies scale for future training of diagram-related models.
  • SOTA numbers are established on existing diagram code benchmarks when SFT is followed by Viva-based RL.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same interrogation style of feedback could be adapted to other generation tasks where output is judged by rendered appearance, such as plot or UI code.
  • If Viva-style questions prove reliable across diagram types, the method might reduce dependence on large labeled datasets in related multimodal code tasks.
  • The unified framework opens the possibility of extending support to additional languages or tasks not covered in current benchmarks.
  • Measuring how well Viva inquiries align with human judgments on diagram correctness would provide an independent check on the reward quality.

Load-bearing premise

The visual inquiries that Viva generates give accurate and unbiased feedback on diagram visual fidelity that reliably guides code improvements.

What would settle it

Training the same base model with Viva rewards replaced by syntax-only or random rewards and measuring whether performance on the benchmark suite stays at or above the reported SOTA level would directly test whether the visual interrogation step is necessary for the gains.

Figures

Figures reproduced from arXiv: 2604.05514 by Feibang Jiang, Haoyue Yang, Xuanle Zhao, Xuexin Liu, Yao Zhu.

Figure 1
Figure 1. Figure 1: Overcoming the barriers of single-modality: [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the OmniDiagram methodology. The framework illustrates the end-to-end flow from scalable [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Breakdown of the 196k-sample M3 [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Progression of the overall reward during the [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Statistical breakdown of tasks and diagram [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative showcase of our model across three modalities (LA [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative example of visual verification questions for the Text-to-Code task. [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative example of visual verification questions for the Diagram-to-Code task. [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative example of visual verification questions for the Diagram Editing task. [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: The prompt template used for generating different topics to set scene limitations. [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: The prompt template used for generating diverse diagram scenario based on user topics and specific [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: The prompt template used for generating structured JSON data elements tailored to specific Mermaid [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: The prompt template used and applying structured JSON data into executable Mermaid Mindmap code. [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: The prompt used for the Diagram-to-Code task evaluation. [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: The prompt used for the Diagram Editing task evaluation. [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: The prompt used for Text-to-Code task evaluation. [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗
read the original abstract

The paradigm of programmable diagram generation is evolving rapidly, playing a crucial role in structured visualization. However, most existing studies are confined to a narrow range of task formulations and language support, constraining their applicability to diverse diagram types. In this work, we propose OmniDiagram, a unified framework that incorporates diverse diagram code languages and task definitions. To address the challenge of aligning code logic with visual fidelity in Reinforcement Learning (RL), we introduce a novel visual feedback strategy named Visual Interrogation Verifies All (\textsc{Viva}). Unlike brittle syntax-based rules or pixel-level matching, \textsc{Viva} rewards the visual structure of rendered diagrams through a generative approach. Specifically, \textsc{Viva} actively generates targeted visual inquiries to scrutinize diagram visual fidelity and provides fine-grained feedback for optimization. This mechanism facilitates a self-evolving training process, effectively obviating the need for manually annotated ground truth code. Furthermore, we construct M3$^2$Diagram, the first large-scale diagram code generation dataset, containing over 196k high-quality instances. Experimental results confirm that the combination of SFT and our \textsc{Viva}-based RL allows OmniDiagram to establish a new state-of-the-art (SOTA) across diagram code generation benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces OmniDiagram, a unified framework supporting diverse diagram code languages and task formulations. It proposes Viva (Visual Interrogation Verifies All), a generative visual feedback strategy for RL that generates targeted inquiries to assess rendered diagram fidelity and supply rewards, enabling training without ground-truth annotations. The authors construct the M3²Diagram dataset containing over 196k instances and claim that SFT combined with Viva-based RL achieves new state-of-the-art results on diagram code generation benchmarks.

Significance. If the Viva feedback mechanism can be empirically validated as reliable and unbiased, the work could meaningfully advance annotation-efficient RL for structured visualization tasks by addressing the code-to-visual alignment problem in a scalable way. The release of the large-scale M3²Diagram dataset is a clear positive contribution that may serve as a foundation for future benchmark studies in programmable diagram generation.

major comments (2)
  1. Abstract: The central claim that SFT plus Viva-based RL establishes new SOTA performance is asserted without any reported metrics, baseline comparisons, ablation studies, or implementation details, leaving the experimental support for the primary result unverifiable from the manuscript text.
  2. Viva description (Abstract and method sections): The approach depends on the unvalidated assumption that generative visual inquiries produce accurate, unbiased, and fine-grained signals of diagram visual fidelity. No checks—such as inter-rater agreement with humans, correlation with pixel-level metrics, or ablation on interrogator quality—are provided, which is load-bearing because unreliable rewards could reinforce artifacts rather than genuine improvements in the RL stage.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: Abstract: The central claim that SFT plus Viva-based RL establishes new SOTA performance is asserted without any reported metrics, baseline comparisons, ablation studies, or implementation details, leaving the experimental support for the primary result unverifiable from the manuscript text.

    Authors: We agree that the abstract's brevity makes the SOTA claim difficult to verify at a glance. The full manuscript reports these details in Section 4 (Experiments), including quantitative tables comparing against baselines, ablation studies on the RL component, and implementation specifics. To address the concern directly, we will revise the abstract to include key performance metrics (e.g., relative improvements on the primary benchmarks) and a brief mention of the evaluation setup. revision: yes

  2. Referee: Viva description (Abstract and method sections): The approach depends on the unvalidated assumption that generative visual inquiries produce accurate, unbiased, and fine-grained signals of diagram visual fidelity. No checks—such as inter-rater agreement with humans, correlation with pixel-level metrics, or ablation on interrogator quality—are provided, which is load-bearing because unreliable rewards could reinforce artifacts rather than genuine improvements in the RL stage.

    Authors: We acknowledge that direct validation of the Viva reward signals is important for establishing reliability. The current manuscript supports Viva's utility through end-to-end performance gains over SFT-only training and qualitative examples of the generated inquiries. However, we agree that additional checks would strengthen the claims. We will add a dedicated validation subsection that reports correlation between Viva rewards and both human ratings and pixel-level structural metrics, plus an ablation varying the interrogator model. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation relies on novel Viva mechanism, new dataset, and external benchmarks

full rationale

The paper introduces OmniDiagram as a unified framework, proposes Viva as a generative visual interrogation reward for RL without ground-truth annotations, and constructs the M3^2Diagram dataset. The SOTA claim is supported by experimental results on benchmarks rather than any self-definitional reduction, fitted parameter renamed as prediction, or load-bearing self-citation chain. The derivation chain remains self-contained with independent components and external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the effectiveness of the Viva visual feedback loop and the quality of the newly constructed dataset; no free parameters are explicitly named in the abstract.

axioms (1)
  • domain assumption Generative visual inquiries can reliably assess and improve the structural fidelity of rendered diagrams in an RL loop
    This is the core premise of the Viva strategy as described.
invented entities (1)
  • Viva (Visual Interrogation Verifies All) no independent evidence
    purpose: Provide fine-grained visual feedback for RL optimization of diagram code without ground-truth annotations
    Newly introduced reward mechanism whose independent validation is not described in the abstract.

pith-pipeline@v0.9.0 · 5526 in / 1279 out tokens · 37061 ms · 2026-05-10T18:54:38.184685+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages

  1. [1]

    Linzheng Chai, Jian Yang, Shukai Liu, Wei Zhang, Li- ran Wang, Ke Jin, Tao Sun, Congnan Liu, Chenchen Zhang, Hualei Zhu, and 1 others

    Starflow: Generating structured work- flow outputs from sketch images.arXiv preprint arXiv:2503.21889. Linzheng Chai, Jian Yang, Shukai Liu, Wei Zhang, Li- ran Wang, Ke Jin, Tao Sun, Congnan Liu, Chenchen Zhang, Hualei Zhu, and 1 others. 2025. Multilingual multimodal software developer for code generation. arXiv preprint arXiv:2507.08719. Lei Chen, Xuanle...

  2. [2]

    nvbench 2.0: A benchmark for natural language to visualization under ambi- guity.arXiv preprint arXiv:2503.12880, 2025

    nvbench 2.0: Resolving ambiguity in text- to-visualization through stepwise reasoning.arXiv preprint arXiv:2503.12880. Yuansheng Ni, Songcheng Cai, Xiangchao Chen, Jiarong Liang, Zhiheng Lyu, Jiaqi Deng, Kai Zou, Ping Nie, Fei Yuan, Xiang Yue, and 1 others. 2025. Viscoder2: Building multi-language visualization coding agents.arXiv preprint arXiv:2510.2364...

  3. [3]

    Start: User initiates VPN connection\

    Is the top-most node labeled \"Start: User initiates VPN connection\" colored green with a thick green outline?

  4. [4]

    Authentication server validates credentials\

    Is there a diamond-shaped node with the text \"Authentication server validates credentials\"?

  5. [5]

    Is there an arrow labeled \"Yes\" originating from the diamond-shaped node and pointing to a blue node?

  6. [6]

    Is there an arrow labeled \"No\" originating from the diamond-shaped node and pointing to a red node?

  7. [7]

    Device sends authentication request\

    Do all process nodes, such as \"Device sends authentication request\" and \"Authentication successful\", have a blue fill and a thick blue outline?

  8. [8]

    Are the connecting lines black arrows that have a stealth arrowhead style?

  9. [9]

    Encrypted VPN tunnel established\

    Is the node labeled \"Encrypted VPN tunnel established\" colored red with a thick red outline?

  10. [10]

    Is the overall layout of the main VPN connection flow primarily vertical, moving from top to bottom?

  11. [11]

    Is the diagram free of any overlapping text or lines, ensuring all elements are clearly readable?

  12. [12]

    Diagram-to-Code Input: Output(Rendered): Questions:

    Do all visible nodes (start, process, decision, end) feature rounded corners and a drop shadow effect? Figure 7: Qualitative example of visual verification questions for the Text-to-Code task. Diagram-to-Code Input: Output(Rendered): Questions:

  13. [13]

    Does the diagram contain a top-level node labeled 'Load Balancer'?

  14. [14]

    Are there two nodes labeled 'Web Server 1' and 'Web Server 2' positioned below the Load Balancer?

  15. [15]

    Is the text 'HTTP Requests' visible on the connections originating from the Load Balancer?

  16. [16]

    Does the diagram feature a central node labeled 'Application Server’?

  17. [17]

    Do the arrows connecting the Web Servers to the Application Server contain the label 'API Calls’?

  18. [18]

    Is there a node labeled 'Cache' positioned at the bottom-left of the structure?

  19. [19]

    Is the text 'Cache Responses' present on the link connecting to the Cache node?

  20. [20]

    Does the diagram contain a node labeled 'Database' at the bottom- right?

  21. [21]

    Is the text 'Query Results' visible on the connection leading to the Database node?

  22. [22]

    E Evaluation E.1 Prompt Used in Evaluation To ensure reproducibility, we provide the exact sys- tem prompts used for our GPT-4.1-based evalua- tion

    Are all nodes depicted as rounded rectangles with a light orange background and darker orange border? Figure 8: Qualitative example of visual verification questions for the Diagram-to-Code task. E Evaluation E.1 Prompt Used in Evaluation To ensure reproducibility, we provide the exact sys- tem prompts used for our GPT-4.1-based evalua- tion. Figure 14 ill...

  23. [23]

    Do all the rectangular nodes display rounded corners instead of sharp 90-degree angles?

  24. [24]

    Is the interior fill color of the nodes a soft, light blue?

  25. [25]

    Is a darker blue border clearly visible outlining each node?

  26. [26]

    Does the border line width appear consistent and distinct across all nodes?

  27. [27]

    Is the diagram completely free of any sharp-cornered, white-filled nodes?

  28. [28]

    Does the 'Customer Relationship Management' node match the rounded blue style of the other nodes? 7.Are the text labels inside the nodes clearly legible against the light blue background?

  29. [29]

    Do the connecting arrows remain correctly attached to the boundaries of the modified nodes?

  30. [30]

    Is the diagram free of any broken or floating connections resulting from the style change?

  31. [31]

    cess rate of specific modifications requested by the user (e.g., color changes or node deletions), focus- ing strictly on the execution of the edit instruction

    Does the overall diagram maintain a consistent visual theme across all block elements? Figure 9: Qualitative example of visual verification questions for the Diagram Editing task. cess rate of specific modifications requested by the user (e.g., color changes or node deletions), focus- ing strictly on the execution of the edit instruction. Content Preserva...

  32. [32]

    You MUST return a single, valid JSON object

  33. [33]

    The JSON object MUST contain exactly one key: "topics"

  34. [34]

    The value of "topics" MUST be a JSON array of strings

  35. [35]

    Each string in the array should be a 2-3 sentence topic description corresponding to one set of keywords from the user input

  36. [36]

    User Prompt: Please generate topic descriptions for the following 3 characters

    The number of strings in the array MUST EXACTLY match the number of keyword sets provided. User Prompt: Please generate topic descriptions for the following 3 characters. Return the result as a JSON object according to the system instructions

  37. [37]

    Keywords: Name=Alex, Age=32, Profession=Software Engineer, Trait=innovative problem solving, Goal=to streamline a complex workflow

  38. [38]

    Keywords: Name=Jordan, Age=45, Profession=Product Manager, Trait=user-centric design, Goal=to map out a new user experience

  39. [39]

    Syntax Tax

    Keywords: Name=Taylor, Age=28, Profession=Data Analyst, Trait=data-driven, Goal=to present data insights to stakeholders Figure 10: The prompt template used for generating different topics to set scene limitations. size, pointing to a persistent bottleneck in handling strict domain specific languages. These failures typically occur in diagrams featuring m...

  40. [40]

    a Sequence Diagram illustrating a user login process with two-factor authentication

    Each topic is a high-level summary of the contents in the diagram with some design details, e.g., “a Sequence Diagram illustrating a user login process with two-factor authentication”

  41. [41]

    Each topic should be unique and not overlap with others

    The topics should be diverse to help me generate varied diagrams. Each topic should be unique and not overlap with others

  42. [42]

    {figure_type}

    The topics are strictly conditioned on the Mermaid diagram type. Please ensure the topics you provided can be best visualized in “{figure_type}”

  43. [43]

    All topics must be in English, even if the scenario is non-English

  44. [44]

    {scenario}

    List{num_topics} topics for “{scenario}” and separate them with a | character, e.g., topic1 | topic2 | ...... | topic{num_topics}. Do not include any additional text at the beginning or end of your response. Figure 11: The prompt template used for generating diverse diagram scenario based on user topics and specific diagram types. ral language reasoning c...

  45. [45]

    {figure_type}

    The data should be highly structured as a JSON object, with its schema tailored specifically for the “{figure_type}” syntax. For example, for a Class Diagram, the JSON should contain a list of ‘classes’ and a list of ‘relationships’

  46. [46]

    Do not use placeholder names like xxA, xxB, etc

    The data should be realistic, and the contents should be named using real-world entities. Do not use placeholder names like xxA, xxB, etc

  47. [47]

    Do not provide too many elements; just the key information

    The elements should be concise and directly map to a meaningful diagram. Do not provide too many elements; just the key information

  48. [48]

    All elements must be in English, even if the topic is non-English

  49. [49]

    {topic}”. I have a JSON object of structured data about “{scenario}

    You can use the provided JSON templates to structure decision-based flows. If the topic is related to decision-making or conditional logic, please use or adapt the templates provided in the<templates>block. Figure 12: The prompt template used for generating structured JSON data elements tailored to specific Mermaid diagram types and topics. 19 Prompts for...

  50. [50]

    The code must be a valid and complete Mermaid mindmap script adhering strictly to syntax rules

  51. [51]

    mindmap”. The “mindmap

    The code must start with “mindmap”. The “mindmap” keyword is reserved

  52. [52]

    All child nodes MUST be indented more deeply than their parent

    Indentation is the ONLY way to define the hierarchy. All child nodes MUST be indented more deeply than their parent. Consistent indentation (e.g., 2 or 4 spaces) is required

  53. [53]

    Do not include any additional text outside of the Mermaid code block

  54. [54]

    CRITICAL RULE: A mindmap can only have ONE SINGLE ROOT NODE

    The code must be self-contained within the“‘mermaid...“‘block. CRITICAL RULE: A mindmap can only have ONE SINGLE ROOT NODE. • The first line of the mindmap defines the root. • All other nodes MUST BE indented under this root node. • Any line with zero indentation (nodes or styles) will be treated as a second root, causing a fatal error. IMPORTANT: Styling...