pith. sign in

arxiv: 2604.14941 · v1 · submitted 2026-04-16 · 💻 cs.CL

Text2Arch: A Dataset for Generating Scientific Architecture Diagrams from Natural Language Descriptions

Pith reviewed 2026-05-10 11:47 UTC · model grok-4.3

classification 💻 cs.CL
keywords text-to-diagramarchitecture diagramsDOT codelanguage modelsdatasetsemantic fidelitydiagram generationfine-tuning
0
0 comments X

The pith

A new dataset of text-to-DOT pairs lets small language models generate scientific architecture diagrams at GPT-4o level.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates Text2Arch, a large open dataset that pairs natural language descriptions of scientific architectures with their corresponding diagrams and DOT code representations. This resource supports fine-tuning of smaller language models to translate text descriptions into code that renders accurate diagrams. Experiments demonstrate that the resulting models exceed the performance of prior baselines such as DiagramAgent while matching the output quality obtained from GPT-4o through in-context learning. The contribution fills a gap where no suitable large-scale public data previously existed for this text-to-diagram task.

Core claim

By releasing Text2Arch, which supplies aligned textual descriptions, visual architecture images, and DOT code for a wide range of scientific systems, the authors show that fine-tuned small language models can produce high-fidelity diagrams from text input, outperforming existing specialized baselines and reaching parity with in-context learning from GPT-4o.

What carries the argument

The Text2Arch dataset, consisting of scientific architecture images, their natural language descriptions, and associated DOT code representations, which supplies supervised training pairs for mapping semantics to diagram code.

If this is right

  • Automated generation of clear visual aids for complex system designs in enterprise and research settings.
  • Reduced ambiguity when conveying scientific processes through combined text and diagram outputs.
  • Support for educational tools that convert descriptions into ready-to-use architecture visuals.
  • Public models and data that enable further development of text-to-diagram systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same paired text-and-code approach could be adapted to generate other diagram types such as flowcharts or network topologies.
  • Integration with documentation pipelines might allow automatic visual updates whenever system descriptions change.
  • The dataset could serve as a benchmark for testing how well future models preserve technical relationships in generated visuals.

Load-bearing premise

The collected dataset and chosen metrics accurately reflect real-world scientific descriptions and true improvements in diagram semantic fidelity without selection biases.

What would settle it

Evaluating the fine-tuned models on a fresh collection of independently sourced real scientific text descriptions and finding that generated diagrams show no measurable gain in accuracy or fidelity over the baselines.

Figures

Figures reproduced from arXiv: 2604.14941 by Manish Gupta, Sankalp Mittal, Shivank Garg.

Figure 2
Figure 2. Figure 2: TEXT2ARCH Dataset Curation images are stratified split into train and validation so as to maintain the same ratio of arch vs no-arch images in train and test. We train multiple models like CLIP (Radford et al., 2021), ViT (Dosovitskiy et al., 2020), BEiT (Bao et al., 2021), and ResNet (He et al., 2016), and report results in [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Case Study 1: Comparison showing DeepSeek-7B inference significantly outperforming [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Illustrations for generated dots using DiagramAgent (left), GPT (right top) and fewShot [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Case Study 2: Comparison showing DeepSeek-7B inference significantly outperforming [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Illustrations for generated dots using DiagramAgent (left), GPT (middle) and fewShot [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Case Study 3: Comparison showing DeepSeek-7B inference significantly outperforming [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Illustrations for generated dots using DiagramAgent (left), GPT (middle) and fewShot [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
read the original abstract

Communicating complex system designs or scientific processes through text alone is inefficient and prone to ambiguity. A system that automatically generates scientific architecture diagrams from text with high semantic fidelity can be useful in multiple applications like enterprise architecture visualization, AI-driven software design, and educational content creation. Hence, in this paper, we focus on leveraging language models to perform semantic understanding of the input text description to generate intermediate code that can be processed to generate high-fidelity architecture diagrams. Unfortunately, no clean large-scale open-access dataset exists, implying lack of any effective open models for this task. Hence, we contribute a comprehensive dataset, \system, comprising scientific architecture images, their corresponding textual descriptions, and associated DOT code representations. Leveraging this resource, we fine-tune a suite of small language models, and also perform in-context learning using GPT-4o. Through extensive experimentation, we show that \system{} models significantly outperform existing baseline models like DiagramAgent and perform at par with in-context learning-based generations from GPT-4o. We make the code, data and models publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces the Text2Arch dataset, which pairs natural language descriptions of scientific architectures with corresponding diagram images and DOT code representations. It fine-tunes small language models on this resource and evaluates in-context learning with GPT-4o, claiming that the resulting models significantly outperform baselines such as DiagramAgent while matching the performance of GPT-4o ICL generations.

Significance. If the dataset construction and evaluations prove robust, the work supplies a much-needed open resource for text-to-diagram generation in scientific and architectural domains, where prior datasets were absent. The public release of data, code, and models strengthens reproducibility and enables follow-on research.

major comments (2)
  1. [Abstract and experimental sections] Abstract and experimental sections: The abstract asserts outperformance 'through extensive experimentation' yet provides no details on dataset size, construction process, train/test splits, chosen evaluation metrics, or statistical tests. These omissions are load-bearing for the central empirical claims and prevent verification that the reported gains are supported by the data.
  2. [Evaluation methodology] Evaluation methodology (likely Section 4 or 5): Generating and scoring DOT code raises a direct concern for semantic fidelity. Multiple syntactically distinct DOT strings can render identical diagrams (different edge orders, attribute placements, or subgraph groupings). If metrics are string-based (e.g., BLEU/ROUGE on raw DOT), they risk rewarding surface-form matches rather than visual or semantic equivalence; the same limitation applies to the DiagramAgent baseline, so relative gains may be metric artifacts.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We appreciate the emphasis on transparency in the abstract and experimental design as well as the important methodological point about evaluating DOT code. We address each major comment below and outline the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract and experimental sections] Abstract and experimental sections: The abstract asserts outperformance 'through extensive experimentation' yet provides no details on dataset size, construction process, train/test splits, chosen evaluation metrics, or statistical tests. These omissions are load-bearing for the central empirical claims and prevent verification that the reported gains are supported by the data.

    Authors: We agree that the abstract would be strengthened by including these key details to allow immediate verification of the claims. In the revised manuscript we will expand the abstract to report the total dataset size, a concise summary of the construction process, the train/test split ratios, the primary evaluation metrics, and any statistical significance tests performed. We will also review the experimental sections to ensure all of these elements are explicitly stated with clear cross-references and will add statistical tests where they are currently absent. These changes will make the empirical support for our results fully transparent. revision: yes

  2. Referee: [Evaluation methodology] Evaluation methodology (likely Section 4 or 5): Generating and scoring DOT code raises a direct concern for semantic fidelity. Multiple syntactically distinct DOT strings can render identical diagrams (different edge orders, attribute placements, or subgraph groupings). If metrics are string-based (e.g., BLEU/ROUGE on raw DOT), they risk rewarding surface-form matches rather than visual or semantic equivalence; the same limitation applies to the DiagramAgent baseline, so relative gains may be metric artifacts.

    Authors: This concern is valid: string-based metrics on raw DOT code can indeed be sensitive to superficial syntactic differences that do not affect the rendered diagram. Because the DiagramAgent baseline is scored with the same metrics, the relative gains we report remain meaningful as a comparative measure, yet we acknowledge the limitation. In the revision we will add an explicit discussion of this issue in the evaluation section, describe any steps taken to mitigate surface-form variance (such as canonicalization where applied), and include qualitative examples of equivalent diagrams produced by different DOT strings. If space permits we will also report supplementary visual similarity results on rendered diagrams to further support semantic fidelity. revision: partial

Circularity Check

0 steps flagged

Empirical dataset release and benchmarking with no derivation chain

full rationale

The paper contributes a new dataset (Text2Arch) of scientific architecture images, text descriptions, and DOT code, then fine-tunes small LMs and evaluates GPT-4o ICL against external baselines such as DiagramAgent. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided abstract or described content. All claims rest on empirical comparisons to independent external models and metrics; the work is self-contained against external benchmarks with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review reveals no mathematical axioms, free parameters, or invented entities; the work relies on standard language-model fine-tuning practices whose hyperparameters are not specified.

pith-pipeline@v0.9.0 · 5486 in / 1152 out tokens · 47151 ms · 2026-05-10T11:47:36.034357+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages

  1. [1]

    Determineifthe image is an architecture diagram (commonly usedinresearch papers to depict the structure, components, or workflows of systems)

  2. [2]

    Description should include module names, short description of modules, and flow of information across modules

    If it is an architecture diagram, generate a concise, precise, and coherent description of 10−20 sentences explaining the main elements of the diagram. Description should include module names, short description of modules, and flow of information across modules

  3. [3]

    Hence avoid any irrelevant informationinthe description

    The description you provide would be further used to train a model to generate such architecture images. Hence avoid any irrelevant informationinthe description. ## Requirements:

  4. [4]

    16 Published as a conference paper at ICLR 2026 (a)Original figure(Fig

    Carefully analyze the provided paragraphs, focusing on extracting key elements that directly explain the architecture depictedinthe image. 16 Published as a conference paper at ICLR 2026 (a)Original figure(Fig. 3 from https://arxiv.org/pdf/1701.07543v1) Fig obtained using generated DOT code by our TEXT2ARCH’s finetuned DeepSeek 7B model x1 w1 x2 w2 xI wI ...

  5. [5]

    The output of this summing junction is denoted as u

    Summing Junction (Σ): The weighted inputs are summed together at the summing junction, represented by the symbolΣ. The output of this summing junction is denoted as u. 4. Activation Function (Φ): The summed value u is then passed through an activation function, denoted asΦ(.). This function determines the output of the perceptron. 5. Output (y): The final...

  6. [6]

    Exclude extraneous, redundant, or noisy details from the textual content and focus only on the architectural aspects

  7. [7]

    Clearly indicate whether the image is an architecture diagram or not

  8. [8]

    Label output should be within<label></label>tags and could be ”arch” or ”not arch”

    Provide your outputina structured format ## Inputs: IMAGE:#imageURL# Caption:#caption# Description:#Descriptions# ## Example Output: <results> <label>[arch|not arch]</label> <newDesc>Concise and precise description goes here.</newDesc> </results> Output resultsina nested XML. Label output should be within<label></label>tags and could be ”arch” or ”not arc...

  9. [9]

    Analyze both the DOT code and the image to identify any incorrect node labels, incorrect connections, or incorrect ordering of nodes

  10. [10]

    Refine the DOT code to ensure it accurately represents the structure and relationships depictedinthe image

  11. [11]

    ## Inputs: IMAGE:#imageURL# Initial DOT code: (which may contain errors or incomplete data) #dotCode# 20 Published as a conference paper at ICLR 2026 (a)Original figure(Fig

    Output the corrected DOT fileina structured XML format. ## Inputs: IMAGE:#imageURL# Initial DOT code: (which may contain errors or incomplete data) #dotCode# 20 Published as a conference paper at ICLR 2026 (a)Original figure(Fig. 16 from https://arxiv.org/pdf/1905.09481v1) Fig obtained using generated DOT code by our TEXT2ARCH’s finetuned DeepSeek 7B mode...

  12. [12]

    Conv 3x3: Another convolutional layer with a 3x3 filter size, further refining the features. 5. Conv 3x5: A second convolutional layer with a 3x5 filter size, continuing the feature extraction process. 6. Pool max: A max pooling layer that further reduces the spatial dimensions by taking the maximum value in each region. 7. Conv 3x3: A final convolutional...

  13. [13]

    Ensure the refined DOT code fully represents the relationshipsinthe image

  14. [14]

    Maintain proper indentation and formattinginthe DOT code

  15. [15]

    H GPT PROMPT TO CONVERTTIKZCODE TODOT Given the following LaTeX TikZ code:

    Encapsulate the final DOT code within<results><![CDATA[ ]]></results>to prevent XML parsing issues. H GPT PROMPT TO CONVERTTIKZCODE TODOT Given the following LaTeX TikZ code:

  16. [16]

    First, re−indent the TikZ partforreadability

  17. [17]

    Then, extract all\node text labels and assign each a unique integer ID (e.g., 0, 1, 2...)

  18. [18]

    Use the format: ID [label=‘‘...’’]; to define each node

  19. [19]

    Infer reasonable directed edges based on layout or label semantics (e.g., data flow, left−to−right, top−to−bottom)

  20. [20]

    Maintain proper indentation and formattinginthe DOT code

    Output the result as a DOT file using the below graph structure, starting directly with: <results> <![CDATA[ digraph{ 0 [label=‘‘Node 0 description’’] 1 [label=‘‘Node 1 description’’] 2 [label=‘‘Node 2 description’’] 22 Published as a conference paper at ICLR 2026 3 [label=‘‘Node 3 description’’] 0 −>1; 0 −>2; 2 −>3; } ]]> </results> Do not include any ra...

  21. [21]

    Read and interpret the following textual description of a system, pipeline, or process

  22. [22]

    Generate accurate DOT code that reflects the described structure, relationships, and flow

  23. [23]

    Output the DOT codeina structured XML formatfordownstream usage. ## Input: #Cleaned−Description# ## Example Output: ‘‘‘ <results> <![CDATA[ digraph{ 0 [label=‘‘Node 0 description’’] 1 [label=‘‘Node 1 description’’] 2 [label=‘‘Node 2 description’’] 3 [label=‘‘Node 3 description’’] 0 −>1; 0 −>2; 2 −>3; } ]]> </results> ‘‘‘ Instructions: − Identify all relev...

  24. [24]

    Analyze the given image along with two candidate textual descriptions (marked as Description 1 and Description 2)

  25. [25]

    Determine which description better matches the content and semantics of the image

  26. [26]

    ## Inputs: Image: IMAGE:#Image URL# Description 1:#description 1# Description 2:#description 2# ## Output Format: Output all of theseina nested XML

    Return the index of the better matching description (either 1 or 2), followed by a short explanation justifying your choice. ## Inputs: Image: IMAGE:#Image URL# Description 1:#description 1# Description 2:#description 2# ## Output Format: Output all of theseina nested XML. <results> <index>1</index> <explanation>The explanation should briefly describe why...

  27. [27]

    Analyze the image to understand the correct structure, node positions, labels, and connections

  28. [28]

    Compare the generated DOT code with both the image and the ground−truth DOT code

  29. [29]

    Determineifthe structure, labels, node ordering, and relationshipsinthe generated DOT code accurately reflect the image and ground−truth

  30. [30]

    − 4 = Minor discrepancies that don’t affect comprehension

    Assign a compatibility score between 0 and 5, where: − 5 = Perfect match. − 4 = Minor discrepancies that don’t affect comprehension. − 3 = Some noticeable errors, but mostly accurate. − 2 = Multiple mismatches that affect comprehension. − 1 = Mostly incorrect. − 0 = Completely unrelated

  31. [31]

    Provide a concise explanation (2−3 sentences) describing the key issues or strengths. ## Output Format: <results> <score>4</score> <explanation>The generated DOT code has correct node labels and most connections, but the order and direction of two edges differ from the image.</explanation> </results> Output all of these in a nested XML. 24 Published as a ...

  32. [32]

    Use ‘digraph{’ as the graph declaration

  33. [33]

    Set appropriate rankdir (TB for top−bottom, LR for left−right) if needed

  34. [34]

    Use appropriate node shapes (box is default)

  35. [35]

    Create meaningful node labels

  36. [36]

    Add edge labels where appropriate to describe relationships

  37. [37]

    Keep the graph structure clear and readable

  38. [38]

    Instruct

    IMPORTANT: Respond with ONLY the DOT code, no explanations or additional text Here are examples of how to convert descriptions to DOT graphs:{few shot examples}. Convert the following description into DOT language code. Respond with ONLY the DOT code and nothing else:{description}’’. Few Shot Examples for the Small Language Models based evaluation Example...