Text2Arch: A Dataset for Generating Scientific Architecture Diagrams from Natural Language Descriptions
Pith reviewed 2026-05-10 11:47 UTC · model grok-4.3
The pith
A new dataset of text-to-DOT pairs lets small language models generate scientific architecture diagrams at GPT-4o level.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By releasing Text2Arch, which supplies aligned textual descriptions, visual architecture images, and DOT code for a wide range of scientific systems, the authors show that fine-tuned small language models can produce high-fidelity diagrams from text input, outperforming existing specialized baselines and reaching parity with in-context learning from GPT-4o.
What carries the argument
The Text2Arch dataset, consisting of scientific architecture images, their natural language descriptions, and associated DOT code representations, which supplies supervised training pairs for mapping semantics to diagram code.
If this is right
- Automated generation of clear visual aids for complex system designs in enterprise and research settings.
- Reduced ambiguity when conveying scientific processes through combined text and diagram outputs.
- Support for educational tools that convert descriptions into ready-to-use architecture visuals.
- Public models and data that enable further development of text-to-diagram systems.
Where Pith is reading between the lines
- The same paired text-and-code approach could be adapted to generate other diagram types such as flowcharts or network topologies.
- Integration with documentation pipelines might allow automatic visual updates whenever system descriptions change.
- The dataset could serve as a benchmark for testing how well future models preserve technical relationships in generated visuals.
Load-bearing premise
The collected dataset and chosen metrics accurately reflect real-world scientific descriptions and true improvements in diagram semantic fidelity without selection biases.
What would settle it
Evaluating the fine-tuned models on a fresh collection of independently sourced real scientific text descriptions and finding that generated diagrams show no measurable gain in accuracy or fidelity over the baselines.
Figures
read the original abstract
Communicating complex system designs or scientific processes through text alone is inefficient and prone to ambiguity. A system that automatically generates scientific architecture diagrams from text with high semantic fidelity can be useful in multiple applications like enterprise architecture visualization, AI-driven software design, and educational content creation. Hence, in this paper, we focus on leveraging language models to perform semantic understanding of the input text description to generate intermediate code that can be processed to generate high-fidelity architecture diagrams. Unfortunately, no clean large-scale open-access dataset exists, implying lack of any effective open models for this task. Hence, we contribute a comprehensive dataset, \system, comprising scientific architecture images, their corresponding textual descriptions, and associated DOT code representations. Leveraging this resource, we fine-tune a suite of small language models, and also perform in-context learning using GPT-4o. Through extensive experimentation, we show that \system{} models significantly outperform existing baseline models like DiagramAgent and perform at par with in-context learning-based generations from GPT-4o. We make the code, data and models publicly available.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the Text2Arch dataset, which pairs natural language descriptions of scientific architectures with corresponding diagram images and DOT code representations. It fine-tunes small language models on this resource and evaluates in-context learning with GPT-4o, claiming that the resulting models significantly outperform baselines such as DiagramAgent while matching the performance of GPT-4o ICL generations.
Significance. If the dataset construction and evaluations prove robust, the work supplies a much-needed open resource for text-to-diagram generation in scientific and architectural domains, where prior datasets were absent. The public release of data, code, and models strengthens reproducibility and enables follow-on research.
major comments (2)
- [Abstract and experimental sections] Abstract and experimental sections: The abstract asserts outperformance 'through extensive experimentation' yet provides no details on dataset size, construction process, train/test splits, chosen evaluation metrics, or statistical tests. These omissions are load-bearing for the central empirical claims and prevent verification that the reported gains are supported by the data.
- [Evaluation methodology] Evaluation methodology (likely Section 4 or 5): Generating and scoring DOT code raises a direct concern for semantic fidelity. Multiple syntactically distinct DOT strings can render identical diagrams (different edge orders, attribute placements, or subgraph groupings). If metrics are string-based (e.g., BLEU/ROUGE on raw DOT), they risk rewarding surface-form matches rather than visual or semantic equivalence; the same limitation applies to the DiagramAgent baseline, so relative gains may be metric artifacts.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We appreciate the emphasis on transparency in the abstract and experimental design as well as the important methodological point about evaluating DOT code. We address each major comment below and outline the revisions we will make.
read point-by-point responses
-
Referee: [Abstract and experimental sections] Abstract and experimental sections: The abstract asserts outperformance 'through extensive experimentation' yet provides no details on dataset size, construction process, train/test splits, chosen evaluation metrics, or statistical tests. These omissions are load-bearing for the central empirical claims and prevent verification that the reported gains are supported by the data.
Authors: We agree that the abstract would be strengthened by including these key details to allow immediate verification of the claims. In the revised manuscript we will expand the abstract to report the total dataset size, a concise summary of the construction process, the train/test split ratios, the primary evaluation metrics, and any statistical significance tests performed. We will also review the experimental sections to ensure all of these elements are explicitly stated with clear cross-references and will add statistical tests where they are currently absent. These changes will make the empirical support for our results fully transparent. revision: yes
-
Referee: [Evaluation methodology] Evaluation methodology (likely Section 4 or 5): Generating and scoring DOT code raises a direct concern for semantic fidelity. Multiple syntactically distinct DOT strings can render identical diagrams (different edge orders, attribute placements, or subgraph groupings). If metrics are string-based (e.g., BLEU/ROUGE on raw DOT), they risk rewarding surface-form matches rather than visual or semantic equivalence; the same limitation applies to the DiagramAgent baseline, so relative gains may be metric artifacts.
Authors: This concern is valid: string-based metrics on raw DOT code can indeed be sensitive to superficial syntactic differences that do not affect the rendered diagram. Because the DiagramAgent baseline is scored with the same metrics, the relative gains we report remain meaningful as a comparative measure, yet we acknowledge the limitation. In the revision we will add an explicit discussion of this issue in the evaluation section, describe any steps taken to mitigate surface-form variance (such as canonicalization where applied), and include qualitative examples of equivalent diagrams produced by different DOT strings. If space permits we will also report supplementary visual similarity results on rendered diagrams to further support semantic fidelity. revision: partial
Circularity Check
Empirical dataset release and benchmarking with no derivation chain
full rationale
The paper contributes a new dataset (Text2Arch) of scientific architecture images, text descriptions, and DOT code, then fine-tunes small LMs and evaluates GPT-4o ICL against external baselines such as DiagramAgent. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided abstract or described content. All claims rest on empirical comparisons to independent external models and metrics; the work is self-contained against external benchmarks with no load-bearing self-referential steps.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Determineifthe image is an architecture diagram (commonly usedinresearch papers to depict the structure, components, or workflows of systems)
-
[2]
If it is an architecture diagram, generate a concise, precise, and coherent description of 10−20 sentences explaining the main elements of the diagram. Description should include module names, short description of modules, and flow of information across modules
-
[3]
Hence avoid any irrelevant informationinthe description
The description you provide would be further used to train a model to generate such architecture images. Hence avoid any irrelevant informationinthe description. ## Requirements:
-
[4]
16 Published as a conference paper at ICLR 2026 (a)Original figure(Fig
Carefully analyze the provided paragraphs, focusing on extracting key elements that directly explain the architecture depictedinthe image. 16 Published as a conference paper at ICLR 2026 (a)Original figure(Fig. 3 from https://arxiv.org/pdf/1701.07543v1) Fig obtained using generated DOT code by our TEXT2ARCH’s finetuned DeepSeek 7B model x1 w1 x2 w2 xI wI ...
-
[5]
The output of this summing junction is denoted as u
Summing Junction (Σ): The weighted inputs are summed together at the summing junction, represented by the symbolΣ. The output of this summing junction is denoted as u. 4. Activation Function (Φ): The summed value u is then passed through an activation function, denoted asΦ(.). This function determines the output of the perceptron. 5. Output (y): The final...
-
[6]
Exclude extraneous, redundant, or noisy details from the textual content and focus only on the architectural aspects
-
[7]
Clearly indicate whether the image is an architecture diagram or not
-
[8]
Label output should be within<label></label>tags and could be ”arch” or ”not arch”
Provide your outputina structured format ## Inputs: IMAGE:#imageURL# Caption:#caption# Description:#Descriptions# ## Example Output: <results> <label>[arch|not arch]</label> <newDesc>Concise and precise description goes here.</newDesc> </results> Output resultsina nested XML. Label output should be within<label></label>tags and could be ”arch” or ”not arc...
-
[9]
Analyze both the DOT code and the image to identify any incorrect node labels, incorrect connections, or incorrect ordering of nodes
-
[10]
Refine the DOT code to ensure it accurately represents the structure and relationships depictedinthe image
-
[11]
Output the corrected DOT fileina structured XML format. ## Inputs: IMAGE:#imageURL# Initial DOT code: (which may contain errors or incomplete data) #dotCode# 20 Published as a conference paper at ICLR 2026 (a)Original figure(Fig. 16 from https://arxiv.org/pdf/1905.09481v1) Fig obtained using generated DOT code by our TEXT2ARCH’s finetuned DeepSeek 7B mode...
-
[12]
Conv 3x3: Another convolutional layer with a 3x3 filter size, further refining the features. 5. Conv 3x5: A second convolutional layer with a 3x5 filter size, continuing the feature extraction process. 6. Pool max: A max pooling layer that further reduces the spatial dimensions by taking the maximum value in each region. 7. Conv 3x3: A final convolutional...
work page 2026
-
[13]
Ensure the refined DOT code fully represents the relationshipsinthe image
-
[14]
Maintain proper indentation and formattinginthe DOT code
-
[15]
H GPT PROMPT TO CONVERTTIKZCODE TODOT Given the following LaTeX TikZ code:
Encapsulate the final DOT code within<results><![CDATA[ ]]></results>to prevent XML parsing issues. H GPT PROMPT TO CONVERTTIKZCODE TODOT Given the following LaTeX TikZ code:
-
[16]
First, re−indent the TikZ partforreadability
-
[17]
Then, extract all\node text labels and assign each a unique integer ID (e.g., 0, 1, 2...)
-
[18]
Use the format: ID [label=‘‘...’’]; to define each node
-
[19]
Infer reasonable directed edges based on layout or label semantics (e.g., data flow, left−to−right, top−to−bottom)
-
[20]
Maintain proper indentation and formattinginthe DOT code
Output the result as a DOT file using the below graph structure, starting directly with: <results> <![CDATA[ digraph{ 0 [label=‘‘Node 0 description’’] 1 [label=‘‘Node 1 description’’] 2 [label=‘‘Node 2 description’’] 22 Published as a conference paper at ICLR 2026 3 [label=‘‘Node 3 description’’] 0 −>1; 0 −>2; 2 −>3; } ]]> </results> Do not include any ra...
work page 2026
-
[21]
Read and interpret the following textual description of a system, pipeline, or process
-
[22]
Generate accurate DOT code that reflects the described structure, relationships, and flow
-
[23]
Output the DOT codeina structured XML formatfordownstream usage. ## Input: #Cleaned−Description# ## Example Output: ‘‘‘ <results> <![CDATA[ digraph{ 0 [label=‘‘Node 0 description’’] 1 [label=‘‘Node 1 description’’] 2 [label=‘‘Node 2 description’’] 3 [label=‘‘Node 3 description’’] 0 −>1; 0 −>2; 2 −>3; } ]]> </results> ‘‘‘ Instructions: − Identify all relev...
work page 2026
-
[24]
Analyze the given image along with two candidate textual descriptions (marked as Description 1 and Description 2)
-
[25]
Determine which description better matches the content and semantics of the image
-
[26]
Return the index of the better matching description (either 1 or 2), followed by a short explanation justifying your choice. ## Inputs: Image: IMAGE:#Image URL# Description 1:#description 1# Description 2:#description 2# ## Output Format: Output all of theseina nested XML. <results> <index>1</index> <explanation>The explanation should briefly describe why...
-
[27]
Analyze the image to understand the correct structure, node positions, labels, and connections
-
[28]
Compare the generated DOT code with both the image and the ground−truth DOT code
-
[29]
Determineifthe structure, labels, node ordering, and relationshipsinthe generated DOT code accurately reflect the image and ground−truth
-
[30]
− 4 = Minor discrepancies that don’t affect comprehension
Assign a compatibility score between 0 and 5, where: − 5 = Perfect match. − 4 = Minor discrepancies that don’t affect comprehension. − 3 = Some noticeable errors, but mostly accurate. − 2 = Multiple mismatches that affect comprehension. − 1 = Mostly incorrect. − 0 = Completely unrelated
-
[31]
Provide a concise explanation (2−3 sentences) describing the key issues or strengths. ## Output Format: <results> <score>4</score> <explanation>The generated DOT code has correct node labels and most connections, but the order and direction of two edges differ from the image.</explanation> </results> Output all of these in a nested XML. 24 Published as a ...
work page 2026
-
[32]
Use ‘digraph{’ as the graph declaration
-
[33]
Set appropriate rankdir (TB for top−bottom, LR for left−right) if needed
-
[34]
Use appropriate node shapes (box is default)
-
[35]
Create meaningful node labels
-
[36]
Add edge labels where appropriate to describe relationships
-
[37]
Keep the graph structure clear and readable
-
[38]
IMPORTANT: Respond with ONLY the DOT code, no explanations or additional text Here are examples of how to convert descriptions to DOT graphs:{few shot examples}. Convert the following description into DOT language code. Respond with ONLY the DOT code and nothing else:{description}’’. Few Shot Examples for the Small Language Models based evaluation Example...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.