pith. machine review for the scientific record. sign in

arxiv: 2511.14967 · v2 · submitted 2025-11-18 · 💻 cs.SE · cs.AI· cs.LG

Recognition: 1 theorem link

· Lean Theorem

MermaidSeqBench: An Evaluation Benchmark for NL-to-Mermaid Sequence Diagram Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-17 20:02 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.LG
keywords Mermaid sequence diagramsLLM evaluation benchmarknatural language to diagramsoftware engineering diagramsLLM-as-a-judgesyntax correctnessdiagram generation
0
0 comments X

The pith

MermaidSeqBench introduces a 132-sample benchmark to measure how well LLMs generate Mermaid sequence diagrams from natural language prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the absence of reliable tests for LLMs turning text descriptions into Mermaid sequence diagrams used in software engineering. It builds MermaidSeqBench through human-verified flows, LLM augmentation, and rule-based expansion to create a dataset of 132 examples. An LLM-as-a-judge then scores outputs on syntax correctness, activation and error handling, and practical usability. Tests across multiple state-of-the-art models and judges show clear performance differences. The benchmark aims to set standards for deploying these models in real production settings.

Core claim

MermaidSeqBench is a human-verified and LLM-synthetically-extended benchmark of 132 samples that evaluates LLM generation of Mermaid sequence diagrams using fine-grained metrics such as syntax correctness, activation handling, error handling, and practical usability via an LLM-as-a-judge, with initial evaluations revealing significant capability gaps across models.

What carries the argument

The hybrid creation process of human-verified flows, LLM-based augmentation, and rule-based expansion, together with LLM-as-a-judge scoring on syntax, activation, error handling, and usability metrics.

If this is right

  • Supplies a concrete dataset and scoring method to compare LLMs on structured diagram tasks.
  • Identifies which models meet the correctness thresholds required for software engineering use.
  • Supports repeated testing as new models appear or prompts change.
  • Creates measurable standards for diagram generation that can guide model improvement.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same hybrid construction approach could extend to other Mermaid diagram types such as flowcharts or class diagrams.
  • Benchmarks like this may highlight specific prompt patterns where current models consistently fail, guiding targeted training data collection.
  • Over time the dataset could serve as a fixed test set to track progress in diagram generation accuracy.

Load-bearing premise

An LLM-as-a-judge can reliably score the fine-grained metrics like syntax correctness and practical usability without complete human verification of every judgment.

What would settle it

A human re-scoring of a representative subset of model outputs that shows low agreement with the LLM judge on activation handling or usability scores.

Figures

Figures reproduced from arXiv: 2511.14967 by Basel Shbita, Chad DeLuca, Farhan Ahmed.

Figure 1
Figure 1. Figure 1: A UML sequence diagram from our benchmark, illustrating the “Uploading Documents [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: A UML sequence diagram from our benchmark, illustrating the “Chatbot Interaction for [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
read the original abstract

Large language models (LLMs) have shown great promise in generating structured diagrams from natural language descriptions, particularly Mermaid sequence diagrams for software engineering. However, the lack of existing benchmarks to assess the LLM's correctness on this task hinders the reliable deployment of these models in production environments. To address this shortcoming, we introduce MermaidSeqBench, a human-verified and LLM-synthetically-extended benchmark for assessing LLM capabilities in generating Mermaid sequence diagrams from natural language prompts. The benchmark consists of 132 samples developed via a hybrid methodology of human-verified flows, LLM-based augmentation, and rule-based expansion. The evaluation uses an LLM-as-a-judge model to assess generation across various fine-grained metrics such as syntax correctness, activation handling, error handling, and practical usability. To demonstrate the effectiveness and flexibility of our benchmark, we perform initial evaluations on numerous state-of-the-art LLMs with multiple LLM judges which reveal significant capability gaps across models and evaluation modes. MermaidSeqBench provides a foundation for evaluating structured diagram generation and establishes the correctness standards needed for real-world software engineering deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces MermaidSeqBench, a benchmark of 132 hybrid human-verified and LLM-synthetically-extended samples for evaluating LLMs on generating Mermaid sequence diagrams from natural language prompts. It employs an LLM-as-a-judge protocol to score outputs on fine-grained metrics including syntax correctness, activation handling, error handling, and practical usability, and reports initial evaluations on multiple state-of-the-art LLMs that reveal significant capability gaps.

Significance. If the evaluation protocol is shown to be reliable, MermaidSeqBench would address a clear gap in benchmarks for structured diagram generation in software engineering, offering a foundation for assessing correctness standards needed for real-world deployment of LLMs in producing sequence diagrams.

major comments (2)
  1. [Evaluation / Abstract] The evaluation protocol (described in the abstract and evaluation sections) relies on LLM-as-a-judge scoring for nuanced metrics such as activation handling and practical usability without any reported human agreement validation, inter-annotator agreement, or spot-check correlation results. These metrics depend on semantic understanding of diagram flows, where LLM judges can mis-evaluate; this absence directly undermines the reliability of the claimed capability gaps and the benchmark's utility as a 'reliable foundation'.
  2. [Benchmark Construction] The central claim that the 132-sample benchmark 'provides a foundation for evaluating structured diagram generation' depends on the hybrid construction (human-verified flows + LLM augmentation + rule-based expansion) being representative and unbiased, yet no analysis of selection criteria for the human-verified subset or potential artifacts from synthetic extension is provided.
minor comments (2)
  1. [Abstract] The abstract refers to evaluations on 'numerous state-of-the-art LLMs with multiple LLM judges' but does not name the specific models or judges; a summary table of results would improve clarity.
  2. [Evaluation] Notation for the fine-grained metrics (e.g., how 'activation handling' is operationalized) could be defined more explicitly to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for their insightful comments, which have helped us identify areas for improvement in the manuscript. We respond to each major comment in turn and outline the revisions we intend to make.

read point-by-point responses
  1. Referee: The evaluation protocol (described in the abstract and evaluation sections) relies on LLM-as-a-judge scoring for nuanced metrics such as activation handling and practical usability without any reported human agreement validation, inter-annotator agreement, or spot-check correlation results. These metrics depend on semantic understanding of diagram flows, where LLM judges can mis-evaluate; this absence directly undermines the reliability of the claimed capability gaps and the benchmark's utility as a 'reliable foundation'.

    Authors: We appreciate the referee's concern regarding the validation of our LLM-as-a-judge protocol. Our manuscript already employs multiple distinct LLM judges to cross-validate scores and reduce the risk of mis-evaluation by any single model. However, to further bolster confidence in the results, we will add a human validation study in the revised manuscript. This will involve selecting a representative subset of generated diagrams, having them scored by human experts, and reporting agreement metrics such as percentage agreement and correlation coefficients between human and LLM judgments. We believe this addition will directly address the potential for LLM mis-evaluation on semantic aspects like activation handling. revision: yes

  2. Referee: The central claim that the 132-sample benchmark 'provides a foundation for evaluating structured diagram generation' depends on the hybrid construction (human-verified flows + LLM augmentation + rule-based expansion) being representative and unbiased, yet no analysis of selection criteria for the human-verified subset or potential artifacts from synthetic extension is provided.

    Authors: We concur that a more detailed justification of the benchmark's construction is necessary to support its role as a foundation. In the updated manuscript, we will revise the benchmark construction section to explicitly describe the selection criteria used for the human-verified subset, such as ensuring coverage across various software engineering scenarios, different levels of interaction complexity, and inclusion of both standard and edge-case flows. Furthermore, we will provide an analysis of the LLM-based augmentation and rule-based expansion steps, including the safeguards implemented to detect and mitigate any introduced artifacts, such as post-generation human review for a portion of the synthetic samples. These enhancements will better demonstrate the benchmark's representativeness and lack of bias. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark creation and evaluation are self-contained

full rationale

The paper introduces MermaidSeqBench via a hybrid human/LLM/rule-based construction process and applies an LLM-as-a-judge protocol to score generated diagrams on syntax, activation, error handling, and usability. No equations, fitted parameters, predictions, or derivations are present that could reduce to the inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central contribution is the creation of a new evaluation artifact rather than any self-referential derivation, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are described in the abstract; the core contribution is the construction of the benchmark itself via hybrid human-LLM-rule methods.

pith-pipeline@v0.9.0 · 5492 in / 966 out tokens · 51415 ms · 2026-05-17T20:02:38.390560+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Benchmarking Requirement-to-Architecture Generation with Hybrid Evaluation

    cs.SE 2026-04 unverdicted novelty 7.0

    R2ABench benchmark shows LLMs generate syntactically valid software architectures from requirements but produce structurally fragmented results due to weak relational reasoning.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · cited by 1 Pith paper · 6 internal anchors

  1. [1]

    A Survey on Large Language Models for Code Generation

    Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. A survey on large language models for code generation, 2024. URLhttps://arxiv.org/abs/2406.00515

  2. [2]

    Large language models for code generation: A comprehensive survey of challenges, techniques, evaluation, and applications, 2025

    Nam Huynh and Beiyu Lin. Large language models for code generation: A comprehensive survey of challenges, techniques, evaluation, and applications, 2025. URL https://arxiv. org/abs/2503.01245

  3. [3]

    Free and customizable code documentation with llms: A fine-tuning approach, 2024

    Sayak Chakrabarty and Souradip Pal. Free and customizable code documentation with llms: A fine-tuning approach, 2024. URLhttps://arxiv.org/abs/2412.00726

  4. [4]

    A comparative analysis of large language models for code documentation generation,

    Shubhang Shekhar Dvivedi, Vyshnav Vijay, Sai Leela Rahul Pujari, Shoumik Lodh, and Dhruv Kumar. A comparative analysis of large language models for code documentation generation,

  5. [5]

    URLhttps://arxiv.org/abs/2312.10349

  6. [6]

    Conversational ai as a coding assistant.arXiv preprint arXiv:2503.16508, 2025

    Xiaoyu Li et al. Conversational ai as a coding assistant.arXiv preprint arXiv:2503.16508, 2025. URLhttps://arxiv.org/abs/2503.16508

  7. [7]

    Harnessing large language models for automated software diagram generation

    Grant Guernsey. Harnessing large language models for automated software diagram generation. Master’s thesis, University of Cincinnati, 2025

  8. [8]

    Unified modeling language (uml) specification, version 2.5.1

    Object Management Group. Unified modeling language (uml) specification, version 2.5.1. Technical Report formal/2017-12-05, Object Management Group (OMG), 2017. URL https: //www.omg.org/spec/UML/2.5.1

  9. [9]

    Plantuml: Generate diagrams from textual descriptions

    Arnaud Roques. Plantuml: Generate diagrams from textual descriptions. https://plantuml. com/, 2025

  10. [10]

    Mermaid: Javascript-based diagramming and charting tool.https://mermaid.js.org/, 2025

    Knut Sveidqvist, Sidharth Vinod, Ashish Jain, Neil Cuzon, Tyler Liu, Alois Klink, Reda Al Sulais, Nikolay Rozhkov, Justin Greywolf, Steph Huynh, Matthieu Morel, Marc Faber, Yash Singh, Nacho Orlandoni, Per Brolin, and Mindaugas Laganeckas. Mermaid: Javascript-based diagramming and charting tool.https://mermaid.js.org/, 2025

  11. [11]

    Generating sequence diagram from natural language requirements

    Munima Jahan, Zahra Shakeri Hossein Abad, and Behrouz Far. Generating sequence diagram from natural language requirements. In2021 IEEE 29th International Requirements Engineering Conference Workshops (REW), pages 39–48, 2021. doi: 10.1109/REW53955.2021.00012

  12. [12]

    How llms aid in uml modeling: An exploratory study with novice analysts, 2024

    Beian Wang, Chong Wang, Peng Liang, Bing Li, and Cheng Zeng. How llms aid in uml modeling: An exploratory study with novice analysts, 2024. URL https://arxiv.org/abs/ 2404.17739

  13. [13]

    Model generation with llms: From requirements to uml sequence diagrams, 2024

    Alessio Ferrari, Sallam Abualhaija, and Chetan Arora. Model generation with llms: From requirements to uml sequence diagrams, 2024. URL https://arxiv.org/abs/2404.06371

  14. [14]

    Mcet: Behavioral model correctness evaluation using large language models, 2025

    Khaled Ahmed, Jialing Song, Boqi Chen, Ou Wei, and Bingzhou Zheng. Mcet: Behavioral model correctness evaluation using large language models, 2025. URL https://arxiv.org/ abs/2508.00630

  15. [15]

    Dynamic creation of uml diagrams using generative AI

    Shobhit Sahai Saxena, Irshad Alam, Vaibhav Sharma, Umesh Vats, and Vijay Kumar Chundury. Dynamic creation of uml diagrams using generative AI. 2025

  16. [16]

    Qwen2.5 technical report,

    Qwen: An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2.5 technical report,

  17. [17]

    URLhttps://arxiv.org/abs/2412.15115

  18. [18]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models, 2024. URLhttps://arxiv.org/abs/2407.21783

  19. [19]

    Granite 3.0 language models.URL: https://github.com/ibm-granite/granite- 3.0-language-models, 2024

    IBM Granite Team. Granite 3.0 language models.URL: https://github.com/ibm-granite/granite- 3.0-language-models, 2024. 5

  20. [20]

    DeepSeek-V3 Technical Report

    DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, et al. Deepseek-v3 technical report, 2025. URLhttps://arxiv.org/abs/2412.19437

  21. [21]

    gpt-oss-120b & gpt-oss-20b Model Card

    OpenAI, :, Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K. Arora, Yu Bai, Bowen Baker, Haiming Bao, Boaz Barak, et al. gpt-oss-120b & gpt-oss-20b model card, 2025. URLhttps://arxiv.org/abs/2508.10925

  22. [22]

    Behavioral augmentation of uml class diagrams: An empirical study of large language models for method generation, 2025

    Djaber Rouabhia and Ismail Hadjadj. Behavioral augmentation of uml class diagrams: An empirical study of large language models for method generation, 2025. URL https://arxiv. org/abs/2506.00788

  23. [23]

    Benchmarking large language models in uml diagram generation from informal notations, 2025

    Cecilia Eklund and Tom Jonsson. Benchmarking large language models in uml diagram generation from informal notations, 2025

  24. [24]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

  25. [25]

    Program Synthesis with Large Language Models

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models, 2021. URLhttps://arxiv.org/abs/2108.07732

  26. [26]

    fms-dgt: Synthetic data generation for foundation models

    foundation-model-stack. fms-dgt: Synthetic data generation for foundation models. https: //github.com/foundation-model-stack/fms-dgt, 2025

  27. [27]

    Manning, Christopher Potts, Christopher Ré, and Percy Liang

    Siddharth* Karamcheti, Laurel* Orr, Jason Bolton, Tianyi Zhang, Karan Goel, Avanika Narayan, Rishi Bommasani, Deepak Narayanan, Tatsunori Hashimoto, Dan Jurafsky, Christopher D. Manning, Christopher Potts, Christopher Ré, and Percy Liang. Mistral - a journey towards reproducible language model training, 2021. URL https://github.com/stanford-crfm/ mistral

  28. [28]

    Uploading Documents with Secure Storage

    Knut Sveidqvist and Mermaid contributors. Mermaid live editor. https://mermaid.live, 2025. 6 A Sequence Diagrams The UML sequence diagram in Figure 1 and the accompanying Listing 1 illustrate one of the test cases included in our benchmark dataset. The figure presents the rendered Mermaid sequence diagram describing the flow in uploading documents with se...

  29. [29]

    User Action : User uploads a do cu me nt ( e . g . , an ID or co nt rac t ) through the mobile app

  30. [30]

    Mobile App : Sends the d oc um en t along with the session token to the BFF

  31. [31]

    Checks user p e r m i s s i o n s to ensure they are a u t h o r i z e d to upload d o c u m e n t s

    BFF V a l i d a t i o n : V a l i d a t e s the session token with Azure AD . Checks user p e r m i s s i o n s to ensure they are a u t h o r i z e d to upload d o c u m e n t s

  32. [32]

    Storage Process : The BFF saves met ad at a about the do cum en t ( e . g . , file name , size , upload t i m e s t a m p ) in the da ta ba se . The actual do cum en t is se cu re ly stored in cloud storage ( e . g . , Azure Blob Storage )

  33. [33]

    Uploading Documents with Secure Storage

    R es po ns e : On s u c c e s s f u l upload , the BFF returns a c o n f i r m a t i o n to the app . If the user is u n a u t h o r i z e d or the file exceeds size limits , an a p p r o p r i a t e error is re tu rn ed . Listing 3: Natural language specification for the “Uploading Documents with Secure Storage” flow, corresponding to the syntax in Listi...

  34. [34]

    User Query : The user types a q ue st ion or query into the mobile app

  35. [35]

    Chatbot E n g a g e m e n t : The mobile app sends the query to the Chatbot via the BFF

  36. [36]

    Initial Res po ns e : The Chatbot p r o c e s s e s the query and sends an initial r es po ns e to the mobile app

  37. [37]

    Follow - up Q u e s t i o n s : The Chatbot may ask follow - up q u e s t i o n s to better u n d e r s t a n d the user ’ s issue

  38. [38]

    E s c a l a t i o n : If the Chatbot cannot resolve the issue , it e s c a l a t e s the query to a Cu st om er Support Agent

  39. [39]

    Agent I n t e r a c t i o n : The Cus to me r Support Agent re cei ve s the e s c a l a t e d query and i n t e r a c t s with the user through the mobile app

  40. [40]

    R e s o l u t i o n : The Cu sto me r Support Agent pr ov ide s a so lu tio n or r e s o l u t i o n to the user ’ s issue

  41. [41]

    Chatbot Interaction for Customer Support

    F ee db ac k : The mobile app prompts the user to provide fe ed bac k on the support e x p e r i e n c e . Listing 4: Natural language specification for the “Chatbot Interaction for Customer Support” flow, corresponding to the syntax in Listing 2 and the rendered diagram in Figure 2. 10