Recognition: 1 theorem link
· Lean TheoremMermaidSeqBench: An Evaluation Benchmark for NL-to-Mermaid Sequence Diagram Generation
Pith reviewed 2026-05-17 20:02 UTC · model grok-4.3
The pith
MermaidSeqBench introduces a 132-sample benchmark to measure how well LLMs generate Mermaid sequence diagrams from natural language prompts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MermaidSeqBench is a human-verified and LLM-synthetically-extended benchmark of 132 samples that evaluates LLM generation of Mermaid sequence diagrams using fine-grained metrics such as syntax correctness, activation handling, error handling, and practical usability via an LLM-as-a-judge, with initial evaluations revealing significant capability gaps across models.
What carries the argument
The hybrid creation process of human-verified flows, LLM-based augmentation, and rule-based expansion, together with LLM-as-a-judge scoring on syntax, activation, error handling, and usability metrics.
If this is right
- Supplies a concrete dataset and scoring method to compare LLMs on structured diagram tasks.
- Identifies which models meet the correctness thresholds required for software engineering use.
- Supports repeated testing as new models appear or prompts change.
- Creates measurable standards for diagram generation that can guide model improvement.
Where Pith is reading between the lines
- The same hybrid construction approach could extend to other Mermaid diagram types such as flowcharts or class diagrams.
- Benchmarks like this may highlight specific prompt patterns where current models consistently fail, guiding targeted training data collection.
- Over time the dataset could serve as a fixed test set to track progress in diagram generation accuracy.
Load-bearing premise
An LLM-as-a-judge can reliably score the fine-grained metrics like syntax correctness and practical usability without complete human verification of every judgment.
What would settle it
A human re-scoring of a representative subset of model outputs that shows low agreement with the LLM judge on activation handling or usability scores.
Figures
read the original abstract
Large language models (LLMs) have shown great promise in generating structured diagrams from natural language descriptions, particularly Mermaid sequence diagrams for software engineering. However, the lack of existing benchmarks to assess the LLM's correctness on this task hinders the reliable deployment of these models in production environments. To address this shortcoming, we introduce MermaidSeqBench, a human-verified and LLM-synthetically-extended benchmark for assessing LLM capabilities in generating Mermaid sequence diagrams from natural language prompts. The benchmark consists of 132 samples developed via a hybrid methodology of human-verified flows, LLM-based augmentation, and rule-based expansion. The evaluation uses an LLM-as-a-judge model to assess generation across various fine-grained metrics such as syntax correctness, activation handling, error handling, and practical usability. To demonstrate the effectiveness and flexibility of our benchmark, we perform initial evaluations on numerous state-of-the-art LLMs with multiple LLM judges which reveal significant capability gaps across models and evaluation modes. MermaidSeqBench provides a foundation for evaluating structured diagram generation and establishes the correctness standards needed for real-world software engineering deployment.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MermaidSeqBench, a benchmark of 132 hybrid human-verified and LLM-synthetically-extended samples for evaluating LLMs on generating Mermaid sequence diagrams from natural language prompts. It employs an LLM-as-a-judge protocol to score outputs on fine-grained metrics including syntax correctness, activation handling, error handling, and practical usability, and reports initial evaluations on multiple state-of-the-art LLMs that reveal significant capability gaps.
Significance. If the evaluation protocol is shown to be reliable, MermaidSeqBench would address a clear gap in benchmarks for structured diagram generation in software engineering, offering a foundation for assessing correctness standards needed for real-world deployment of LLMs in producing sequence diagrams.
major comments (2)
- [Evaluation / Abstract] The evaluation protocol (described in the abstract and evaluation sections) relies on LLM-as-a-judge scoring for nuanced metrics such as activation handling and practical usability without any reported human agreement validation, inter-annotator agreement, or spot-check correlation results. These metrics depend on semantic understanding of diagram flows, where LLM judges can mis-evaluate; this absence directly undermines the reliability of the claimed capability gaps and the benchmark's utility as a 'reliable foundation'.
- [Benchmark Construction] The central claim that the 132-sample benchmark 'provides a foundation for evaluating structured diagram generation' depends on the hybrid construction (human-verified flows + LLM augmentation + rule-based expansion) being representative and unbiased, yet no analysis of selection criteria for the human-verified subset or potential artifacts from synthetic extension is provided.
minor comments (2)
- [Abstract] The abstract refers to evaluations on 'numerous state-of-the-art LLMs with multiple LLM judges' but does not name the specific models or judges; a summary table of results would improve clarity.
- [Evaluation] Notation for the fine-grained metrics (e.g., how 'activation handling' is operationalized) could be defined more explicitly to aid reproducibility.
Simulated Author's Rebuttal
We are grateful to the referee for their insightful comments, which have helped us identify areas for improvement in the manuscript. We respond to each major comment in turn and outline the revisions we intend to make.
read point-by-point responses
-
Referee: The evaluation protocol (described in the abstract and evaluation sections) relies on LLM-as-a-judge scoring for nuanced metrics such as activation handling and practical usability without any reported human agreement validation, inter-annotator agreement, or spot-check correlation results. These metrics depend on semantic understanding of diagram flows, where LLM judges can mis-evaluate; this absence directly undermines the reliability of the claimed capability gaps and the benchmark's utility as a 'reliable foundation'.
Authors: We appreciate the referee's concern regarding the validation of our LLM-as-a-judge protocol. Our manuscript already employs multiple distinct LLM judges to cross-validate scores and reduce the risk of mis-evaluation by any single model. However, to further bolster confidence in the results, we will add a human validation study in the revised manuscript. This will involve selecting a representative subset of generated diagrams, having them scored by human experts, and reporting agreement metrics such as percentage agreement and correlation coefficients between human and LLM judgments. We believe this addition will directly address the potential for LLM mis-evaluation on semantic aspects like activation handling. revision: yes
-
Referee: The central claim that the 132-sample benchmark 'provides a foundation for evaluating structured diagram generation' depends on the hybrid construction (human-verified flows + LLM augmentation + rule-based expansion) being representative and unbiased, yet no analysis of selection criteria for the human-verified subset or potential artifacts from synthetic extension is provided.
Authors: We concur that a more detailed justification of the benchmark's construction is necessary to support its role as a foundation. In the updated manuscript, we will revise the benchmark construction section to explicitly describe the selection criteria used for the human-verified subset, such as ensuring coverage across various software engineering scenarios, different levels of interaction complexity, and inclusion of both standard and edge-case flows. Furthermore, we will provide an analysis of the LLM-based augmentation and rule-based expansion steps, including the safeguards implemented to detect and mitigate any introduced artifacts, such as post-generation human review for a portion of the synthetic samples. These enhancements will better demonstrate the benchmark's representativeness and lack of bias. revision: yes
Circularity Check
No circularity: benchmark creation and evaluation are self-contained
full rationale
The paper introduces MermaidSeqBench via a hybrid human/LLM/rule-based construction process and applies an LLM-as-a-judge protocol to score generated diagrams on syntax, activation, error handling, and usability. No equations, fitted parameters, predictions, or derivations are present that could reduce to the inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central contribution is the creation of a new evaluation artifact rather than any self-referential derivation, making the work self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce MermaidSeqBench, a human-verified and LLM-synthetically-extended benchmark for assessing LLM capabilities in generating Mermaid sequence diagrams from natural language prompts.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Benchmarking Requirement-to-Architecture Generation with Hybrid Evaluation
R2ABench benchmark shows LLMs generate syntactically valid software architectures from requirements but produce structurally fragmented results due to weak relational reasoning.
Reference graph
Works this paper leans on
-
[1]
A Survey on Large Language Models for Code Generation
Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. A survey on large language models for code generation, 2024. URLhttps://arxiv.org/abs/2406.00515
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
Nam Huynh and Beiyu Lin. Large language models for code generation: A comprehensive survey of challenges, techniques, evaluation, and applications, 2025. URL https://arxiv. org/abs/2503.01245
-
[3]
Free and customizable code documentation with llms: A fine-tuning approach, 2024
Sayak Chakrabarty and Souradip Pal. Free and customizable code documentation with llms: A fine-tuning approach, 2024. URLhttps://arxiv.org/abs/2412.00726
-
[4]
A comparative analysis of large language models for code documentation generation,
Shubhang Shekhar Dvivedi, Vyshnav Vijay, Sai Leela Rahul Pujari, Shoumik Lodh, and Dhruv Kumar. A comparative analysis of large language models for code documentation generation,
- [5]
-
[6]
Conversational ai as a coding assistant.arXiv preprint arXiv:2503.16508, 2025
Xiaoyu Li et al. Conversational ai as a coding assistant.arXiv preprint arXiv:2503.16508, 2025. URLhttps://arxiv.org/abs/2503.16508
-
[7]
Harnessing large language models for automated software diagram generation
Grant Guernsey. Harnessing large language models for automated software diagram generation. Master’s thesis, University of Cincinnati, 2025
work page 2025
-
[8]
Unified modeling language (uml) specification, version 2.5.1
Object Management Group. Unified modeling language (uml) specification, version 2.5.1. Technical Report formal/2017-12-05, Object Management Group (OMG), 2017. URL https: //www.omg.org/spec/UML/2.5.1
work page 2017
-
[9]
Plantuml: Generate diagrams from textual descriptions
Arnaud Roques. Plantuml: Generate diagrams from textual descriptions. https://plantuml. com/, 2025
work page 2025
-
[10]
Mermaid: Javascript-based diagramming and charting tool.https://mermaid.js.org/, 2025
Knut Sveidqvist, Sidharth Vinod, Ashish Jain, Neil Cuzon, Tyler Liu, Alois Klink, Reda Al Sulais, Nikolay Rozhkov, Justin Greywolf, Steph Huynh, Matthieu Morel, Marc Faber, Yash Singh, Nacho Orlandoni, Per Brolin, and Mindaugas Laganeckas. Mermaid: Javascript-based diagramming and charting tool.https://mermaid.js.org/, 2025
work page 2025
-
[11]
Generating sequence diagram from natural language requirements
Munima Jahan, Zahra Shakeri Hossein Abad, and Behrouz Far. Generating sequence diagram from natural language requirements. In2021 IEEE 29th International Requirements Engineering Conference Workshops (REW), pages 39–48, 2021. doi: 10.1109/REW53955.2021.00012
-
[12]
How llms aid in uml modeling: An exploratory study with novice analysts, 2024
Beian Wang, Chong Wang, Peng Liang, Bing Li, and Cheng Zeng. How llms aid in uml modeling: An exploratory study with novice analysts, 2024. URL https://arxiv.org/abs/ 2404.17739
-
[13]
Model generation with llms: From requirements to uml sequence diagrams, 2024
Alessio Ferrari, Sallam Abualhaija, and Chetan Arora. Model generation with llms: From requirements to uml sequence diagrams, 2024. URL https://arxiv.org/abs/2404.06371
-
[14]
Mcet: Behavioral model correctness evaluation using large language models, 2025
Khaled Ahmed, Jialing Song, Boqi Chen, Ou Wei, and Bingzhou Zheng. Mcet: Behavioral model correctness evaluation using large language models, 2025. URL https://arxiv.org/ abs/2508.00630
-
[15]
Dynamic creation of uml diagrams using generative AI
Shobhit Sahai Saxena, Irshad Alam, Vaibhav Sharma, Umesh Vats, and Vijay Kumar Chundury. Dynamic creation of uml diagrams using generative AI. 2025
work page 2025
-
[16]
Qwen: An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2.5 technical report,
-
[17]
URLhttps://arxiv.org/abs/2412.15115
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models, 2024. URLhttps://arxiv.org/abs/2407.21783
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Granite 3.0 language models.URL: https://github.com/ibm-granite/granite- 3.0-language-models, 2024
IBM Granite Team. Granite 3.0 language models.URL: https://github.com/ibm-granite/granite- 3.0-language-models, 2024. 5
work page 2024
-
[20]
DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, et al. Deepseek-v3 technical report, 2025. URLhttps://arxiv.org/abs/2412.19437
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[21]
gpt-oss-120b & gpt-oss-20b Model Card
OpenAI, :, Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K. Arora, Yu Bai, Bowen Baker, Haiming Bao, Boaz Barak, et al. gpt-oss-120b & gpt-oss-20b model card, 2025. URLhttps://arxiv.org/abs/2508.10925
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
Djaber Rouabhia and Ismail Hadjadj. Behavioral augmentation of uml class diagrams: An empirical study of large language models for method generation, 2025. URL https://arxiv. org/abs/2506.00788
-
[23]
Benchmarking large language models in uml diagram generation from informal notations, 2025
Cecilia Eklund and Tom Jonsson. Benchmarking large language models in uml diagram generation from informal notations, 2025
work page 2025
-
[24]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...
work page 2021
-
[25]
Program Synthesis with Large Language Models
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models, 2021. URLhttps://arxiv.org/abs/2108.07732
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[26]
fms-dgt: Synthetic data generation for foundation models
foundation-model-stack. fms-dgt: Synthetic data generation for foundation models. https: //github.com/foundation-model-stack/fms-dgt, 2025
work page 2025
-
[27]
Manning, Christopher Potts, Christopher Ré, and Percy Liang
Siddharth* Karamcheti, Laurel* Orr, Jason Bolton, Tianyi Zhang, Karan Goel, Avanika Narayan, Rishi Bommasani, Deepak Narayanan, Tatsunori Hashimoto, Dan Jurafsky, Christopher D. Manning, Christopher Potts, Christopher Ré, and Percy Liang. Mistral - a journey towards reproducible language model training, 2021. URL https://github.com/stanford-crfm/ mistral
work page 2021
-
[28]
Uploading Documents with Secure Storage
Knut Sveidqvist and Mermaid contributors. Mermaid live editor. https://mermaid.live, 2025. 6 A Sequence Diagrams The UML sequence diagram in Figure 1 and the accompanying Listing 1 illustrate one of the test cases included in our benchmark dataset. The figure presents the rendered Mermaid sequence diagram describing the flow in uploading documents with se...
work page 2025
-
[29]
User Action : User uploads a do cu me nt ( e . g . , an ID or co nt rac t ) through the mobile app
-
[30]
Mobile App : Sends the d oc um en t along with the session token to the BFF
-
[31]
Checks user p e r m i s s i o n s to ensure they are a u t h o r i z e d to upload d o c u m e n t s
BFF V a l i d a t i o n : V a l i d a t e s the session token with Azure AD . Checks user p e r m i s s i o n s to ensure they are a u t h o r i z e d to upload d o c u m e n t s
-
[32]
Storage Process : The BFF saves met ad at a about the do cum en t ( e . g . , file name , size , upload t i m e s t a m p ) in the da ta ba se . The actual do cum en t is se cu re ly stored in cloud storage ( e . g . , Azure Blob Storage )
-
[33]
Uploading Documents with Secure Storage
R es po ns e : On s u c c e s s f u l upload , the BFF returns a c o n f i r m a t i o n to the app . If the user is u n a u t h o r i z e d or the file exceeds size limits , an a p p r o p r i a t e error is re tu rn ed . Listing 3: Natural language specification for the “Uploading Documents with Secure Storage” flow, corresponding to the syntax in Listi...
-
[34]
User Query : The user types a q ue st ion or query into the mobile app
-
[35]
Chatbot E n g a g e m e n t : The mobile app sends the query to the Chatbot via the BFF
-
[36]
Initial Res po ns e : The Chatbot p r o c e s s e s the query and sends an initial r es po ns e to the mobile app
-
[37]
Follow - up Q u e s t i o n s : The Chatbot may ask follow - up q u e s t i o n s to better u n d e r s t a n d the user ’ s issue
-
[38]
E s c a l a t i o n : If the Chatbot cannot resolve the issue , it e s c a l a t e s the query to a Cu st om er Support Agent
-
[39]
Agent I n t e r a c t i o n : The Cus to me r Support Agent re cei ve s the e s c a l a t e d query and i n t e r a c t s with the user through the mobile app
-
[40]
R e s o l u t i o n : The Cu sto me r Support Agent pr ov ide s a so lu tio n or r e s o l u t i o n to the user ’ s issue
-
[41]
Chatbot Interaction for Customer Support
F ee db ac k : The mobile app prompts the user to provide fe ed bac k on the support e x p e r i e n c e . Listing 4: Natural language specification for the “Chatbot Interaction for Customer Support” flow, corresponding to the syntax in Listing 2 and the rendered diagram in Figure 2. 10
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.