pith. machine review for the scientific record. sign in

arxiv: 2604.06683 · v1 · submitted 2026-04-08 · 💻 cs.SE

Recognition: 1 theorem link

· Lean Theorem

Benchmarking Requirement-to-Architecture Generation with Hybrid Evaluation

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:24 UTC · model grok-4.3

classification 💻 cs.SE
keywords R2ABenchLLM benchmarkingsoftware architecture generationrequirement to architecturerelational reasoningPlantUMLhybrid evaluationLLM limitations
0
0 comments X

The pith

Large language models extract entities from requirements well but fail to reason about their relations, producing fragmented architectures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates R2ABench to test how well large language models can turn product requirement documents into software architecture diagrams using PlantUML. It evaluates models with a new hybrid method that checks structural graphs, scores multiple quality aspects, and detects anti-patterns. Results indicate models reliably pull out entities and follow syntax rules but cannot properly link those entities, yielding disconnected designs. Code-focused models help somewhat with this linking problem, yet agent setups increase inconsistency instead of fixing it. This benchmark gives a clear way to track progress toward reliable automated architecture design.

Core claim

Our study shows that LLMs show strong syntactic validity and robust entity extraction but fundamentally struggle with relational reasoning, leading to structurally fragmented architectures. Code-specialized models partially alleviate this limitation, while agent frameworks introduce significant instability rather than consistent improvements. R2ABench provides a robust and standardized foundation for advancing LLM-driven software architecture generation.

What carries the argument

R2ABench benchmark paired with a hybrid evaluation framework that layers structural graph metrics, multi-dimensional scoring, and architecture anti-pattern detection on PlantUML diagrams.

Load-bearing premise

Expert-curated PlantUML reference diagrams constitute valid ground-truth architectures and the hybrid evaluation layers sufficiently measure architectural quality without further human validation.

What would settle it

Independent software architects scoring generated diagrams for coherence and relational completeness, then checking whether those human scores align with the automated hybrid metrics.

Figures

Figures reproduced from arXiv: 2604.06683 by Fang Liu, Li Zhang, Minxiao Li, Shuying Yan, Yang Liu.

Figure 1
Figure 1. Figure 1: Comparison of Traditional Manual vs. AI-Assisted [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Domain distribution of datasets in R2ABench. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 2
Figure 2. Figure 2: An overview of the R2ABench methodology, en￾compassing dataset construction, context gradation testing, and the multi-dimensional evaluation framework. and ROUGE [19], can measure the semantic similarity or surface￾level of generated content, but are insufficient for a rigorous evalu￾ation of UML diagrams at both the semantic and structural levels. Recently, the "LLM-as-a-Judge" approach [33] has emerged a… view at source ↗
Figure 4
Figure 4. Figure 4: We provide a detailed analysis of each error type below. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of error counts across different models. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
read the original abstract

Recently, Large Language Models (LLMs) have demonstrated significant potential in automating software engineering tasks. Generating software architecture designs from requirement documents is a crucial step in software development. However, there is currently a lack of functional datasets tailored for this task. To bridge this gap, we introduce R2ABench (Requirement-To-Architecture Benchmark), a novel benchmark comprising diverse real-world software projects paired with comprehensive Product Requirements Documents (PRDs) and expert-curated PlantUML reference diagrams. Furthermore, we propose a multi-dimensional, hybrid evaluation framework that assesses generated diagrams across three complementary layers: Structural Graph Metrics, Multi-dimensional Scoring, and Architecture Anti-pattern Detection. Using this framework, we conducted a comprehensive empirical study evaluating state-of-the-art models and agentic workflows. Our study shows that LLMs show strong syntactic validity and robust entity extraction but fundamentally struggle with relational reasoning, leading to structurally fragmented architectures. Code-specialized models partially alleviate this limitation, while agent frameworks introduce significant instability rather than consistent improvements. R2ABench provides a robust and standardized foundation for advancing LLM-driven software architecture generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces R2ABench, a benchmark for requirement-to-architecture generation consisting of real-world software projects paired with Product Requirements Documents (PRDs) and expert-curated PlantUML reference diagrams. It proposes a hybrid evaluation framework with three layers—Structural Graph Metrics, Multi-dimensional Scoring, and Architecture Anti-pattern Detection—and reports an empirical study on state-of-the-art LLMs and agentic workflows. The central finding is that LLMs exhibit strong syntactic validity and entity extraction but struggle with relational reasoning, producing structurally fragmented architectures; code-specialized models partially mitigate this, while agent frameworks add instability rather than consistent gains.

Significance. If the hybrid metrics are shown to align with human architectural judgments, this work would provide a valuable standardized benchmark and diagnostic framework for LLM-driven architecture generation, an important but under-benchmarked software engineering task. The creation of R2ABench as a new artifact with external models evaluated against it is a clear strength that supports reproducibility and future comparisons.

major comments (3)
  1. R2ABench construction section: the expert-curated PlantUML diagrams are presented as ground truth without inter-rater reliability statistics, details on curation protocol, or validation against independent architects, which directly affects the validity of claims about relational reasoning deficits.
  2. Hybrid Evaluation Framework section: no ablation of the three layers (structural graph metrics, multi-dimensional scoring, anti-pattern detection) and no correlation analysis between automated scores and independent human ratings of architectural quality are reported, leaving open the possibility that reported fragmentation is an artifact of uncalibrated metrics rather than a model limitation.
  3. Empirical Study / Results section: the manuscript supplies no dataset size (number of PRDs or diagrams), model count, or statistical tests for the comparative claims, making the conclusions about code-specialized models versus agents difficult to assess for robustness.
minor comments (2)
  1. Abstract: adding the number of projects in R2ABench and one or two key quantitative results would improve informativeness.
  2. Notation and terminology: ensure consistent capitalization and definition of 'hybrid evaluation framework' and its sub-layers across sections.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We sincerely thank the referee for their constructive and detailed feedback, which has identified key areas for improving the clarity, rigor, and transparency of our manuscript. We address each major comment below with specific plans for revision.

read point-by-point responses
  1. Referee: R2ABench construction section: the expert-curated PlantUML diagrams are presented as ground truth without inter-rater reliability statistics, details on curation protocol, or validation against independent architects, which directly affects the validity of claims about relational reasoning deficits.

    Authors: We thank the referee for this important observation on benchmark validity. The original manuscript describes the diagrams as expert-curated but provides insufficient protocol details. In the revised version, we will expand the R2ABench construction section to include a step-by-step account of the curation protocol (how PRDs were mapped to PlantUML elements by the expert architect), any internal consistency checks performed, and an explicit limitations subsection noting the single-expert process and absence of inter-rater reliability statistics. We will also highlight the expert's qualifications to support ground-truth quality. While we cannot retroactively compute inter-rater metrics, this expanded discussion will better ground our claims regarding relational reasoning deficits. revision: yes

  2. Referee: Hybrid Evaluation Framework section: no ablation of the three layers (structural graph metrics, multi-dimensional scoring, anti-pattern detection) and no correlation analysis between automated scores and independent human ratings of architectural quality are reported, leaving open the possibility that reported fragmentation is an artifact of uncalibrated metrics rather than a model limitation.

    Authors: We agree that ablations and human correlation would strengthen confidence in the hybrid framework. The manuscript presents the three layers as complementary without explicit ablations or human validation. In revision, we will add an analysis subsection examining each layer's incremental contribution based on our existing empirical observations (e.g., cases where anti-pattern detection flags issues missed by graph metrics). We will also report any feasible correlation with human ratings using available resources or, if new data collection is required, explicitly acknowledge this as a limitation while outlining a plan for future validation. This will help confirm that observed structural fragmentation reflects model behavior rather than uncalibrated metrics. revision: partial

  3. Referee: Empirical Study / Results section: the manuscript supplies no dataset size (number of PRDs or diagrams), model count, or statistical tests for the comparative claims, making the conclusions about code-specialized models versus agents difficult to assess for robustness.

    Authors: We acknowledge that explicit reporting of dataset scale, model counts, and statistical tests was insufficiently prominent, which may have obscured these details. In the revised manuscript, we will prominently state the exact size of R2ABench (number of PRDs and reference diagrams), enumerate all evaluated LLMs and agentic workflows, and incorporate appropriate statistical tests (e.g., significance testing for differences between code-specialized models and agent frameworks) to support the comparative claims. These changes will allow readers to more readily assess the robustness of our findings on relational reasoning limitations. revision: yes

Circularity Check

0 steps flagged

Low circularity: new benchmark and hybrid evaluation framework applied to external models

full rationale

The paper introduces R2ABench as a new artifact with real-world PRDs and expert-curated PlantUML references, plus a hybrid evaluation framework of structural graph metrics, multi-dimensional scoring, and anti-pattern detection. It then applies this framework to evaluate external state-of-the-art LLMs and agent workflows, reporting findings on syntactic validity, entity extraction, and relational reasoning deficits. No equations or derivations reduce claims to self-defined fitted quantities, no load-bearing self-citations are invoked for uniqueness or ansatz, and no predictions are statistically forced from subsets of the same data. The central empirical claims rest on independent application of the proposed metrics to model-generated outputs rather than any circular reduction to the paper's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

Claims rest on the new benchmark and the assumption that expert diagrams plus the three evaluation layers capture architectural quality. No explicit free parameters are described.

axioms (2)
  • domain assumption Expert-curated PlantUML diagrams serve as reliable ground truth for architecture quality.
    Invoked when using them as reference for all evaluations.
  • domain assumption The three-layer hybrid evaluation comprehensively assesses generated architecture quality.
    Central premise of the proposed framework.
invented entities (1)
  • R2ABench no independent evidence
    purpose: Benchmark dataset and evaluation framework for requirement-to-architecture generation
    Newly created in this paper with no external independent validation described.

pith-pipeline@v0.9.0 · 5490 in / 1307 out tokens · 52223 ms · 2026-05-10T18:24:30.288208+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

35 extracted references · 10 canonical work pages · 4 internal anchors

  1. [1]

    Mathqa: Towards interpretable math word problem solving with operation-based formalisms

    Aida Amini et al. Mathqa: Towards interpretable math word problem solving with operation-based formalisms. InProceedings of NAACL, 2019

  2. [2]

    Program Synthesis with Large Language Models

    Jacob Austin et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

  3. [3]

    Software architecture documentation in practice: Documenting architectural layers

    Felix Bachmann, Len Bass, Jeromy Carriere, Paul Clements, David Garlan, James Ivers, Robert Nord, and Reed Little. Software architecture documentation in practice: Documenting architectural layers. Technical report, 2000

  4. [4]

    Assessing the suitability of large language models in generating uml class diagrams as conceptual models

    Marco Calamo, Massimo Mecella, and Monique Snoeck. Assessing the suitability of large language models in generating uml class diagrams as conceptual models. InInternational Conference on Business Process Modeling, Development and Support, pages 211–226. Springer, 2025

  5. [5]

    On the assessment of generative ai in modeling tasks: an experience report with chatgpt and uml

    Javier Cámara-Moreno, Javier Troya-Castilla, Lola Burgueño-Caballero, and Antonio Jesús Vallecillo-Moreno. On the assessment of generative ai in modeling tasks: an experience report with chatgpt and uml. 2023

  6. [6]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

  7. [7]

    Can llms generate architectural design decisions?-an exploratory empirical study

    Rudra Dhar, Karthik Vaidhyanathan, and Vasudeva Varma. Can llms generate architectural design decisions?-an exploratory empirical study. In2024 IEEE 21st International Conference on Software Architecture (ICSA), pages 79–89. IEEE, 2024

  8. [8]

    Crosscodeeval: A diverse and multilingual benchmark for cross-file code completion.arXiv preprint arXiv:2310.11248, 2023

    Yangruibo Ding et al. Crosscodeeval: A diverse and multilingual benchmark for cross-file code completion.arXiv preprint arXiv:2310.11248, 2023

  9. [9]

    Classeval: A manually-crafted benchmark for evaluating llms on class-level code generation

    Xueying Du et al. Classeval: A manually-crafted benchmark for evaluating llms on class-level code generation.arXiv preprint arXiv:2308.01861, 2023

  10. [10]

    Erni and C

    K. Erni and C. Lewerentz. Applying design-metrics to object-oriented frameworks. InProceedings of the 3rd International Software Metrics Symposium, pages 64–74,

  11. [11]

    doi: 10.1109/METRIC.1996.492444

  12. [12]

    A survey on llm-as-a- judge.The Innovation, 2024

    Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al. A survey on llm-as-a- judge.The Innovation, 2024

  13. [13]

    MetaGPT: Meta programming for a multi-agent collaborative framework

    Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. MetaGPT: Meta programming for a multi-agent collaborative framework. In The Twelfth International Conference on Learning Representations, 20...

  14. [14]

    Large language models for software engineering: A systematic literature review.ACM Transactions on Software Engineering and Methodology, 33(8):1–79, 2024

    Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. Large language models for software engineering: A systematic literature review.ACM Transactions on Software Engineering and Methodology, 33(8):1–79, 2024

  15. [15]

    Role of ai in requirements engineering, 2023

    Ivan Filippov. Role of ai in requirements engineering, 2023. URL https://www. getxray.app/blog/ai-in-requirements-engineering. Accessed: 2026-03-18

  16. [16]

    The unified modeling language reference manual

    Lvar Jacobson and James Rumbaugh Grady Booch. The unified modeling language reference manual. 2021

  17. [17]

    Testgeneval: A real world unit test generation and test completion benchmark

    Kush Jain, Gabriel Synnaeve, and Baptiste Rozière. Testgeneval: A real world unit test generation and test completion benchmark. InICLR, 2025

  18. [18]

    Survey of hallucination in natural language generation.ACM computing surveys, 55(12):1–38, 2023

    Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation.ACM computing surveys, 55(12):1–38, 2023

  19. [19]

    Prompting large language models to tackle the full software development lifecycle: A case study (devbench)

    Bowen Li, Wenhan Wu, et al. Prompting large language models to tackle the full software development lifecycle: A case study (devbench). InProceedings of the 31st International Conference on Computational Linguistics, pages 7511–7531, 2025

  20. [20]

    Rouge: A package for automatic evaluation of summaries

    Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004

  21. [21]

    C4 model: a research guide for designing software architectures

    Argyro Mavrogiorgou, Athanasios Kiourtis, Dimosthenis Kyriazis, Martin Ser- rano, Mauro Isaja, Raquel Lazcano, John Soldatos, and Ernesto Troiano. C4 model: a research guide for designing software architectures. In2025 8th International Conference on Software and System Engineering (ICoSSE), pages 1–9. IEEE, 2025

  22. [22]

    Rec- ommended practice for architectural description of software intensive systems

    Architecture Working Group of the Software Engineering Committee et al. Rec- ommended practice for architectural description of software intensive systems. IEEE Standards Department, 2000

  23. [23]

    Bleu: a method for automatic evaluation of machine translation

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002

  24. [24]

    Sys- tematic literature reviews in software engineering—enhancement of the study selection process using cohen’s kappa statistic.Journal of Systems and Software, 168:110657, 2020

    Jorge Pérez, Jessica Díaz, Javier Garcia-Martin, and Bernardo Tabuenca. Sys- tematic literature reviews in software engineering—enhancement of the study selection process using cohen’s kappa statistic.Journal of Systems and Software, 168:110657, 2020

  25. [25]

    Software architecture meets LLMs: A systematic literature review,

    Larissa Schmid, Tobias Hey, Martin Armbruster, Sophie Corallo, Dominik Fuchß, Jan Keim, Haoyu Liu, and Anne Koziolek. Software architecture meets llms: A systematic literature review.arXiv preprint arXiv:2505.16697, 2025

  26. [26]

    MermaidSeqBench: An Evaluation Benchmark for NL-to-Mermaid Sequence Diagram Generation

    Basel Shbita, Farhan Ahmed, and Chad DeLuca. Mermaidseqbench: An evaluation benchmark for llm-to-mermaid sequence diagram generation, 2025. URL https: //arxiv.org/abs/2511.14967

  27. [27]

    Application of the tree-of-thoughts framework to llm-enabled domain modeling

    Jonathan Silva, Qin Ma, Jordi Cabot, Pierre Kelsen, and Henderik A Proper. Application of the tree-of-thoughts framework to llm-enabled domain modeling. InInternational Conference on Conceptual Modeling, pages 94–111. Springer, 2024

  28. [28]

    Collaborative llm agents for c4 software architecture design automation.arXiv preprint arXiv:2510.22787, 2025

    Kamil Szczepanik, JarosĹ Chudziak, et al. Collaborative llm agents for c4 software architecture design automation.arXiv preprint arXiv:2510.22787, 2025

  29. [29]

    Contest: A unit test comple- tion benchmark featuring context

    Johannes Villmow, Jonas Depoix, and Adrian Ulges. Contest: A unit test comple- tion benchmark featuring context. InProceedings of the 1st Workshop on Natural Language Processing for Programming, 2021

  30. [30]

    Testeval: Benchmarking large language models for test case generation

    Wenhan Wang et al. Testeval: Benchmarking large language models for test case generation. InFindings of NAACL, 2025

  31. [31]

    OpenHands: An Open Platform for AI Software Developers as Generalist Agents

    Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. OpenHands: An Open Platform for AI Soft...

  32. [32]

    Repocoder: Repository-level code completion through iterative retrieval and generation

    Fengji Zhang et al. Repocoder: Repository-level code completion through iterative retrieval and generation. InProceedings of EMNLP, 2023

  33. [33]

    Towards realistic project-level code generation via multi-agent collaboration and semantic architecture modeling.arXiv preprint arXiv:2511.03404, 2025

    Qianhui Zhao, Li Zhang, Fang Liu, Junhang Cheng, Chengru Wu, Junchen Ai, Qiaoyuanhe Meng, Lichen Zhang, Xiaoli Lian, Shubin Song, et al. Towards realistic project-level code generation via multi-agent collaboration and semantic architecture modeling.arXiv preprint arXiv:2511.03404, 2025

  34. [34]

    Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

  35. [35]

    Codegeex: A pre-trained model for code generation with multilingual benchmarking on humaneval-x

    Qinkai Zheng et al. Codegeex: A pre-trained model for code generation with multilingual benchmarking on humaneval-x. InProceedings of the 29th ACM SIGKDD Conference, 2023