arxiv: 2604.06683 · v1 · submitted 2026-04-08 · 💻 cs.SE

Recognition: 1 theorem link

· Lean Theorem

Benchmarking Requirement-to-Architecture Generation with Hybrid Evaluation

Minxiao Li , Shuying Yan , Li Zhang , Yang Liu , Fang Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:24 UTC · model grok-4.3

classification 💻 cs.SE

keywords R2ABenchLLM benchmarkingsoftware architecture generationrequirement to architecturerelational reasoningPlantUMLhybrid evaluationLLM limitations

0 comments

The pith

Large language models extract entities from requirements well but fail to reason about their relations, producing fragmented architectures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates R2ABench to test how well large language models can turn product requirement documents into software architecture diagrams using PlantUML. It evaluates models with a new hybrid method that checks structural graphs, scores multiple quality aspects, and detects anti-patterns. Results indicate models reliably pull out entities and follow syntax rules but cannot properly link those entities, yielding disconnected designs. Code-focused models help somewhat with this linking problem, yet agent setups increase inconsistency instead of fixing it. This benchmark gives a clear way to track progress toward reliable automated architecture design.

Core claim

Our study shows that LLMs show strong syntactic validity and robust entity extraction but fundamentally struggle with relational reasoning, leading to structurally fragmented architectures. Code-specialized models partially alleviate this limitation, while agent frameworks introduce significant instability rather than consistent improvements. R2ABench provides a robust and standardized foundation for advancing LLM-driven software architecture generation.

What carries the argument

R2ABench benchmark paired with a hybrid evaluation framework that layers structural graph metrics, multi-dimensional scoring, and architecture anti-pattern detection on PlantUML diagrams.

Load-bearing premise

Expert-curated PlantUML reference diagrams constitute valid ground-truth architectures and the hybrid evaluation layers sufficiently measure architectural quality without further human validation.

What would settle it

Independent software architects scoring generated diagrams for coherence and relational completeness, then checking whether those human scores align with the automated hybrid metrics.

Figures

Figures reproduced from arXiv: 2604.06683 by Fang Liu, Li Zhang, Minxiao Li, Shuying Yan, Yang Liu.

**Figure 3.** Figure 3: Domain distribution of datasets in R2ABench. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 2.** Figure 2: An overview of the R2ABench methodology, encompassing dataset construction, context gradation testing, and the multi-dimensional evaluation framework. and ROUGE [19], can measure the semantic similarity or surfacelevel of generated content, but are insufficient for a rigorous evaluation of UML diagrams at both the semantic and structural levels. Recently, the "LLM-as-a-Judge" approach [33] has emerged a… view at source ↗

**Figure 4.** Figure 4: We provide a detailed analysis of each error type below. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 4.** Figure 4: Distribution of error counts across different models. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

read the original abstract

Recently, Large Language Models (LLMs) have demonstrated significant potential in automating software engineering tasks. Generating software architecture designs from requirement documents is a crucial step in software development. However, there is currently a lack of functional datasets tailored for this task. To bridge this gap, we introduce R2ABench (Requirement-To-Architecture Benchmark), a novel benchmark comprising diverse real-world software projects paired with comprehensive Product Requirements Documents (PRDs) and expert-curated PlantUML reference diagrams. Furthermore, we propose a multi-dimensional, hybrid evaluation framework that assesses generated diagrams across three complementary layers: Structural Graph Metrics, Multi-dimensional Scoring, and Architecture Anti-pattern Detection. Using this framework, we conducted a comprehensive empirical study evaluating state-of-the-art models and agentic workflows. Our study shows that LLMs show strong syntactic validity and robust entity extraction but fundamentally struggle with relational reasoning, leading to structurally fragmented architectures. Code-specialized models partially alleviate this limitation, while agent frameworks introduce significant instability rather than consistent improvements. R2ABench provides a robust and standardized foundation for advancing LLM-driven software architecture generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

R2ABench supplies a useful new dataset and evaluation layers for requirement-to-architecture work, but the relational-reasoning claims depend on unvalidated metrics.

read the letter

The paper's real contribution is R2ABench itself: a collection of real projects with PRDs and expert PlantUML diagrams, plus a three-layer evaluation that combines graph metrics, multi-dimensional scoring, and anti-pattern checks. That setup lets them run LLMs and agent workflows and report that syntax and entity extraction hold up while relational structure falls apart. Code-tuned models improve things somewhat; agents add variance instead of gains. The benchmark and framework are new relative to prior SE datasets, and the focus on architecture diagrams from requirements fills a gap that general code benchmarks miss.

Referee Report

3 major / 2 minor

Summary. The paper introduces R2ABench, a benchmark for requirement-to-architecture generation consisting of real-world software projects paired with Product Requirements Documents (PRDs) and expert-curated PlantUML reference diagrams. It proposes a hybrid evaluation framework with three layers—Structural Graph Metrics, Multi-dimensional Scoring, and Architecture Anti-pattern Detection—and reports an empirical study on state-of-the-art LLMs and agentic workflows. The central finding is that LLMs exhibit strong syntactic validity and entity extraction but struggle with relational reasoning, producing structurally fragmented architectures; code-specialized models partially mitigate this, while agent frameworks add instability rather than consistent gains.

Significance. If the hybrid metrics are shown to align with human architectural judgments, this work would provide a valuable standardized benchmark and diagnostic framework for LLM-driven architecture generation, an important but under-benchmarked software engineering task. The creation of R2ABench as a new artifact with external models evaluated against it is a clear strength that supports reproducibility and future comparisons.

major comments (3)

R2ABench construction section: the expert-curated PlantUML diagrams are presented as ground truth without inter-rater reliability statistics, details on curation protocol, or validation against independent architects, which directly affects the validity of claims about relational reasoning deficits.
Hybrid Evaluation Framework section: no ablation of the three layers (structural graph metrics, multi-dimensional scoring, anti-pattern detection) and no correlation analysis between automated scores and independent human ratings of architectural quality are reported, leaving open the possibility that reported fragmentation is an artifact of uncalibrated metrics rather than a model limitation.
Empirical Study / Results section: the manuscript supplies no dataset size (number of PRDs or diagrams), model count, or statistical tests for the comparative claims, making the conclusions about code-specialized models versus agents difficult to assess for robustness.

minor comments (2)

Abstract: adding the number of projects in R2ABench and one or two key quantitative results would improve informativeness.
Notation and terminology: ensure consistent capitalization and definition of 'hybrid evaluation framework' and its sub-layers across sections.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We sincerely thank the referee for their constructive and detailed feedback, which has identified key areas for improving the clarity, rigor, and transparency of our manuscript. We address each major comment below with specific plans for revision.

read point-by-point responses

Referee: R2ABench construction section: the expert-curated PlantUML diagrams are presented as ground truth without inter-rater reliability statistics, details on curation protocol, or validation against independent architects, which directly affects the validity of claims about relational reasoning deficits.

Authors: We thank the referee for this important observation on benchmark validity. The original manuscript describes the diagrams as expert-curated but provides insufficient protocol details. In the revised version, we will expand the R2ABench construction section to include a step-by-step account of the curation protocol (how PRDs were mapped to PlantUML elements by the expert architect), any internal consistency checks performed, and an explicit limitations subsection noting the single-expert process and absence of inter-rater reliability statistics. We will also highlight the expert's qualifications to support ground-truth quality. While we cannot retroactively compute inter-rater metrics, this expanded discussion will better ground our claims regarding relational reasoning deficits. revision: yes
Referee: Hybrid Evaluation Framework section: no ablation of the three layers (structural graph metrics, multi-dimensional scoring, anti-pattern detection) and no correlation analysis between automated scores and independent human ratings of architectural quality are reported, leaving open the possibility that reported fragmentation is an artifact of uncalibrated metrics rather than a model limitation.

Authors: We agree that ablations and human correlation would strengthen confidence in the hybrid framework. The manuscript presents the three layers as complementary without explicit ablations or human validation. In revision, we will add an analysis subsection examining each layer's incremental contribution based on our existing empirical observations (e.g., cases where anti-pattern detection flags issues missed by graph metrics). We will also report any feasible correlation with human ratings using available resources or, if new data collection is required, explicitly acknowledge this as a limitation while outlining a plan for future validation. This will help confirm that observed structural fragmentation reflects model behavior rather than uncalibrated metrics. revision: partial
Referee: Empirical Study / Results section: the manuscript supplies no dataset size (number of PRDs or diagrams), model count, or statistical tests for the comparative claims, making the conclusions about code-specialized models versus agents difficult to assess for robustness.

Authors: We acknowledge that explicit reporting of dataset scale, model counts, and statistical tests was insufficiently prominent, which may have obscured these details. In the revised manuscript, we will prominently state the exact size of R2ABench (number of PRDs and reference diagrams), enumerate all evaluated LLMs and agentic workflows, and incorporate appropriate statistical tests (e.g., significance testing for differences between code-specialized models and agent frameworks) to support the comparative claims. These changes will allow readers to more readily assess the robustness of our findings on relational reasoning limitations. revision: yes

Circularity Check

0 steps flagged

Low circularity: new benchmark and hybrid evaluation framework applied to external models

full rationale

The paper introduces R2ABench as a new artifact with real-world PRDs and expert-curated PlantUML references, plus a hybrid evaluation framework of structural graph metrics, multi-dimensional scoring, and anti-pattern detection. It then applies this framework to evaluate external state-of-the-art LLMs and agent workflows, reporting findings on syntactic validity, entity extraction, and relational reasoning deficits. No equations or derivations reduce claims to self-defined fitted quantities, no load-bearing self-citations are invoked for uniqueness or ansatz, and no predictions are statistically forced from subsets of the same data. The central empirical claims rest on independent application of the proposed metrics to model-generated outputs rather than any circular reduction to the paper's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

Claims rest on the new benchmark and the assumption that expert diagrams plus the three evaluation layers capture architectural quality. No explicit free parameters are described.

axioms (2)

domain assumption Expert-curated PlantUML diagrams serve as reliable ground truth for architecture quality.
Invoked when using them as reference for all evaluations.
domain assumption The three-layer hybrid evaluation comprehensively assesses generated architecture quality.
Central premise of the proposed framework.

invented entities (1)

R2ABench no independent evidence
purpose: Benchmark dataset and evaluation framework for requirement-to-architecture generation
Newly created in this paper with no external independent validation described.

pith-pipeline@v0.9.0 · 5490 in / 1307 out tokens · 52223 ms · 2026-05-10T18:24:30.288208+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean, Foundation/AlexanderDuality.lean, Foundation/ArithmeticFromLogic.lean, Foundation/BranchSelection.lean reality_from_one_distinction, washburn_uniqueness_aczel, alexander_duality_circle_linking, branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

multi-dimensional, hybrid evaluation framework ... Structural Graph Metrics, Multi-dimensional Scoring, and Architecture Anti-pattern Detection ... Node F1, Edge F1 ... GED-Accuracy ... Orphan Ratio ... God Component Ratio

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

35 extracted references · 10 canonical work pages · 4 internal anchors

[1]

Mathqa: Towards interpretable math word problem solving with operation-based formalisms

Aida Amini et al. Mathqa: Towards interpretable math word problem solving with operation-based formalisms. InProceedings of NAACL, 2019

2019
[2]

Program Synthesis with Large Language Models

Jacob Austin et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

Software architecture documentation in practice: Documenting architectural layers

Felix Bachmann, Len Bass, Jeromy Carriere, Paul Clements, David Garlan, James Ivers, Robert Nord, and Reed Little. Software architecture documentation in practice: Documenting architectural layers. Technical report, 2000

2000
[4]

Assessing the suitability of large language models in generating uml class diagrams as conceptual models

Marco Calamo, Massimo Mecella, and Monique Snoeck. Assessing the suitability of large language models in generating uml class diagrams as conceptual models. InInternational Conference on Business Process Modeling, Development and Support, pages 211–226. Springer, 2025

2025
[5]

On the assessment of generative ai in modeling tasks: an experience report with chatgpt and uml

Javier Cámara-Moreno, Javier Troya-Castilla, Lola Burgueño-Caballero, and Antonio Jesús Vallecillo-Moreno. On the assessment of generative ai in modeling tasks: an experience report with chatgpt and uml. 2023

2023
[6]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

Can llms generate architectural design decisions?-an exploratory empirical study

Rudra Dhar, Karthik Vaidhyanathan, and Vasudeva Varma. Can llms generate architectural design decisions?-an exploratory empirical study. In2024 IEEE 21st International Conference on Software Architecture (ICSA), pages 79–89. IEEE, 2024

2024
[8]

Crosscodeeval: A diverse and multilingual benchmark for cross-file code completion.arXiv preprint arXiv:2310.11248, 2023

Yangruibo Ding et al. Crosscodeeval: A diverse and multilingual benchmark for cross-file code completion.arXiv preprint arXiv:2310.11248, 2023

work page arXiv 2023
[9]

Classeval: A manually-crafted benchmark for evaluating llms on class-level code generation

Xueying Du et al. Classeval: A manually-crafted benchmark for evaluating llms on class-level code generation.arXiv preprint arXiv:2308.01861, 2023

work page arXiv 2023
[10]

Erni and C

K. Erni and C. Lewerentz. Applying design-metrics to object-oriented frameworks. InProceedings of the 3rd International Software Metrics Symposium, pages 64–74,
[11]

doi: 10.1109/METRIC.1996.492444

work page doi:10.1109/metric.1996.492444 1996
[12]

A survey on llm-as-a- judge.The Innovation, 2024

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al. A survey on llm-as-a- judge.The Innovation, 2024

2024
[13]

MetaGPT: Meta programming for a multi-agent collaborative framework

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. MetaGPT: Meta programming for a multi-agent collaborative framework. In The Twelfth International Conference on Learning Representations, 20...

2024
[14]

Large language models for software engineering: A systematic literature review.ACM Transactions on Software Engineering and Methodology, 33(8):1–79, 2024

Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. Large language models for software engineering: A systematic literature review.ACM Transactions on Software Engineering and Methodology, 33(8):1–79, 2024

2024
[15]

Role of ai in requirements engineering, 2023

Ivan Filippov. Role of ai in requirements engineering, 2023. URL https://www. getxray.app/blog/ai-in-requirements-engineering. Accessed: 2026-03-18

2023
[16]

The unified modeling language reference manual

Lvar Jacobson and James Rumbaugh Grady Booch. The unified modeling language reference manual. 2021

2021
[17]

Testgeneval: A real world unit test generation and test completion benchmark

Kush Jain, Gabriel Synnaeve, and Baptiste Rozière. Testgeneval: A real world unit test generation and test completion benchmark. InICLR, 2025

2025
[18]

Survey of hallucination in natural language generation.ACM computing surveys, 55(12):1–38, 2023

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation.ACM computing surveys, 55(12):1–38, 2023

2023
[19]

Prompting large language models to tackle the full software development lifecycle: A case study (devbench)

Bowen Li, Wenhan Wu, et al. Prompting large language models to tackle the full software development lifecycle: A case study (devbench). InProceedings of the 31st International Conference on Computational Linguistics, pages 7511–7531, 2025

2025
[20]

Rouge: A package for automatic evaluation of summaries

Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004

2004
[21]

C4 model: a research guide for designing software architectures

Argyro Mavrogiorgou, Athanasios Kiourtis, Dimosthenis Kyriazis, Martin Ser- rano, Mauro Isaja, Raquel Lazcano, John Soldatos, and Ernesto Troiano. C4 model: a research guide for designing software architectures. In2025 8th International Conference on Software and System Engineering (ICoSSE), pages 1–9. IEEE, 2025

2025
[22]

Rec- ommended practice for architectural description of software intensive systems

Architecture Working Group of the Software Engineering Committee et al. Rec- ommended practice for architectural description of software intensive systems. IEEE Standards Department, 2000

2000
[23]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002

2002
[24]

Sys- tematic literature reviews in software engineering—enhancement of the study selection process using cohen’s kappa statistic.Journal of Systems and Software, 168:110657, 2020

Jorge Pérez, Jessica Díaz, Javier Garcia-Martin, and Bernardo Tabuenca. Sys- tematic literature reviews in software engineering—enhancement of the study selection process using cohen’s kappa statistic.Journal of Systems and Software, 168:110657, 2020

2020
[25]

Software architecture meets LLMs: A systematic literature review,

Larissa Schmid, Tobias Hey, Martin Armbruster, Sophie Corallo, Dominik Fuchß, Jan Keim, Haoyu Liu, and Anne Koziolek. Software architecture meets llms: A systematic literature review.arXiv preprint arXiv:2505.16697, 2025

work page arXiv 2025
[26]

MermaidSeqBench: An Evaluation Benchmark for NL-to-Mermaid Sequence Diagram Generation

Basel Shbita, Farhan Ahmed, and Chad DeLuca. Mermaidseqbench: An evaluation benchmark for llm-to-mermaid sequence diagram generation, 2025. URL https: //arxiv.org/abs/2511.14967

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Application of the tree-of-thoughts framework to llm-enabled domain modeling

Jonathan Silva, Qin Ma, Jordi Cabot, Pierre Kelsen, and Henderik A Proper. Application of the tree-of-thoughts framework to llm-enabled domain modeling. InInternational Conference on Conceptual Modeling, pages 94–111. Springer, 2024

2024
[28]

Collaborative llm agents for c4 software architecture design automation.arXiv preprint arXiv:2510.22787, 2025

Kamil Szczepanik, JarosĹ Chudziak, et al. Collaborative llm agents for c4 software architecture design automation.arXiv preprint arXiv:2510.22787, 2025

work page arXiv 2025
[29]

Contest: A unit test comple- tion benchmark featuring context

Johannes Villmow, Jonas Depoix, and Adrian Ulges. Contest: A unit test comple- tion benchmark featuring context. InProceedings of the 1st Workshop on Natural Language Processing for Programming, 2021

2021
[30]

Testeval: Benchmarking large language models for test case generation

Wenhan Wang et al. Testeval: Benchmarking large language models for test case generation. InFindings of NAACL, 2025

2025
[31]

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. OpenHands: An Open Platform for AI Soft...

work page internal anchor Pith review arXiv 2024
[32]

Repocoder: Repository-level code completion through iterative retrieval and generation

Fengji Zhang et al. Repocoder: Repository-level code completion through iterative retrieval and generation. InProceedings of EMNLP, 2023

2023
[33]

Towards realistic project-level code generation via multi-agent collaboration and semantic architecture modeling.arXiv preprint arXiv:2511.03404, 2025

Qianhui Zhao, Li Zhang, Fang Liu, Junhang Cheng, Chengru Wu, Junchen Ai, Qiaoyuanhe Meng, Lichen Zhang, Xiaoli Lian, Shubin Song, et al. Towards realistic project-level code generation via multi-agent collaboration and semantic architecture modeling.arXiv preprint arXiv:2511.03404, 2025

work page arXiv 2025
[34]

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

2023
[35]

Codegeex: A pre-trained model for code generation with multilingual benchmarking on humaneval-x

Qinkai Zheng et al. Codegeex: A pre-trained model for code generation with multilingual benchmarking on humaneval-x. InProceedings of the 29th ACM SIGKDD Conference, 2023

2023