pith. sign in

arxiv: 2604.05755 · v1 · submitted 2026-04-07 · 💻 cs.SE · cs.AI

CAKE: Cloud Architecture Knowledge Evaluation of Large Language Models

Pith reviewed 2026-05-10 19:12 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords large language modelscloud-native architecturebenchmarkBloom's taxonomysoftware architecture evaluationmultiple choice questionsfree responsemodel scaling
0
0 comments X

The pith

A new benchmark of 188 questions shows multiple-choice and free-response formats measure distinct aspects of how well large language models understand cloud-native architecture.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces CAKE, a benchmark consisting of 188 expert-validated questions spanning four cognitive levels from Bloom's revised taxonomy and five cloud-native topics. When applied to 22 model configurations ranging from 0.5B to 70B parameters, multiple-choice accuracy reaches a ceiling above 3 billion parameters while free-response scores continue to improve steadily with scale. The two question formats therefore appear to probe different facets of architectural knowledge. Reasoning augmentation improves free-response quality, but tool augmentation reduces performance in smaller models.

Core claim

CAKE is a benchmark of 188 expert-validated questions covering recall, analyze, design, and implement levels of Bloom's revised taxonomy across five cloud-native topics. Evaluation across four LLM families with majority voting for multiple-choice items and LLM-as-a-judge scoring for free responses identifies four patterns: multiple-choice accuracy plateaus above 3B parameters at up to 99.2 percent, free-response scores scale across all cognitive levels, the two formats capture separate facets of knowledge, and reasoning augmentation improves free-response quality while tool augmentation degrades results for small models.

What carries the argument

The CAKE benchmark, which uses majority-voted multiple-choice questions and LLM-as-a-judge scored free responses to test LLM performance on cloud architecture at distinct cognitive levels.

If this is right

  • Multiple-choice questions lose power to distinguish architectural knowledge once models exceed roughly 3 billion parameters.
  • Free-response questions continue to reveal differences in model capability at larger scales across all cognitive levels.
  • Explicit reasoning steps in prompts improve the quality of free-response answers about architecture.
  • Tool access tends to lower performance on free-response architecture questions when applied to smaller models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Evaluations of LLMs for software architecture tasks should combine closed and open question formats to avoid early performance ceilings.
  • Prompt engineering choices such as reasoning augmentation may be more reliably helpful than tool use when deploying LLMs as architecture assistants.
  • Scaling behavior observed in LLM benchmarks can depend on the specific response format chosen for measurement.

Load-bearing premise

The 188 questions accurately and comprehensively measure actual understanding of cloud-native software architecture without gaps or biases in topic coverage or question design.

What would settle it

Finding a model that achieves high CAKE scores yet produces incorrect cloud architecture designs in independent real-world tasks, or a model with low CAKE scores that succeeds at those tasks, would challenge the benchmark's validity.

Figures

Figures reproduced from arXiv: 2604.05755 by Florian Girardo Lukas, Krzysztof Sierszecki, Phongsakon Mark Konrad, Rahime Yilmaz, Riccardo Terrenzi, Serkan Ayvaz, Tim Lukas Adam.

Figure 1
Figure 1. Figure 1: Question distribution across five cloud-native topics and four cognitive levels. (A) Full Bench (188 evaluated questions; 12 implement-level MCQs [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Distribution of expert ratings across clarity, correctness, and difficulty [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Free-response judge scores (0–5) for all 22 configurations, ranked per [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Augmentation effects on MCQ and Free-response performance. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Full Bench vs. CAKE-Core MCQ accuracy for all 22 configurations. Darker bars show Full Bench; lighter bars show CAKE-Core. Delta values at [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Free-response judge scores (0–5) vs. model parameters across [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Per-topic performance across all 22 configurations. (A) MCQ accuracy (%) aggregated across recall, analyze, and design levels. (B) free-response [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Expert–model alignment analysis. (A) Expert difficulty ratings show [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
read the original abstract

In today's software architecture, large language models (LLMs) serve as software architecture co-pilots. However, no benchmark currently exists to evaluate large language models' actual understanding of cloud-native software architecture. For this reason we present a benchmark called CAKE, which consists of 188 expert-validated questions covering four cognitive levels of Bloom's revised taxonomy -- recall, analyze, design, and implement -- and five cloud-native topics. Evaluation is conducted on 22 model configurations (0.5B--70B parameters) across four LLM families, using three-run majority voting for multiple-choice questions (MCQs) and LLM-as-a-judge scoring for free-responses (FR). Based on this evaluation, four notable findings were identified. First, MCQ accuracy plateaus above 3B parameters, with the best model reaching 99.2\%. Second, free-response scores scale steadily across all cognitive levels. Third, the two formats capture different facets of knowledge, as the MCQ accuracy approaches a ceiling while free-responses continue to differentiate models. Finally, reasoning augmentation (+think) improves free-response quality, while tool augmentation (+tool) degrades performance for small models. These results suggest that the evaluation format fundamentally shapes how we measure architectural knowledge in LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces the CAKE benchmark, consisting of 188 expert-validated questions on cloud-native software architecture spanning four Bloom's revised taxonomy levels (recall, analyze, design, implement) and five topics. It evaluates 22 model configurations (0.5B–70B parameters across four families) using three-run majority voting for MCQs and LLM-as-a-judge scoring for free responses. The four main findings are: MCQ accuracy plateaus above 3B parameters (best model at 99.2%), free-response scores scale steadily across cognitive levels, the two formats capture different knowledge facets (MCQ ceilings while FR differentiates), and reasoning augmentation (+think) improves FR quality while tool augmentation (+tool) degrades small-model performance.

Significance. If the findings hold after addressing scoring validation, this provides a useful domain-specific benchmark for LLM architectural knowledge beyond generic tests, highlighting how evaluation format influences observed scaling and augmentation effects. The multi-level cognitive design and broad model coverage are strengths that could inform co-pilot tool development in software engineering.

major comments (2)
  1. Free-response scoring subsection (Evaluation section): The manuscript relies on LLM-as-a-judge for FR scores without reported human validation, inter-rater agreement metrics, or controls for judge bias (e.g., favoritism toward larger/similar models or format penalties on tool-augmented outputs). This is load-bearing for the third finding (formats capture distinct facets) and fourth finding (augmentation effects), as systematic judge artifacts could produce the observed MCQ plateau vs. FR differentiation without reflecting true knowledge differences.
  2. Benchmark construction section: While questions are stated as expert-validated, insufficient detail is provided on the validation process (expert count, agreement statistics, topic/cognitive-level coverage checks, or bias mitigation in question design). This weakens support for the claim that the benchmark measures 'actual understanding' at the four levels, which underpins all four findings.
minor comments (1)
  1. Abstract and results tables: Clarify whether the three-run majority voting applies exclusively to MCQs or influences any FR preprocessing; the current wording leaves this ambiguous.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for their insightful comments, which have helped us identify areas for improvement in our manuscript. Below, we provide point-by-point responses to the major comments. We have prepared revisions to address the concerns raised regarding the evaluation methodology and benchmark construction details.

read point-by-point responses
  1. Referee: Free-response scoring subsection (Evaluation section): The manuscript relies on LLM-as-a-judge for FR scores without reported human validation, inter-rater agreement metrics, or controls for judge bias (e.g., favoritism toward larger/similar models or format penalties on tool-augmented outputs). This is load-bearing for the third finding (formats capture distinct facets) and fourth finding (augmentation effects), as systematic judge artifacts could produce the observed MCQ plateau vs. FR differentiation without reflecting true knowledge differences.

    Authors: We agree that the manuscript would benefit from human validation of the LLM-as-a-judge scoring to bolster confidence in the findings. The original submission does not report such validation or inter-rater metrics. In the revised manuscript, we will expand the Evaluation section to include a discussion of this limitation, potential sources of judge bias, and their possible impact on the observed differences between MCQ and free-response formats as well as augmentation effects. We will also make the full set of free-response answers and corresponding judge scores available in a public repository to enable independent assessment. revision: yes

  2. Referee: Benchmark construction section: While questions are stated as expert-validated, insufficient detail is provided on the validation process (expert count, agreement statistics, topic/cognitive-level coverage checks, or bias mitigation in question design). This weakens support for the claim that the benchmark measures 'actual understanding' at the four levels, which underpins all four findings.

    Authors: We acknowledge that the Benchmark Construction section provides only a high-level statement of expert validation without the specific details on the process. We will revise this section to include additional information on the validation procedure, such as the number of experts involved, any agreement statistics collected during validation, verification of coverage across topics and cognitive levels, and steps taken to reduce bias in question design. This will provide stronger support for the benchmark's ability to assess understanding at the specified levels. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark with observational results

full rationale

The paper creates a new benchmark (188 expert-validated questions across Bloom's levels and cloud topics) and reports direct empirical comparisons of 22 LLM configurations on MCQ accuracy and LLM-as-a-judge free-response scores. No equations, fitted parameters, derivations, or predictions are present; the four findings are observational statements about scaling behavior and format differences. No self-citations are load-bearing for any claim, and the evaluation pipeline does not reduce any result to its own inputs by construction. The study is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that expert-validated questions validly capture architectural understanding; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Bloom's revised taxonomy provides a valid structure for measuring cognitive levels relevant to cloud-native software architecture knowledge
    Questions are explicitly organized around recall, analyze, design, and implement levels from this taxonomy.

pith-pipeline@v0.9.0 · 5548 in / 1429 out tokens · 78505 ms · 2026-05-10T19:12:51.121024+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 4 internal anchors

  1. [1]

    SWE-bench: Can language models resolve real-world GitHub issues?

    C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, et al., “SWE-bench: Can language models resolve real-world GitHub issues?”, inProc. Int. Conf. Learn. Represent. (ICLR), 2024

  2. [2]

    Evaluating Large Language Models Trained on Code

    M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, et al., “Evaluating large language models trained on code,”arXiv preprint arXiv:2107.03374, 2021

  3. [3]

    ArchCode: Incor- porating software requirements in code generation with large language models,

    H. Han, J. Kim, J. Yoo, Y . Lee, and S.-w. Hwang, “ArchCode: Incor- porating software requirements in code generation with large language models,” inProc. Annu. Meet. Assoc. Comput. Linguist. (ACL), 2024

  4. [4]

    CRUXEval: A benchmark for code reasoning, understanding, and execution,

    A. Gu, B. Roziere, H. Leather, A. Solar-Lezama, G. Synnaeve, and S. I. Wang, “CRUXEval: A benchmark for code reasoning, understanding, and execution,” inProc. Int. Conf. Mach. Learn. (ICML), 2024

  5. [5]

    Measuring massive multitask language understanding,

    D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, et al., “Measuring massive multitask language understanding,” inProc. Int. Conf. Learn. Represent. (ICLR), 2021

  6. [6]

    Beyond the imitation game: Quantifying and extrapolating the capabilities of language models,

    A. Srivastava, A. Rastogi, A. Rao, A. A. M. Shoeb, A. Abid, et al., “Beyond the imitation game: Quantifying and extrapolating the capabilities of language models,”Trans. Mach. Learn. Res., 2023

  7. [7]

    GPQA: A Graduate-Level Google-Proof Q&A Benchmark

    D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y . Pang, J. Dirani, et al., “GPQA: A graduate-level Google-proof Q&A benchmark,”arXiv preprint arXiv:2311.12022, 2023

  8. [8]

    Judging LLM-as-a-judge with MT-Bench and Chatbot Arena,

    L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, et al., “Judging LLM-as-a-judge with MT-Bench and Chatbot Arena,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2023

  9. [9]

    L. Bass, P. Clements, and R. Kazman,Software Architecture in Practice, 4th ed. Boston, MA, USA: Addison-Wesley, 2021

  10. [10]

    Richards and N

    M. Richards and N. Ford,Fundamentals of Software Architecture. Sebastopol, CA, USA: O’Reilly Media, 2020

  11. [11]

    Developing a computer science concept inventory for introductory programming,

    R. Caceffo, S. Wolfman, K. S. Booth, and R. Azevedo, “Developing a computer science concept inventory for introductory programming,” inProc. ACM Tech. Symp. Comput. Sci. Educ. (SIGCSE), 2016, pp. 364–369

  12. [12]

    Generative ai for software architecture

    M. Esposito, X. Li, S. Moreschini, N. Ahmad, T. Cerny, K. Vaid- hyanathan, et al., “Generative AI for software architecture: Applications, challenges, and future directions,”arXiv preprint arXiv:2503.13310, 2025

  13. [13]

    L. W. Anderson and D. R. Krathwohl, Eds.,A Taxonomy for Learning, Teaching, and Assessing: A Revision of Bloom’s Taxonomy of Educa- tional Objectives. New York, NY , USA: Longman, 2001

  14. [14]

    Bloom’s taxonomy for CS assessment,

    E. Thompson, A. Luxton-Reilly, J. L. Whalley, M. Hu, and P. Robbins, “Bloom’s taxonomy for CS assessment,” inProc. Australas. Comput. Educ. Conf. (ACE), 2008, pp. 155–161

  15. [15]

    Developing a computer science-specific learning taxonomy,

    U. Fuller, C. G. Johnson, T. Ahoniemi, D. Cukierman, I. Hern ´an-Losada, J. Jackova, et al., “Developing a computer science-specific learning taxonomy,”ACM SIGCSE Bull., vol. 39, no. 4, pp. 152–170, 2007

  16. [16]

    Krippendorff,Content Analysis: An Introduction to Its Methodology, 4th ed

    K. Krippendorff,Content Analysis: An Introduction to Its Methodology, 4th ed. Thousand Oaks, CA, USA: Sage, 2018

  17. [17]

    Scaling Laws for Neural Language Models

    J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, et al., “Scaling laws for neural language models,”arXiv preprint arXiv:2001.08361, 2020

  18. [18]

    Qwen2.5 Technical Report

    A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, et al., “Qwen2.5 technical report,”arXiv preprint arXiv:2412.15115, 2024

  19. [19]

    QuArch: A question-answering dataset for AI agents in computer architecture,

    S. Prakash, A. Cheng, J. Yik, A. Tschand, R. Ghosal, I. Uchendu, et al., “QuArch: A question-answering dataset for AI agents in computer architecture,”IEEE Comput. Archit. Lett., vol. 24, no. 1, pp. 105–108, 2025

  20. [20]

    Judging the judges: Evaluating alignment and vulnerabilities in LLMs-as-judges,

    A. S. Thakur, K. Choudhary, V . S. Ramayapally, S. Vaidyanathan, and D. Hupkes, “Judging the judges: Evaluating alignment and vulnerabilities in LLMs-as-judges,” inProc. GEM Workshop, in conjunction withACL, 2025

  21. [21]

    LawBench: Benchmarking legal knowledge of large language models,

    Z. Fei, X. Shen, D. Zhu, F. Zhou, Z. Han, A. Huang, et al., “LawBench: Benchmarking legal knowledge of large language models,” inProc. Conf. Empir . Methods Nat. Lang. Process. (EMNLP), 2024

  22. [22]

    Automated benchmark generation from domain guidelines informed by Bloom’s taxonomy,

    S. Chen, L. H. Khiem, A. Szymanski, R. Metoyer, T. Hua, and N. V . Chawla, “Automated benchmark generation from domain guidelines informed by Bloom’s taxonomy,”arXiv preprint arXiv:2601.20253, 2026

  23. [23]

    DesignQA: A multimodal benchmark for evaluating large language models’ understanding of engineering documentation,

    A. C. Doris, D. Grandi, R. Tomich, M. F. Alam, M. Ataei, H. Cheong, et al., “DesignQA: A multimodal benchmark for evaluating large language models’ understanding of engineering documentation,”J. Comput. Inf. Sci. Eng., vol. 25, no. 2, art. no. 021009, 2025

  24. [24]

    Software architecture meets LLMs: A systematic literature review,

    L. Schmid, T. Hey, M. Armbruster, S. Corallo, D. Fuchß, J. Keim, et al., “Software architecture meets LLMs: A systematic literature review,” arXiv preprint arXiv:2505.16697, 2025

  25. [25]

    MMLU- Pro: A more robust and challenging multi-task language understanding benchmark,

    Y . Wang, X. Ma, G. Zhang, Y . Ni, A. Chandra, S. Guo, et al., “MMLU- Pro: A more robust and challenging multi-task language understanding benchmark,” inProc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2024

  26. [26]

    How reliable is multilingual LLM-as-a-judge?

    X. Fu and W. Liu, “How reliable is multilingual LLM-as-a-judge?” in Findings of the Association for Computational Linguistics (EMNLP), 2025