CAKE: Cloud Architecture Knowledge Evaluation of Large Language Models
Pith reviewed 2026-05-10 19:12 UTC · model grok-4.3
The pith
A new benchmark of 188 questions shows multiple-choice and free-response formats measure distinct aspects of how well large language models understand cloud-native architecture.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CAKE is a benchmark of 188 expert-validated questions covering recall, analyze, design, and implement levels of Bloom's revised taxonomy across five cloud-native topics. Evaluation across four LLM families with majority voting for multiple-choice items and LLM-as-a-judge scoring for free responses identifies four patterns: multiple-choice accuracy plateaus above 3B parameters at up to 99.2 percent, free-response scores scale across all cognitive levels, the two formats capture separate facets of knowledge, and reasoning augmentation improves free-response quality while tool augmentation degrades results for small models.
What carries the argument
The CAKE benchmark, which uses majority-voted multiple-choice questions and LLM-as-a-judge scored free responses to test LLM performance on cloud architecture at distinct cognitive levels.
If this is right
- Multiple-choice questions lose power to distinguish architectural knowledge once models exceed roughly 3 billion parameters.
- Free-response questions continue to reveal differences in model capability at larger scales across all cognitive levels.
- Explicit reasoning steps in prompts improve the quality of free-response answers about architecture.
- Tool access tends to lower performance on free-response architecture questions when applied to smaller models.
Where Pith is reading between the lines
- Evaluations of LLMs for software architecture tasks should combine closed and open question formats to avoid early performance ceilings.
- Prompt engineering choices such as reasoning augmentation may be more reliably helpful than tool use when deploying LLMs as architecture assistants.
- Scaling behavior observed in LLM benchmarks can depend on the specific response format chosen for measurement.
Load-bearing premise
The 188 questions accurately and comprehensively measure actual understanding of cloud-native software architecture without gaps or biases in topic coverage or question design.
What would settle it
Finding a model that achieves high CAKE scores yet produces incorrect cloud architecture designs in independent real-world tasks, or a model with low CAKE scores that succeeds at those tasks, would challenge the benchmark's validity.
Figures
read the original abstract
In today's software architecture, large language models (LLMs) serve as software architecture co-pilots. However, no benchmark currently exists to evaluate large language models' actual understanding of cloud-native software architecture. For this reason we present a benchmark called CAKE, which consists of 188 expert-validated questions covering four cognitive levels of Bloom's revised taxonomy -- recall, analyze, design, and implement -- and five cloud-native topics. Evaluation is conducted on 22 model configurations (0.5B--70B parameters) across four LLM families, using three-run majority voting for multiple-choice questions (MCQs) and LLM-as-a-judge scoring for free-responses (FR). Based on this evaluation, four notable findings were identified. First, MCQ accuracy plateaus above 3B parameters, with the best model reaching 99.2\%. Second, free-response scores scale steadily across all cognitive levels. Third, the two formats capture different facets of knowledge, as the MCQ accuracy approaches a ceiling while free-responses continue to differentiate models. Finally, reasoning augmentation (+think) improves free-response quality, while tool augmentation (+tool) degrades performance for small models. These results suggest that the evaluation format fundamentally shapes how we measure architectural knowledge in LLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the CAKE benchmark, consisting of 188 expert-validated questions on cloud-native software architecture spanning four Bloom's revised taxonomy levels (recall, analyze, design, implement) and five topics. It evaluates 22 model configurations (0.5B–70B parameters across four families) using three-run majority voting for MCQs and LLM-as-a-judge scoring for free responses. The four main findings are: MCQ accuracy plateaus above 3B parameters (best model at 99.2%), free-response scores scale steadily across cognitive levels, the two formats capture different knowledge facets (MCQ ceilings while FR differentiates), and reasoning augmentation (+think) improves FR quality while tool augmentation (+tool) degrades small-model performance.
Significance. If the findings hold after addressing scoring validation, this provides a useful domain-specific benchmark for LLM architectural knowledge beyond generic tests, highlighting how evaluation format influences observed scaling and augmentation effects. The multi-level cognitive design and broad model coverage are strengths that could inform co-pilot tool development in software engineering.
major comments (2)
- Free-response scoring subsection (Evaluation section): The manuscript relies on LLM-as-a-judge for FR scores without reported human validation, inter-rater agreement metrics, or controls for judge bias (e.g., favoritism toward larger/similar models or format penalties on tool-augmented outputs). This is load-bearing for the third finding (formats capture distinct facets) and fourth finding (augmentation effects), as systematic judge artifacts could produce the observed MCQ plateau vs. FR differentiation without reflecting true knowledge differences.
- Benchmark construction section: While questions are stated as expert-validated, insufficient detail is provided on the validation process (expert count, agreement statistics, topic/cognitive-level coverage checks, or bias mitigation in question design). This weakens support for the claim that the benchmark measures 'actual understanding' at the four levels, which underpins all four findings.
minor comments (1)
- Abstract and results tables: Clarify whether the three-run majority voting applies exclusively to MCQs or influences any FR preprocessing; the current wording leaves this ambiguous.
Simulated Author's Rebuttal
We are grateful to the referee for their insightful comments, which have helped us identify areas for improvement in our manuscript. Below, we provide point-by-point responses to the major comments. We have prepared revisions to address the concerns raised regarding the evaluation methodology and benchmark construction details.
read point-by-point responses
-
Referee: Free-response scoring subsection (Evaluation section): The manuscript relies on LLM-as-a-judge for FR scores without reported human validation, inter-rater agreement metrics, or controls for judge bias (e.g., favoritism toward larger/similar models or format penalties on tool-augmented outputs). This is load-bearing for the third finding (formats capture distinct facets) and fourth finding (augmentation effects), as systematic judge artifacts could produce the observed MCQ plateau vs. FR differentiation without reflecting true knowledge differences.
Authors: We agree that the manuscript would benefit from human validation of the LLM-as-a-judge scoring to bolster confidence in the findings. The original submission does not report such validation or inter-rater metrics. In the revised manuscript, we will expand the Evaluation section to include a discussion of this limitation, potential sources of judge bias, and their possible impact on the observed differences between MCQ and free-response formats as well as augmentation effects. We will also make the full set of free-response answers and corresponding judge scores available in a public repository to enable independent assessment. revision: yes
-
Referee: Benchmark construction section: While questions are stated as expert-validated, insufficient detail is provided on the validation process (expert count, agreement statistics, topic/cognitive-level coverage checks, or bias mitigation in question design). This weakens support for the claim that the benchmark measures 'actual understanding' at the four levels, which underpins all four findings.
Authors: We acknowledge that the Benchmark Construction section provides only a high-level statement of expert validation without the specific details on the process. We will revise this section to include additional information on the validation procedure, such as the number of experts involved, any agreement statistics collected during validation, verification of coverage across topics and cognitive levels, and steps taken to reduce bias in question design. This will provide stronger support for the benchmark's ability to assess understanding at the specified levels. revision: yes
Circularity Check
No circularity: purely empirical benchmark with observational results
full rationale
The paper creates a new benchmark (188 expert-validated questions across Bloom's levels and cloud topics) and reports direct empirical comparisons of 22 LLM configurations on MCQ accuracy and LLM-as-a-judge free-response scores. No equations, fitted parameters, derivations, or predictions are present; the four findings are observational statements about scaling behavior and format differences. No self-citations are load-bearing for any claim, and the evaluation pipeline does not reduce any result to its own inputs by construction. The study is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Bloom's revised taxonomy provides a valid structure for measuring cognitive levels relevant to cloud-native software architecture knowledge
Reference graph
Works this paper leans on
-
[1]
SWE-bench: Can language models resolve real-world GitHub issues?
C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, et al., “SWE-bench: Can language models resolve real-world GitHub issues?”, inProc. Int. Conf. Learn. Represent. (ICLR), 2024
work page 2024
-
[2]
Evaluating Large Language Models Trained on Code
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, et al., “Evaluating large language models trained on code,”arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[3]
ArchCode: Incor- porating software requirements in code generation with large language models,
H. Han, J. Kim, J. Yoo, Y . Lee, and S.-w. Hwang, “ArchCode: Incor- porating software requirements in code generation with large language models,” inProc. Annu. Meet. Assoc. Comput. Linguist. (ACL), 2024
work page 2024
-
[4]
CRUXEval: A benchmark for code reasoning, understanding, and execution,
A. Gu, B. Roziere, H. Leather, A. Solar-Lezama, G. Synnaeve, and S. I. Wang, “CRUXEval: A benchmark for code reasoning, understanding, and execution,” inProc. Int. Conf. Mach. Learn. (ICML), 2024
work page 2024
-
[5]
Measuring massive multitask language understanding,
D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, et al., “Measuring massive multitask language understanding,” inProc. Int. Conf. Learn. Represent. (ICLR), 2021
work page 2021
-
[6]
Beyond the imitation game: Quantifying and extrapolating the capabilities of language models,
A. Srivastava, A. Rastogi, A. Rao, A. A. M. Shoeb, A. Abid, et al., “Beyond the imitation game: Quantifying and extrapolating the capabilities of language models,”Trans. Mach. Learn. Res., 2023
work page 2023
-
[7]
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y . Pang, J. Dirani, et al., “GPQA: A graduate-level Google-proof Q&A benchmark,”arXiv preprint arXiv:2311.12022, 2023
work page internal anchor Pith review arXiv 2023
-
[8]
Judging LLM-as-a-judge with MT-Bench and Chatbot Arena,
L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, et al., “Judging LLM-as-a-judge with MT-Bench and Chatbot Arena,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2023
work page 2023
-
[9]
L. Bass, P. Clements, and R. Kazman,Software Architecture in Practice, 4th ed. Boston, MA, USA: Addison-Wesley, 2021
work page 2021
-
[10]
M. Richards and N. Ford,Fundamentals of Software Architecture. Sebastopol, CA, USA: O’Reilly Media, 2020
work page 2020
-
[11]
Developing a computer science concept inventory for introductory programming,
R. Caceffo, S. Wolfman, K. S. Booth, and R. Azevedo, “Developing a computer science concept inventory for introductory programming,” inProc. ACM Tech. Symp. Comput. Sci. Educ. (SIGCSE), 2016, pp. 364–369
work page 2016
-
[12]
Generative ai for software architecture
M. Esposito, X. Li, S. Moreschini, N. Ahmad, T. Cerny, K. Vaid- hyanathan, et al., “Generative AI for software architecture: Applications, challenges, and future directions,”arXiv preprint arXiv:2503.13310, 2025
-
[13]
L. W. Anderson and D. R. Krathwohl, Eds.,A Taxonomy for Learning, Teaching, and Assessing: A Revision of Bloom’s Taxonomy of Educa- tional Objectives. New York, NY , USA: Longman, 2001
work page 2001
-
[14]
Bloom’s taxonomy for CS assessment,
E. Thompson, A. Luxton-Reilly, J. L. Whalley, M. Hu, and P. Robbins, “Bloom’s taxonomy for CS assessment,” inProc. Australas. Comput. Educ. Conf. (ACE), 2008, pp. 155–161
work page 2008
-
[15]
Developing a computer science-specific learning taxonomy,
U. Fuller, C. G. Johnson, T. Ahoniemi, D. Cukierman, I. Hern ´an-Losada, J. Jackova, et al., “Developing a computer science-specific learning taxonomy,”ACM SIGCSE Bull., vol. 39, no. 4, pp. 152–170, 2007
work page 2007
-
[16]
Krippendorff,Content Analysis: An Introduction to Its Methodology, 4th ed
K. Krippendorff,Content Analysis: An Introduction to Its Methodology, 4th ed. Thousand Oaks, CA, USA: Sage, 2018
work page 2018
-
[17]
Scaling Laws for Neural Language Models
J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, et al., “Scaling laws for neural language models,”arXiv preprint arXiv:2001.08361, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[18]
A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, et al., “Qwen2.5 technical report,”arXiv preprint arXiv:2412.15115, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
QuArch: A question-answering dataset for AI agents in computer architecture,
S. Prakash, A. Cheng, J. Yik, A. Tschand, R. Ghosal, I. Uchendu, et al., “QuArch: A question-answering dataset for AI agents in computer architecture,”IEEE Comput. Archit. Lett., vol. 24, no. 1, pp. 105–108, 2025
work page 2025
-
[20]
Judging the judges: Evaluating alignment and vulnerabilities in LLMs-as-judges,
A. S. Thakur, K. Choudhary, V . S. Ramayapally, S. Vaidyanathan, and D. Hupkes, “Judging the judges: Evaluating alignment and vulnerabilities in LLMs-as-judges,” inProc. GEM Workshop, in conjunction withACL, 2025
work page 2025
-
[21]
LawBench: Benchmarking legal knowledge of large language models,
Z. Fei, X. Shen, D. Zhu, F. Zhou, Z. Han, A. Huang, et al., “LawBench: Benchmarking legal knowledge of large language models,” inProc. Conf. Empir . Methods Nat. Lang. Process. (EMNLP), 2024
work page 2024
-
[22]
Automated benchmark generation from domain guidelines informed by Bloom’s taxonomy,
S. Chen, L. H. Khiem, A. Szymanski, R. Metoyer, T. Hua, and N. V . Chawla, “Automated benchmark generation from domain guidelines informed by Bloom’s taxonomy,”arXiv preprint arXiv:2601.20253, 2026
-
[23]
A. C. Doris, D. Grandi, R. Tomich, M. F. Alam, M. Ataei, H. Cheong, et al., “DesignQA: A multimodal benchmark for evaluating large language models’ understanding of engineering documentation,”J. Comput. Inf. Sci. Eng., vol. 25, no. 2, art. no. 021009, 2025
work page 2025
-
[24]
Software architecture meets LLMs: A systematic literature review,
L. Schmid, T. Hey, M. Armbruster, S. Corallo, D. Fuchß, J. Keim, et al., “Software architecture meets LLMs: A systematic literature review,” arXiv preprint arXiv:2505.16697, 2025
-
[25]
MMLU- Pro: A more robust and challenging multi-task language understanding benchmark,
Y . Wang, X. Ma, G. Zhang, Y . Ni, A. Chandra, S. Guo, et al., “MMLU- Pro: A more robust and challenging multi-task language understanding benchmark,” inProc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2024
work page 2024
-
[26]
How reliable is multilingual LLM-as-a-judge?
X. Fu and W. Liu, “How reliable is multilingual LLM-as-a-judge?” in Findings of the Association for Computational Linguistics (EMNLP), 2025
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.