CAKE: Cloud Architecture Knowledge Evaluation of Large Language Models

Florian Girardo Lukas; Krzysztof Sierszecki; Phongsakon Mark Konrad; Rahime Yilmaz; Riccardo Terrenzi; Serkan Ayvaz; Tim Lukas Adam

arxiv: 2604.05755 · v1 · submitted 2026-04-07 · 💻 cs.SE · cs.AI

CAKE: Cloud Architecture Knowledge Evaluation of Large Language Models

Tim Lukas Adam , Phongsakon Mark Konrad , Riccardo Terrenzi , Florian Girardo Lukas , Rahime Yilmaz , Krzysztof Sierszecki , Serkan Ayvaz This is my paper

Pith reviewed 2026-05-10 19:12 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords large language modelscloud-native architecturebenchmarkBloom's taxonomysoftware architecture evaluationmultiple choice questionsfree responsemodel scaling

0 comments

The pith

A new benchmark of 188 questions shows multiple-choice and free-response formats measure distinct aspects of how well large language models understand cloud-native architecture.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces CAKE, a benchmark consisting of 188 expert-validated questions spanning four cognitive levels from Bloom's revised taxonomy and five cloud-native topics. When applied to 22 model configurations ranging from 0.5B to 70B parameters, multiple-choice accuracy reaches a ceiling above 3 billion parameters while free-response scores continue to improve steadily with scale. The two question formats therefore appear to probe different facets of architectural knowledge. Reasoning augmentation improves free-response quality, but tool augmentation reduces performance in smaller models.

Core claim

CAKE is a benchmark of 188 expert-validated questions covering recall, analyze, design, and implement levels of Bloom's revised taxonomy across five cloud-native topics. Evaluation across four LLM families with majority voting for multiple-choice items and LLM-as-a-judge scoring for free responses identifies four patterns: multiple-choice accuracy plateaus above 3B parameters at up to 99.2 percent, free-response scores scale across all cognitive levels, the two formats capture separate facets of knowledge, and reasoning augmentation improves free-response quality while tool augmentation degrades results for small models.

What carries the argument

The CAKE benchmark, which uses majority-voted multiple-choice questions and LLM-as-a-judge scored free responses to test LLM performance on cloud architecture at distinct cognitive levels.

If this is right

Multiple-choice questions lose power to distinguish architectural knowledge once models exceed roughly 3 billion parameters.
Free-response questions continue to reveal differences in model capability at larger scales across all cognitive levels.
Explicit reasoning steps in prompts improve the quality of free-response answers about architecture.
Tool access tends to lower performance on free-response architecture questions when applied to smaller models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Evaluations of LLMs for software architecture tasks should combine closed and open question formats to avoid early performance ceilings.
Prompt engineering choices such as reasoning augmentation may be more reliably helpful than tool use when deploying LLMs as architecture assistants.
Scaling behavior observed in LLM benchmarks can depend on the specific response format chosen for measurement.

Load-bearing premise

The 188 questions accurately and comprehensively measure actual understanding of cloud-native software architecture without gaps or biases in topic coverage or question design.

What would settle it

Finding a model that achieves high CAKE scores yet produces incorrect cloud architecture designs in independent real-world tasks, or a model with low CAKE scores that succeeds at those tasks, would challenge the benchmark's validity.

Figures

Figures reproduced from arXiv: 2604.05755 by Florian Girardo Lukas, Krzysztof Sierszecki, Phongsakon Mark Konrad, Rahime Yilmaz, Riccardo Terrenzi, Serkan Ayvaz, Tim Lukas Adam.

**Figure 1.** Figure 1: Question distribution across five cloud-native topics and four cognitive levels. (A) Full Bench (188 evaluated questions; 12 implement-level MCQs [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Distribution of expert ratings across clarity, correctness, and difficulty [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Free-response judge scores (0–5) for all 22 configurations, ranked per [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Augmentation effects on MCQ and Free-response performance. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Full Bench vs. CAKE-Core MCQ accuracy for all 22 configurations. Darker bars show Full Bench; lighter bars show CAKE-Core. Delta values at [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Free-response judge scores (0–5) vs. model parameters across [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Per-topic performance across all 22 configurations. (A) MCQ accuracy (%) aggregated across recall, analyze, and design levels. (B) free-response [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Expert–model alignment analysis. (A) Expert difficulty ratings show [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

read the original abstract

In today's software architecture, large language models (LLMs) serve as software architecture co-pilots. However, no benchmark currently exists to evaluate large language models' actual understanding of cloud-native software architecture. For this reason we present a benchmark called CAKE, which consists of 188 expert-validated questions covering four cognitive levels of Bloom's revised taxonomy -- recall, analyze, design, and implement -- and five cloud-native topics. Evaluation is conducted on 22 model configurations (0.5B--70B parameters) across four LLM families, using three-run majority voting for multiple-choice questions (MCQs) and LLM-as-a-judge scoring for free-responses (FR). Based on this evaluation, four notable findings were identified. First, MCQ accuracy plateaus above 3B parameters, with the best model reaching 99.2\%. Second, free-response scores scale steadily across all cognitive levels. Third, the two formats capture different facets of knowledge, as the MCQ accuracy approaches a ceiling while free-responses continue to differentiate models. Finally, reasoning augmentation (+think) improves free-response quality, while tool augmentation (+tool) degrades performance for small models. These results suggest that the evaluation format fundamentally shapes how we measure architectural knowledge in LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CAKE gives a usable new benchmark for cloud architecture knowledge in LLMs, but the free-response results rest on unvalidated LLM judging that weakens the main claims.

read the letter

The paper's real contribution is the CAKE benchmark itself: 188 expert-validated questions spread across four Bloom levels and five cloud-native topics, run on 22 model configurations from 0.5B to 70B parameters. They report clear patterns—MCQ accuracy plateaus above 3B parameters while free-response scores keep rising, the two formats diverge, and +think helps free responses while +tool hurts small models. That setup is new for this domain and gives practitioners a concrete way to compare models on architecture tasks rather than generic coding benchmarks. The dual-format design and the scale of the model sweep are the parts that hold up best. The questions appear to have gone through expert review, and the majority-vote MCQ protocol is straightforward. Those elements make the work worth looking at for anyone building or selecting LLMs for cloud work. The soft spot is the free-response scoring. The paper relies on an LLM-as-judge without any reported human validation, agreement metrics, or bias checks. If the judge model systematically favors longer answers, certain phrasing, or outputs from larger models, the observed differentiation between MCQ and free-response could be an artifact rather than evidence that the formats measure distinct knowledge. The augmentation results would also need re-checking under that condition. The abstract and methods description do not address this directly, so the central claim that the formats capture different facets rests on weaker ground than the MCQ numbers. This paper is aimed at software engineering researchers who evaluate LLMs on domain tasks and at teams choosing models for architecture co-pilot work. A reader who needs a ready benchmark or wants to see how cognitive level and response format interact will find usable data here. The work is coherent enough and the benchmark is novel enough that it deserves a serious referee rather than a desk reject; the scoring method is the main thing that would need tightening in revision.

Referee Report

2 major / 1 minor

Summary. The paper introduces the CAKE benchmark, consisting of 188 expert-validated questions on cloud-native software architecture spanning four Bloom's revised taxonomy levels (recall, analyze, design, implement) and five topics. It evaluates 22 model configurations (0.5B–70B parameters across four families) using three-run majority voting for MCQs and LLM-as-a-judge scoring for free responses. The four main findings are: MCQ accuracy plateaus above 3B parameters (best model at 99.2%), free-response scores scale steadily across cognitive levels, the two formats capture different knowledge facets (MCQ ceilings while FR differentiates), and reasoning augmentation (+think) improves FR quality while tool augmentation (+tool) degrades small-model performance.

Significance. If the findings hold after addressing scoring validation, this provides a useful domain-specific benchmark for LLM architectural knowledge beyond generic tests, highlighting how evaluation format influences observed scaling and augmentation effects. The multi-level cognitive design and broad model coverage are strengths that could inform co-pilot tool development in software engineering.

major comments (2)

Free-response scoring subsection (Evaluation section): The manuscript relies on LLM-as-a-judge for FR scores without reported human validation, inter-rater agreement metrics, or controls for judge bias (e.g., favoritism toward larger/similar models or format penalties on tool-augmented outputs). This is load-bearing for the third finding (formats capture distinct facets) and fourth finding (augmentation effects), as systematic judge artifacts could produce the observed MCQ plateau vs. FR differentiation without reflecting true knowledge differences.
Benchmark construction section: While questions are stated as expert-validated, insufficient detail is provided on the validation process (expert count, agreement statistics, topic/cognitive-level coverage checks, or bias mitigation in question design). This weakens support for the claim that the benchmark measures 'actual understanding' at the four levels, which underpins all four findings.

minor comments (1)

Abstract and results tables: Clarify whether the three-run majority voting applies exclusively to MCQs or influences any FR preprocessing; the current wording leaves this ambiguous.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for their insightful comments, which have helped us identify areas for improvement in our manuscript. Below, we provide point-by-point responses to the major comments. We have prepared revisions to address the concerns raised regarding the evaluation methodology and benchmark construction details.

read point-by-point responses

Referee: Free-response scoring subsection (Evaluation section): The manuscript relies on LLM-as-a-judge for FR scores without reported human validation, inter-rater agreement metrics, or controls for judge bias (e.g., favoritism toward larger/similar models or format penalties on tool-augmented outputs). This is load-bearing for the third finding (formats capture distinct facets) and fourth finding (augmentation effects), as systematic judge artifacts could produce the observed MCQ plateau vs. FR differentiation without reflecting true knowledge differences.

Authors: We agree that the manuscript would benefit from human validation of the LLM-as-a-judge scoring to bolster confidence in the findings. The original submission does not report such validation or inter-rater metrics. In the revised manuscript, we will expand the Evaluation section to include a discussion of this limitation, potential sources of judge bias, and their possible impact on the observed differences between MCQ and free-response formats as well as augmentation effects. We will also make the full set of free-response answers and corresponding judge scores available in a public repository to enable independent assessment. revision: yes
Referee: Benchmark construction section: While questions are stated as expert-validated, insufficient detail is provided on the validation process (expert count, agreement statistics, topic/cognitive-level coverage checks, or bias mitigation in question design). This weakens support for the claim that the benchmark measures 'actual understanding' at the four levels, which underpins all four findings.

Authors: We acknowledge that the Benchmark Construction section provides only a high-level statement of expert validation without the specific details on the process. We will revise this section to include additional information on the validation procedure, such as the number of experts involved, any agreement statistics collected during validation, verification of coverage across topics and cognitive levels, and steps taken to reduce bias in question design. This will provide stronger support for the benchmark's ability to assess understanding at the specified levels. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark with observational results

full rationale

The paper creates a new benchmark (188 expert-validated questions across Bloom's levels and cloud topics) and reports direct empirical comparisons of 22 LLM configurations on MCQ accuracy and LLM-as-a-judge free-response scores. No equations, fitted parameters, derivations, or predictions are present; the four findings are observational statements about scaling behavior and format differences. No self-citations are load-bearing for any claim, and the evaluation pipeline does not reduce any result to its own inputs by construction. The study is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that expert-validated questions validly capture architectural understanding; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Bloom's revised taxonomy provides a valid structure for measuring cognitive levels relevant to cloud-native software architecture knowledge
Questions are explicitly organized around recall, analyze, design, and implement levels from this taxonomy.

pith-pipeline@v0.9.0 · 5548 in / 1429 out tokens · 78505 ms · 2026-05-10T19:12:51.121024+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 4 internal anchors

[1]

SWE-bench: Can language models resolve real-world GitHub issues?

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, et al., “SWE-bench: Can language models resolve real-world GitHub issues?”, inProc. Int. Conf. Learn. Represent. (ICLR), 2024

work page 2024
[2]

Evaluating Large Language Models Trained on Code

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, et al., “Evaluating large language models trained on code,”arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

ArchCode: Incor- porating software requirements in code generation with large language models,

H. Han, J. Kim, J. Yoo, Y . Lee, and S.-w. Hwang, “ArchCode: Incor- porating software requirements in code generation with large language models,” inProc. Annu. Meet. Assoc. Comput. Linguist. (ACL), 2024

work page 2024
[4]

CRUXEval: A benchmark for code reasoning, understanding, and execution,

A. Gu, B. Roziere, H. Leather, A. Solar-Lezama, G. Synnaeve, and S. I. Wang, “CRUXEval: A benchmark for code reasoning, understanding, and execution,” inProc. Int. Conf. Mach. Learn. (ICML), 2024

work page 2024
[5]

Measuring massive multitask language understanding,

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, et al., “Measuring massive multitask language understanding,” inProc. Int. Conf. Learn. Represent. (ICLR), 2021

work page 2021
[6]

Beyond the imitation game: Quantifying and extrapolating the capabilities of language models,

A. Srivastava, A. Rastogi, A. Rao, A. A. M. Shoeb, A. Abid, et al., “Beyond the imitation game: Quantifying and extrapolating the capabilities of language models,”Trans. Mach. Learn. Res., 2023

work page 2023
[7]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y . Pang, J. Dirani, et al., “GPQA: A graduate-level Google-proof Q&A benchmark,”arXiv preprint arXiv:2311.12022, 2023

work page internal anchor Pith review arXiv 2023
[8]

Judging LLM-as-a-judge with MT-Bench and Chatbot Arena,

L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, et al., “Judging LLM-as-a-judge with MT-Bench and Chatbot Arena,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2023

work page 2023
[9]

L. Bass, P. Clements, and R. Kazman,Software Architecture in Practice, 4th ed. Boston, MA, USA: Addison-Wesley, 2021

work page 2021
[10]

Richards and N

M. Richards and N. Ford,Fundamentals of Software Architecture. Sebastopol, CA, USA: O’Reilly Media, 2020

work page 2020
[11]

Developing a computer science concept inventory for introductory programming,

R. Caceffo, S. Wolfman, K. S. Booth, and R. Azevedo, “Developing a computer science concept inventory for introductory programming,” inProc. ACM Tech. Symp. Comput. Sci. Educ. (SIGCSE), 2016, pp. 364–369

work page 2016
[12]

Generative ai for software architecture

M. Esposito, X. Li, S. Moreschini, N. Ahmad, T. Cerny, K. Vaid- hyanathan, et al., “Generative AI for software architecture: Applications, challenges, and future directions,”arXiv preprint arXiv:2503.13310, 2025

work page arXiv 2025
[13]

L. W. Anderson and D. R. Krathwohl, Eds.,A Taxonomy for Learning, Teaching, and Assessing: A Revision of Bloom’s Taxonomy of Educa- tional Objectives. New York, NY , USA: Longman, 2001

work page 2001
[14]

Bloom’s taxonomy for CS assessment,

E. Thompson, A. Luxton-Reilly, J. L. Whalley, M. Hu, and P. Robbins, “Bloom’s taxonomy for CS assessment,” inProc. Australas. Comput. Educ. Conf. (ACE), 2008, pp. 155–161

work page 2008
[15]

Developing a computer science-specific learning taxonomy,

U. Fuller, C. G. Johnson, T. Ahoniemi, D. Cukierman, I. Hern ´an-Losada, J. Jackova, et al., “Developing a computer science-specific learning taxonomy,”ACM SIGCSE Bull., vol. 39, no. 4, pp. 152–170, 2007

work page 2007
[16]

Krippendorff,Content Analysis: An Introduction to Its Methodology, 4th ed

K. Krippendorff,Content Analysis: An Introduction to Its Methodology, 4th ed. Thousand Oaks, CA, USA: Sage, 2018

work page 2018
[17]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, et al., “Scaling laws for neural language models,”arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[18]

Qwen2.5 Technical Report

A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, et al., “Qwen2.5 technical report,”arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

QuArch: A question-answering dataset for AI agents in computer architecture,

S. Prakash, A. Cheng, J. Yik, A. Tschand, R. Ghosal, I. Uchendu, et al., “QuArch: A question-answering dataset for AI agents in computer architecture,”IEEE Comput. Archit. Lett., vol. 24, no. 1, pp. 105–108, 2025

work page 2025
[20]

Judging the judges: Evaluating alignment and vulnerabilities in LLMs-as-judges,

A. S. Thakur, K. Choudhary, V . S. Ramayapally, S. Vaidyanathan, and D. Hupkes, “Judging the judges: Evaluating alignment and vulnerabilities in LLMs-as-judges,” inProc. GEM Workshop, in conjunction withACL, 2025

work page 2025
[21]

LawBench: Benchmarking legal knowledge of large language models,

Z. Fei, X. Shen, D. Zhu, F. Zhou, Z. Han, A. Huang, et al., “LawBench: Benchmarking legal knowledge of large language models,” inProc. Conf. Empir . Methods Nat. Lang. Process. (EMNLP), 2024

work page 2024
[22]

Automated benchmark generation from domain guidelines informed by Bloom’s taxonomy,

S. Chen, L. H. Khiem, A. Szymanski, R. Metoyer, T. Hua, and N. V . Chawla, “Automated benchmark generation from domain guidelines informed by Bloom’s taxonomy,”arXiv preprint arXiv:2601.20253, 2026

work page arXiv 2026
[23]

DesignQA: A multimodal benchmark for evaluating large language models’ understanding of engineering documentation,

A. C. Doris, D. Grandi, R. Tomich, M. F. Alam, M. Ataei, H. Cheong, et al., “DesignQA: A multimodal benchmark for evaluating large language models’ understanding of engineering documentation,”J. Comput. Inf. Sci. Eng., vol. 25, no. 2, art. no. 021009, 2025

work page 2025
[24]

Software architecture meets LLMs: A systematic literature review,

L. Schmid, T. Hey, M. Armbruster, S. Corallo, D. Fuchß, J. Keim, et al., “Software architecture meets LLMs: A systematic literature review,” arXiv preprint arXiv:2505.16697, 2025

work page arXiv 2025
[25]

MMLU- Pro: A more robust and challenging multi-task language understanding benchmark,

Y . Wang, X. Ma, G. Zhang, Y . Ni, A. Chandra, S. Guo, et al., “MMLU- Pro: A more robust and challenging multi-task language understanding benchmark,” inProc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2024

work page 2024
[26]

How reliable is multilingual LLM-as-a-judge?

X. Fu and W. Liu, “How reliable is multilingual LLM-as-a-judge?” in Findings of the Association for Computational Linguistics (EMNLP), 2025

work page 2025

[1] [1]

SWE-bench: Can language models resolve real-world GitHub issues?

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, et al., “SWE-bench: Can language models resolve real-world GitHub issues?”, inProc. Int. Conf. Learn. Represent. (ICLR), 2024

work page 2024

[2] [2]

Evaluating Large Language Models Trained on Code

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, et al., “Evaluating large language models trained on code,”arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[3] [3]

ArchCode: Incor- porating software requirements in code generation with large language models,

H. Han, J. Kim, J. Yoo, Y . Lee, and S.-w. Hwang, “ArchCode: Incor- porating software requirements in code generation with large language models,” inProc. Annu. Meet. Assoc. Comput. Linguist. (ACL), 2024

work page 2024

[4] [4]

CRUXEval: A benchmark for code reasoning, understanding, and execution,

A. Gu, B. Roziere, H. Leather, A. Solar-Lezama, G. Synnaeve, and S. I. Wang, “CRUXEval: A benchmark for code reasoning, understanding, and execution,” inProc. Int. Conf. Mach. Learn. (ICML), 2024

work page 2024

[5] [5]

Measuring massive multitask language understanding,

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, et al., “Measuring massive multitask language understanding,” inProc. Int. Conf. Learn. Represent. (ICLR), 2021

work page 2021

[6] [6]

Beyond the imitation game: Quantifying and extrapolating the capabilities of language models,

A. Srivastava, A. Rastogi, A. Rao, A. A. M. Shoeb, A. Abid, et al., “Beyond the imitation game: Quantifying and extrapolating the capabilities of language models,”Trans. Mach. Learn. Res., 2023

work page 2023

[7] [7]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y . Pang, J. Dirani, et al., “GPQA: A graduate-level Google-proof Q&A benchmark,”arXiv preprint arXiv:2311.12022, 2023

work page internal anchor Pith review arXiv 2023

[8] [8]

Judging LLM-as-a-judge with MT-Bench and Chatbot Arena,

L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, et al., “Judging LLM-as-a-judge with MT-Bench and Chatbot Arena,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2023

work page 2023

[9] [9]

L. Bass, P. Clements, and R. Kazman,Software Architecture in Practice, 4th ed. Boston, MA, USA: Addison-Wesley, 2021

work page 2021

[10] [10]

Richards and N

M. Richards and N. Ford,Fundamentals of Software Architecture. Sebastopol, CA, USA: O’Reilly Media, 2020

work page 2020

[11] [11]

Developing a computer science concept inventory for introductory programming,

R. Caceffo, S. Wolfman, K. S. Booth, and R. Azevedo, “Developing a computer science concept inventory for introductory programming,” inProc. ACM Tech. Symp. Comput. Sci. Educ. (SIGCSE), 2016, pp. 364–369

work page 2016

[12] [12]

Generative ai for software architecture

M. Esposito, X. Li, S. Moreschini, N. Ahmad, T. Cerny, K. Vaid- hyanathan, et al., “Generative AI for software architecture: Applications, challenges, and future directions,”arXiv preprint arXiv:2503.13310, 2025

work page arXiv 2025

[13] [13]

L. W. Anderson and D. R. Krathwohl, Eds.,A Taxonomy for Learning, Teaching, and Assessing: A Revision of Bloom’s Taxonomy of Educa- tional Objectives. New York, NY , USA: Longman, 2001

work page 2001

[14] [14]

Bloom’s taxonomy for CS assessment,

E. Thompson, A. Luxton-Reilly, J. L. Whalley, M. Hu, and P. Robbins, “Bloom’s taxonomy for CS assessment,” inProc. Australas. Comput. Educ. Conf. (ACE), 2008, pp. 155–161

work page 2008

[15] [15]

Developing a computer science-specific learning taxonomy,

U. Fuller, C. G. Johnson, T. Ahoniemi, D. Cukierman, I. Hern ´an-Losada, J. Jackova, et al., “Developing a computer science-specific learning taxonomy,”ACM SIGCSE Bull., vol. 39, no. 4, pp. 152–170, 2007

work page 2007

[16] [16]

Krippendorff,Content Analysis: An Introduction to Its Methodology, 4th ed

K. Krippendorff,Content Analysis: An Introduction to Its Methodology, 4th ed. Thousand Oaks, CA, USA: Sage, 2018

work page 2018

[17] [17]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, et al., “Scaling laws for neural language models,”arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

[18] [18]

Qwen2.5 Technical Report

A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, et al., “Qwen2.5 technical report,”arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

QuArch: A question-answering dataset for AI agents in computer architecture,

S. Prakash, A. Cheng, J. Yik, A. Tschand, R. Ghosal, I. Uchendu, et al., “QuArch: A question-answering dataset for AI agents in computer architecture,”IEEE Comput. Archit. Lett., vol. 24, no. 1, pp. 105–108, 2025

work page 2025

[20] [20]

Judging the judges: Evaluating alignment and vulnerabilities in LLMs-as-judges,

A. S. Thakur, K. Choudhary, V . S. Ramayapally, S. Vaidyanathan, and D. Hupkes, “Judging the judges: Evaluating alignment and vulnerabilities in LLMs-as-judges,” inProc. GEM Workshop, in conjunction withACL, 2025

work page 2025

[21] [21]

LawBench: Benchmarking legal knowledge of large language models,

Z. Fei, X. Shen, D. Zhu, F. Zhou, Z. Han, A. Huang, et al., “LawBench: Benchmarking legal knowledge of large language models,” inProc. Conf. Empir . Methods Nat. Lang. Process. (EMNLP), 2024

work page 2024

[22] [22]

Automated benchmark generation from domain guidelines informed by Bloom’s taxonomy,

S. Chen, L. H. Khiem, A. Szymanski, R. Metoyer, T. Hua, and N. V . Chawla, “Automated benchmark generation from domain guidelines informed by Bloom’s taxonomy,”arXiv preprint arXiv:2601.20253, 2026

work page arXiv 2026

[23] [23]

DesignQA: A multimodal benchmark for evaluating large language models’ understanding of engineering documentation,

A. C. Doris, D. Grandi, R. Tomich, M. F. Alam, M. Ataei, H. Cheong, et al., “DesignQA: A multimodal benchmark for evaluating large language models’ understanding of engineering documentation,”J. Comput. Inf. Sci. Eng., vol. 25, no. 2, art. no. 021009, 2025

work page 2025

[24] [24]

Software architecture meets LLMs: A systematic literature review,

L. Schmid, T. Hey, M. Armbruster, S. Corallo, D. Fuchß, J. Keim, et al., “Software architecture meets LLMs: A systematic literature review,” arXiv preprint arXiv:2505.16697, 2025

work page arXiv 2025

[25] [25]

MMLU- Pro: A more robust and challenging multi-task language understanding benchmark,

Y . Wang, X. Ma, G. Zhang, Y . Ni, A. Chandra, S. Guo, et al., “MMLU- Pro: A more robust and challenging multi-task language understanding benchmark,” inProc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2024

work page 2024

[26] [26]

How reliable is multilingual LLM-as-a-judge?

X. Fu and W. Liu, “How reliable is multilingual LLM-as-a-judge?” in Findings of the Association for Computational Linguistics (EMNLP), 2025

work page 2025