CAKE benchmark shows MCQ accuracy on cloud architecture plateaus near 99% above 3B parameters while free-response scores improve steadily with size, and reasoning steps help but tools hurt small models.
Judging the judges: Evaluating alignment and vulnerabilities in LLMs-as-judges,
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.SE 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
CAKE: Cloud Architecture Knowledge Evaluation of Large Language Models
CAKE benchmark shows MCQ accuracy on cloud architecture plateaus near 99% above 3B parameters while free-response scores improve steadily with size, and reasoning steps help but tools hurt small models.