Evaluating Large Language Models on Computer Science University Exams in Data Structures
Pith reviewed 2026-05-08 08:13 UTC · model grok-4.3
The pith
A new benchmark of Tel Aviv University data structures exams tests how well current LLMs handle closed-ended university questions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce a new benchmark dataset comprising exam questions from Tel Aviv University, curated to assess LLMs' abilities in handling closed and multiple-choice questions, and evaluate GPT-4o, Claude 3.5, Mathstral 7B, and LLaMA 3 8B on it to provide insight into the current capabilities of LLMs in CS education.
What carries the argument
The TAU exams benchmark dataset of real university closed and multiple-choice questions, which serves as the test bed for measuring LLM performance on data structures topics.
If this is right
- The benchmark supplies a public, repeatable way to compare future LLMs on genuine university-level CS questions rather than synthetic ones.
- Performance differences between large and small models on the dataset highlight where scale still matters for educational tasks.
- The evaluation results establish a baseline that later models can be measured against as they improve.
- The approach demonstrates how real exam questions can be used to assess LLM readiness for CS tutoring or grading support.
Where Pith is reading between the lines
- The same curation method could be repeated for other CS courses or other universities to build a broader collection of benchmarks.
- Direct comparison of LLM answers against actual student exam responses on the identical questions would test whether high model scores translate to useful educational help.
- Breaking down errors by specific data structures topics could identify where models need targeted improvement before they are used in classrooms.
Load-bearing premise
The selected Tel Aviv University exam questions are representative of typical university data structures courses and that model performance on them reflects meaningful capability in computer science education.
What would settle it
A study that applies the same models to data structures exams from several other universities and finds substantially different accuracy patterns, or that shows LLM scores do not predict whether students using the models actually learn the material better.
Figures
read the original abstract
We present a comprehensive evaluation of Large Language Models (LLMs) on Computer Science (CS) Data Structure examination questions. Our work introduces a new benchmark dataset comprising exam questions from Tel Aviv University (TAU), curated to assess LLMs' abilities in handling closed and multiple-choice questions. We evaluated the performance of OpenAI's GPT 4o and Anthropic's Claude 3.5, popular LLMs, alongside two smaller LLMs, Mathstral 7B and LLaMA 3 8B, across the TAU exams benchmark. Our findings provide insight into the current capabilities of LLMs in CS education.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a new benchmark dataset comprising closed and multiple-choice data structures exam questions from Tel Aviv University (TAU). It evaluates four LLMs—GPT-4o, Claude 3.5, Mathstral-7B, and LLaMA-3-8B—on this benchmark and reports their performance to assess current LLM capabilities in CS education.
Significance. If the curation and evaluation protocol are sound, the work supplies a concrete, university-sourced benchmark for data-structures questions that enables direct comparison of frontier and smaller models. This empirical contribution is useful for the growing literature on LLM use in CS education, particularly because it moves beyond synthetic or textbook problems to real exam items.
minor comments (1)
- [Abstract] Abstract: The abstract states that a benchmark was created and models were evaluated but supplies no question counts, prompting details, scoring method, or performance numbers. Adding one or two key quantitative results would make the abstract a more informative summary of the central claim.
Simulated Author's Rebuttal
We thank the referee for their positive assessment of our work, including the recognition that the TAU data-structures benchmark supplies a useful, university-sourced resource for comparing frontier and smaller LLMs in CS education. We are pleased with the recommendation for minor revision.
Circularity Check
No circularity: purely empirical benchmark evaluation with external data and no derivations or self-referential fits
full rationale
The paper introduces a TAU exam dataset and reports LLM accuracies on closed/multiple-choice data-structures questions. No equations, fitted parameters, predictions, or derivation chain exist. The central claim reduces only to data collection plus consistent prompting and scoring, which are independent of any internal model or self-citation. No load-bearing self-citations, ansatzes, or renamings are present. This is the expected 0-score outcome for a straightforward empirical study.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
OpenAI, “Gpt-4 technical report.” [Online]. Available: https://ar5iv.labs. arxiv.org/html/2303.08774
work page internal anchor Pith review arXiv
-
[2]
The claude 3 model family: Opus, sonnet, haiku,
Anthropic, “The claude 3 model family: Opus, sonnet, haiku,” 2023. [Online]. Available: https://www.anthropic.com/claude
2023
-
[3]
Mistral ai. 2024. mistral: A state-of-the-art open- weight language model
“Mistral ai. 2024. mistral: A state-of-the-art open- weight language model.” [Online]. Available: https://mistral
2024
-
[4]
Meta ai. 2024. llama 3: Large language model meta ai – third generation
“Meta ai. 2024. llama 3: Large language model meta ai – third generation.” [Online]. Available: https://ai.facebook.com/research/llama
2024
-
[5]
W. Lyu, Y. Wang, T. Chung, Y. Sun, and Y. Zhang, “Evaluating the effectiveness of llms in introductory computer science education: A semester-long field study,”Learning @ Scale, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2404.13414
-
[6]
Gsm-symbolic: Understanding the limitations of mathematical rea- soning in large language models,
I. Mirzadeh, K. Alizadeh, H. Shahrokhi, O. Tuzel, S. Bengio, and M. Fara- jtabar, “Gsm-symbolic: Understanding the limitations of mathematical rea- soning in large language models,”Apple, arXiv, 2024
2024
-
[7]
Measuring Massive Multitask Language Understanding
D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt, “Measuring massive multitask language understanding,” arxiv, 2021. [Online]. Available: https://arxiv.org/abs/2009.03300 8
work page internal anchor Pith review arXiv 2021
-
[8]
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y. Pang, J. Dirani, J. Michael, and S. R. Bowman, “Gpqa: A graduate- level google-proof q&a benchmark,”arxiv, 2023. [Online]. Available: https://arxiv.org/abs/2311.12022
work page internal anchor Pith review arXiv 2023
-
[9]
Language Models are Few-Shot Learners
T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert- Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amo...
work page internal anchor Pith review arXiv 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.