Evaluating Large Language Models on Computer Science University Exams in Data Structures

Abdo Amer; Adi Haviv; Amir Rubinstein; Edan Gabay; Hanoch Levy; Jonathan Stahl; Michal Kleinbort; Naama Maoz; Orr Eilat; Yael Maoz

arxiv: 2604.23347 · v1 · submitted 2026-04-25 · 💻 cs.CL

Evaluating Large Language Models on Computer Science University Exams in Data Structures

Edan Gabay , Yael Maoz , Jonathan Stahl , Naama Maoz , Abdo Amer , Orr Eilat , Hanoch Levy , Michal Kleinbort

show 2 more authors

Amir Rubinstein Adi Haviv

This is my paper

Pith reviewed 2026-05-08 08:13 UTC · model grok-4.3

classification 💻 cs.CL

keywords large language modelsdata structuresbenchmark datasetuniversity examscomputer science educationmodel evaluationGPT-4oClaude 3.5

0 comments

The pith

A new benchmark of Tel Aviv University data structures exams tests how well current LLMs handle closed-ended university questions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates a dataset drawn directly from Tel Aviv University data structures exams, focusing on closed and multiple-choice items. It runs GPT-4o, Claude 3.5, Mathstral 7B, and LLaMA 3 8B on this collection to measure accuracy. The work aims to show what these models can and cannot do on authentic CS course material. A reader would care because the benchmark supplies a concrete, repeatable yardstick for tracking AI progress in educational settings instead of relying on artificial test items.

Core claim

We introduce a new benchmark dataset comprising exam questions from Tel Aviv University, curated to assess LLMs' abilities in handling closed and multiple-choice questions, and evaluate GPT-4o, Claude 3.5, Mathstral 7B, and LLaMA 3 8B on it to provide insight into the current capabilities of LLMs in CS education.

What carries the argument

The TAU exams benchmark dataset of real university closed and multiple-choice questions, which serves as the test bed for measuring LLM performance on data structures topics.

If this is right

The benchmark supplies a public, repeatable way to compare future LLMs on genuine university-level CS questions rather than synthetic ones.
Performance differences between large and small models on the dataset highlight where scale still matters for educational tasks.
The evaluation results establish a baseline that later models can be measured against as they improve.
The approach demonstrates how real exam questions can be used to assess LLM readiness for CS tutoring or grading support.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same curation method could be repeated for other CS courses or other universities to build a broader collection of benchmarks.
Direct comparison of LLM answers against actual student exam responses on the identical questions would test whether high model scores translate to useful educational help.
Breaking down errors by specific data structures topics could identify where models need targeted improvement before they are used in classrooms.

Load-bearing premise

The selected Tel Aviv University exam questions are representative of typical university data structures courses and that model performance on them reflects meaningful capability in computer science education.

What would settle it

A study that applies the same models to data structures exams from several other universities and finds substantially different accuracy patterns, or that shows LLM scores do not predict whether students using the models actually learn the material better.

Figures

Figures reproduced from arXiv: 2604.23347 by Abdo Amer, Adi Haviv, Amir Rubinstein, Edan Gabay, Hanoch Levy, Jonathan Stahl, Michal Kleinbort, Naama Maoz, Orr Eilat, Yael Maoz.

**Figure 1.** Figure 1: A histogram of the number of possible answers for each multiple choice view at source ↗

**Figure 2.** Figure 2: Accuracy by question type. Success was measured in two granulari view at source ↗

**Figure 3.** Figure 3: A histogram of the number of correct repetitions (out of 5), with CoT. view at source ↗

read the original abstract

We present a comprehensive evaluation of Large Language Models (LLMs) on Computer Science (CS) Data Structure examination questions. Our work introduces a new benchmark dataset comprising exam questions from Tel Aviv University (TAU), curated to assess LLMs' abilities in handling closed and multiple-choice questions. We evaluated the performance of OpenAI's GPT 4o and Anthropic's Claude 3.5, popular LLMs, alongside two smaller LLMs, Mathstral 7B and LLaMA 3 8B, across the TAU exams benchmark. Our findings provide insight into the current capabilities of LLMs in CS education.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A new dataset of real TAU data structures exam questions plus straightforward LLM scores, but narrow in scope and analysis.

read the letter

This paper's main offering is a benchmark dataset of actual Tel Aviv University data structures exam questions, along with accuracy numbers for GPT-4o, Claude 3.5, Mathstral-7B, and LLaMA-3-8B on the closed and multiple-choice sections. The full text supplies question counts, prompting details, and per-model results, so the central claim is supported by visible evidence rather than just the abstract. The use of real university exams is the concrete new element here, and it avoids the circularity problems that plague some synthetic benchmarks. The evaluation protocol is standard and reproducible, which makes the numbers easy to check or extend. The stress-test note is right that no load-bearing inconsistency appears in the pipeline. The soft spots are mostly about limited reach. All questions come from one institution, so it is unclear how representative they are of data structures courses at other schools. The work stays with closed and multiple-choice formats, which means it does not speak to open-ended code writing or explanation tasks. There is also no reported statistical testing or error analysis, so differences between models are harder to interpret with certainty. This is useful for researchers who need a ready set of real exam items for LLM testing in CS education. A reader building benchmarks or studying model performance on technical material could pull the dataset and run their own checks. It is incremental rather than transformative, but the empirical setup is honest and falsifiable. I would send it for peer review. The new data and reported scores are concrete enough that referees can evaluate the curation and the numbers directly.

Referee Report

0 major / 1 minor

Summary. The paper introduces a new benchmark dataset comprising closed and multiple-choice data structures exam questions from Tel Aviv University (TAU). It evaluates four LLMs—GPT-4o, Claude 3.5, Mathstral-7B, and LLaMA-3-8B—on this benchmark and reports their performance to assess current LLM capabilities in CS education.

Significance. If the curation and evaluation protocol are sound, the work supplies a concrete, university-sourced benchmark for data-structures questions that enables direct comparison of frontier and smaller models. This empirical contribution is useful for the growing literature on LLM use in CS education, particularly because it moves beyond synthetic or textbook problems to real exam items.

minor comments (1)

[Abstract] Abstract: The abstract states that a benchmark was created and models were evaluated but supplies no question counts, prompting details, scoring method, or performance numbers. Adding one or two key quantitative results would make the abstract a more informative summary of the central claim.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of our work, including the recognition that the TAU data-structures benchmark supplies a useful, university-sourced resource for comparing frontier and smaller LLMs in CS education. We are pleased with the recommendation for minor revision.

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark evaluation with external data and no derivations or self-referential fits

full rationale

The paper introduces a TAU exam dataset and reports LLM accuracies on closed/multiple-choice data-structures questions. No equations, fitted parameters, predictions, or derivation chain exist. The central claim reduces only to data collection plus consistent prompting and scoring, which are independent of any internal model or self-citation. No load-bearing self-citations, ansatzes, or renamings are present. This is the expected 0-score outcome for a straightforward empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical benchmarking study with no mathematical derivations, fitted parameters, or postulated entities; it relies only on the assumption that the selected exam questions form a valid test of LLM capability.

pith-pipeline@v0.9.0 · 5430 in / 1077 out tokens · 18894 ms · 2026-05-08T08:13:38.765635+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

9 extracted references · 5 canonical work pages · 4 internal anchors

[1]

GPT-4 Technical Report

OpenAI, “Gpt-4 technical report.” [Online]. Available: https://ar5iv.labs. arxiv.org/html/2303.08774

work page internal anchor Pith review arXiv
[2]

The claude 3 model family: Opus, sonnet, haiku,

Anthropic, “The claude 3 model family: Opus, sonnet, haiku,” 2023. [Online]. Available: https://www.anthropic.com/claude

2023
[3]

Mistral ai. 2024. mistral: A state-of-the-art open- weight language model

“Mistral ai. 2024. mistral: A state-of-the-art open- weight language model.” [Online]. Available: https://mistral

2024
[4]

Meta ai. 2024. llama 3: Large language model meta ai – third generation

“Meta ai. 2024. llama 3: Large language model meta ai – third generation.” [Online]. Available: https://ai.facebook.com/research/llama

2024
[5]

Evaluating the effectiveness of llms in introductory computer science education: A semester-long field study,

W. Lyu, Y. Wang, T. Chung, Y. Sun, and Y. Zhang, “Evaluating the effectiveness of llms in introductory computer science education: A semester-long field study,”Learning @ Scale, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2404.13414

work page doi:10.48550/arxiv.2404.13414 2024
[6]

Gsm-symbolic: Understanding the limitations of mathematical rea- soning in large language models,

I. Mirzadeh, K. Alizadeh, H. Shahrokhi, O. Tuzel, S. Bengio, and M. Fara- jtabar, “Gsm-symbolic: Understanding the limitations of mathematical rea- soning in large language models,”Apple, arXiv, 2024

2024
[7]

Measuring Massive Multitask Language Understanding

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt, “Measuring massive multitask language understanding,” arxiv, 2021. [Online]. Available: https://arxiv.org/abs/2009.03300 8

work page internal anchor Pith review arXiv 2021
[8]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y. Pang, J. Dirani, J. Michael, and S. R. Bowman, “Gpqa: A graduate- level google-proof q&a benchmark,”arxiv, 2023. [Online]. Available: https://arxiv.org/abs/2311.12022

work page internal anchor Pith review arXiv 2023
[9]

Language Models are Few-Shot Learners

T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert- Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amo...

work page internal anchor Pith review arXiv 2020

[1] [1]

GPT-4 Technical Report

OpenAI, “Gpt-4 technical report.” [Online]. Available: https://ar5iv.labs. arxiv.org/html/2303.08774

work page internal anchor Pith review arXiv

[2] [2]

The claude 3 model family: Opus, sonnet, haiku,

Anthropic, “The claude 3 model family: Opus, sonnet, haiku,” 2023. [Online]. Available: https://www.anthropic.com/claude

2023

[3] [3]

Mistral ai. 2024. mistral: A state-of-the-art open- weight language model

“Mistral ai. 2024. mistral: A state-of-the-art open- weight language model.” [Online]. Available: https://mistral

2024

[4] [4]

Meta ai. 2024. llama 3: Large language model meta ai – third generation

“Meta ai. 2024. llama 3: Large language model meta ai – third generation.” [Online]. Available: https://ai.facebook.com/research/llama

2024

[5] [5]

Evaluating the effectiveness of llms in introductory computer science education: A semester-long field study,

W. Lyu, Y. Wang, T. Chung, Y. Sun, and Y. Zhang, “Evaluating the effectiveness of llms in introductory computer science education: A semester-long field study,”Learning @ Scale, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2404.13414

work page doi:10.48550/arxiv.2404.13414 2024

[6] [6]

Gsm-symbolic: Understanding the limitations of mathematical rea- soning in large language models,

I. Mirzadeh, K. Alizadeh, H. Shahrokhi, O. Tuzel, S. Bengio, and M. Fara- jtabar, “Gsm-symbolic: Understanding the limitations of mathematical rea- soning in large language models,”Apple, arXiv, 2024

2024

[7] [7]

Measuring Massive Multitask Language Understanding

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt, “Measuring massive multitask language understanding,” arxiv, 2021. [Online]. Available: https://arxiv.org/abs/2009.03300 8

work page internal anchor Pith review arXiv 2021

[8] [8]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y. Pang, J. Dirani, J. Michael, and S. R. Bowman, “Gpqa: A graduate- level google-proof q&a benchmark,”arxiv, 2023. [Online]. Available: https://arxiv.org/abs/2311.12022

work page internal anchor Pith review arXiv 2023

[9] [9]

Language Models are Few-Shot Learners

T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert- Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amo...

work page internal anchor Pith review arXiv 2020