pith. sign in

arxiv: 2604.23347 · v1 · submitted 2026-04-25 · 💻 cs.CL

Evaluating Large Language Models on Computer Science University Exams in Data Structures

Pith reviewed 2026-05-08 08:13 UTC · model grok-4.3

classification 💻 cs.CL
keywords large language modelsdata structuresbenchmark datasetuniversity examscomputer science educationmodel evaluationGPT-4oClaude 3.5
0
0 comments X

The pith

A new benchmark of Tel Aviv University data structures exams tests how well current LLMs handle closed-ended university questions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates a dataset drawn directly from Tel Aviv University data structures exams, focusing on closed and multiple-choice items. It runs GPT-4o, Claude 3.5, Mathstral 7B, and LLaMA 3 8B on this collection to measure accuracy. The work aims to show what these models can and cannot do on authentic CS course material. A reader would care because the benchmark supplies a concrete, repeatable yardstick for tracking AI progress in educational settings instead of relying on artificial test items.

Core claim

We introduce a new benchmark dataset comprising exam questions from Tel Aviv University, curated to assess LLMs' abilities in handling closed and multiple-choice questions, and evaluate GPT-4o, Claude 3.5, Mathstral 7B, and LLaMA 3 8B on it to provide insight into the current capabilities of LLMs in CS education.

What carries the argument

The TAU exams benchmark dataset of real university closed and multiple-choice questions, which serves as the test bed for measuring LLM performance on data structures topics.

If this is right

  • The benchmark supplies a public, repeatable way to compare future LLMs on genuine university-level CS questions rather than synthetic ones.
  • Performance differences between large and small models on the dataset highlight where scale still matters for educational tasks.
  • The evaluation results establish a baseline that later models can be measured against as they improve.
  • The approach demonstrates how real exam questions can be used to assess LLM readiness for CS tutoring or grading support.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same curation method could be repeated for other CS courses or other universities to build a broader collection of benchmarks.
  • Direct comparison of LLM answers against actual student exam responses on the identical questions would test whether high model scores translate to useful educational help.
  • Breaking down errors by specific data structures topics could identify where models need targeted improvement before they are used in classrooms.

Load-bearing premise

The selected Tel Aviv University exam questions are representative of typical university data structures courses and that model performance on them reflects meaningful capability in computer science education.

What would settle it

A study that applies the same models to data structures exams from several other universities and finds substantially different accuracy patterns, or that shows LLM scores do not predict whether students using the models actually learn the material better.

Figures

Figures reproduced from arXiv: 2604.23347 by Abdo Amer, Adi Haviv, Amir Rubinstein, Edan Gabay, Hanoch Levy, Jonathan Stahl, Michal Kleinbort, Naama Maoz, Orr Eilat, Yael Maoz.

Figure 1
Figure 1. Figure 1: A histogram of the number of possible answers for each multiple choice view at source ↗
Figure 2
Figure 2. Figure 2: Accuracy by question type. Success was measured in two granulari view at source ↗
Figure 3
Figure 3. Figure 3: A histogram of the number of correct repetitions (out of 5), with CoT. view at source ↗
read the original abstract

We present a comprehensive evaluation of Large Language Models (LLMs) on Computer Science (CS) Data Structure examination questions. Our work introduces a new benchmark dataset comprising exam questions from Tel Aviv University (TAU), curated to assess LLMs' abilities in handling closed and multiple-choice questions. We evaluated the performance of OpenAI's GPT 4o and Anthropic's Claude 3.5, popular LLMs, alongside two smaller LLMs, Mathstral 7B and LLaMA 3 8B, across the TAU exams benchmark. Our findings provide insight into the current capabilities of LLMs in CS education.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 1 minor

Summary. The paper introduces a new benchmark dataset comprising closed and multiple-choice data structures exam questions from Tel Aviv University (TAU). It evaluates four LLMs—GPT-4o, Claude 3.5, Mathstral-7B, and LLaMA-3-8B—on this benchmark and reports their performance to assess current LLM capabilities in CS education.

Significance. If the curation and evaluation protocol are sound, the work supplies a concrete, university-sourced benchmark for data-structures questions that enables direct comparison of frontier and smaller models. This empirical contribution is useful for the growing literature on LLM use in CS education, particularly because it moves beyond synthetic or textbook problems to real exam items.

minor comments (1)
  1. [Abstract] Abstract: The abstract states that a benchmark was created and models were evaluated but supplies no question counts, prompting details, scoring method, or performance numbers. Adding one or two key quantitative results would make the abstract a more informative summary of the central claim.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of our work, including the recognition that the TAU data-structures benchmark supplies a useful, university-sourced resource for comparing frontier and smaller LLMs in CS education. We are pleased with the recommendation for minor revision.

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark evaluation with external data and no derivations or self-referential fits

full rationale

The paper introduces a TAU exam dataset and reports LLM accuracies on closed/multiple-choice data-structures questions. No equations, fitted parameters, predictions, or derivation chain exist. The central claim reduces only to data collection plus consistent prompting and scoring, which are independent of any internal model or self-citation. No load-bearing self-citations, ansatzes, or renamings are present. This is the expected 0-score outcome for a straightforward empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical benchmarking study with no mathematical derivations, fitted parameters, or postulated entities; it relies only on the assumption that the selected exam questions form a valid test of LLM capability.

pith-pipeline@v0.9.0 · 5430 in / 1077 out tokens · 18894 ms · 2026-05-08T08:13:38.765635+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

9 extracted references · 5 canonical work pages · 4 internal anchors

  1. [1]

    GPT-4 Technical Report

    OpenAI, “Gpt-4 technical report.” [Online]. Available: https://ar5iv.labs. arxiv.org/html/2303.08774

  2. [2]

    The claude 3 model family: Opus, sonnet, haiku,

    Anthropic, “The claude 3 model family: Opus, sonnet, haiku,” 2023. [Online]. Available: https://www.anthropic.com/claude

  3. [3]

    Mistral ai. 2024. mistral: A state-of-the-art open- weight language model

    “Mistral ai. 2024. mistral: A state-of-the-art open- weight language model.” [Online]. Available: https://mistral

  4. [4]

    Meta ai. 2024. llama 3: Large language model meta ai – third generation

    “Meta ai. 2024. llama 3: Large language model meta ai – third generation.” [Online]. Available: https://ai.facebook.com/research/llama

  5. [5]

    Evaluating the effectiveness of llms in introductory computer science education: A semester-long field study,

    W. Lyu, Y. Wang, T. Chung, Y. Sun, and Y. Zhang, “Evaluating the effectiveness of llms in introductory computer science education: A semester-long field study,”Learning @ Scale, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2404.13414

  6. [6]

    Gsm-symbolic: Understanding the limitations of mathematical rea- soning in large language models,

    I. Mirzadeh, K. Alizadeh, H. Shahrokhi, O. Tuzel, S. Bengio, and M. Fara- jtabar, “Gsm-symbolic: Understanding the limitations of mathematical rea- soning in large language models,”Apple, arXiv, 2024

  7. [7]

    Measuring Massive Multitask Language Understanding

    D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt, “Measuring massive multitask language understanding,” arxiv, 2021. [Online]. Available: https://arxiv.org/abs/2009.03300 8

  8. [8]

    GPQA: A Graduate-Level Google-Proof Q&A Benchmark

    D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y. Pang, J. Dirani, J. Michael, and S. R. Bowman, “Gpqa: A graduate- level google-proof q&a benchmark,”arxiv, 2023. [Online]. Available: https://arxiv.org/abs/2311.12022

  9. [9]

    Language Models are Few-Shot Learners

    T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert- Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amo...