arxiv: 2604.12881 · v1 · submitted 2026-04-14 · 💻 cs.SE

Recognition: unknown

Evaluating LLMs Code Reasoning Under Real-World Context

Changshu Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 14:38 UTC · model grok-4.3

classification 💻 cs.SE

keywords code reasoningLLM evaluationbenchmarksPython projectsdata serializationcompound typesreal-world context

0 comments

The pith

R2Eval tests LLMs on code reasoning by serializing compound types from real Python projects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing code reasoning benchmarks for large language models rely on simplistic snippets or solutions limited to primitive types such as integers and strings. These choices omit the nested structures, custom classes, and project dependencies that appear in actual software. The paper introduces R2Eval as a set of 135 problems extracted from ten widely used Python projects. By serializing compound and custom types, the benchmark keeps the data complexity intact. A sympathetic reader would care because this setup offers a clearer test of whether LLMs can handle the kinds of code they would actually encounter in practice.

Core claim

We present R2Eval, a benchmark of 135 code reasoning problems drawn from ten widely used Python projects. Unlike prior work, R2Eval serializes compound and custom types, preserving real-world data complexity and enabling a more realistic assessment of LLMs.

What carries the argument

The R2Eval benchmark, which extracts problems from real Python projects and serializes compound and custom types to retain data complexity.

If this is right

LLM evaluations would better capture practical generalizability to code with real dependencies.
Models succeeding on primitive-only tests may reveal new failure modes when facing serialized custom objects.
Benchmark creators could adopt similar serialization methods to avoid oversimplification.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training data for code LLMs might benefit from greater emphasis on handling custom class instances.
The method could extend to other languages by applying analogous serialization for their complex types.
Expanding the set of source projects would allow checks on whether the current selection covers typical industry patterns.

Load-bearing premise

The 135 problems from ten widely used Python projects adequately represent the structure, dependencies, and challenges of real-world code reasoning tasks.

What would settle it

An experiment that finds no meaningful difference in LLM accuracy between R2Eval and prior benchmarks restricted to primitive types would undermine the claim that serialization of complex types is necessary for realistic evaluation.

Figures

Figures reproduced from arXiv: 2604.12881 by Changshu Liu.

**Figure 2.** Figure 2: Unique and common problems each LLM succeeds in predicting their inputs and outputs [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

read the original abstract

Code reasoning tasks are increasingly crucial to evaluating large language models (LLMs). Yet most existing benchmarks rely on simplistic, LLM-generated snippets or human-written solutions to code challenges and often restrict inputs and outputs to primitive types, failing to reflect the structure and dependencies of real-world projects. These simplifications limit their ability to measure practical generalizability. We present R2Eval1, a benchmark of 135 code reasoning problems drawn from ten widely used Python projects. Unlike prior work, R2Eval serializes compound and custom types, preserving real-world data complexity and enabling a more realistic assessment of LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

R2Eval introduces a benchmark from real Python projects that serializes compound types to test LLM code reasoning more realistically, but the abstract leaves selection and validation details out so the realism claim stays untested.

read the letter

R2Eval is a new benchmark that pulls 135 code reasoning problems from ten real Python projects and serializes compound and custom types to keep the data complexity intact. That is the main thing to know: it tries to fix the toy-problem problem in LLM evaluations for code. It does a decent job calling out the limitations of prior benchmarks. Most stick to simple snippets or primitive types, which misses how real code uses custom classes, nested structures, and module dependencies. Using actual projects is a step in the right direction for making tests more relevant. The soft spot is representativeness. The abstract gives no selection criteria for the problems or projects, no breakdown of type complexity or dependency levels, and no check that these 135 cases stand in for broader real-world tasks. If the problems skew simple or the serialization creates its own distortions, the realism claim weakens. There are also no results or validation reported, so it's still just a proposal at this stage. This is for people who build or critique benchmarks in AI for software engineering. Someone thinking about evaluation practices could find the motivation helpful, but it won't be immediately usable without more on the construction. I would send it for peer review. The target is important and the approach has potential, so referees can ask for the missing details on selection and some pilot data.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces R2Eval, a benchmark of 135 code reasoning problems extracted from ten widely used Python projects. It claims to improve on prior work by serializing compound and custom types (rather than restricting to primitives), thereby preserving real-world data complexity, inter-module dependencies, and enabling a more realistic evaluation of LLMs' code reasoning capabilities.

Significance. If the problem selection proves representative and the serialization step faithfully retains necessary complexities without artifacts, the benchmark could meaningfully advance evaluation standards for practical LLM code reasoning. The work correctly identifies a gap in existing benchmarks that rely on simplified or LLM-generated snippets.

major comments (2)

[Abstract] Abstract: The central claim that R2Eval enables a 'more realistic assessment' because it serializes compound/custom types rests on the assumption that the 135 problems exercise non-trivial type complexity and dependencies at scale. No selection protocol, statistics on type usage, dependency depth, or coverage argument for the ten projects is supplied, leaving the representativeness claim unsupported.
[Abstract] Abstract and title: The title promises an evaluation of LLMs, yet the abstract and available description contain no empirical results, baseline comparisons, or validation of the benchmark instances. Without these, the practical utility of the serialization approach cannot be assessed.

minor comments (1)

[Abstract] Abstract: 'R2Eval1' appears to be a typographical error and should read 'R2Eval'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive comments on our manuscript. We address each major comment below and will incorporate revisions to strengthen the paper.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that R2Eval enables a 'more realistic assessment' because it serializes compound/custom types rests on the assumption that the 135 problems exercise non-trivial type complexity and dependencies at scale. No selection protocol, statistics on type usage, dependency depth, or coverage argument for the ten projects is supplied, leaving the representativeness claim unsupported.

Authors: We agree that the abstract would benefit from explicit support for the representativeness claim. In the revised manuscript we will add a dedicated subsection detailing the selection protocol (project popularity metrics, diversity criteria, and problem extraction process), along with quantitative statistics on type usage (proportion of compound and custom types), average and maximum dependency depths, and coverage across modules and projects. revision: yes
Referee: [Abstract] Abstract and title: The title promises an evaluation of LLMs, yet the abstract and available description contain no empirical results, baseline comparisons, or validation of the benchmark instances. Without these, the practical utility of the serialization approach cannot be assessed.

Authors: The current manuscript centers on benchmark construction, yet the title and framing indicate its purpose for LLM evaluation. To address this, the revision will update the abstract with a concise summary of evaluation results (including LLM performance on the 135 problems), baseline comparisons against existing benchmarks, and validation steps for the serialized instances. A results section will be added to present these findings. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark construction with no derivations or self-referential reductions.

full rationale

The paper introduces R2Eval as a new benchmark of 135 problems from ten Python projects, emphasizing serialization of compound/custom types to better reflect real-world complexity. No equations, parameter fitting, predictions, or derivation chains appear in the provided text. The central claim rests on the benchmark's explicit construction choices rather than reducing to prior self-citations, fitted inputs, or renamed results. This is a standard benchmark presentation paper whose validity hinges on external representativeness and empirical evaluation, not internal circular logic.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the creation of this benchmark; it relies on one domain assumption about project representativeness but introduces no free parameters or invented entities.

axioms (1)

domain assumption Problems from ten widely used Python projects represent real-world code complexity and dependencies.
Invoked to support the claim of realistic assessment.

pith-pipeline@v0.9.0 · 5379 in / 982 out tokens · 23898 ms · 2026-05-10T14:38:02.395075+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 13 canonical work pages · 9 internal anchors

[1]

Wasi Ahmad, Md Golam Rahman Tushar, Saikat Chakraborty, and Kai-Wei Chang
[2]

InFindings of the Association for Computational Linguistics: ACL 2023

Avatar: A parallel corpus for java-python program translation. InFindings of the Association for Computational Linguistics: ACL 2023. 2268–2281

2023
[3]

Junkai Chen, Zhiyuan Pan, Xing Hu, Zhenhao Li, Ge Li, and Xin Xia. 2025. Reasoning Runtime Behavior of a Program with LLM: How Far Are We? . In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE Computer Society, Los Alamitos, CA, USA, 140–152. doi:10.1109/ICSE55347.2025. 00012

work page doi:10.1109/icse55347.2025 2025
[4]

Mark Chen. 2021. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[5]

Google DeepMind. 2025. Gemini 2.5 Pro (March 25 version). https://cloud.google. com/vertex-ai/generative-ai/docs/models/gemini/2-5-pro. Multimodal large language model

2025
[6]

Yangruibo Ding, Jinjun Peng, Marcus Min, Gail Kaiser, Junfeng Yang, and Baishakhi Ray. 2024. Semcoder: Training code language models with compre- hensive semantics reasoning.Advances in Neural Information Processing Systems 37 (2024), 60275–60308

2024
[7]

Xueying Du, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan Chen, Jiayi Feng, Chaofeng Sha, Xin Peng, and Yiling Lou. 2023. Classeval: A manually- crafted benchmark for evaluating llms on class-level code generation.arXiv preprint arXiv:2308.01861(2023)

work page arXiv 2023
[8]

Alex Gu, Baptiste Rozière, Hugh Leather, Armando Solar-Lezama, Gabriel Syn- naeve, and Sida I Wang. 2024. Cruxeval: A benchmark for code reasoning, understanding and execution.arXiv preprint arXiv:2401.03065(2024)

work page internal anchor Pith review arXiv 2024
[9]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al . 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Yu Wu, YK Li, et al. 2024. DeepSeek-Coder: When the Large Language Model Meets Programming–The Rise of Code Intelligence.arXiv preprint arXiv:2401.14196(2024)

work page internal anchor Pith review arXiv 2024
[11]

Ali Reza Ibrahimzada. 2024. Program decomposition and translation with static analysis. InProceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings. 453–455

2024
[12]

Ali Reza Ibrahimzada, Kaiyao Ke, Mrigank Pawagi, Muhammad Salman Abid, Rangeet Pan, Saurabh Sinha, and Reyhaneh Jabbarvand. 2025. AlphaTrans: A Neuro-Symbolic Compositional Approach for Repository-Level Code Translation and Validation.Proceedings of the ACM on Software Engineering2, FSE (2025), 2454–2476

2025
[13]

Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. 2024. A survey on large language models for code generation.arXiv preprint arXiv:2406.00515 (2024)

work page internal anchor Pith review arXiv 2024
[14]

Rob Kopel. 2025. EXecution-Eval:Can language models execute real-world code? (2025)

2025
[15]

Changshu Liu, Yang Chen, and Reyhaneh Jabbarvand. 2025. Assessing Coherency and Consistency of Code Execution Reasoning by Large Language Models.arXiv preprint arXiv:2510.15079(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Changshu Liu and Reyhan Jabbarvand. 2025. A tool for in-depth analysis of code execution reasoning of large language models. InProceedings of the 33rd ACM International Conference on the Foundations of Software Engineering. 1178–1182

2025
[17]

Changshu Liu, Shizhuo Dylan Zhang, Ali Reza Ibrahimzada, and Reyhaneh Jabbarvand. 2024. Codemind: A framework to challenge large language models for code reasoning.arXiv preprint arXiv:2402.09664(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation.Advances in Neural Information Processing Systems 36 (2023), 21558–21572

2023
[19]

OpenAI. 2025. GPT-4.1 (April 14 version). https://openai.com/index/gpt-4-1/. Large language model

2025
[20]

OpenAI. 2025. o4-mini (April 16 version). https://platform.openai.com/docs/ models/o4-mini. Large language model

2025
[21]

Rangeet Pan, Ali Reza Ibrahimzada, Rahul Krishna, Divya Sankar, Lam- bert Pouguem Wassi, Michele Merler, Boris Sobolev, Raju Pavuluri, Saurabh Sinha, and Reyhaneh Jabbarvand. 2024. Lost in translation: A study of bugs introduced by large language models while translating code. InProceedings of the IEEE/ACM 46th International Conference on Software Enginee...

2024
[22]

Monoshi Kumar Roy, Simin Chen, Benjamin Steenhoek, Jinjun Peng, Gail Kaiser, Baishakhi Ray, and Wei Le. 2025. CodeSense: a Real-World Benchmark and Dataset for Code Semantic Reasoning.arXiv preprint arXiv:2506.00750(2025)

work page arXiv 2025
[23]

Yuheng Tang, Hongwei Li, Kaijie Zhu, Michael Yang, Yangruibo Ding, and Wenbo Guo. 2025. Co-PatcheR: Collaborative Software Patching with Component (s)- specific Small Reasoning Models.arXiv preprint arXiv:2505.18955(2025)

work page arXiv 2025
[24]

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. 2023. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. 2024. Agentless: Demystifying llm-based software engineering agents.arXiv preprint arXiv:2407.01489(2024)

work page internal anchor Pith review arXiv 2024