Assessing Coherency and Consistency of Code Execution Reasoning by Large Language Models

· 2025 · cs.SE · arXiv 2510.15079

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

open full Pith review browse 3 citing papers arXiv PDF

abstract

This paper proposes CES, a task to evaluate the abilities of LLMs in simulating program execution and using that reasoning in programming tasks. Besides measuring the correctness of variable predictions during execution simulation, CES introduces the notion of coherence to determine whether the simulation complies with commonsense execution logic, even if the predicted values along the simulations are incorrect. This enables CES to rule out suspiciously correct output predictions due to reasoning shortcuts, hallucinations, or potential data leakage. CES also introduces a novel metric to measure reasoning consistency across tests with the same or different prime path coverage in a spectrum: strong, weak, and random. Evaluating 16 LLMs (including three reasoning LLMs) using CES indicates 81.42% coherent execution simulation on HumanEval, 46.92% and 53.08% of which result in correct and incorrect output predictions. Frontier LLMs such as GPT-4 and DeepSeek-R1 have the most incoherent execution reasoning, mostly due to natural language shortcuts. Despite relatively coherent execution simulation, LLMs' reasoning performance across different tests is inconsistent, mostly random (48.87%) or weak (45.37%), potentially explaining their weakness in programming tasks that require path-sensitive program analysis to succeed. We also compare CES with bug prediction/localization/repair, which intuitively requires control- and data-flow awareness. We observe that LLMs barely incorporate execution reasoning into their analysis for bug-related tasks, and their success is primarily due to inherent abilities in pattern matching or natural language shortcuts, if not data leakage. Without reasoning, there is a threat to the generalizability of LLMs in dealing with unseen bugs or patterns in different contexts. CES can be used to vet the suspicious success of LLMs in these tasks systematically.

representative citing papers

The Path Not Taken: Duality in Reasoning about Program Execution

cs.LG · 2026-04-22 · unverdicted · novelty 7.0

DexBench introduces paired forward and backward reasoning tasks to measure LLMs' dynamic understanding of program execution more discriminatively than prior benchmarks.

Evaluating LLMs Code Reasoning Under Real-World Context

cs.SE · 2026-04-14 · unverdicted · novelty 7.0

R2Eval is a new benchmark with 135 real-world code reasoning problems from Python projects that preserves complex data structures for more realistic LLM evaluation.

Evaluating Code Reasoning Abilities of Large Language Models Under Real-World Settings

cs.SE · 2025-12-16 · unverdicted · novelty 7.0

A new dataset and nine-metric majority-vote procedure show that existing code-reasoning benchmarks are dominated by lower-complexity problems that do not reflect real-world code.

citing papers explorer

Showing 3 of 3 citing papers.

The Path Not Taken: Duality in Reasoning about Program Execution cs.LG · 2026-04-22 · unverdicted · none · ref 3 · internal anchor
DexBench introduces paired forward and backward reasoning tasks to measure LLMs' dynamic understanding of program execution more discriminatively than prior benchmarks.
Evaluating LLMs Code Reasoning Under Real-World Context cs.SE · 2026-04-14 · unverdicted · none · ref 15 · internal anchor
R2Eval is a new benchmark with 135 real-world code reasoning problems from Python projects that preserves complex data structures for more realistic LLM evaluation.
Evaluating Code Reasoning Abilities of Large Language Models Under Real-World Settings cs.SE · 2025-12-16 · unverdicted · none · ref 36 · internal anchor
A new dataset and nine-metric majority-vote procedure show that existing code-reasoning benchmarks are dominated by lower-complexity problems that do not reflect real-world code.

Assessing Coherency and Consistency of Code Execution Reasoning by Large Language Models

fields

years

verdicts

representative citing papers

citing papers explorer