Self-Execution Simulation Improves Coding Models

· 2026 · cs.CL · arXiv 2604.03253

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

A promising research direction in enabling LLMs to generate consistently correct code involves addressing their inability to properly estimate program execution, particularly for code they generate. In this work, we demonstrate that Code LLMs can be trained to simulate program execution in a step-by-step manner and that this capability can be leveraged to improve competitive programming performance. Our approach combines supervised fine-tuning on natural language execution traces, textual explanations grounded in true execution, with reinforcement learning using verifiable rewards. We introduce two complementary objectives: output prediction given code and inputs, and solving competitive programming tasks with either ground-truth or self-predicted execution feedback. These objectives enable models to perform self-verification over multiple candidate solutions, and iterative self-fixing by simulating test execution. Across multiple competitive programming benchmarks, our method yields consistent improvements over standard reasoning approaches. We further present ablations and analysis to elucidate the role of execution simulation and its limitations.

representative citing papers

Extrapolative Weight Averaging Reveals Correctness-Efficiency Frontiers in Code RL

cs.LG · 2026-05-27 · conditional · novelty 7.0

Extrapolative weight averaging of RL checkpoints trained under nested unit-test coverage extends a correctness-efficiency frontier and boosts ensemble pass rates in code generation across model scales and inference modes.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Extrapolative Weight Averaging Reveals Correctness-Efficiency Frontiers in Code RL cs.LG · 2026-05-27 · conditional · none · ref 29 · internal anchor
Extrapolative weight averaging of RL checkpoints trained under nested unit-test coverage extends a correctness-efficiency frontier and boosts ensemble pass rates in code generation across model scales and inference modes.

Self-Execution Simulation Improves Coding Models

fields

years

verdicts

representative citing papers

citing papers explorer