An empirical study of 57 ML evaluation harnesses shows 41.4% of operational issues occur in the specification stage, driven mainly by unimplemented features, documentation gaps, and missing input validation.
Why Code, Why Now: An Information-Theoretic Perspective on the Limits of Machine Learning
1 Pith paper cite this work. Polarity classification is still indexing.
abstract
This paper offers a new perspective on the limits of machine learning: the ceiling on progress is set not by model size or algorithm choice but by the information structure of the task itself. Code generation has progressed more reliably than reinforcement learning, largely because code provides dense, local, verifiable feedback at every token, whereas most reinforcement learning problems do not. This difference in feedback quality is not binary but graded. We propose a five-level hierarchy of learnability based on information structure and argue that diagnosing a task's position in this hierarchy is more predictive of scaling outcomes than any property of the model. The hierarchy rests on a formal distinction among three properties of computational problems (expressibility, computability, and learnability). We establish their pairwise relationships, including where implications hold and where they fail, and present a unified template that makes the structural differences explicit. The analysis suggests why supervised learning on code scales predictably while reinforcement learning does not, and why the common assumption that scaling alone will solve remaining ML challenges warrants scrutiny.
fields
cs.SE 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Towards Evaluation Engineering: An Empirical Study of ML Evaluation Harnesses in the Wild
An empirical study of 57 ML evaluation harnesses shows 41.4% of operational issues occur in the specification stage, driven mainly by unimplemented features, documentation gaps, and missing input validation.