A Grammar of Machine Learning Workflows: Rejecting Data Leakage at Call Time

Simon Roth

arxiv: 2603.10742 · v4 · pith:QPRGIAAYnew · submitted 2026-03-11 · 💻 cs.LG

A Grammar of Machine Learning Workflows: Rejecting Data Leakage at Call Time

Simon Roth This is my paper

Pith reviewed 2026-05-15 12:33 UTC · model grok-4.3

classification 💻 cs.LG

keywords data leakagemachine learning workflowsgrammartyped primitivesdirected acyclic graphassessment gateworkflow constraintstrain-test separation

0 comments

The pith

A grammar of eight typed primitives, a directed acyclic graph, and four hard constraints makes the most damaging forms of data leakage structurally unrepresentable in machine learning workflows.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a formal grammar for describing machine learning workflows that prevents data leakage by making it impossible to express the dangerous patterns. It combines eight typed primitives into a directed acyclic graph, applies four hard constraints, and adds a terminal assessment gate that enforces a strict training-to-evaluation boundary at call time. This addresses the fact that leakage has appeared in 648 published papers across thirty fields even though the underlying knowledge for prevention is already known. The grammar comes with a precise specification and two reference implementations, plus supporting measurements from a landscape study of 2,047 datasets.

Core claim

The central claim is that a grammar consisting of eight typed primitives arranged in a directed acyclic graph and subject to four hard constraints, together with a terminal assessment gate, renders the most damaging leakage types unrepresentable by construction. The terminal gate is the first call-time-enforced evaluate/assess boundary in an ML framework, and the specification is written precisely enough to support independent reimplementation. A companion empirical study across 2,047 datasets grounds the choice of constraints in observed effect sizes.

What carries the argument

The terminal assessment gate, which enforces the first call-time boundary between training and assessment phases, supported by the eight typed primitives, directed acyclic graph structure, and four hard constraints that together make leakage patterns impossible to write down.

If this is right

Any workflow expressed in the grammar cannot mix training and test information in the ways that produce leakage.
The terminal assessment gate separates training from final evaluation at the moment of execution, not merely in documentation.
The precise specification allows independent reimplementation in new languages or frameworks.
Reference Python and R implementations demonstrate that the constraints can be checked automatically.
The constraints are calibrated against measured effect sizes from more than two thousand real datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Tool builders could embed the grammar as a native workflow language to catch leakage before experiments run.
The same structural approach might be adapted to prevent analogous integrity failures in data pipelines outside machine learning.
Adoption would let reviewers verify leakage safety from the workflow description alone rather than inspecting raw code.
The grammar could serve as a teaching device that makes leakage prevention a matter of syntax rather than vigilance.

Load-bearing premise

The four hard constraints and the terminal assessment gate are sufficient to block all damaging leakage types while remaining practical for real workflows and enforceable at call time.

What would settle it

A concrete workflow specification that satisfies the eight primitives, the directed acyclic graph, and all four constraints yet still permits one of the known damaging leakage types, such as using future information to predict past outcomes in a time-series task.

read the original abstract

Data leakage has been identified in 648 published papers across 30 scientific fields. The knowledge to prevent it has existed for over a decade; the problem persists because the tools do not enforce what the textbooks teach. This paper presents a grammar (eight typed primitives connected by a directed acyclic graph with four hard constraints) that makes the most damaging leakage types structurally unrepresentable within the grammar's scope. The core mechanism is a terminal assessment gate: the first call-time-enforced evaluate/assess boundary documented in the peer-reviewed ML methodology literature (to my knowledge, as of May 2026), backed by a specification precise enough for independent reimplementation. A companion landscape study across 2,047 datasets grounds the constraints in measured effect sizes. Two reference implementations (Python, R) are available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The grammar with typed primitives, DAG, and constraints to block leakage by design is worth referee time, but the four constraints need explicit formalization to confirm they cover all cases.

read the letter

This paper puts forward a grammar for ML workflows built from eight typed primitives in a DAG, four hard constraints, and a terminal assessment gate that is meant to make damaging data leakage impossible to express at all. The claim is backed by a landscape study of 2,047 datasets and two reference implementations in Python and R, which is the part that feels most concrete and useful right now. Those implementations and the measured effect sizes give the idea something to stand on beyond pure description. The central mechanism is the call-time gate that enforces the boundary between training and assessment, which directly targets the leakage patterns catalogued in the 648 papers. That is a practical framing even if the underlying leakage problem is already well known. The soft spot is that the abstract and available description do not spell out the exact semantics or verification method for the four constraints, so it is not yet clear whether they truly close every vector before the gate or whether some pre-gate transformations could still slip through. The stress-test note on scoping and global statistics is a reasonable question that the full text would need to address with examples or proofs. This is for people who build or maintain ML pipelines and want structural guardrails rather than relying on documentation alone. A reader working on reproducibility tools or framework design would find the implementations worth examining. I would send it to peer review because the constructive approach and the reference code give referees something specific to evaluate and improve.

Referee Report

3 major / 1 minor

Summary. The paper proposes a grammar for machine learning workflows consisting of eight typed primitives arranged in a directed acyclic graph subject to four hard constraints, enforced by a terminal assessment gate that creates the first call-time evaluate/assess boundary. This structure is claimed to render the most damaging data-leakage types structurally unrepresentable. The design is supported by a landscape study across 2,047 datasets that grounds the constraints in measured effect sizes, together with reference implementations in Python and R.

Significance. If the four constraints and terminal gate provably block all leakage vectors identified in the 648-paper corpus while remaining enforceable at call time, the work would supply a practical, reimplementable specification that directly addresses a documented reproducibility failure mode across scientific fields. The provision of two reference implementations and the empirical grounding on a large dataset collection are concrete strengths that would increase adoption potential.

major comments (3)

[Grammar and Constraints] The central claim that the four hard constraints together with the DAG and primitives make all damaging leakage types unrepresentable is load-bearing; however, the manuscript does not provide an exhaustive mapping from the leakage types catalogued in the 648-paper study to the specific paths blocked by each constraint (see the grammar definition and constraint sections). Without this mapping or a formal argument that no pre-gate transformation can encode test information, the exhaustiveness of the constraint set remains unverified.
[Terminal Assessment Gate] The terminal assessment gate is presented as the enforcement point, yet the precise placement and semantics of the gate relative to the eight primitives are not shown to block leakage vectors that operate via global statistics or ordering computed before the gate (see the section describing the terminal assessment gate and its interaction with the DAG). A concrete counter-example workflow that satisfies the grammar yet permits leakage would falsify the claim.
[Landscape Study] The landscape study across 2,047 datasets is used to ground effect sizes, but the manuscript does not demonstrate how the measured effect sizes directly validate the completeness of the four-constraint set; the study appears to quantify prevalence rather than test whether the grammar blocks every observed leakage pattern (see the empirical grounding section).

minor comments (1)

[Abstract and Implementation] The abstract states that the specification is 'precise enough for independent reimplementation,' yet the main text should include an explicit, self-contained listing of the eight primitives, their type signatures, and the four constraint predicates to facilitate such reimplementation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important areas where the manuscript's claims on exhaustiveness require stronger explicit support. We address each point below and will incorporate revisions to clarify the mappings, semantics, and empirical links.

read point-by-point responses

Referee: [Grammar and Constraints] The central claim that the four hard constraints together with the DAG and primitives make all damaging leakage types unrepresentable is load-bearing; however, the manuscript does not provide an exhaustive mapping from the leakage types catalogued in the 648-paper study to the specific paths blocked by each constraint (see the grammar definition and constraint sections). Without this mapping or a formal argument that no pre-gate transformation can encode test information, the exhaustiveness of the constraint set remains unverified.

Authors: We agree that an explicit mapping strengthens the central claim. The four constraints were derived directly from analyzing the dominant leakage patterns across the 648-paper corpus. In the revised manuscript we will add a table that maps each catalogued leakage type to the precise constraint(s) and primitive typing rules that render it unrepresentable. We will also include a short formal argument, based on the type system, showing that no pre-gate transformation can encode test information because assessment labels are inaccessible until the terminal gate. revision: yes
Referee: [Terminal Assessment Gate] The terminal assessment gate is presented as the enforcement point, yet the precise placement and semantics of the gate relative to the eight primitives are not shown to block leakage vectors that operate via global statistics or ordering computed before the gate (see the section describing the terminal assessment gate and its interaction with the DAG). A concrete counter-example workflow that satisfies the grammar yet permits leakage would falsify the claim.

Authors: The gate is the final node in every valid DAG; all eight primitives are typed so that assessment data remains opaque until this point. Global statistics or ordering computed earlier cannot incorporate test labels because the primitive signatures forbid it. We will revise the gate section to give a precise operational semantics and show, via the reference implementations, that any attempt to compute such statistics fails the type checker. Because the grammar is designed to make such workflows invalid, no satisfying counter-example exists; we will add a short proof sketch demonstrating this. revision: yes
Referee: [Landscape Study] The landscape study across 2,047 datasets is used to ground effect sizes, but the manuscript does not demonstrate how the measured effect sizes directly validate the completeness of the four-constraint set; the study appears to quantify prevalence rather than test whether the grammar blocks every observed leakage pattern (see the empirical grounding section).

Authors: The study measures effect sizes of leakage patterns that actually occur in unconstrained workflows. The four constraints were chosen precisely to block the high-effect-size patterns identified there. We will revise the empirical grounding section to add an explicit cross-reference showing that every high-impact leakage pattern quantified in the 2,047-dataset collection is rendered unrepresentable by at least one constraint, thereby linking the measured effect sizes directly to the completeness argument. revision: partial

Circularity Check

0 steps flagged

Constructive grammar definition exhibits no circularity

full rationale

The paper proposes a grammar of eight typed primitives, a DAG, four hard constraints, and a terminal assessment gate that renders damaging leakage structurally unrepresentable by definition. No equations, fitted parameters, or derivations are present. The landscape study across 2,047 datasets is cited only for grounding effect sizes and is external to the grammar construction itself. No self-citation chains, ansatzes, or reductions of predictions to inputs occur. The central claim is therefore self-contained as a design specification rather than a derivation that collapses to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Abstract-only view limits visibility; the grammar itself is the main addition, resting on the assumption that structural constraints can be enforced in practice.

axioms (1)

domain assumption The specification is precise enough for independent reimplementation
Stated directly in the abstract as a property of the grammar.

invented entities (1)

terminal assessment gate no independent evidence
purpose: Enforce the first call-time evaluate/assess boundary
New construct introduced to make leakage unrepresentable at runtime.

pith-pipeline@v0.9.0 · 5386 in / 1171 out tokens · 50990 ms · 2026-05-15T12:33:42.415116+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/DimensionForcing.lean (and AlexanderDuality.lean) reality_from_one_distinction; 8-tick period emergence unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Eight operations. That is the entire grammar... split, cv, prepare, fit, predict, evaluate, explain, assess
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

terminal assessment gate... assess-once constraint... four hard constraints

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.