Copy-as-Decode: Grammar-Constrained Parallel Prefill for LLM Editing
Pith reviewed 2026-05-10 04:39 UTC · model grok-4.3
The pith
Copy-as-Decode recasts LLM editing as grammar-constrained decoding that copies input spans via single-step parallel prefill instead of full autoregressive regeneration.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Editing generation reduces to structured decoding over the grammar with primitives <copy lines=i-j/> and <gen>...</gen>; a deterministic resolver executes copy spans by one parallel-prefill KV-cache update rather than N autoregressive steps, yielding kernel speedups of 6.8x-303x on Qwen2.5 models and closed-form wall-clock bounds of 29.0x/3.4x/4.2x (13.0x pooled) when composed with each corpus's span histogram, while guaranteeing 100 percent exact-match round-trip on 482 oracle programs.
What carries the argument
Grammar-constrained FSM decoder paired with parallel-prefill copy primitive that updates the KV cache for input spans in a single forward pass.
If this is right
- Kernel-level copying of N tokens via parallel prefill is 6.8x-303x faster than autoregressive decoding for N between 8 and 512 tokens on Qwen2.5-1.5B and 7B models.
- Line-level grammar reaches 74-98 percent of gold tokens on the evaluated code-edit corpora, composing with span histograms to the stated wall-clock bounds.
- Token-level extension of the primitive lifts coverage to 91-99 percent with speed floors of 4.5x-6.5x.
- Oracle programs achieve exact round-trip on all 482 cases, confining downstream failure exclusively to span selection.
- A fine-tuning pilot on Qwen2.5-Coder-1.5B raises exact match from 0/33 to 12-17 percent on one fix dataset, indicating span selection is learnable.
Where Pith is reading between the lines
- The single-step copy primitive could be combined with existing speculative-decoding infrastructure to allow hybrid draft-and-accept pipelines that mix copied spans with generated continuations.
- Extending the grammar to cross-file references would enable multi-file editing without separate context windows.
- Attention patterns or retrieval modules could be trained to propose the copy ranges directly, turning the lossless mechanism into a practical end-to-end system.
- The same parallel-prefill copy operator might apply to retrieval-augmented generation where long passages are inserted verbatim rather than regenerated.
Load-bearing premise
That accurate copy spans can be selected at inference time, since the mechanism is lossless only with perfect spans but exact-match performance falls from 100 percent to 15.48 percent under off-by-one noise.
What would settle it
End-to-end exact-match rate on the ProbeEdit and HumanEvalPack-Fix suites when a learned span selector replaces the oracle program, compared against the reported 100 percent oracle round-trip fidelity.
Figures
read the original abstract
LLMs edit text and code by autoregressively regenerating the full output, even when most tokens appear verbatim in the input. We study Copy-as-Decode, a decoding-layer mechanism that recasts edit generation as structured decoding over a two-primitive grammar: <copy lines="i-j"/> references an input line range, <gen>...</gen> emits new content. A token-level FSM guarantees syntactic validity, and a serving-layer primitive updates the KV cache for each copy span via a single parallel-prefill forward rather than $N$ autoregressive steps -- sharing the parallel-forward kernel of speculative decoding but with input tokens as the draft and program-enforced acceptance replacing probabilistic verification. We report an upper-bound analysis that requires no end-to-end training. (i) Kernel speedup: on Qwen2.5-{1.5B, 7B}, copying $N$ tokens via parallel prefill is $6.8\times$--$303\times$ faster than autoregressive ($N \in [8, 512]$, A100 80GB bf16). (ii) Copy ceiling: on ProbeEdit and HumanEvalPack-Fix (Py/JS), $74$--$98\%$ of gold tokens are reachable under the line-level primitive; composed with the empirical kernel over each corpus's span histogram this yields a closed-form wall-clock bound of $29.0\times / 3.4\times / 4.2\times$ ($13.0\times$ pooled). A token-level extension reaches $91$--$99\%$ coverage with $4.5\times$--$6.5\times$ floors. (iii) Pipeline losslessness: oracle programs round-trip through the deterministic resolver on all $482$ cases, localizing any downstream failure to span selection rather than the mechanism. A perturbation study shows pooled EM drops from $100\%$ to $15.48\%$ under off-by-one noise. A fine-tuning pilot on Qwen2.5-Coder-1.5B lifts HEvalFix-Py EM from $0/33$ (untrained) to $12$--$17\%$, a learnability signal, not a production selector. Batched-serving integration and multi-file coverage are scoped as follow-up.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that by recasting LLM editing as grammar-constrained decoding with copy and generate primitives, and implementing copies via parallel prefill, significant speedups are possible. It supports this with direct kernel timing measurements on hardware, coverage statistics from editing benchmarks, a closed-form upper-bound calculation yielding up to 29× speedup, exhaustive proof of losslessness on 482 cases under perfect spans, and a small fine-tuning pilot showing that span selection is learnable to some degree.
Significance. If the assumption of accurate span selection holds, this technique could substantially accelerate code and text editing applications by avoiding unnecessary autoregressive steps for copied content. The use of direct measurements rather than simulations, the exhaustive oracle verification, and the parameter-free nature of the bound are notable strengths that increase confidence in the reported numbers. The work also re-purposes speculative decoding infrastructure, enhancing its practicality.
major comments (2)
- [Abstract] The wall-clock bounds of 29.0× / 3.4× / 4.2× (13.0× pooled) are presented prominently, yet they rely on the unproven ability to select accurate copy spans at inference time. The perturbation study indicates that even minor inaccuracies collapse performance to 15.48% EM, while the fine-tuning pilot only achieves 12–17% EM. This makes the bounds an idealized upper limit whose relevance depends on future progress in span selection, which should be stated more clearly to avoid overstatement.
- [Fine-tuning pilot description] The pilot experiment on Qwen2.5-Coder-1.5B is too small in scope (only 12-17% EM on one benchmark) to serve as convincing evidence that span selection is feasible at the accuracy levels required (74-98%). More details on the training procedure, data, and potential improvements would help assess whether this direction can close the gap to the oracle performance.
minor comments (2)
- The two-primitive grammar is introduced without a formal syntax definition or illustrative example early in the paper; adding this would improve accessibility.
- Ensure that all reported speedups specify the exact model variant, batch size, and precision used in the kernel measurements for full reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which helps clarify the scope of our upper-bound results. We address each major comment below and will revise the manuscript accordingly to avoid any potential overstatement while preserving the technical contributions.
read point-by-point responses
-
Referee: [Abstract] The wall-clock bounds of 29.0× / 3.4× / 4.2× (13.0× pooled) are presented prominently, yet they rely on the unproven ability to select accurate copy spans at inference time. The perturbation study indicates that even minor inaccuracies collapse performance to 15.48% EM, while the fine-tuning pilot only achieves 12–17% EM. This makes the bounds an idealized upper limit whose relevance depends on future progress in span selection, which should be stated more clearly to avoid overstatement.
Authors: We agree that the reported wall-clock bounds constitute an idealized upper limit under oracle span selection. The manuscript already frames the analysis as an 'upper-bound analysis that requires no end-to-end training,' reports the perturbation study showing the EM drop to 15.48%, and explicitly labels the fine-tuning result as 'a learnability signal, not a production selector.' To address the concern about prominence in the abstract, we will revise the abstract and introduction to state more explicitly that the speedups assume perfect spans and that realized gains will depend on advances in span selection. This revision will be made without altering the kernel measurements, coverage statistics, or closed-form bound derivation. revision: yes
-
Referee: [Fine-tuning pilot description] The pilot experiment on Qwen2.5-Coder-1.5B is too small in scope (only 12-17% EM on one benchmark) to serve as convincing evidence that span selection is feasible at the accuracy levels required (74-98%). More details on the training procedure, data, and potential improvements would help assess whether this direction can close the gap to the oracle performance.
Authors: We acknowledge that the pilot is limited in scope and does not reach the oracle coverage levels; its intent was solely to demonstrate that span selection is learnable from the untrained baseline of 0/33 EM rather than to provide a production system. In the revised manuscript we will expand the pilot section with additional details on the training procedure (including data construction, loss formulation, and hyperparameters), the exact benchmark split used, and a brief discussion of potential improvements such as scaling to larger models or incorporating the grammar constraints during fine-tuning. We agree that substantial further research is required to approach oracle accuracy. revision: partial
Circularity Check
Derivation chain is self-contained with no circular reductions
full rationale
The paper's central upper-bound analysis composes two independently obtained quantities: (1) direct wall-clock measurements of parallel-prefill kernel latency versus autoregressive decoding for varying N on A100 hardware, and (2) coverage fractions computed from span-length histograms on the ProbeEdit and HumanEvalPack-Fix corpora. These are multiplied to produce the closed-form speedups (29.0×/3.4×/4.2× pooled). No parameter is fitted to the final bound, no self-citation supplies a load-bearing uniqueness theorem or ansatz, and the oracle round-trip plus perturbation study are separate empirical checks of mechanism losslessness. The derivation therefore remains external to its own outputs.
Axiom & Free-Parameter Ledger
free parameters (1)
- corpus span-length histogram
axioms (1)
- domain assumption A single parallel prefill forward pass can correctly update the KV cache for an arbitrary input-derived copy span
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.