pith. sign in

arxiv: 2604.18170 · v2 · pith:HZLVOKH7new · submitted 2026-04-20 · 💻 cs.CL · cs.AI

Copy-as-Decode: Grammar-Constrained Parallel Prefill for LLM Editing

Pith reviewed 2026-05-10 04:39 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords LLM editingcopy mechanismparallel prefillgrammar-constrained decodingcode editingKV cache optimizationspeculative decodingstructured decoding
0
0 comments X

The pith

Copy-as-Decode recasts LLM editing as grammar-constrained decoding that copies input spans via single-step parallel prefill instead of full autoregressive regeneration.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Copy-as-Decode to handle text and code edits without regenerating verbatim tokens. It defines a two-primitive grammar where copy operations reference input line ranges and generate operations produce new content, enforced by a token-level finite-state machine. Copies execute through a serving-layer primitive that performs one parallel prefill forward pass to update the KV cache, sharing the kernel with speculative decoding but replacing probabilistic checks with program-enforced acceptance. Upper-bound analysis without training shows 6.8x to 303x kernel speedups on spans of 8 to 512 tokens, and corpus-level coverage of 74 to 98 percent of gold tokens composes to closed-form wall-clock bounds of 29.0x, 3.4x, and 4.2x across datasets with a 13.0x pooled figure. The mechanism is lossless under perfect spans, with all 482 oracle programs round-tripping exactly, while a perturbation study isolates errors to span selection.

Core claim

Editing generation reduces to structured decoding over the grammar with primitives <copy lines=i-j/> and <gen>...</gen>; a deterministic resolver executes copy spans by one parallel-prefill KV-cache update rather than N autoregressive steps, yielding kernel speedups of 6.8x-303x on Qwen2.5 models and closed-form wall-clock bounds of 29.0x/3.4x/4.2x (13.0x pooled) when composed with each corpus's span histogram, while guaranteeing 100 percent exact-match round-trip on 482 oracle programs.

What carries the argument

Grammar-constrained FSM decoder paired with parallel-prefill copy primitive that updates the KV cache for input spans in a single forward pass.

If this is right

  • Kernel-level copying of N tokens via parallel prefill is 6.8x-303x faster than autoregressive decoding for N between 8 and 512 tokens on Qwen2.5-1.5B and 7B models.
  • Line-level grammar reaches 74-98 percent of gold tokens on the evaluated code-edit corpora, composing with span histograms to the stated wall-clock bounds.
  • Token-level extension of the primitive lifts coverage to 91-99 percent with speed floors of 4.5x-6.5x.
  • Oracle programs achieve exact round-trip on all 482 cases, confining downstream failure exclusively to span selection.
  • A fine-tuning pilot on Qwen2.5-Coder-1.5B raises exact match from 0/33 to 12-17 percent on one fix dataset, indicating span selection is learnable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The single-step copy primitive could be combined with existing speculative-decoding infrastructure to allow hybrid draft-and-accept pipelines that mix copied spans with generated continuations.
  • Extending the grammar to cross-file references would enable multi-file editing without separate context windows.
  • Attention patterns or retrieval modules could be trained to propose the copy ranges directly, turning the lossless mechanism into a practical end-to-end system.
  • The same parallel-prefill copy operator might apply to retrieval-augmented generation where long passages are inserted verbatim rather than regenerated.

Load-bearing premise

That accurate copy spans can be selected at inference time, since the mechanism is lossless only with perfect spans but exact-match performance falls from 100 percent to 15.48 percent under off-by-one noise.

What would settle it

End-to-end exact-match rate on the ProbeEdit and HumanEvalPack-Fix suites when a learned span selector replaces the oracle program, compared against the reported 100 percent oracle round-trip fidelity.

Figures

Figures reproduced from arXiv: 2604.18170 by Ziyang Liu.

Figure 1
Figure 1. Figure 1: Mechanism comparison. (a) Standard autoregressive decoding spends one forward pass per output token. (b) Speculative decoding uses a small draft model to propose k tokens that the target model verifies in one parallel forward; tokens are accepted probabilistically. (c) Copy-as-Decode degenerates the draft to the input document: a grammar-constrained decoder commits to a specific <copy lines="i-j"/> span, a… view at source ↗
Figure 2
Figure 2. Figure 2: A Copy-as-Decode program on a concrete case. The input document D (left) carries line numbers 1–n. The constrained decoder (§2.3) emits a program P (middle) composed of <copy> ops (green) that reference line ranges in D and <gen> ops (pink) that carry new free text. The deterministic resolver R (§2.2) expands P against D to yield D′ (right); the color bar on each output line indicates its origin. Copy ops … view at source ↗
Figure 3
Figure 3. Figure 3: Per-span kernel speedup of one parallel-prefill forward vs. N autoregressive decode steps on identical hardware (A100 80GB, bf16, 1024-token KV prefix). Median of 5–7 trials per configuration. (a) Absolute wall-clock per span. (b) Speedup s(N); the dashed line is the theoretical overhead-dominated linear regime. Full wall-clock table and per-trial dispersion in Appendix C. 2 0 2 1 2 2 2 3 2 4 2 5 Minimum c… view at source ↗
Figure 4
Figure 4. Figure 4: Copy ceiling. (a) Fraction of gold-output tokens realizable as copy spans, aggregated per corpus (Qwen2.5 tokenizer). Solid markers: token-level greedy cover at minimum span length m (aspirational upper bound, not current primitive). Horizontal dashed: line-level oracle cover (the primitive as defined). (b) Theoretical end-to-end speedup ceiling 1/((1 − f) + f /s(m)) using the measured 7B kernel speedups s… view at source ↗
Figure 5
Figure 5. Figure 5: Format-only vLLM latency ratio across batch sizes. Mean-latency ratio LFULL/LCAD on vLLM 0.19 continuous batching, Qwen2.5-Coder-1.5B-Instruct, RTX 3090 ( [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Oracle EM collapses sharply under small index perturbation. Trimmed EM across 5 trials per case, line-level CopyLines. The ϵ=0 → ϵ=1 drop quantifies the span-endpoint precision a trained selector must clear. HEvalFix-Py plateaus higher because many Python fixes are single-line and the document-edge clamp absorbs noise. Span-selection brittleness. Perfect line-level span selection round-trips to gold on all… view at source ↗
Figure 7
Figure 7. Figure 7: Grammar-enforcing FSM. Five states control the decoder: AWAIT_OP (op boundary), AFTER_LT (op-type choice), COPY_LINES (line-index digits), GEN_BODY (free generation), and CLOSE (program terminator). Each state carries an allowed-token mask (italic annotations); the decoder’s logits are zeroed outside the mask. Structural literals (/>, </gen>, </program>) are force-prefilled in a single parallel forward rat… view at source ↗
Figure 8
Figure 8. Figure 8: Selector pilot. Mean over 3 seeds with min–max whiskers; 7B QLoRA single-seed. EM climbs monotonically (0 → 14.1 → 17.2 → 18.2%); the dominant driver at 7B is span-endpoint accuracy (end acc. +15.3 pp over 1.5B Combined), consistent with the perturbation diagnosis [PITH_FULL_IMAGE:figures/full_fig_p027_8.png] view at source ↗
read the original abstract

LLMs edit text and code by autoregressively regenerating the full output, even when most tokens appear verbatim in the input. We study Copy-as-Decode, a decoding-layer mechanism that recasts edit generation as structured decoding over a two-primitive grammar: <copy lines="i-j"/> references an input line range, <gen>...</gen> emits new content. A token-level FSM guarantees syntactic validity, and a serving-layer primitive updates the KV cache for each copy span via a single parallel-prefill forward rather than $N$ autoregressive steps -- sharing the parallel-forward kernel of speculative decoding but with input tokens as the draft and program-enforced acceptance replacing probabilistic verification. We report an upper-bound analysis that requires no end-to-end training. (i) Kernel speedup: on Qwen2.5-{1.5B, 7B}, copying $N$ tokens via parallel prefill is $6.8\times$--$303\times$ faster than autoregressive ($N \in [8, 512]$, A100 80GB bf16). (ii) Copy ceiling: on ProbeEdit and HumanEvalPack-Fix (Py/JS), $74$--$98\%$ of gold tokens are reachable under the line-level primitive; composed with the empirical kernel over each corpus's span histogram this yields a closed-form wall-clock bound of $29.0\times / 3.4\times / 4.2\times$ ($13.0\times$ pooled). A token-level extension reaches $91$--$99\%$ coverage with $4.5\times$--$6.5\times$ floors. (iii) Pipeline losslessness: oracle programs round-trip through the deterministic resolver on all $482$ cases, localizing any downstream failure to span selection rather than the mechanism. A perturbation study shows pooled EM drops from $100\%$ to $15.48\%$ under off-by-one noise. A fine-tuning pilot on Qwen2.5-Coder-1.5B lifts HEvalFix-Py EM from $0/33$ (untrained) to $12$--$17\%$, a learnability signal, not a production selector. Batched-serving integration and multi-file coverage are scoped as follow-up.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that by recasting LLM editing as grammar-constrained decoding with copy and generate primitives, and implementing copies via parallel prefill, significant speedups are possible. It supports this with direct kernel timing measurements on hardware, coverage statistics from editing benchmarks, a closed-form upper-bound calculation yielding up to 29× speedup, exhaustive proof of losslessness on 482 cases under perfect spans, and a small fine-tuning pilot showing that span selection is learnable to some degree.

Significance. If the assumption of accurate span selection holds, this technique could substantially accelerate code and text editing applications by avoiding unnecessary autoregressive steps for copied content. The use of direct measurements rather than simulations, the exhaustive oracle verification, and the parameter-free nature of the bound are notable strengths that increase confidence in the reported numbers. The work also re-purposes speculative decoding infrastructure, enhancing its practicality.

major comments (2)
  1. [Abstract] The wall-clock bounds of 29.0× / 3.4× / 4.2× (13.0× pooled) are presented prominently, yet they rely on the unproven ability to select accurate copy spans at inference time. The perturbation study indicates that even minor inaccuracies collapse performance to 15.48% EM, while the fine-tuning pilot only achieves 12–17% EM. This makes the bounds an idealized upper limit whose relevance depends on future progress in span selection, which should be stated more clearly to avoid overstatement.
  2. [Fine-tuning pilot description] The pilot experiment on Qwen2.5-Coder-1.5B is too small in scope (only 12-17% EM on one benchmark) to serve as convincing evidence that span selection is feasible at the accuracy levels required (74-98%). More details on the training procedure, data, and potential improvements would help assess whether this direction can close the gap to the oracle performance.
minor comments (2)
  1. The two-primitive grammar is introduced without a formal syntax definition or illustrative example early in the paper; adding this would improve accessibility.
  2. Ensure that all reported speedups specify the exact model variant, batch size, and precision used in the kernel measurements for full reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps clarify the scope of our upper-bound results. We address each major comment below and will revise the manuscript accordingly to avoid any potential overstatement while preserving the technical contributions.

read point-by-point responses
  1. Referee: [Abstract] The wall-clock bounds of 29.0× / 3.4× / 4.2× (13.0× pooled) are presented prominently, yet they rely on the unproven ability to select accurate copy spans at inference time. The perturbation study indicates that even minor inaccuracies collapse performance to 15.48% EM, while the fine-tuning pilot only achieves 12–17% EM. This makes the bounds an idealized upper limit whose relevance depends on future progress in span selection, which should be stated more clearly to avoid overstatement.

    Authors: We agree that the reported wall-clock bounds constitute an idealized upper limit under oracle span selection. The manuscript already frames the analysis as an 'upper-bound analysis that requires no end-to-end training,' reports the perturbation study showing the EM drop to 15.48%, and explicitly labels the fine-tuning result as 'a learnability signal, not a production selector.' To address the concern about prominence in the abstract, we will revise the abstract and introduction to state more explicitly that the speedups assume perfect spans and that realized gains will depend on advances in span selection. This revision will be made without altering the kernel measurements, coverage statistics, or closed-form bound derivation. revision: yes

  2. Referee: [Fine-tuning pilot description] The pilot experiment on Qwen2.5-Coder-1.5B is too small in scope (only 12-17% EM on one benchmark) to serve as convincing evidence that span selection is feasible at the accuracy levels required (74-98%). More details on the training procedure, data, and potential improvements would help assess whether this direction can close the gap to the oracle performance.

    Authors: We acknowledge that the pilot is limited in scope and does not reach the oracle coverage levels; its intent was solely to demonstrate that span selection is learnable from the untrained baseline of 0/33 EM rather than to provide a production system. In the revised manuscript we will expand the pilot section with additional details on the training procedure (including data construction, loss formulation, and hyperparameters), the exact benchmark split used, and a brief discussion of potential improvements such as scaling to larger models or incorporating the grammar constraints during fine-tuning. We agree that substantial further research is required to approach oracle accuracy. revision: partial

Circularity Check

0 steps flagged

Derivation chain is self-contained with no circular reductions

full rationale

The paper's central upper-bound analysis composes two independently obtained quantities: (1) direct wall-clock measurements of parallel-prefill kernel latency versus autoregressive decoding for varying N on A100 hardware, and (2) coverage fractions computed from span-length histograms on the ProbeEdit and HumanEvalPack-Fix corpora. These are multiplied to produce the closed-form speedups (29.0×/3.4×/4.2× pooled). No parameter is fitted to the final bound, no self-citation supplies a load-bearing uniqueness theorem or ansatz, and the oracle round-trip plus perturbation study are separate empirical checks of mechanism losslessness. The derivation therefore remains external to its own outputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The upper-bound claims rest on two measured quantities (kernel latency for parallel prefill of copy spans, and empirical span-length histograms from the test corpora) plus the assumption that a serving-layer primitive can perform the KV-cache update in one forward pass.

free parameters (1)
  • corpus span-length histogram
    Empirical distribution of copy-span sizes taken directly from the evaluation datasets and used to weight the kernel speedups.
axioms (1)
  • domain assumption A single parallel prefill forward pass can correctly update the KV cache for an arbitrary input-derived copy span
    Invoked when the serving-layer primitive is introduced; treated as an engineering primitive rather than derived.

pith-pipeline@v0.9.0 · 5721 in / 1472 out tokens · 41729 ms · 2026-05-10T04:39:45.471296+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.