Synthetic Data for any Differentiable Target

Christopher Potts; Herman Brunborg; Luke Bailey; Marcel Roed; Neil Band; Sung Min Park; Tatsunori Hashimoto; Tristan Thrush

arxiv: 2604.08423 · v1 · submitted 2026-04-09 · 💻 cs.CL · cs.AI· cs.LG· stat.ML

Synthetic Data for any Differentiable Target

Tristan Thrush , Sung Min Park , Herman Brunborg , Luke Bailey , Marcel Roed , Neil Band , Christopher Potts , Tatsunori Hashimoto This is my paper

Pith reviewed 2026-05-10 18:03 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LGstat.ML

keywords synthetic datapolicy gradientdata attributionlanguage model controlsupervised fine-tuninghigher-order gradientsmodel behavior shaping

0 comments

The pith

A reinforcement learning method optimizes synthetic data generators so fine-tuning on their outputs steers language models toward any differentiable target.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops Dataset Policy Gradient (DPG) to train generators that produce synthetic examples for language models. It assigns each example a reward equal to its influence on a chosen metric, computed exactly via higher-order gradients, then updates the generator with a policy gradient step. When the resulting data is used for ordinary supervised fine-tuning, the target model improves on the metric. Experiments demonstrate this by embedding QR codes or digit patterns in model weights, lowering weight norms, rephrasing text in a new language, and emitting specific UUID strings, even when those goals are absent from the generator prompts. A proof establishes that the procedure approximates the true but intractable gradient for the generator.

Core claim

Dataset Policy Gradient is a reinforcement learning primitive that optimizes a synthetic data generator by using higher-order gradients to compute exact data attributions and feeding those scores as policy rewards. The resulting dataset, when used for supervised fine-tuning, causes the target model to optimize any chosen differentiable metric. The authors prove the update closely approximates the ideal gradient for the generator and illustrate the approach by making the target model's language-modeling head embed a QR code, embed the pattern 67, reduce its L2 norm, rephrase inputs in a new language, and produce a chosen UUID.

What carries the argument

Dataset Policy Gradient (DPG), which scores each generated example by its higher-order-gradient contribution to the target metric and treats those scores as rewards in a policy-gradient update for the generator.

If this is right

Target model weights can embed arbitrary patterns such as QR codes or the sequence 67 solely through supervised fine-tuning on the generated examples.
The generator can learn to rephrase inputs in a language never mentioned in its prompts.
Model components can be driven toward lower L2 norms or other weight-space properties via the generated data alone.
Specific output strings such as chosen UUIDs can be produced consistently after fine-tuning, even without explicit conditioning in the generator.
The same procedure applies to any differentiable metric, not just the five demonstrated cases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The technique may let practitioners edit deployed models by generating fresh data rather than retraining from scratch or editing weights directly.
If the approximation remains accurate at larger scales, DPG could become a general interface for specifying model goals through data rather than loss functions.
Combining DPG with existing data-augmentation pipelines might reduce reliance on human-labeled examples for narrow behavioral targets.
Computational cost of higher-order gradients will determine whether the method stays practical beyond small models or toy metrics.

Load-bearing premise

Higher-order gradients supply a stable, low-bias signal that can be turned into policy rewards without causing the generator to diverge or produce misleading attributions.

What would settle it

After running DPG, fine-tuning the target model on the generated data fails to improve the chosen metric beyond what random or unoptimized data achieves.

Figures

Figures reproduced from arXiv: 2604.08423 by Christopher Potts, Herman Brunborg, Luke Bailey, Marcel Roed, Neil Band, Sung Min Park, Tatsunori Hashimoto, Tristan Thrush.

**Figure 2.** Figure 2: Here, we initialize the target model in A to be GPT-2, and explore exotic target metrics: the goal of the first metric is to encode the greyscale image 67 in the upper 6x7 patch of the sign’d LM head weight updates to the target model. This number was chosen arbitrarily. The goal of the second metric is to lower the ℓ 2 norm of the target model’s LM head. The plots show validation performance as the GRPO p… view at source ↗

**Figure 3.** Figure 3: Final validation results for the 6x7 pixel images in the target models’ sign’d [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Generator results when setting Φ to be post-training loss on four multilingual LAMBADA (Paperno et al., 2016) translations from Black (2023): DE, ES, FR, and IT. We initialized the generator from Llama 3.2 Instruct. We initialized the target model in A also from Llama 3.2 Instruct. In each GRPO step, we conduct a single step of target model continued pretraining on the synthetic data before computing metag… view at source ↗

**Figure 5.** Figure 5: We keep the same setup as the LAMBADA cases, with the exception of changing [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: DPG RL, using GRPO. The target model in A is trained on generator rollouts. A’s training loss incorporates weights for each training example. We compute gradients of the data weights with respect to some differentiable training or post-training target. We use these gradients as the rewards. DPG GRPO with cross group batching Prompts Rollouts Generate Rewards Train A and compute metagrads wrt data weights A… view at source ↗

**Figure 7.** Figure 7: DPG RL, using GRPO. Same as Figure [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

read the original abstract

What are the limits of controlling language models via synthetic training data? We develop a reinforcement learning (RL) primitive, the Dataset Policy Gradient (DPG), which can precisely optimize synthetic data generators to produce a dataset of targeted examples. When used for supervised fine-tuning (SFT) of a target model, these examples cause the target model to do well on a differentiable metric of our choice. Our approach achieves this by taking exact data attribution via higher-order gradients and using those scores as policy gradient rewards. We prove that this procedure closely approximates the true, intractable gradient for the synthetic data generator. To illustrate the potential of DPG, we show that, using only SFT on generated examples, we can cause the target model's LM head weights to (1) embed a QR code, (2) embed the pattern $\texttt{67}$, and (3) have lower $\ell^2$ norm. We additionally show that we can cause the generator to (4) rephrase inputs in a new language and (5) produce a specific UUID, even though neither of these objectives is conveyed in the generator's input prompts. These findings suggest that DPG is a powerful and flexible technique for shaping model properties using only synthetic training examples.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DPG gives a workable RL primitive for tuning synthetic data generators via higher-order gradient rewards, but the approximation claim needs tighter validation for real multi-step SFT.

read the letter

The main point is that this paper defines Dataset Policy Gradient as a way to optimize a generator so its synthetic examples, once used for SFT, improve a chosen differentiable metric on the target model. They convert higher-order gradient attributions into policy rewards and claim a proof that the resulting update approximates the true (intractable) gradient of the post-SFT objective with respect to generator parameters. The five demos then show the generator can be steered to produce data that embeds a QR code or the digit pattern 67 into the target LM head, reduces its L2 norm, triggers rephrasing in a new language, or forces output of a specific UUID.

Referee Report

2 major / 2 minor

Summary. The paper introduces Dataset Policy Gradient (DPG), an RL primitive for optimizing synthetic data generators. DPG uses exact data attribution scores derived from higher-order gradients of a differentiable target metric (computed after SFT on the generated data) as rewards in a policy gradient update for the generator. The authors prove that this yields a close approximation to the true (intractable) gradient of the post-SFT metric w.r.t. generator parameters. They demonstrate the approach on five tasks: causing a target LM's weights to embed a QR code, embed the pattern '67', or have reduced ℓ² norm; causing the generator to rephrase inputs in a new language; and causing it to output a specific UUID, all via SFT on the generated examples alone.

Significance. If the gradient approximation holds with bounded error across finite SFT trajectories and the method proves stable, DPG would offer a general mechanism for steering LM properties and behaviors using only synthetic data for arbitrary differentiable objectives. The demonstrations suggest potential for non-prompt-based control (e.g., weight-level pattern embedding or output constraints), which could complement existing fine-tuning and editing techniques. Credit is due for the explicit proof attempt and the concrete, falsifiable examples; however, significance is tempered by the need for rigorous validation of the approximation under realistic multi-step optimization.

major comments (2)

[Proof of gradient approximation (Section 3 or equivalent)] The proof that DPG 'closely approximates the true, intractable gradient' (abstract) relies on higher-order gradients providing reliable data attribution. Standard derivations of such attributions (influence functions or Hessian-vector products) are first-order linearizations around the initial parameters. The demonstrations involve multiple SFT steps with non-negligible learning rates, during which parameters move and the attribution becomes path-dependent; the manuscript must supply either an explicit error bound on the linearization or empirical measurements of approximation quality (e.g., correlation between DPG rewards and true finite-difference gradients) along the actual optimization trajectory.
[Experimental results (Section 4)] Table or figure reporting the five demonstrations: success is claimed for ambitious targets (QR-code embedding in LM head, exact UUID production). The manuscript should report quantitative success metrics, variance across runs, and controls (e.g., SFT on random or non-DPG-generated data) to show that the observed effects are attributable to the DPG-optimized dataset rather than incidental properties of the generator or base model.

minor comments (2)

[Method description] Clarify the precise definition and computational implementation of the 'higher-order gradients' used for attribution (e.g., which Hessian terms are retained, how they are estimated without full materialization).
[Introduction or Related Work] Add discussion of related work on influence functions, data attribution, and synthetic-data-based model editing to better position the contribution relative to prior approximations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's report. We have carefully considered the major comments and will make revisions to address them as detailed below.

read point-by-point responses

Referee: [Proof of gradient approximation (Section 3 or equivalent)] The proof that DPG 'closely approximates the true, intractable gradient' (abstract) relies on higher-order gradients providing reliable data attribution. Standard derivations of such attributions (influence functions or Hessian-vector products) are first-order linearizations around the initial parameters. The demonstrations involve multiple SFT steps with non-negligible learning rates, during which parameters move and the attribution becomes path-dependent; the manuscript must supply either an explicit error bound on the linearization or empirical measurements of approximation quality (e.g., correlation between DPG rewards and true finite-difference gradients) along the actual optimization trajectory.

Authors: We thank the referee for this insightful observation. While our proof establishes that DPG approximates the gradient using higher-order terms for data attribution, we recognize that the multi-step nature of SFT introduces path-dependence not fully bounded in the current analysis. In the revision, we will include empirical evaluations measuring the correlation between the DPG-computed rewards and finite-difference approximations of the true gradient at various points along the training trajectory. This will quantify the approximation error in practice. revision: yes
Referee: [Experimental results (Section 4)] Table or figure reporting the five demonstrations: success is claimed for ambitious targets (QR-code embedding in LM head, exact UUID production). The manuscript should report quantitative success metrics, variance across runs, and controls (e.g., SFT on random or non-DPG-generated data) to show that the observed effects are attributable to the DPG-optimized dataset rather than incidental properties of the generator or base model.

Authors: We agree that additional quantitative details would strengthen the experimental section. The demonstrations in the manuscript are primarily qualitative to illustrate the novel capabilities, but we will add a table in the revised version reporting success metrics (such as embedding accuracy or output match rates), standard deviations from multiple independent runs, and control experiments using randomly generated data or data from non-optimized generators. These additions will confirm the attribution of the observed effects to the DPG method. revision: yes

Circularity Check

0 steps flagged

No circularity in the core derivation chain

full rationale

The paper defines DPG as using exact higher-order gradient data attribution scores directly as policy rewards for optimizing a synthetic data generator, then claims a proof that this approximates the intractable true gradient of the post-SFT metric. No equations or steps in the abstract reduce the claimed approximation to a fitted parameter, self-definition, or self-citation chain by construction. The targets (QR-code embedding, UUID production) are presented as empirical illustrations rather than inputs to the derivation. The method is grounded in external gradient computations from the target model, satisfying the criteria for an independent, non-circular derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The approach rests on standard differentiability assumptions in machine learning plus the new DPG construction; no free parameters or invented physical entities are described.

axioms (2)

domain assumption The target metric is differentiable with respect to model parameters
Required to compute the higher-order gradients used as rewards.
domain assumption Higher-order gradients can be obtained exactly for data attribution
Central mechanism for turning attribution into policy rewards.

invented entities (1)

Dataset Policy Gradient (DPG) no independent evidence
purpose: RL primitive that optimizes synthetic data generators using gradient attribution rewards
New method introduced to solve the stated problem.

pith-pipeline@v0.9.0 · 5538 in / 1395 out tokens · 59388 ms · 2026-05-10T18:03:57.264726+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

7 extracted references · 7 canonical work pages

[1]

URL https://zenodo.org/ records/12608602. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie 11 Preprint. Under review. Sravankumar, Artem Korenev, Arthur Hinsvar...

work page arXiv
[2]

A.3 Lemma 2 Let learning algorithm A be SGD operating on x∼π θ, performing gradient descent on ℓ(ϕ,x)to minimizeE x∼π θ [ℓ(ϕ,x)]

Thus, the two functions are smooth with parameterL :=max(G 3,Φ max(G2 1 +G 2)). A.3 Lemma 2 Let learning algorithm A be SGD operating on x∼π θ, performing gradient descent on ℓ(ϕ,x)to minimizeE x∼π θ [ℓ(ϕ,x)]. We show that the SGD iterates defined by ϕk :=ϕ k−1 −η∇ℓ(ϕ k−1,x k−1) with xk ∼π θ converges to its SDE equivalent in the small-step-size limit, wi...

work page 2021
[3]

Now, we bound the ℓ2 distance of the two processes, which is the ℓ2 norm of ∆t

Now define the difference sequence ∆t :=Z t −Z ′ t with the associated SDE d∆t :=−(∇f(Z t)− ∇f(Z ′ t))dt+ √η(Σ(Z t)−Σ ′(Z′ t))dWt. Now, we bound the ℓ2 distance of the two processes, which is the ℓ2 norm of ∆t. By Ito’s formula (Itˆo, 1951), d||∆ t||2 =2∆ td∆t +Tr(η(Σ(Z t)−Σ ′(Z′ t))(Σ(Z t)−Σ ′(Z′ t))⊤)dt =2∆ t(−∇f(Z t) +∇f(Z ′ t))dt+2 √η∆t(Σ(Zt)−Σ ′(Z′ t...

work page 1951
[4]

Timo Gottschalk: A Renowned German Rally Co-driver

of CPT’d models on our test split of the multilingual LAMBADA tasks. Rows designate the source of the CPT data. All CPT experiments are run with 10M tokens, which is far more than the single step case where our generators were optimized. Our DPG RL procedure with Adam in A is able to generate synthetic data that generalizes to this longer training regime,...

work page 2024
[5]

She was definitely Pluche

Prior to being elected to this position, he was the ambassador of the National Assembly of Armenia. He attended Yerevan State University, finishing his degree in the same institution.\n\nI will likely rewrite this in a more readable and fluid version. \n\n' 29 'The Independent (Armenian): An Interview with Vardan Bostanjyan. \n\nWe couldn’t fall apart. At...

work page 1949
[6]

Under review

29 Preprint. Under review. 67, DPG GRPO Hyperparameters Parameter V alue Generator Learning rate 5e−6 Max Prompt Length 1024 Max Response Length 128 Groups,G4 Rollout Batch Size /G256 (1), 2048 (8), 24576 (96) KL Coefficient 0 Train Temperature 1.0 Val Temperature 1.0 GRPO Optimization Steps,M3840 (1), 480 (8), 40 (96) GRPO Train Epochs 40 Modelmeta-llama...

work page 2048
[7]

Hyperparameters for SFT on LAMBADA Parameter V alue Learning rate 1e−6, 1e−5 Weight Decay 0 Epochs 1, 5, 10 Context Length 2048 Batch Size 64 Modelmeta-llama/Llama-3.2-1B-Instruct,meta-llama/Llama-3.2-1B Infra Hugging Face Table 12: Hypermarameters for the SFT experiments shown in Table

work page 2048

[1] [1]

URL https://zenodo.org/ records/12608602. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie 11 Preprint. Under review. Sravankumar, Artem Korenev, Arthur Hinsvar...

work page arXiv

[2] [2]

A.3 Lemma 2 Let learning algorithm A be SGD operating on x∼π θ, performing gradient descent on ℓ(ϕ,x)to minimizeE x∼π θ [ℓ(ϕ,x)]

Thus, the two functions are smooth with parameterL :=max(G 3,Φ max(G2 1 +G 2)). A.3 Lemma 2 Let learning algorithm A be SGD operating on x∼π θ, performing gradient descent on ℓ(ϕ,x)to minimizeE x∼π θ [ℓ(ϕ,x)]. We show that the SGD iterates defined by ϕk :=ϕ k−1 −η∇ℓ(ϕ k−1,x k−1) with xk ∼π θ converges to its SDE equivalent in the small-step-size limit, wi...

work page 2021

[3] [3]

Now, we bound the ℓ2 distance of the two processes, which is the ℓ2 norm of ∆t

Now define the difference sequence ∆t :=Z t −Z ′ t with the associated SDE d∆t :=−(∇f(Z t)− ∇f(Z ′ t))dt+ √η(Σ(Z t)−Σ ′(Z′ t))dWt. Now, we bound the ℓ2 distance of the two processes, which is the ℓ2 norm of ∆t. By Ito’s formula (Itˆo, 1951), d||∆ t||2 =2∆ td∆t +Tr(η(Σ(Z t)−Σ ′(Z′ t))(Σ(Z t)−Σ ′(Z′ t))⊤)dt =2∆ t(−∇f(Z t) +∇f(Z ′ t))dt+2 √η∆t(Σ(Zt)−Σ ′(Z′ t...

work page 1951

[4] [4]

Timo Gottschalk: A Renowned German Rally Co-driver

of CPT’d models on our test split of the multilingual LAMBADA tasks. Rows designate the source of the CPT data. All CPT experiments are run with 10M tokens, which is far more than the single step case where our generators were optimized. Our DPG RL procedure with Adam in A is able to generate synthetic data that generalizes to this longer training regime,...

work page 2024

[5] [5]

She was definitely Pluche

Prior to being elected to this position, he was the ambassador of the National Assembly of Armenia. He attended Yerevan State University, finishing his degree in the same institution.\n\nI will likely rewrite this in a more readable and fluid version. \n\n' 29 'The Independent (Armenian): An Interview with Vardan Bostanjyan. \n\nWe couldn’t fall apart. At...

work page 1949

[6] [6]

Under review

29 Preprint. Under review. 67, DPG GRPO Hyperparameters Parameter V alue Generator Learning rate 5e−6 Max Prompt Length 1024 Max Response Length 128 Groups,G4 Rollout Batch Size /G256 (1), 2048 (8), 24576 (96) KL Coefficient 0 Train Temperature 1.0 Val Temperature 1.0 GRPO Optimization Steps,M3840 (1), 480 (8), 40 (96) GRPO Train Epochs 40 Modelmeta-llama...

work page 2048

[7] [7]

Hyperparameters for SFT on LAMBADA Parameter V alue Learning rate 1e−6, 1e−5 Weight Decay 0 Epochs 1, 5, 10 Context Length 2048 Batch Size 64 Modelmeta-llama/Llama-3.2-1B-Instruct,meta-llama/Llama-3.2-1B Infra Hugging Face Table 12: Hypermarameters for the SFT experiments shown in Table

work page 2048