pith. sign in

arxiv: 2605.12906 · v2 · pith:OKRWYSAYnew · submitted 2026-05-13 · 💻 cs.LG · cs.AI

Data Difficulty and the Generalization--Extrapolation Tradeoff in LLM Fine-Tuning

Pith reviewed 2026-05-14 19:56 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords data difficultysupervised fine-tuningLLM fine-tuninggeneralization gapextrapolation gapPAC-Bayesian boundsdata selectionSFT
0
0 comments X

The pith

For any fixed data budget in LLM fine-tuning, an optimal difficulty level exists and moves toward harder examples as the budget grows.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how the difficulty of selected training examples affects supervised fine-tuning of large language models. It finds that no single difficulty works best for every dataset size. Instead, for a given number of examples there is a sweet spot in difficulty, and this sweet spot becomes harder once more data is available. The pattern is traced to a tradeoff in which easy data reduces the gap between training and test distributions while hard data improves performance outside that distribution. Controlled synthetic experiments and PAC-Bayesian bounds are used to isolate and quantify the two gaps.

Core claim

Data difficulty and dataset size interact through a generalization-extrapolation tradeoff. For small budgets, easier examples minimize the in-distribution generalization gap and raise performance. Larger budgets favor harder examples because they shrink the extrapolation gap to unseen cases. The location of the optimum is predicted by PAC-Bayesian bounds that depend on model capacity and data volume.

What carries the argument

The interplay between the in-distribution generalization gap and the extrapolation gap, formalized through PAC-Bayesian bounds.

If this is right

  • For small fine-tuning sets, selecting easier data reduces the generalization gap and improves accuracy.
  • Once the data budget exceeds a threshold, selecting harder data improves extrapolation to out-of-distribution cases.
  • PAC-Bayesian bounds can be used to estimate the optimal difficulty level for given model size and data volume.
  • Difficulty-based data selection must be adjusted according to total budget rather than applied with a fixed threshold.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Curators of SFT datasets may need to measure difficulty distributions at different scales to locate the operating point.
  • The same tradeoff could inform data mixing strategies when difficulty is combined with other filters such as length or quality.
  • The mechanism suggests a way to decide when to add harder synthetic or augmented examples during scaling of fine-tuning runs.

Load-bearing premise

The controlled synthetic experiments and PAC-Bayesian analysis capture the dominant mechanism in real LLM fine-tuning on natural language data.

What would settle it

An experiment on real LLMs in which optimal difficulty does not shift toward harder data as the fine-tuning budget increases, or in which measured generalization and extrapolation gaps fail to track the observed performance changes, would falsify the account.

Figures

Figures reproduced from arXiv: 2605.12906 by Jingzhao Zhang, Jingzhao Zhang (IIIS, Siyuan Liu, Tinghong Chen, Xinghan Li, Xinghan Li (IIIS, Yifei Wang, Yifei Wang (Amazon AGI SF Lab).

Figure 1
Figure 1. Figure 1: Relationship between data difficulty mea [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Performance gains over different base models as a function of data size and difficulty, trained on [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: One-dimensional slices of the 2D data size–difficulty experiment on Qwen-2.5-Math-7B from [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Performance gains over different base models on synthetic iGSM data as a function of data difficulty [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Decomposed test results for SFT experiments on the base model Ops[2–8]2k under data sizes of 5k [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Illustration of the two-gap decomposition in SFT. The generalization gap rises with difficulty, while [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: DFT performance on synthetic iGSM data (base model Ops[2–8]2k) across various data difficulty [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: An example from the iGSM dataset in our work we fix the number of edges according to #edges =  op · 4 3  + 1, so that difficulty is effectively controlled by op. Notice that in the iGSM setup, the problem length grows linearly with the number of operations, which is consistent with our length-based difficulty control discussed in previous sections. In the iGSM experiments, all models are trained with a b… view at source ↗
Figure 10
Figure 10. Figure 10: Performance gain over base model as a function of data size and difficulty, trained on the OpenMath [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Extension experiments on Llama models and science reasoning tasks. Data difficulty is measured [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
read the original abstract

Data selection during supervised fine-tuning (SFT) can critically change the behavior of large language models (LLMs). Although existing work has studied the effect of selecting data based on heuristics such as perplexity, difficulty, or length, the reported findings are often inconsistent or context-dependent. In this work, we systematically study the role of data difficulty in fine-tuning from both empirical and theoretical perspectives, and find that there is no universally optimal difficulty level; rather, its effectiveness depends on the dataset size. We show that for a fixed data budget, there exists an optimal data difficulty for SFT, and that this optimal difficulty shifts toward harder data as the data budget increases. To explain this phenomenon, we conduct controlled synthetic experiments that reveal a simple underlying mechanism: the interplay between the (in-distribution) generalization gap and the extrapolation gap. We further support this mechanism through a theoretical analysis using PAC-Bayesian generalization bounds. Overall, our results clarify how data size and difficulty jointly affect the trade-off between generalization and extrapolation in SFT, providing guidance for difficulty-based data selection under certain model and data conditions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript claims that there is no universally optimal data difficulty for supervised fine-tuning (SFT) of LLMs. For a fixed data budget, an optimal difficulty exists and shifts toward harder data as the budget increases. This is demonstrated empirically, explained mechanistically via controlled synthetic experiments that isolate the interplay between the in-distribution generalization gap and the extrapolation gap, and supported by PAC-Bayesian generalization bounds.

Significance. If the results hold, the work clarifies how data size and difficulty jointly determine the generalization-extrapolation tradeoff in SFT, offering concrete guidance for difficulty-based data selection under the studied model and data conditions. The combination of synthetic experiments and PAC-Bayesian analysis provides a mechanistic account that strengthens the empirical findings and distinguishes this contribution from heuristic-based prior work.

major comments (1)
  1. [Synthetic experiments] Synthetic experiments section: the difficulty binning threshold is identified as a free parameter; the central claim that an optimal difficulty exists and shifts with budget size would be strengthened by an explicit robustness check showing that the location of the optimum is insensitive to reasonable variations in this threshold.
minor comments (1)
  1. [Abstract] Abstract: the phrase 'under certain model and data conditions' is appropriately cautious but could be expanded by one sentence to indicate the scope (e.g., synthetic tasks or specific model scales) without lengthening the abstract excessively.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive recommendation for minor revision. We address the single major comment below.

read point-by-point responses
  1. Referee: [Synthetic experiments] Synthetic experiments section: the difficulty binning threshold is identified as a free parameter; the central claim that an optimal difficulty exists and shifts with budget size would be strengthened by an explicit robustness check showing that the location of the optimum is insensitive to reasonable variations in this threshold.

    Authors: We agree that an explicit robustness check would strengthen the central claim. In the revised manuscript we will add a dedicated subsection in the synthetic experiments that varies the binning threshold over a range of reasonable values (e.g., the original threshold together with shifts of ±10 % and ±20 %). For each budget size we will report the location of the optimal difficulty bin and show that it remains stable across these threshold choices, thereby confirming that the observed shift toward harder data is not an artifact of the particular binning parameter. revision: yes

Circularity Check

0 steps flagged

Minor self-citation risk but central claim remains independent

full rationale

The paper grounds its main result in new controlled synthetic experiments isolating the generalization-extrapolation tradeoff plus standard PAC-Bayesian bounds. No equation or claim reduces by construction to a fitted parameter defined from the target quantity, nor does any load-bearing step rely on a self-citation chain that itself assumes the result. The derivation introduces an explanatory mechanism via fresh experiments rather than renaming known patterns or smuggling an ansatz through prior work. A low-level self-citation risk is noted but does not force the central claim, keeping the overall circularity low.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that the synthetic data distribution isolates generalization and extrapolation gaps in a manner representative of natural language, plus standard PAC-Bayesian assumptions on model priors and loss functions. No new entities are postulated. One free parameter is the precise definition of 'difficulty' used to bin examples, which is fitted or chosen per experiment.

free parameters (1)
  • difficulty binning threshold
    The cutoff used to label examples as easy or hard is chosen or fitted to produce the observed shift; its value is not derived from first principles.
axioms (2)
  • standard math PAC-Bayesian generalization bounds apply to the fine-tuned LLM under the chosen prior and loss
    Invoked to support the theoretical analysis of the generalization-extrapolation tradeoff.
  • domain assumption Synthetic task distributions faithfully reproduce the relevant generalization and extrapolation behavior of natural language data
    Required for the controlled experiments to explain real LLM fine-tuning.

pith-pipeline@v0.9.0 · 5537 in / 1584 out tokens · 41715 ms · 2026-05-14T19:56:54.951641+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.