pith. sign in

arxiv: 2510.00526 · v3 · pith:DGK3WXKMnew · submitted 2025-10-01 · 💻 cs.CL · cs.LG

Beyond Log Likelihood: Probability-Based Objectives for Supervised Fine-Tuning across the Model Capability Continuum

Pith reviewed 2026-05-25 07:39 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords supervised fine-tuningnegative log likelihoodprobability-based objectivesmodel capability continuumlarge language modelspost-training
0
0 comments X

The pith

The best probability-based objective for supervised fine-tuning shifts with a model's place on the capability continuum.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Supervised fine-tuning of language models usually relies on negative log likelihood, yet this objective can limit generalization once models already hold task-relevant knowledge. The paper tests multiple probability-based alternatives and identifies the model-capability continuum as the factor that decides which objective wins. At the strong end, objectives that downweight low-probability tokens outperform NLL; at the weak end NLL is better; in between results vary. A theoretical account explains why the ranking changes across capability levels. The result matters because it gives a rule for matching the training objective to the starting model rather than using one objective for all cases.

Core claim

Near the model-strong end of the capability continuum, prior-leaning objectives that downweight low-probability tokens (such as -p, -p^10, and thresholded variants) consistently outperform NLL; toward the model-weak end NLL dominates; in the middle region no single objective prevails. Theoretical analysis shows how the objectives trade places as capability changes.

What carries the argument

The model-capability continuum, an ordering of models by strength that governs which probability-based objective is optimal for supervised fine-tuning.

If this is right

  • Strong models reach higher benchmark scores when fine-tuned with prior-leaning objectives instead of NLL.
  • Weak models obtain better results when fine-tuned with NLL than with alternatives that downweight low-probability tokens.
  • Intermediate models show no consistent winner, so objective choice must be tested per case.
  • Training pipelines can adapt the objective to the measured capability of the base model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Capability estimates could be used to switch or blend objectives automatically during a single training run.
  • The same continuum pattern may appear in other post-training stages such as preference tuning.
  • Domain-specific capability measures might replace a single general continuum for more precise objective selection.

Load-bearing premise

Model capability forms a single ordered continuum that reliably predicts which probability-based objective will perform best.

What would settle it

An experiment that places models on an independent capability scale and finds that the observed best objective does not follow the predicted ordering across multiple tasks.

Figures

Figures reproduced from arXiv: 2510.00526 by Gaotang Li, Hanghang Tong, Heng Ji, Ruizhong Qiu, Xiusi Chen.

Figure 1
Figure 1. Figure 1: The model capability continuum of SFT objectives in Post-Training. At the model￾strong (MS) end, where base models already encode extensive priors (e.g., Llama 3 reports 25% math pretraining tokens (Grattafiori et al., 2024)), prior-leaning objectives that downweight low￾probability tokens (e.g., −p, −p 10, or thresholded variants) consistently outperform NLL by up to 16%. At the model-weak (MW) end, where… view at source ↗
Figure 2
Figure 2. Figure 2: The logit gradients Wf (p) of different functions. The weighting term Wf (p) determines how much learn￾ing signal each token contributes relative to the model’s prior belief. For the parametric family in Eq. 3, we have Wf (p) = p α(1 − p). As α → 0 (NLL), this reduces to Wf (p) → (1 − p), which strongly emphasizes low￾probability tokens. When α ≥ 1 (f(p) = 1 − p), the gradient signal from low-probability t… view at source ↗
Figure 3
Figure 3. Figure 3: Performance under quantile thresholding for − log(p), −p, and log(1 − p). Let Qpercentile denote the predicted probability at the specified percentile of the training set. (≥ Per￾centile) corresponds to I = [Qpercentile, 1] in Eq. 4, while (≤ Percentile) corresponds to I = [0, Qpercentile]. Key findings: (1) low-probability tokens consistently harm performance across all objectives; (2) when training on al… view at source ↗
Figure 4
Figure 4. Figure 4: Analysis of MS and MW ends in terms of objective convexity (with Eq. 3) and likelihood [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 1
Figure 1. Figure 1: Training mirrors the MI setup, with a maximum sequence length of 800 and a micro-batch [PITH_FULL_IMAGE:figures/full_fig_p016_1.png] view at source ↗
read the original abstract

Supervised fine-tuning (SFT) is the standard approach for post-training large language models (LLMs), yet it often shows limited generalization. We trace this limitation to its default training objective: negative log likelihood (NLL). While NLL is classically optimal when training from scratch, post-training operates in a different paradigm and could violate its optimality assumptions, where models already encode task-relevant priors and supervision can be long and noisy. In this work, we systematically study various probability-based objectives and characterize when and why different objectives succeed or fail under varying conditions. Through comprehensive experiments and extensive ablation studies across 8 model backbones, 27 benchmarks, and 7 domains, we uncover a critical dimension that governs objective behavior: the model-capability continuum. Near the model-strong end, prior-leaning objectives that downweight low-probability tokens (e.g., $-p$, $-p^{10}$, thresholded variants) consistently outperform NLL; toward the model-weak end, NLL dominates; in between, no single objective prevails. Our theoretical analysis further elucidates how objectives trade places across the continuum, providing a principled foundation for adapting objectives to model capability. The code is available at https://github.com/GaotangLi/Beyond-Log-Likelihood.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that negative log-likelihood (NLL) is not always optimal for supervised fine-tuning (SFT) of LLMs because post-training operates with pre-existing priors and noisy supervision. Through experiments on 8 backbones, 27 benchmarks, and 7 domains, it identifies a model-capability continuum as the governing factor: prior-leaning objectives (e.g., -p, -p^10, thresholded variants) that downweight low-probability tokens outperform NLL near the strong-model end, NLL dominates near the weak-model end, and results are mixed in between. A theoretical analysis is provided to explain the trade-offs, with code released for reproducibility.

Significance. If the central empirical pattern holds under a validated capability measure, the work offers a practical rule for objective selection during SFT and a theoretical lens on why NLL can underperform in post-training. The scale of the experiments (multiple backbones and domains) and public code are strengths that would support adoption if the continuum claim is shown to be robust rather than benchmark-specific.

major comments (3)
  1. [§4 and §5] §4 (Experiments) and §5 (Theoretical Analysis): The capability continuum is constructed by ordering the 8 backbones according to their post-SFT performance; no pre-specified, independent capability metric (e.g., zero-shot accuracy on a held-out suite orthogonal to the 27 evaluation benchmarks) is reported or validated. This leaves open the possibility that the observed switches are driven by task-specific factors (sequence length, noise level, domain) rather than a single ordered dimension, directly undermining the claim that the continuum 'governs objective behavior.'
  2. [Table 2 and Figure 3] Table 2 and Figure 3 (main results): The reported outperformance of -p and thresholded objectives for the strongest models is shown only as average deltas; per-benchmark variance and statistical significance (e.g., paired t-tests or bootstrap intervals) are not provided, making it impossible to assess whether the 'consistent' superiority holds after multiple-comparison correction across 27 benchmarks.
  3. [§3.2] §3.2 (Objective definitions): The family of prior-leaning objectives is introduced without a derivation showing how their gradients differ from NLL in the presence of model priors; the theoretical section later invokes this difference but does not connect it back to the specific functional forms (e.g., the exponent 10 in -p^10) via an explicit inequality or limiting case.
minor comments (2)
  1. [Abstract and §1] The abstract states 'comprehensive experiments and extensive ablation studies' but the main text does not include an explicit ablation on data noise level or sequence length, which are hypothesized in the introduction as reasons NLL optimality fails.
  2. [§3.2] Notation for the thresholded variants is introduced in §3.2 but the exact threshold value and how it is chosen (fixed or tuned) is not stated until the experimental setup; this should be moved earlier for clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We provide point-by-point responses below and indicate where revisions will be made to address the concerns.

read point-by-point responses
  1. Referee: [§4 and §5] §4 (Experiments) and §5 (Theoretical Analysis): The capability continuum is constructed by ordering the 8 backbones according to their post-SFT performance; no pre-specified, independent capability metric (e.g., zero-shot accuracy on a held-out suite orthogonal to the 27 evaluation benchmarks) is reported or validated. This leaves open the possibility that the observed switches are driven by task-specific factors (sequence length, noise level, domain) rather than a single ordered dimension, directly undermining the claim that the continuum 'governs objective behavior.'

    Authors: The continuum is defined via average post-SFT performance to reflect the relevant capability for SFT tasks. This ordering is consistent across the 7 domains, suggesting it captures a general dimension. We will revise to include zero-shot results on a held-out benchmark suite orthogonal to the 27 benchmarks to provide an independent validation of the ordering. revision: partial

  2. Referee: [Table 2 and Figure 3] Table 2 and Figure 3 (main results): The reported outperformance of -p and thresholded objectives for the strongest models is shown only as average deltas; per-benchmark variance and statistical significance (e.g., paired t-tests or bootstrap intervals) are not provided, making it impossible to assess whether the 'consistent' superiority holds after multiple-comparison correction across 27 benchmarks.

    Authors: We will update the manuscript to report per-benchmark performance with variance measures and conduct paired statistical tests with appropriate multiple-comparison corrections to substantiate the claims of consistent superiority. revision: yes

  3. Referee: [§3.2] §3.2 (Objective definitions): The family of prior-leaning objectives is introduced without a derivation showing how their gradients differ from NLL in the presence of model priors; the theoretical section later invokes this difference but does not connect it back to the specific functional forms (e.g., the exponent 10 in -p^10) via an explicit inequality or limiting case.

    Authors: We will enhance §3.2 and §5 with a derivation of the gradient differences for the prior-leaning objectives relative to NLL, including an analysis of the exponent α in -p^α and its limiting behavior to better connect the specific forms to the theoretical trade-offs. revision: partial

Circularity Check

0 steps flagged

Primarily empirical; capability continuum observed from experiments, no reduction to fitted parameters or self-citations

full rationale

The paper's core contribution consists of experimental results across 8 backbones, 27 benchmarks and 7 domains, with performance differences reported directly from SFT runs using different objectives. The model-capability continuum is characterized as an observed pattern that partitions which objective wins, rather than a quantity fitted from the same outcomes and then renamed as a prediction. No equations, self-citations, or uniqueness theorems are shown in the abstract or described claims that would make the reported ordering or objective rankings tautological by construction. The accompanying theoretical sketch is presented as explanatory rather than load-bearing for the empirical findings. This matches the default case of a self-contained empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the unverified empirical pattern and the assumption that capability forms an ordered continuum; no free parameters, axioms, or invented entities are stated in the abstract.

pith-pipeline@v0.9.0 · 5771 in / 1165 out tokens · 18199 ms · 2026-05-25T07:39:31.414922+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning

    cs.CL 2026-04 accept novelty 5.0

    LLM post-training is unified as off-policy or on-policy interventions that expand support for useful behaviors, reshape policies within reachable states, or consolidate behavior across training stages.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · cited by 1 Pith paper

  1. [1]

    For allq, q2 j ∥ej −q∥ 2 ≤2q 2 j (1−q j)2,(9) and the bound is tight (equality holds) when all massP i̸=j qi = 1−q j is concentrated on a single coordinate

  2. [2]

    Then max q∈∆V−1 F(q) = 11 √ 33−59 768 ≤0.00546, and the maximizer is attained by a vector with qi =q j = 9− √ 33 24 ,all remaining mass1−2q i placed on one coordinate

    For fixed distincti̸=j, consider F(q) :=q iqj −qi −q j +∥q∥ 2 . Then max q∈∆V−1 F(q) = 11 √ 33−59 768 ≤0.00546, and the maximizer is attained by a vector with qi =q j = 9− √ 33 24 ,all remaining mass1−2q i placed on one coordinate

  3. [3]

    If we know−q i −q j +∥q∥ 2 ≤0, then −qi −q j +∥q∥ 2 ≤1 + 2(q i +q j)2 −3(q i +q j) Proof. (1)Sinceqis a probability vector with nonnegative coordinates, ∥ej −q∥ 2 = (1−q j)2 + X k̸=j q2 k ≤(1−q j)2 + X k̸=j qk 2 = 2(1−q j)2, becauseP k̸=j q2 k ≤( P k̸=j qk)2 for nonnegative terms. Multiplying byq 2 j yields Eq. 9. Equal- ity holds when the entire mass1−q ...

  4. [4]

    Substituting and simplifying, max q∈∆V−1 F(q) =G(x ⋆) = 11 √ 33−59 768 ≤0.00546

    =− 1 8 <0, andGachieves a positive value atx ⋆, the global maximum is attained atx ⋆. Substituting and simplifying, max q∈∆V−1 F(q) =G(x ⋆) = 11 √ 33−59 768 ≤0.00546. This value is realized by qi =q j =x ⋆, q ℓ = 1−2x ⋆ for someℓ /∈ {i, j}, q k = 0 (k /∈ {i, j, ℓ}), i.e., the remaining mass is concentrated on a single coordinate, as established at the sta...

  5. [5]

    (q˜y)>−cfor some small positive constantc >0whenq(˜y)∈[0,0.55]andq ˜y(f ′ 2 −f ′

  6. [6]

    • ˙R(θ(1) t ) t=0 ≤ ˙R(θ(2) t ) t=0 in Model Weak End

    (q˜y)<−dfor some small positive constant dwhenq(˜y)∈[0.55,0.95]and thatc <10d, with an appropriate choice of label noise (e.g.,when y∗ ̸= ˜y) rateE, then we have the following conclusions: • ˙R(θ(1) t ) t=0 ≥ ˙R(θ(2) t ) t=0 in Model Strong End. • ˙R(θ(1) t ) t=0 ≤ ˙R(θ(2) t ) t=0 in Model Weak End. Proof.By Assumption. 5, we first expand the following te...

  7. [7]

    (qy)]e y − "X y (Tyqy) (f′ 1 −f ′

  8. [8]

    (qy) # q(12) =q ˜y(f ′ 1 −f ′

  9. [9]

    (q˜y)e ˜y−q ˜y(f ′ 1 −f ′

  10. [10]

    (q˜y)q(Only considerTone-hot) =q ˜y(f ′ 1 −f ′

  11. [11]

    (e˜y−q)(13) We can then proceeed as follows: ˙R(θ(1) t ) t=0 − ˙R(θ(2) t ) t=0 =E x q˜y(f ′ 2 −f ′

  12. [12]

    (q˜y)⟨r⊙q− r⊤q q, e˜y−q⟩ (14) =E x [q˜y(f ′ 2 −f ′

  13. [13]

    (q˜y)⟨q y∗ −q y∗ q, e˜y−q⟩](ris also one-hot) =E x [q˜yqy∗ (f ′ 2 −f ′

  14. [14]

    (q˜y)⟨e y∗ −q, e ˜y−q⟩](15) =E x h q˜yqy∗ (f ′ 2 −f ′

  15. [15]

    (q˜y)∥e y∗ −q∥ 2 : ˜y=y ∗ i (16) +E x h q˜yqy∗ (f ′ 2 −f ′

  16. [16]

    Denote the label noise rate to beE

    (q˜y) −qy∗ −q ˜y+∥q∥ 2 : ˜y̸=y ∗ i (17) Then we first examine the weak model end, now the model is assumed to output uniform distribution overV. Denote the label noise rate to beE. Then we have that ˙R(θ(1) t ) t=0 − ˙R(θ(2) t ) t=0 = V−1 V 3 (f ′ 2 −f ′ 1) 1 V (1− E)(18) − 1 V 3 (f ′ 2 −f ′ 1) 1 V E(19) = (f ′ 2 −f ′ 1) 1 V 1 V 3 ((V−1) (1− E)− E)<0(20) ...

  17. [17]

    (q˜y)∥e y∗ −q∥ 2 : ˜y=y ∗ i ≥2 (1− E)E h (f ′ 2 −f ′

  18. [18]

    (qy∗)q 2 y∗ (1−q y∗)2 i (21) 22 Preprint and defineR=q ˜y(f ′ 2 −f ′

  19. [19]

    1 E Ex h q˜yqy∗ (f ′ 2 −f ′

    (q˜y)andQ=q ˜yqy∗ −qy∗ −q ˜y+∥q∥ 2 , then first we show the other term is positive. 1 E Ex h q˜yqy∗ (f ′ 2 −f ′

  20. [20]

    5, we have that givenQ <0, min 0.95≥q ˜y+qy∗ ≥0.55 |Q| ≤ −max 0.95≥q ˜y+qy∗ ≥0.55 1 + 2(q˜y+q y∗)2 −3(q ˜y+q y∗)≤0.045 Also by Assumpion

    (q˜y) −qy∗ −q ˜y+∥q∥ 2 : ˜y̸=y ∗ i (22) =E x [QR](23) =E x [QR:Q≥0] +E x [QR:Q <0](24) ≥ −cE x [Q:Q≥0] +E x [QR:Q <0](25) ≥ −cPr [Q≥0]∗0.00546 +E x [QR:Q <0](26) >0(27) For the last inequality, we can proceed as follows: Ex [QR:Q <0]−cPr [Q≥0]∗0.00546 ≥d∗Pr ˜y,y∗ [0.95≥q ˜y+q y∗ ≥0.55]∗min 0.95≥q ˜y+qy∗ ≥0.55 |Q| −cPr [q ˜y+q y∗ ≤0.50]∗0.00546 =d∗Pr ˜y,y∗...

  21. [21]

    (q˜y) −qy∗ −q ˜y+∥q∥ 2 : ˜y̸=y ∗ i >0andA= Ex h q˜yqy∗ (f ′ 2 −f ′

  22. [22]

    (q˜y)∥e y∗ −q∥ 2 : ˜y=y ∗ i <0, then we could achieve the desired result. 23