Beyond Log Likelihood: Probability-Based Objectives for Supervised Fine-Tuning across the Model Capability Continuum
Pith reviewed 2026-05-25 07:39 UTC · model grok-4.3
The pith
The best probability-based objective for supervised fine-tuning shifts with a model's place on the capability continuum.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Near the model-strong end of the capability continuum, prior-leaning objectives that downweight low-probability tokens (such as -p, -p^10, and thresholded variants) consistently outperform NLL; toward the model-weak end NLL dominates; in the middle region no single objective prevails. Theoretical analysis shows how the objectives trade places as capability changes.
What carries the argument
The model-capability continuum, an ordering of models by strength that governs which probability-based objective is optimal for supervised fine-tuning.
If this is right
- Strong models reach higher benchmark scores when fine-tuned with prior-leaning objectives instead of NLL.
- Weak models obtain better results when fine-tuned with NLL than with alternatives that downweight low-probability tokens.
- Intermediate models show no consistent winner, so objective choice must be tested per case.
- Training pipelines can adapt the objective to the measured capability of the base model.
Where Pith is reading between the lines
- Capability estimates could be used to switch or blend objectives automatically during a single training run.
- The same continuum pattern may appear in other post-training stages such as preference tuning.
- Domain-specific capability measures might replace a single general continuum for more precise objective selection.
Load-bearing premise
Model capability forms a single ordered continuum that reliably predicts which probability-based objective will perform best.
What would settle it
An experiment that places models on an independent capability scale and finds that the observed best objective does not follow the predicted ordering across multiple tasks.
Figures
read the original abstract
Supervised fine-tuning (SFT) is the standard approach for post-training large language models (LLMs), yet it often shows limited generalization. We trace this limitation to its default training objective: negative log likelihood (NLL). While NLL is classically optimal when training from scratch, post-training operates in a different paradigm and could violate its optimality assumptions, where models already encode task-relevant priors and supervision can be long and noisy. In this work, we systematically study various probability-based objectives and characterize when and why different objectives succeed or fail under varying conditions. Through comprehensive experiments and extensive ablation studies across 8 model backbones, 27 benchmarks, and 7 domains, we uncover a critical dimension that governs objective behavior: the model-capability continuum. Near the model-strong end, prior-leaning objectives that downweight low-probability tokens (e.g., $-p$, $-p^{10}$, thresholded variants) consistently outperform NLL; toward the model-weak end, NLL dominates; in between, no single objective prevails. Our theoretical analysis further elucidates how objectives trade places across the continuum, providing a principled foundation for adapting objectives to model capability. The code is available at https://github.com/GaotangLi/Beyond-Log-Likelihood.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that negative log-likelihood (NLL) is not always optimal for supervised fine-tuning (SFT) of LLMs because post-training operates with pre-existing priors and noisy supervision. Through experiments on 8 backbones, 27 benchmarks, and 7 domains, it identifies a model-capability continuum as the governing factor: prior-leaning objectives (e.g., -p, -p^10, thresholded variants) that downweight low-probability tokens outperform NLL near the strong-model end, NLL dominates near the weak-model end, and results are mixed in between. A theoretical analysis is provided to explain the trade-offs, with code released for reproducibility.
Significance. If the central empirical pattern holds under a validated capability measure, the work offers a practical rule for objective selection during SFT and a theoretical lens on why NLL can underperform in post-training. The scale of the experiments (multiple backbones and domains) and public code are strengths that would support adoption if the continuum claim is shown to be robust rather than benchmark-specific.
major comments (3)
- [§4 and §5] §4 (Experiments) and §5 (Theoretical Analysis): The capability continuum is constructed by ordering the 8 backbones according to their post-SFT performance; no pre-specified, independent capability metric (e.g., zero-shot accuracy on a held-out suite orthogonal to the 27 evaluation benchmarks) is reported or validated. This leaves open the possibility that the observed switches are driven by task-specific factors (sequence length, noise level, domain) rather than a single ordered dimension, directly undermining the claim that the continuum 'governs objective behavior.'
- [Table 2 and Figure 3] Table 2 and Figure 3 (main results): The reported outperformance of -p and thresholded objectives for the strongest models is shown only as average deltas; per-benchmark variance and statistical significance (e.g., paired t-tests or bootstrap intervals) are not provided, making it impossible to assess whether the 'consistent' superiority holds after multiple-comparison correction across 27 benchmarks.
- [§3.2] §3.2 (Objective definitions): The family of prior-leaning objectives is introduced without a derivation showing how their gradients differ from NLL in the presence of model priors; the theoretical section later invokes this difference but does not connect it back to the specific functional forms (e.g., the exponent 10 in -p^10) via an explicit inequality or limiting case.
minor comments (2)
- [Abstract and §1] The abstract states 'comprehensive experiments and extensive ablation studies' but the main text does not include an explicit ablation on data noise level or sequence length, which are hypothesized in the introduction as reasons NLL optimality fails.
- [§3.2] Notation for the thresholded variants is introduced in §3.2 but the exact threshold value and how it is chosen (fixed or tuned) is not stated until the experimental setup; this should be moved earlier for clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We provide point-by-point responses below and indicate where revisions will be made to address the concerns.
read point-by-point responses
-
Referee: [§4 and §5] §4 (Experiments) and §5 (Theoretical Analysis): The capability continuum is constructed by ordering the 8 backbones according to their post-SFT performance; no pre-specified, independent capability metric (e.g., zero-shot accuracy on a held-out suite orthogonal to the 27 evaluation benchmarks) is reported or validated. This leaves open the possibility that the observed switches are driven by task-specific factors (sequence length, noise level, domain) rather than a single ordered dimension, directly undermining the claim that the continuum 'governs objective behavior.'
Authors: The continuum is defined via average post-SFT performance to reflect the relevant capability for SFT tasks. This ordering is consistent across the 7 domains, suggesting it captures a general dimension. We will revise to include zero-shot results on a held-out benchmark suite orthogonal to the 27 benchmarks to provide an independent validation of the ordering. revision: partial
-
Referee: [Table 2 and Figure 3] Table 2 and Figure 3 (main results): The reported outperformance of -p and thresholded objectives for the strongest models is shown only as average deltas; per-benchmark variance and statistical significance (e.g., paired t-tests or bootstrap intervals) are not provided, making it impossible to assess whether the 'consistent' superiority holds after multiple-comparison correction across 27 benchmarks.
Authors: We will update the manuscript to report per-benchmark performance with variance measures and conduct paired statistical tests with appropriate multiple-comparison corrections to substantiate the claims of consistent superiority. revision: yes
-
Referee: [§3.2] §3.2 (Objective definitions): The family of prior-leaning objectives is introduced without a derivation showing how their gradients differ from NLL in the presence of model priors; the theoretical section later invokes this difference but does not connect it back to the specific functional forms (e.g., the exponent 10 in -p^10) via an explicit inequality or limiting case.
Authors: We will enhance §3.2 and §5 with a derivation of the gradient differences for the prior-leaning objectives relative to NLL, including an analysis of the exponent α in -p^α and its limiting behavior to better connect the specific forms to the theoretical trade-offs. revision: partial
Circularity Check
Primarily empirical; capability continuum observed from experiments, no reduction to fitted parameters or self-citations
full rationale
The paper's core contribution consists of experimental results across 8 backbones, 27 benchmarks and 7 domains, with performance differences reported directly from SFT runs using different objectives. The model-capability continuum is characterized as an observed pattern that partitions which objective wins, rather than a quantity fitted from the same outcomes and then renamed as a prediction. No equations, self-citations, or uniqueness theorems are shown in the abstract or described claims that would make the reported ordering or objective rankings tautological by construction. The accompanying theoretical sketch is presented as explanatory rather than load-bearing for the empirical findings. This matches the default case of a self-contained empirical study.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning
LLM post-training is unified as off-policy or on-policy interventions that expand support for useful behaviors, reshape policies within reachable states, or consolidate behavior across training stages.
Reference graph
Works this paper leans on
-
[1]
For allq, q2 j ∥ej −q∥ 2 ≤2q 2 j (1−q j)2,(9) and the bound is tight (equality holds) when all massP i̸=j qi = 1−q j is concentrated on a single coordinate
-
[2]
For fixed distincti̸=j, consider F(q) :=q iqj −qi −q j +∥q∥ 2 . Then max q∈∆V−1 F(q) = 11 √ 33−59 768 ≤0.00546, and the maximizer is attained by a vector with qi =q j = 9− √ 33 24 ,all remaining mass1−2q i placed on one coordinate
-
[3]
If we know−q i −q j +∥q∥ 2 ≤0, then −qi −q j +∥q∥ 2 ≤1 + 2(q i +q j)2 −3(q i +q j) Proof. (1)Sinceqis a probability vector with nonnegative coordinates, ∥ej −q∥ 2 = (1−q j)2 + X k̸=j q2 k ≤(1−q j)2 + X k̸=j qk 2 = 2(1−q j)2, becauseP k̸=j q2 k ≤( P k̸=j qk)2 for nonnegative terms. Multiplying byq 2 j yields Eq. 9. Equal- ity holds when the entire mass1−q ...
-
[4]
Substituting and simplifying, max q∈∆V−1 F(q) =G(x ⋆) = 11 √ 33−59 768 ≤0.00546
=− 1 8 <0, andGachieves a positive value atx ⋆, the global maximum is attained atx ⋆. Substituting and simplifying, max q∈∆V−1 F(q) =G(x ⋆) = 11 √ 33−59 768 ≤0.00546. This value is realized by qi =q j =x ⋆, q ℓ = 1−2x ⋆ for someℓ /∈ {i, j}, q k = 0 (k /∈ {i, j, ℓ}), i.e., the remaining mass is concentrated on a single coordinate, as established at the sta...
-
[5]
(q˜y)>−cfor some small positive constantc >0whenq(˜y)∈[0,0.55]andq ˜y(f ′ 2 −f ′
-
[6]
• ˙R(θ(1) t ) t=0 ≤ ˙R(θ(2) t ) t=0 in Model Weak End
(q˜y)<−dfor some small positive constant dwhenq(˜y)∈[0.55,0.95]and thatc <10d, with an appropriate choice of label noise (e.g.,when y∗ ̸= ˜y) rateE, then we have the following conclusions: • ˙R(θ(1) t ) t=0 ≥ ˙R(θ(2) t ) t=0 in Model Strong End. • ˙R(θ(1) t ) t=0 ≤ ˙R(θ(2) t ) t=0 in Model Weak End. Proof.By Assumption. 5, we first expand the following te...
-
[7]
(qy)]e y − "X y (Tyqy) (f′ 1 −f ′
-
[8]
(qy) # q(12) =q ˜y(f ′ 1 −f ′
-
[9]
(q˜y)e ˜y−q ˜y(f ′ 1 −f ′
-
[10]
(q˜y)q(Only considerTone-hot) =q ˜y(f ′ 1 −f ′
-
[11]
(e˜y−q)(13) We can then proceeed as follows: ˙R(θ(1) t ) t=0 − ˙R(θ(2) t ) t=0 =E x q˜y(f ′ 2 −f ′
-
[12]
(q˜y)⟨r⊙q− r⊤q q, e˜y−q⟩ (14) =E x [q˜y(f ′ 2 −f ′
-
[13]
(q˜y)⟨q y∗ −q y∗ q, e˜y−q⟩](ris also one-hot) =E x [q˜yqy∗ (f ′ 2 −f ′
-
[14]
(q˜y)⟨e y∗ −q, e ˜y−q⟩](15) =E x h q˜yqy∗ (f ′ 2 −f ′
-
[15]
(q˜y)∥e y∗ −q∥ 2 : ˜y=y ∗ i (16) +E x h q˜yqy∗ (f ′ 2 −f ′
-
[16]
Denote the label noise rate to beE
(q˜y) −qy∗ −q ˜y+∥q∥ 2 : ˜y̸=y ∗ i (17) Then we first examine the weak model end, now the model is assumed to output uniform distribution overV. Denote the label noise rate to beE. Then we have that ˙R(θ(1) t ) t=0 − ˙R(θ(2) t ) t=0 = V−1 V 3 (f ′ 2 −f ′ 1) 1 V (1− E)(18) − 1 V 3 (f ′ 2 −f ′ 1) 1 V E(19) = (f ′ 2 −f ′ 1) 1 V 1 V 3 ((V−1) (1− E)− E)<0(20) ...
-
[17]
(q˜y)∥e y∗ −q∥ 2 : ˜y=y ∗ i ≥2 (1− E)E h (f ′ 2 −f ′
-
[18]
(qy∗)q 2 y∗ (1−q y∗)2 i (21) 22 Preprint and defineR=q ˜y(f ′ 2 −f ′
-
[19]
(q˜y)andQ=q ˜yqy∗ −qy∗ −q ˜y+∥q∥ 2 , then first we show the other term is positive. 1 E Ex h q˜yqy∗ (f ′ 2 −f ′
-
[20]
(q˜y) −qy∗ −q ˜y+∥q∥ 2 : ˜y̸=y ∗ i (22) =E x [QR](23) =E x [QR:Q≥0] +E x [QR:Q <0](24) ≥ −cE x [Q:Q≥0] +E x [QR:Q <0](25) ≥ −cPr [Q≥0]∗0.00546 +E x [QR:Q <0](26) >0(27) For the last inequality, we can proceed as follows: Ex [QR:Q <0]−cPr [Q≥0]∗0.00546 ≥d∗Pr ˜y,y∗ [0.95≥q ˜y+q y∗ ≥0.55]∗min 0.95≥q ˜y+qy∗ ≥0.55 |Q| −cPr [q ˜y+q y∗ ≤0.50]∗0.00546 =d∗Pr ˜y,y∗...
-
[21]
(q˜y) −qy∗ −q ˜y+∥q∥ 2 : ˜y̸=y ∗ i >0andA= Ex h q˜yqy∗ (f ′ 2 −f ′
-
[22]
(q˜y)∥e y∗ −q∥ 2 : ˜y=y ∗ i <0, then we could achieve the desired result. 23
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.