The Differences Between Direct Alignment Algorithms are a Blur

Alexey Gorbatovski; Alexey Malakhov; Boris Shaposhnikov; Daniil Gavrilov; Viacheslav Sinii

arxiv: 2502.01237 · v3 · submitted 2025-02-03 · 💻 cs.LG

The Differences Between Direct Alignment Algorithms are a Blur

Alexey Gorbatovski , Boris Shaposhnikov , Viacheslav Sinii , Alexey Malakhov , Daniil Gavrilov This is my paper

Pith reviewed 2026-05-23 03:46 UTC · model grok-4.3

classification 💻 cs.LG

keywords direct alignment algorithmsLLM alignmentranking objectivepairwise vs pointwiseORPOASFTinstruction followingmath reasoning

0 comments

The pith

When placed in a common two-stage framework with a beta parameter, the ranking objective determines direct alignment quality more than the scalar score used.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Direct alignment algorithms optimize LLM policies without separate reward models or RL steps. The paper isolates the ranking objective as the main performance driver by converting one-stage methods into explicit two-stage pipelines and aligning all methods in the same hyperparameter space. Under this controlled setup, pairwise ranking outperforms pointwise ranking while the choice between policy-reference ratios and odds ratios becomes secondary. The pattern appears in both instruction-following and math-reasoning tasks and is linked to how each objective handles prompt-specific biases. The result implies that earlier claims of superiority among DAAs often reflected inconsistent training setups rather than fundamental differences in the scalar scores.

Core claim

Under the unified training framework, the ranking objective emerges as the primary determinant of alignment quality, whereas the particular scalar score (policy-reference ratio versus odds ratio) is secondary. This holds after converting one-stage methods such as ORPO and ASFT into two-stage pipelines with an explicit SFT phase and after introducing a beta parameter that places all methods in comparable hyperparameter regimes. Evidence from strictly controlled experiments and real data indicates the difference stems from interactions with prompt-specific biases.

What carries the argument

Unified training framework that converts one-stage DAAs to explicit two-stage pipelines and introduces a shared beta parameter to enable direct comparison of ranking objectives.

If this is right

Pairwise ranking objectives produce higher alignment quality than pointwise objectives once training pipelines are standardized.
Introducing the beta parameter improves performance for odds-ratio methods such as ORPO and ASFT.
The primary role of the ranking objective appears consistently across instruction-following and math-reasoning benchmarks and across model scales.
Observed quality gaps trace to how ranking objectives interact with prompt-specific biases rather than to the scalar score itself.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

New direct alignment methods should be evaluated first by their ranking formulation before claims are made about scalar-score innovations.
Prompt-specific bias mitigation techniques could be developed by studying the interaction between ranking objectives and individual prompts.
The unification approach could be applied to other families of alignment losses to test whether ranking type remains the dominant factor outside the current set of DAAs.

Load-bearing premise

Converting one-stage methods into an explicit two-stage pipeline with added SFT and beta preserves the original algorithmic intent and does not itself create the performance gaps attributed to ranking.

What would settle it

A controlled run in which the converted one-stage methods still underperform their original versions after matching the ranking objective exactly would falsify the claim that ranking type alone drives the observed differences.

Figures

Figures reproduced from arXiv: 2502.01237 by Alexey Gorbatovski, Alexey Malakhov, Boris Shaposhnikov, Daniil Gavrilov, Viacheslav Sinii.

**Figure 1.** Figure 1: Overview of our work and main finding. Left: Existing DAA methods differ in use of SFT and β parameter. Center: We unify methods by making SFT and β explicit for each, showing that ORPO and ASFT can be brought into the same framework as other DAAs. Right: We compare DAAs along two axes (scalar score type and ranking type) and find that ranking type (pairwise, green vs. pointwise, red) is the main driver of… view at source ↗

**Figure 2.** Figure 2: Impact of the β Parameter on ASFT and ORPO Alignment Quality. The plot shows how tuning β (Section 3.1.2) affects both ASFT and ORPO performance. Results are reported for GPT-4 Win Rate in the Llama 3.2 3B TL;DR setup and for AlpacaEval 2 LC Win Rate in the Llama 3.1 8B UF scenario. All other hyperparameters (e.g., learning rates) are selected via grid search, using each method’s best configuration at β = … view at source ↗

**Figure 3.** Figure 3: GPT-4 Evaluation of Llama 3.2 3B TL;DR setup. The comparison shows multiple alignment methods (rows) using their best hyperparameters, where each approach aims to generate concise and accurate summaries. Most methods exceed 90% Win Rate; ASFT achieves 87.2%, maintaining robust summarization performance. See Section 5.3 for more details. Llama 3.2 3B UF. The UltraChat and UF datasets serve as more challengi… view at source ↗

**Figure 4.** Figure 4: Impact of SFT Dataset Size on Alignment Quality. Performance of the pairwise (a) and pointwise (b) alignment methods on AlpacaEval 2 (LC WR metric) when the SFT policy is trained on different fractions of the UltraChat dataset. Even a small fraction of SFT data (e.g., 5-10%) yields substantial gains over starting from the raw base model. See Section 5.4 for more details. 6 Discussion Having combined all th… view at source ↗

**Figure 5.** Figure 5: Toy experiment: effect of model capacity (h = 1, 2, 3, 4) on accuracy and prompt bias (ICC1). Pairwise (solid) and pointwise (dashed) objectives compared under unbiased (bias_strength = 0.0, left) and biased (bias_strength = 0.9, right) conditions. Results averaged over 1000 seeds; 95% CI shown. See Section 6 for details. Experimental Setup. For each run, we generate a dataset of N = 2000 samples. Each sa… view at source ↗

**Figure 6.** Figure 6: Toy experiment: effect of model capacity (h = 5, 6, 8) on accuracy and prompt bias (ICC1). Pairwise (solid) and pointwise (dashed) objectives compared under unbiased (bias_strength = 0.0, left) and biased (bias_strength = 0.9, right) conditions. Results averaged over 1000 seeds; 95% CI shown. See Section 6 for details. Model and Training. The model is a simple Multi-Layer Perceptron (MLP) with a single hi… view at source ↗

**Figure 7.** Figure 7: ICC1 on real data. ICC1 computed on the training and validation splits for the best model from each method, across Llama 3.1 8B UF, Llama 3.2 3B UF, and Llama 3.2 3B TL;DR setups. Error bars show 95% confidence intervals. See Section 6 for details. Results. Figures 5 and 6 present the results of the toy experiment, reporting test accuracy and ICC1 across a range of model capacities (hidden dimension h), bo… view at source ↗

**Figure 8.** Figure 8: Pareto front for alignment quality and KL divergence. Results for Llama 3.2 3B TL;DR and UF setups on GPT-4 Win Rate vs. "golden" validation subset and AlpacaEval 2 LC respectively with different β values. Methods are grouped into pairwise and pointwise categories. For the summarization task (Llama 3.2 3B TL;DR), both pointwise and pairwise methods achieve strong overall results. For the UF setup, methods … view at source ↗

read the original abstract

Direct Alignment Algorithms (DAAs) simplify LLM alignment by directly optimizing policies, bypassing reward modeling and RL. While DAAs differ in their use of SFT (one-stage vs. two-stage) and the scalar score they optimize (likelihood vs. odds ratios), the key performance drivers remain underexplored. We present a systematic comparison and analyze a previously overlooked axis - the ranking objective (pairwise vs. pointwise). To isolate this factor, we propose a unified training framework across DAAs by (i) converting one-stage methods (ORPO, ASFT) into a two-stage pipeline with an explicit SFT phase and (ii) introducing a $\beta$ parameter that places all methods in the same hyperparameter space and improves the quality of odds-ratio DAAs (ORPO, ASFT). Under this setup, the ranking objective emerges as the primary determinant of alignment quality, whereas the particular scalar score (policy-reference ratio vs. odds ratio) is secondary. We corroborate this on instruction-following tasks and further confirm it on math-reasoning benchmarks across model scales. Evidence suggests that this stems from how these objectives interact with prompt-specific biases, supported both by strictly controlled experiments and by observations on real data. Our findings underscore the need for nuanced evaluations in DAA research to avoid oversimplified claims of superiority.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

After unifying DAAs with an added SFT stage and beta, ranking objective looks more important than scalar score choice, though the conversion step itself may be doing some of the work.

read the letter

The main thing to know is that the authors put several direct alignment methods into one training setup by converting one-stage approaches like ORPO and ASFT into explicit two-stage pipelines plus a shared beta parameter, then ran controlled comparisons showing the pairwise ranking objective drives alignment quality more than whether the scalar score uses policy-reference ratios or odds ratios. They back this on instruction-following and math-reasoning tasks across model scales and tie it to prompt biases. That ordering of importance is the new empirical claim not in the earlier DAA papers they cite. The unified framework with beta is a clean way to put everything in the same hyperparameter space, and the beta addition visibly improves the odds-ratio methods. The experiments try to hold other factors fixed while varying only the ranking axis versus the scalar one. The soft spot is the conversion itself. Turning originally one-stage losses into two-stage with an extra SFT phase and beta could shift gradient scales or prompt interactions in ways that are not purely about ranking, yet the paper treats the modified versions as faithful enough to attribute gaps to the ranking objective. Without explicit checks on how much the reformulated objectives deviate from the published one-stage versions, the central claim rests on an assumption that needs more direct testing. This is for people actively designing or choosing DAAs who want a clearer map of which design axis moves the needle. A reader focused on alignment method internals will get a useful breakdown even if they end up disagreeing with the ranking-over-scalar conclusion. It deserves peer review because the question is practical and the controlled setup is a step forward, though referees will likely press on the faithfulness of the unification.

Referee Report

1 major / 1 minor

Summary. The paper claims that differences among Direct Alignment Algorithms (DAAs) are driven primarily by the ranking objective (pairwise vs. pointwise) rather than the scalar score (policy-reference ratio vs. odds ratio). To isolate this axis, the authors introduce a unified framework that converts one-stage methods (ORPO, ASFT) into an explicit two-stage pipeline with SFT plus a β hyperparameter, placing all methods in the same space. Experiments on instruction-following and math-reasoning tasks across scales are said to show that ranking dominates, with supporting observations on prompt-specific biases.

Significance. If the unification faithfully isolates the ranking objective, the result would clarify which design choice most affects DAA quality and reduce oversimplified superiority claims. The systematic cross-task, cross-scale comparison is a constructive contribution; however, the central attribution rests on an empirical argument whose soundness depends on unverified equivalence of the modified losses.

major comments (1)

[§3] §3 and experimental setup: The claim that the unified framework isolates the ranking objective requires that converting one-stage DAAs (ORPO, ASFT) to an explicit two-stage pipeline plus β does not itself alter optimization dynamics or prompt-bias interactions. The manuscript reports improvements from β on odds-ratio methods but provides no gradient analysis, loss-equivalence check, or direct comparison to the published one-stage objectives, leaving open the possibility that performance gaps trace to the reformulation rather than the ranking axis.

minor comments (1)

The abstract states that results are 'corroborated' and 'confirmed' on multiple tasks but supplies no quantitative values, error bars, or statistical tests; the main text should include these to allow readers to assess effect sizes.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful comments. We address the major comment on the unified framework below.

read point-by-point responses

Referee: [§3] §3 and experimental setup: The claim that the unified framework isolates the ranking objective requires that converting one-stage DAAs (ORPO, ASFT) to an explicit two-stage pipeline plus β does not itself alter optimization dynamics or prompt-bias interactions. The manuscript reports improvements from β on odds-ratio methods but provides no gradient analysis, loss-equivalence check, or direct comparison to the published one-stage objectives, leaving open the possibility that performance gaps trace to the reformulation rather than the ranking axis.

Authors: We appreciate this observation. The unified framework converts one-stage methods to an explicit two-stage pipeline with a tunable β precisely to standardize the procedure and isolate the ranking objective by controlling the SFT component and hyperparameter space. The improvements from β on ORPO/ASFT indicate that the original one-stage losses may not have been optimally balanced, but the controlled experiments demonstrate that pairwise ranking outperforms pointwise ranking regardless of the scalar score (policy-reference vs. odds ratio). While we do not provide gradient analysis or formal loss-equivalence checks, the consistent cross-task and cross-scale results, along with prompt-bias observations, support that the ranking axis is the primary driver rather than the reformulation itself. We will add a clarification section relating the unified losses to the original one-stage objectives. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical comparison with no derivations or fitted predictions

full rationale

The paper contains no equations, derivations, or first-principles results. Its central claim rests on experimental observations from a proposed unified training setup (explicit two-stage pipeline plus β for one-stage methods). This unification is a methodological choice whose validity is tested empirically rather than assumed by construction. No self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear. The analysis is self-contained against external benchmarks via controlled experiments on instruction-following and math tasks.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The central claim depends on the assumption that the introduced beta and two-stage conversion create a fair comparison space; no free parameters are fitted to the final result, no new entities are postulated, and no domain axioms beyond standard supervised training are invoked.

free parameters (1)

beta parameter
Introduced to place all DAAs in the same hyperparameter space; its specific value is not reported in the abstract.

pith-pipeline@v0.9.0 · 5782 in / 1103 out tokens · 39123 ms · 2026-05-23T03:46:05.269106+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Reinforcement Learning from Human Feedback
cs.LG 2025-04 unverdicted novelty 2.0

The book introduces the origins, mathematical setup, and optimization stages of RLHF including reward modeling, reinforcement learning, rejection sampling, and direct alignment algorithms.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · cited by 1 Pith paper · 7 internal anchors

[1]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

work page
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page
[3]

Llama 3 model card

AI@Meta (2024). Llama 3 model card

work page 2024
[4]

G., Rowland, M., Piot, B., Guo, D., Calandriello, D., Valko, M., and Munos, R

Azar, M. G., Rowland, M., Piot, B., Guo, D., Calandriello, D., Valko, M., and Munos, R. (2023). A general theoretical paradigm to understand learning from human preferences

work page 2023
[5]

Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T. J., Joseph, N., Kadavath, S., Kernion, J., Conerly, T., El-Showk, S., Elhage, N., Hatfield-Dodds, Z., Hernandez, D., Hume, T., Johnston, S., Kravec, S., Lovitt, L., Nanda, N., Olsson, C., Amodei, D., Brown, T. B., Clark, J., McCandlish, S., ...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[6]

Bartko, J. J. (1966). The intraclass correlation coefficient as a measure of reliability. Psychological reports , 19(1):3--11

work page 1966
[7]

Bradley, R. A. and Terry, M. E. (1952). Rank Analysis of Inclomplete Block Design: The Method of Paired Comparisons . Biometrika , 39(3-4):324--345

work page 1952
[8]

Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., and Hullender, G. (2005). Learning to rank using gradient descent. In Proceedings of the 22nd international conference on Machine learning , pages 89--96

work page 2005
[9]

Chen, H., He, G., Yuan, L., Cui, G., Su, H., and Zhu, J. (2024). Noise contrastive alignment of language models with explicit rewards

work page 2024
[10]

Cui, G., Yuan, L., Ding, N., Yao, G., Zhu, W., Ni, Y., Xie, G., Liu, Z., and Sun, M. (2023). Ultrafeedback: Boosting language models with high-quality feedback

work page 2023
[11]

Dao, T. (2023). Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Ding, N., Chen, Y., Xu, B., Qin, Y., Hu, S., Liu, Z., Sun, M., and Zhou, B. (2023). Enhancing chat language models by scaling high-quality instructional conversations. In Bouamor, H., Pino, J., and Bali, K., editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages 3029--3051, Singapore. Association for Comput...

work page 2023
[13]

D'Oosterlinck, K., Xu, W., Develder, C., Demeester, T., Singh, A., Potts, C., Kiela, D., and Mehri, S. (2024). Anchored preference optimization and contrastive revisions: Addressing underspecification in alignment

work page 2024
[14]

Dubois, Y., Galambosi, B., Liang, P., and Hashimoto, T. B. (2024). Length-controlled alpacaeval: A simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Gorbatovski, A., Shaposhnikov, B., Malakhov, A., Surnachev, N., Aksenov, Y., Maksimov, I., Balagansky, N., and Gavrilov, D. (2024). Learn your reference model for real good alignment. arXiv preprint arXiv:2404.09656

work page arXiv 2024
[16]

Han, J., Jiang, M., Song, Y., Ermon, S., and Xu, M. (2024). f -po: Generalizing preference optimization with f -divergence minimization. arXiv preprint arXiv:2410.21662

work page arXiv 2024
[17]

Hong, J., Lee, N., and Thorne, J. (2024). Orpo: Monolithic preference optimization without reference model

work page 2024
[18]

Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. CoRR , abs/1412.6980

work page internal anchor Pith review Pith/arXiv arXiv 2014
[19]

Li, H. (2011). A short introduction to learning to rank. IEICE TRANSACTIONS on Information and Systems , 94(10):1854--1862

work page 2011
[20]

E., and Stoica, I

Li, T., Chiang, W.-L., Frick, E., Dunlap, L., Wu, T., Zhu, B., Gonzalez, J. E., and Stoica, I. (2024). From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline

work page 2024
[21]

Li, X., Zhang, T., Dubois, Y., Taori, R., Gulrajani, I., Guestrin, C., Liang, P., and Hashimoto, T. B. (2023). Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval

work page 2023
[22]

Liu, T., Qin, Z., Wu, J., Shen, J., Khalman, M., Joshi, R., Zhao, Y., Saleh, M., Baumgartner, S., Liu, J., et al. (2024). Lipo: Listwise preference optimization through learning-to-rank. arXiv preprint arXiv:2402.01878

work page arXiv 2024
[23]

Liu, T.-Y. et al. (2009). Learning to rank for information retrieval. Foundations and Trends in Information Retrieval , 3(3):225--331

work page 2009
[24]

McGraw, K. O. and Wong, S. P. (1996). Forming inferences about some intraclass correlation coefficients. Psychological methods , 1(1):30

work page 1996
[25]

Melnikov, V., H \"u llermeier, E., Kaimann, D., Frick, B., and Gupta, P. (2016). Pairwise versus pointwise ranking: A case study. Schedae Informaticae , pages 73--83

work page 2016
[26]

Meng, Y., Xia, M., and Chen, D. (2024). Simpo: Simple preference optimization with a reference-free reward. arXiv preprint arXiv:2405.14734

work page arXiv 2024
[27]

F., Leike, J., and Lowe, R

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P. F., Leike, J., and Lowe, R. (2022). Training language models to follow instructions with human feedback. In Koyejo, S., Mohamed, S., Agar...

work page 2022
[28]

Pal, A., Karkhanis, D., Dooley, S., Roberts, M., Naidu, S., and White, C. (2024). Smaug: Fixing failure modes of preference optimisation with dpo-positive. arXiv preprint arXiv:2402.13228

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Rafailov, R., Chittepu, Y., Park, R., Sikchi, H., Hejna, J., Knox, B., Finn, C., and Niekum, S. (2024). Scaling laws for reward model overoptimization in direct alignment algorithms. arXiv preprint arXiv:2406.02900

work page arXiv 2024
[30]

D., Ermon, S., and Finn, C

Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., and Finn, C. (2023). Direct preference optimization: Your language model is secretly a reward model. In Thirty-seventh Conference on Neural Information Processing Systems

work page 2023
[31]

Rasley, J., Rajbhandari, S., Ruwase, O., and He, Y. (2020). Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , pages 3505--3506

work page 2020
[32]

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization algorithms. CoRR , abs/1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017
[33]

R., Casella, G., and McCulloch, C

Searle, S. R., Casella, G., and McCulloch, C. E. (2009). Variance components . John Wiley & Sons

work page 2009
[34]

Shrout, P. E. and Fleiss, J. L. (1979). Intraclass correlations: uses in assessing rater reliability. Psychological bulletin , 86(2):420

work page 1979
[35]

M., Lowe, R., Voss, C., Radford, A., Amodei, D., and Christiano, P

Stiennon, N., Ouyang, L., Wu, J., Ziegler, D. M., Lowe, R., Voss, C., Radford, A., Amodei, D., and Christiano, P. (2020). Learning to summarize from human feedback. In NeurIPS

work page 2020
[36]

Sun, S., Zhang, Y., Bukharin, A., Mosallanezhad, D., Zeng, J., Singhal, S., Shen, G., Renduchintala, A., Konuk, T., Dong, Y., et al. (2025). Reward-aware preference optimization: A unified mathematical framework for model alignment. arXiv preprint arXiv:2502.00203

work page arXiv 2025
[37]

D., Zheng, Z., Calandriello, D., Munos, R., Rowland, M., Richemond, P

Tang, Y., Guo, Z. D., Zheng, Z., Calandriello, D., Munos, R., Rowland, M., Richemond, P. H., Valko, M., Pires, B. \'A ., and Piot, B. (2024). Generalized preference optimization: A unified approach to offline alignment. arXiv preprint arXiv:2402.05749

work page arXiv 2024
[38]

Tunstall, L., Beeching, E., Lambert, N., Rajani, N., Rasul, K., Belkada, Y., Huang, S., von Werra, L., Fourrier, C., Habib, N., et al. (2023). Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

Tutnov, R., Grosnit, A., and Bou-Ammar, H. (2025). Many of your dpos are secretly one: Attempting unification through mutual information. arXiv preprint arXiv:2501.01544

work page arXiv 2025
[40]

Wang, R., Sun, J., Hua, S., and Fang, Q. (2024). Asft: Aligned supervised fine-tuning through absolute likelihood

work page 2024
[41]

Welleck, S., Kulikov, I., Roller, S., Dinan, E., Cho, K., and Weston, J. (2019). Neural text generation with unlikelihood training. arXiv preprint arXiv:1908.04319

work page arXiv 2019
[42]

Xiao, T., Yuan, Y., Zhu, H., Li, M., and Honavar, V. G. (2024). Cal-dpo: Calibrated direct preference optimization for language model alignment

work page 2024
[43]

Xu, S., Fu, W., Gao, J., Ye, W., Liu, W., Mei, Z., Wang, G., Yu, C., and Wu, Y. (2024). Is dpo superior to ppo for llm alignment? a comprehensive study. arXiv preprint arXiv:2404.10719

work page arXiv 2024
[44]

I., Das, A., Zhang, S.-X., Yao, D

Zhao, H., Winata, G. I., Das, A., Zhang, S.-X., Yao, D. D., Tang, W., and Sahu, S. (2024). Rainbowpo: A unified framework for combining improvements in preference optimization. arXiv preprint arXiv:2410.04203

work page arXiv 2024
[45]

Zhou, C., Liu, P., Xu, P., Iyer, S., Sun, J., Mao, Y., Ma, X., Efrat, A., Yu, P., Yu, L., et al. (2024). Lima: Less is more for alignment. Advances in Neural Information Processing Systems , 36

work page 2024

[1] [1]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

work page

[2] [2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page

[3] [3]

Llama 3 model card

AI@Meta (2024). Llama 3 model card

work page 2024

[4] [4]

G., Rowland, M., Piot, B., Guo, D., Calandriello, D., Valko, M., and Munos, R

Azar, M. G., Rowland, M., Piot, B., Guo, D., Calandriello, D., Valko, M., and Munos, R. (2023). A general theoretical paradigm to understand learning from human preferences

work page 2023

[5] [5]

Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T. J., Joseph, N., Kadavath, S., Kernion, J., Conerly, T., El-Showk, S., Elhage, N., Hatfield-Dodds, Z., Hernandez, D., Hume, T., Johnston, S., Kravec, S., Lovitt, L., Nanda, N., Olsson, C., Amodei, D., Brown, T. B., Clark, J., McCandlish, S., ...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[6] [6]

Bartko, J. J. (1966). The intraclass correlation coefficient as a measure of reliability. Psychological reports , 19(1):3--11

work page 1966

[7] [7]

Bradley, R. A. and Terry, M. E. (1952). Rank Analysis of Inclomplete Block Design: The Method of Paired Comparisons . Biometrika , 39(3-4):324--345

work page 1952

[8] [8]

Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., and Hullender, G. (2005). Learning to rank using gradient descent. In Proceedings of the 22nd international conference on Machine learning , pages 89--96

work page 2005

[9] [9]

Chen, H., He, G., Yuan, L., Cui, G., Su, H., and Zhu, J. (2024). Noise contrastive alignment of language models with explicit rewards

work page 2024

[10] [10]

Cui, G., Yuan, L., Ding, N., Yao, G., Zhu, W., Ni, Y., Xie, G., Liu, Z., and Sun, M. (2023). Ultrafeedback: Boosting language models with high-quality feedback

work page 2023

[11] [11]

Dao, T. (2023). Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691

work page internal anchor Pith review Pith/arXiv arXiv 2023

[12] [12]

Ding, N., Chen, Y., Xu, B., Qin, Y., Hu, S., Liu, Z., Sun, M., and Zhou, B. (2023). Enhancing chat language models by scaling high-quality instructional conversations. In Bouamor, H., Pino, J., and Bali, K., editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages 3029--3051, Singapore. Association for Comput...

work page 2023

[13] [13]

D'Oosterlinck, K., Xu, W., Develder, C., Demeester, T., Singh, A., Potts, C., Kiela, D., and Mehri, S. (2024). Anchored preference optimization and contrastive revisions: Addressing underspecification in alignment

work page 2024

[14] [14]

Dubois, Y., Galambosi, B., Liang, P., and Hashimoto, T. B. (2024). Length-controlled alpacaeval: A simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

Gorbatovski, A., Shaposhnikov, B., Malakhov, A., Surnachev, N., Aksenov, Y., Maksimov, I., Balagansky, N., and Gavrilov, D. (2024). Learn your reference model for real good alignment. arXiv preprint arXiv:2404.09656

work page arXiv 2024

[16] [16]

Han, J., Jiang, M., Song, Y., Ermon, S., and Xu, M. (2024). f -po: Generalizing preference optimization with f -divergence minimization. arXiv preprint arXiv:2410.21662

work page arXiv 2024

[17] [17]

Hong, J., Lee, N., and Thorne, J. (2024). Orpo: Monolithic preference optimization without reference model

work page 2024

[18] [18]

Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. CoRR , abs/1412.6980

work page internal anchor Pith review Pith/arXiv arXiv 2014

[19] [19]

Li, H. (2011). A short introduction to learning to rank. IEICE TRANSACTIONS on Information and Systems , 94(10):1854--1862

work page 2011

[20] [20]

E., and Stoica, I

Li, T., Chiang, W.-L., Frick, E., Dunlap, L., Wu, T., Zhu, B., Gonzalez, J. E., and Stoica, I. (2024). From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline

work page 2024

[21] [21]

Li, X., Zhang, T., Dubois, Y., Taori, R., Gulrajani, I., Guestrin, C., Liang, P., and Hashimoto, T. B. (2023). Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval

work page 2023

[22] [22]

Liu, T., Qin, Z., Wu, J., Shen, J., Khalman, M., Joshi, R., Zhao, Y., Saleh, M., Baumgartner, S., Liu, J., et al. (2024). Lipo: Listwise preference optimization through learning-to-rank. arXiv preprint arXiv:2402.01878

work page arXiv 2024

[23] [23]

Liu, T.-Y. et al. (2009). Learning to rank for information retrieval. Foundations and Trends in Information Retrieval , 3(3):225--331

work page 2009

[24] [24]

McGraw, K. O. and Wong, S. P. (1996). Forming inferences about some intraclass correlation coefficients. Psychological methods , 1(1):30

work page 1996

[25] [25]

Melnikov, V., H \"u llermeier, E., Kaimann, D., Frick, B., and Gupta, P. (2016). Pairwise versus pointwise ranking: A case study. Schedae Informaticae , pages 73--83

work page 2016

[26] [26]

Meng, Y., Xia, M., and Chen, D. (2024). Simpo: Simple preference optimization with a reference-free reward. arXiv preprint arXiv:2405.14734

work page arXiv 2024

[27] [27]

F., Leike, J., and Lowe, R

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P. F., Leike, J., and Lowe, R. (2022). Training language models to follow instructions with human feedback. In Koyejo, S., Mohamed, S., Agar...

work page 2022

[28] [28]

Pal, A., Karkhanis, D., Dooley, S., Roberts, M., Naidu, S., and White, C. (2024). Smaug: Fixing failure modes of preference optimisation with dpo-positive. arXiv preprint arXiv:2402.13228

work page internal anchor Pith review Pith/arXiv arXiv 2024

[29] [29]

Rafailov, R., Chittepu, Y., Park, R., Sikchi, H., Hejna, J., Knox, B., Finn, C., and Niekum, S. (2024). Scaling laws for reward model overoptimization in direct alignment algorithms. arXiv preprint arXiv:2406.02900

work page arXiv 2024

[30] [30]

D., Ermon, S., and Finn, C

Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., and Finn, C. (2023). Direct preference optimization: Your language model is secretly a reward model. In Thirty-seventh Conference on Neural Information Processing Systems

work page 2023

[31] [31]

Rasley, J., Rajbhandari, S., Ruwase, O., and He, Y. (2020). Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , pages 3505--3506

work page 2020

[32] [32]

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization algorithms. CoRR , abs/1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017

[33] [33]

R., Casella, G., and McCulloch, C

Searle, S. R., Casella, G., and McCulloch, C. E. (2009). Variance components . John Wiley & Sons

work page 2009

[34] [34]

Shrout, P. E. and Fleiss, J. L. (1979). Intraclass correlations: uses in assessing rater reliability. Psychological bulletin , 86(2):420

work page 1979

[35] [35]

M., Lowe, R., Voss, C., Radford, A., Amodei, D., and Christiano, P

Stiennon, N., Ouyang, L., Wu, J., Ziegler, D. M., Lowe, R., Voss, C., Radford, A., Amodei, D., and Christiano, P. (2020). Learning to summarize from human feedback. In NeurIPS

work page 2020

[36] [36]

Sun, S., Zhang, Y., Bukharin, A., Mosallanezhad, D., Zeng, J., Singhal, S., Shen, G., Renduchintala, A., Konuk, T., Dong, Y., et al. (2025). Reward-aware preference optimization: A unified mathematical framework for model alignment. arXiv preprint arXiv:2502.00203

work page arXiv 2025

[37] [37]

D., Zheng, Z., Calandriello, D., Munos, R., Rowland, M., Richemond, P

Tang, Y., Guo, Z. D., Zheng, Z., Calandriello, D., Munos, R., Rowland, M., Richemond, P. H., Valko, M., Pires, B. \'A ., and Piot, B. (2024). Generalized preference optimization: A unified approach to offline alignment. arXiv preprint arXiv:2402.05749

work page arXiv 2024

[38] [38]

Tunstall, L., Beeching, E., Lambert, N., Rajani, N., Rasul, K., Belkada, Y., Huang, S., von Werra, L., Fourrier, C., Habib, N., et al. (2023). Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944

work page internal anchor Pith review Pith/arXiv arXiv 2023

[39] [39]

Tutnov, R., Grosnit, A., and Bou-Ammar, H. (2025). Many of your dpos are secretly one: Attempting unification through mutual information. arXiv preprint arXiv:2501.01544

work page arXiv 2025

[40] [40]

Wang, R., Sun, J., Hua, S., and Fang, Q. (2024). Asft: Aligned supervised fine-tuning through absolute likelihood

work page 2024

[41] [41]

Welleck, S., Kulikov, I., Roller, S., Dinan, E., Cho, K., and Weston, J. (2019). Neural text generation with unlikelihood training. arXiv preprint arXiv:1908.04319

work page arXiv 2019

[42] [42]

Xiao, T., Yuan, Y., Zhu, H., Li, M., and Honavar, V. G. (2024). Cal-dpo: Calibrated direct preference optimization for language model alignment

work page 2024

[43] [43]

Xu, S., Fu, W., Gao, J., Ye, W., Liu, W., Mei, Z., Wang, G., Yu, C., and Wu, Y. (2024). Is dpo superior to ppo for llm alignment? a comprehensive study. arXiv preprint arXiv:2404.10719

work page arXiv 2024

[44] [44]

I., Das, A., Zhang, S.-X., Yao, D

Zhao, H., Winata, G. I., Das, A., Zhang, S.-X., Yao, D. D., Tang, W., and Sahu, S. (2024). Rainbowpo: A unified framework for combining improvements in preference optimization. arXiv preprint arXiv:2410.04203

work page arXiv 2024

[45] [45]

Zhou, C., Liu, P., Xu, P., Iyer, S., Sun, J., Mao, Y., Ma, X., Efrat, A., Yu, P., Yu, L., et al. (2024). Lima: Less is more for alignment. Advances in Neural Information Processing Systems , 36

work page 2024