ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.
hub
Neural text generation with unlikelihood training
14 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Self-attention acts as a covariance readout that unifies in-context learning via population gradient descent and repetitive generation via asymptotic Markov behavior.
DPO derives the optimal policy directly from human preferences via a reparameterized reward model, solving the RLHF objective with only a binary classification loss and no sampling or separate reward model.
Reinforcement learning on a reward model trained from human summary comparisons produces summaries humans prefer over supervised fine-tuning or human references on TL;DR and transfers to CNN/DM.
CLORE augments correct on-policy rollouts by deleting repetitive and irrelevant segments then optimizes with auxiliary DPO to improve accuracy-efficiency trade-off on math benchmarks.
Interaction-layer antidistillation watermarks use system-prompt-induced behavioral markers like explicit follow-up questions that transfer to distilled student models at 45-89% relative fidelity and can be audited via black-box LLM-as-judge queries.
SOMA estimates a local response manifold from early turns and adapts a small surrogate model via divergence-maximizing prompts and localized LoRA fine-tuning for efficient multi-turn serving.
Annotation-anchored training reduces semantic diversity collapse in post-trained language models by a factor of six compared to standard supervised fine-tuning while preserving instruction-following and improving with scale.
UAG is a universal avoidance generation method that increases multi-branch diversity in diffusion and transformer models by penalizing output similarity, delivering up to 1.9x higher diversity with 4.4x speed and 1/64th the FLOPs of prior methods.
A controlled unification of direct alignment algorithms shows the ranking objective (pairwise vs pointwise) drives alignment quality more than the scalar score optimized.
Agent Q integrates MCTS-guided search, self-critique, and off-policy DPO to train LLM agents that outperform behavior cloning and reinforced fine-tuning baselines in WebShop and achieve up to 95.4% success in real-world booking scenarios.
CTRL is a large conditional transformer language model that uses naturally occurring control codes to steer text generation style and content.
SPEAR enables online federated LLM fine-tuning by using feedback-guided self-play to create contrastive pairs trained with maximum likelihood on correct completions and confidence-weighted unlikelihood on incorrect ones, outperforming baselines without ground-truth contexts.
POVID generates AI-created preference data to fine-tune vision-language models with DPO, reducing hallucinations and improving benchmark scores.
citing papers explorer
-
ORPO: Monolithic Preference Optimization without Reference Model
ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.
-
Self-Attention as a Covariance Readout: A Unified View of In-Context Learning and Repetition
Self-attention acts as a covariance readout that unifies in-context learning via population gradient descent and repetitive generation via asymptotic Markov behavior.
-
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
DPO derives the optimal policy directly from human preferences via a reparameterized reward model, solving the RLHF objective with only a binary classification loss and no sampling or separate reward model.
-
Learning to summarize from human feedback
Reinforcement learning on a reward model trained from human summary comparisons produces summaries humans prefer over supervised fine-tuning or human references on TL;DR and transfers to CNN/DM.
-
CLORE: Content-Level Optimization for Reasoning Efficiency
CLORE augments correct on-policy rollouts by deleting repetitive and irrelevant segments then optimizes with auxiliary DPO to improve accuracy-efficiency trade-off on math benchmarks.
-
Asking Back: Interaction-Layer Antidistillation Watermarks
Interaction-layer antidistillation watermarks use system-prompt-induced behavioral markers like explicit follow-up questions that transfer to distilled student models at 45-89% relative fidelity and can be audited via black-box LLM-as-judge queries.
-
SOMA: Efficient Multi-turn LLM Serving via Small Language Model
SOMA estimates a local response manifold from early turns and adapts a small surrogate model via divergence-maximizing prompts and localized LoRA fine-tuning for efficient multi-turn serving.
-
Annotations Mitigate Post-Training Mode Collapse
Annotation-anchored training reduces semantic diversity collapse in post-trained language models by a factor of six compared to standard supervised fine-tuning while preserving instruction-following and improving with scale.
-
A Universal Avoidance Method for Diverse Multi-branch Generation
UAG is a universal avoidance generation method that increases multi-branch diversity in diffusion and transformer models by penalizing output similarity, delivering up to 1.9x higher diversity with 4.4x speed and 1/64th the FLOPs of prior methods.
-
The Differences Between Direct Alignment Algorithms are a Blur
A controlled unification of direct alignment algorithms shows the ranking objective (pairwise vs pointwise) drives alignment quality more than the scalar score optimized.
-
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents
Agent Q integrates MCTS-guided search, self-critique, and off-policy DPO to train LLM agents that outperform behavior cloning and reinforced fine-tuning baselines in WebShop and achieve up to 95.4% success in real-world booking scenarios.
-
CTRL: A Conditional Transformer Language Model for Controllable Generation
CTRL is a large conditional transformer language model that uses naturally occurring control codes to steer text generation style and content.
-
Self-Play Enhancement via Advantage-Weighted Refinement in Online Federated LLM Fine-Tuning with Real-Time Feedback
SPEAR enables online federated LLM fine-tuning by using feedback-guided self-play to create contrastive pairs trained with maximum likelihood on correct completions and confidence-weighted unlikelihood on incorrect ones, outperforming baselines without ground-truth contexts.
-
Aligning Modalities in Vision Large Language Models via Preference Fine-tuning
POVID generates AI-created preference data to fine-tune vision-language models with DPO, reducing hallucinations and improving benchmark scores.