hub

Neural text generation with unlikelihood training

Welleck, S · 2019 · arXiv 1908.04319

14 Pith papers cite this work. Polarity classification is still indexing.

14 Pith papers citing it

read on arXiv browse 14 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

method 2 background 1

citation-polarity summary

use method 2 background 1

representative citing papers

ORPO: Monolithic Preference Optimization without Reference Model

cs.CL · 2024-03-12 · conditional · novelty 8.0

ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.

Self-Attention as a Covariance Readout: A Unified View of In-Context Learning and Repetition

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

Self-attention acts as a covariance readout that unifies in-context learning via population gradient descent and repetitive generation via asymptotic Markov behavior.

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

cs.LG · 2023-05-29 · accept · novelty 7.0

DPO derives the optimal policy directly from human preferences via a reparameterized reward model, solving the RLHF objective with only a binary classification loss and no sampling or separate reward model.

Learning to summarize from human feedback

cs.CL · 2020-09-02 · conditional · novelty 7.0

Reinforcement learning on a reward model trained from human summary comparisons produces summaries humans prefer over supervised fine-tuning or human references on TL;DR and transfers to CNN/DM.

CLORE: Content-Level Optimization for Reasoning Efficiency

cs.AI · 2026-05-21 · unverdicted · novelty 6.0

CLORE augments correct on-policy rollouts by deleting repetitive and irrelevant segments then optimizes with auxiliary DPO to improve accuracy-efficiency trade-off on math benchmarks.

Asking Back: Interaction-Layer Antidistillation Watermarks

cs.CR · 2026-05-15 · unverdicted · novelty 6.0

Interaction-layer antidistillation watermarks use system-prompt-induced behavioral markers like explicit follow-up questions that transfer to distilled student models at 45-89% relative fidelity and can be audited via black-box LLM-as-judge queries.

SOMA: Efficient Multi-turn LLM Serving via Small Language Model

cs.CL · 2026-05-11 · unverdicted · novelty 6.0

SOMA estimates a local response manifold from early turns and adapts a small surrogate model via divergence-maximizing prompts and localized LoRA fine-tuning for efficient multi-turn serving.

Annotations Mitigate Post-Training Mode Collapse

cs.CL · 2026-05-11 · unverdicted · novelty 6.0

Annotation-anchored training reduces semantic diversity collapse in post-trained language models by a factor of six compared to standard supervised fine-tuning while preserving instruction-following and improving with scale.

A Universal Avoidance Method for Diverse Multi-branch Generation

cs.CL · 2026-04-19 · unverdicted · novelty 6.0

UAG is a universal avoidance generation method that increases multi-branch diversity in diffusion and transformer models by penalizing output similarity, delivering up to 1.9x higher diversity with 4.4x speed and 1/64th the FLOPs of prior methods.

The Differences Between Direct Alignment Algorithms are a Blur

cs.LG · 2025-02-03 · unverdicted · novelty 6.0

A controlled unification of direct alignment algorithms shows the ranking objective (pairwise vs pointwise) drives alignment quality more than the scalar score optimized.

Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents

cs.AI · 2024-08-13 · unverdicted · novelty 6.0

Agent Q integrates MCTS-guided search, self-critique, and off-policy DPO to train LLM agents that outperform behavior cloning and reinforced fine-tuning baselines in WebShop and achieve up to 95.4% success in real-world booking scenarios.

CTRL: A Conditional Transformer Language Model for Controllable Generation

cs.CL · 2019-09-11 · unverdicted · novelty 6.0

CTRL is a large conditional transformer language model that uses naturally occurring control codes to steer text generation style and content.

Self-Play Enhancement via Advantage-Weighted Refinement in Online Federated LLM Fine-Tuning with Real-Time Feedback

cs.LG · 2026-05-08 · unverdicted · novelty 5.0

SPEAR enables online federated LLM fine-tuning by using feedback-guided self-play to create contrastive pairs trained with maximum likelihood on correct completions and confidence-weighted unlikelihood on incorrect ones, outperforming baselines without ground-truth contexts.

Aligning Modalities in Vision Large Language Models via Preference Fine-tuning

cs.LG · 2024-02-18 · unverdicted · novelty 5.0

POVID generates AI-created preference data to fine-tune vision-language models with DPO, reducing hallucinations and improving benchmark scores.

citing papers explorer

Showing 14 of 14 citing papers.

ORPO: Monolithic Preference Optimization without Reference Model cs.CL · 2024-03-12 · conditional · none · ref 136
ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.
Self-Attention as a Covariance Readout: A Unified View of In-Context Learning and Repetition cs.LG · 2026-05-11 · unverdicted · none · ref 29
Self-attention acts as a covariance readout that unifies in-context learning via population gradient descent and repetitive generation via asymptotic Markov behavior.
Direct Preference Optimization: Your Language Model is Secretly a Reward Model cs.LG · 2023-05-29 · accept · none · ref 48
DPO derives the optimal policy directly from human preferences via a reparameterized reward model, solving the RLHF objective with only a binary classification loss and no sampling or separate reward model.
Learning to summarize from human feedback cs.CL · 2020-09-02 · conditional · none · ref 64
Reinforcement learning on a reward model trained from human summary comparisons produces summaries humans prefer over supervised fine-tuning or human references on TL;DR and transfers to CNN/DM.
CLORE: Content-Level Optimization for Reasoning Efficiency cs.AI · 2026-05-21 · unverdicted · none · ref 47
CLORE augments correct on-policy rollouts by deleting repetitive and irrelevant segments then optimizes with auxiliary DPO to improve accuracy-efficiency trade-off on math benchmarks.
Asking Back: Interaction-Layer Antidistillation Watermarks cs.CR · 2026-05-15 · unverdicted · none · ref 40
Interaction-layer antidistillation watermarks use system-prompt-induced behavioral markers like explicit follow-up questions that transfer to distilled student models at 45-89% relative fidelity and can be audited via black-box LLM-as-judge queries.
SOMA: Efficient Multi-turn LLM Serving via Small Language Model cs.CL · 2026-05-11 · unverdicted · none · ref 49
SOMA estimates a local response manifold from early turns and adapts a small surrogate model via divergence-maximizing prompts and localized LoRA fine-tuning for efficient multi-turn serving.
Annotations Mitigate Post-Training Mode Collapse cs.CL · 2026-05-11 · unverdicted · none · ref 16
Annotation-anchored training reduces semantic diversity collapse in post-trained language models by a factor of six compared to standard supervised fine-tuning while preserving instruction-following and improving with scale.
A Universal Avoidance Method for Diverse Multi-branch Generation cs.CL · 2026-04-19 · unverdicted · none · ref 29
UAG is a universal avoidance generation method that increases multi-branch diversity in diffusion and transformer models by penalizing output similarity, delivering up to 1.9x higher diversity with 4.4x speed and 1/64th the FLOPs of prior methods.
The Differences Between Direct Alignment Algorithms are a Blur cs.LG · 2025-02-03 · unverdicted · none · ref 41
A controlled unification of direct alignment algorithms shows the ranking objective (pairwise vs pointwise) drives alignment quality more than the scalar score optimized.
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents cs.AI · 2024-08-13 · unverdicted · none · ref 158
Agent Q integrates MCTS-guided search, self-critique, and off-policy DPO to train LLM agents that outperform behavior cloning and reinforced fine-tuning baselines in WebShop and achieve up to 95.4% success in real-world booking scenarios.
CTRL: A Conditional Transformer Language Model for Controllable Generation cs.CL · 2019-09-11 · unverdicted · none · ref 48
CTRL is a large conditional transformer language model that uses naturally occurring control codes to steer text generation style and content.
Self-Play Enhancement via Advantage-Weighted Refinement in Online Federated LLM Fine-Tuning with Real-Time Feedback cs.LG · 2026-05-08 · unverdicted · none · ref 41
SPEAR enables online federated LLM fine-tuning by using feedback-guided self-play to create contrastive pairs trained with maximum likelihood on correct completions and confidence-weighted unlikelihood on incorrect ones, outperforming baselines without ground-truth contexts.
Aligning Modalities in Vision Large Language Models via Preference Fine-tuning cs.LG · 2024-02-18 · unverdicted · none · ref 76
POVID generates AI-created preference data to fine-tune vision-language models with DPO, reducing hallucinations and improving benchmark scores.

Neural text generation with unlikelihood training

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer