Title resolution pending

Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author= · 2023

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

browse 5 citing papers

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

representative citing papers

ORPO: Monolithic Preference Optimization without Reference Model

cs.CL · 2024-03-12 · conditional · novelty 8.0

ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.

RouterBench: A Benchmark for Multi-LLM Routing System

cs.LG · 2024-03-18 · unverdicted · novelty 7.0

RouterBench supplies a standardized benchmark, 405k+ inference dataset, theoretical framework, and comparative analysis for multi-LLM routing systems.

Dual Hierarchical Dialogue Policy Learning for Legal Inquisitive Conversational Agents

cs.CL · 2026-05-13 · unverdicted · novelty 6.0 · 2 refs

A dual hierarchical RL framework with two agents coordinates high-level dialogue strategy and low-level question generation to emulate judicial questioning and extract key information from Supreme Court arguments, outperforming baselines.

Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

cs.LG · 2024-02-22 · conditional · novelty 6.0

REINFORCE-style variants outperform PPO, DPO, and RAFT in RLHF for LLMs by removing unnecessary PPO components and adapting the simpler method to LLM alignment characteristics.

A Survey on Knowledge Distillation of Large Language Models

cs.CL · 2024-02-20 · accept · novelty 3.0

A comprehensive survey of knowledge distillation for LLMs structured around algorithms, skill enhancement, and vertical applications, highlighting data augmentation as a key enabler.

citing papers explorer

Showing 5 of 5 citing papers.

ORPO: Monolithic Preference Optimization without Reference Model cs.CL · 2024-03-12 · conditional · none · ref 67
ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.
RouterBench: A Benchmark for Multi-LLM Routing System cs.LG · 2024-03-18 · unverdicted · none · ref 57
RouterBench supplies a standardized benchmark, 405k+ inference dataset, theoretical framework, and comparative analysis for multi-LLM routing systems.
Dual Hierarchical Dialogue Policy Learning for Legal Inquisitive Conversational Agents cs.CL · 2026-05-13 · unverdicted · none · ref 92 · 2 links
A dual hierarchical RL framework with two agents coordinates high-level dialogue strategy and low-level question generation to emulate judicial questioning and extract key information from Supreme Court arguments, outperforming baselines.
Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs cs.LG · 2024-02-22 · conditional · none · ref 61
REINFORCE-style variants outperform PPO, DPO, and RAFT in RLHF for LLMs by removing unnecessary PPO components and adapting the simpler method to LLM alignment characteristics.
A Survey on Knowledge Distillation of Large Language Models cs.CL · 2024-02-20 · accept · none · ref 73
A comprehensive survey of knowledge distillation for LLMs structured around algorithms, skill enhancement, and vertical applications, highlighting data augmentation as a key enabler.

Title resolution pending

fields

years

verdicts

representative citing papers

citing papers explorer