Length desensitization in direct preference optimization

Wei Liu, Yang Bai, Chengcheng Han, Rongxiang Weng, Jun Xu, Xuezhi Cao, Jingang Wang, Xunliang Cai · 2024 · arXiv 2409.06411

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

read on arXiv browse 5 citing papers

representative citing papers

TiCo: Time-Controllable Spoken Dialogue Model

cs.CL · 2026-03-23 · unverdicted · novelty 7.0

TiCo enables spoken dialogue models to follow explicit time constraints in generated responses using Spoken Time Markers and reinforcement learning with verifiable rewards, cutting duration error by 2.7x over its backbone.

Learning to Control Summaries with Score Ranking

cs.CL · 2026-04-19 · unverdicted · novelty 6.0

A score-ranking loss enables controllable summarization by aligning outputs to evaluation scores, matching SOTA performance with dimension-specific control on LLaMA, Qwen, and Mistral.

GroupDPO: Memory efficient Group-wise Direct Preference Optimization

cs.CL · 2026-04-17 · unverdicted · novelty 6.0

GroupDPO decouples group-wise preference optimization during backpropagation to cut peak memory while keeping the same gradients, allowing larger groups and consistent gains over single-pair DPO plus an NLL term on positives.

Modularized Reinforcement Learning on LLMs: From MDP Creation to Exploration and Learning

cs.LG · 2026-06-20 · unverdicted · novelty 5.0

Survey mapping RL techniques onto LLM training and highlighting gaps in value-based, off-policy, and bootstrapping methods.

From Coarse to Fine: Self-Adaptive Hierarchical Planning for LLM Agents

cs.AI · 2026-04-25 · unverdicted · novelty 5.0

AdaPlan-H enables LLM agents to generate self-adaptive hierarchical plans that adjust detail level to task difficulty, improving success rates in multi-step tasks.

citing papers explorer

Showing 1 of 1 citing paper after filters.

From Coarse to Fine: Self-Adaptive Hierarchical Planning for LLM Agents cs.AI · 2026-04-25 · unverdicted · none · ref 13
AdaPlan-H enables LLM agents to generate self-adaptive hierarchical plans that adjust detail level to task difficulty, improving success rates in multi-step tasks.

Length desensitization in direct preference optimization

fields

years

verdicts

representative citing papers

citing papers explorer