Aligning Large Language Models via Fine-grained Supervision

Dehong Xu; Faisal Ladhak; Jaeyoung Do; Liang Qiu; Minseok Kim

arxiv: 2406.02756 · v1 · pith:QWXZ7FABnew · submitted 2024-06-04 · 💻 cs.CL · cs.AI· cs.LG

Aligning Large Language Models via Fine-grained Supervision

Dehong Xu , Liang Qiu , Minseok Kim , Faisal Ladhak , Jaeyoung Do This is my paper

classification 💻 cs.CL cs.AIcs.LG

keywords modelfeedbackfine-grainedalignmentapproachdatasethumanlanguage

0 comments

read the original abstract

Pre-trained large-scale language models (LLMs) excel at producing coherent articles, yet their outputs may be untruthful, toxic, or fail to align with user expectations. Current approaches focus on using reinforcement learning with human feedback (RLHF) to improve model alignment, which works by transforming coarse human preferences of LLM outputs into a feedback signal that guides the model learning process. However, because this approach operates on sequence-level feedback, it lacks the precision to identify the exact parts of the output affecting user preferences. To address this gap, we propose a method to enhance LLM alignment through fine-grained token-level supervision. Specifically, we ask annotators to minimally edit less preferred responses within the standard reward modeling dataset to make them more favorable, ensuring changes are made only where necessary while retaining most of the original content. The refined dataset is used to train a token-level reward model, which is then used for training our fine-grained Proximal Policy Optimization (PPO) model. Our experiment results demonstrate that this approach can achieve up to an absolute improvement of $5.1\%$ in LLM performance, in terms of win rate against the reference model, compared with the traditional PPO model.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Selective-Advantage Entropy-Adaptive Horizon GRPO: Asymmetric Token-Level Discounting for Efficient Reinforcement Learning of Language Models
cs.LG 2026-06 unverdicted novelty 5.0

SA-AH-GRPO applies asymmetric entropy-based discounting only to negative-advantage trajectories in GRPO, yielding similar peak Pass@1 accuracy with 3.6x lower training variance on GSM8K for Qwen 2.5 models.
SAID: Safety-Aware Intent Defense via Prefix Probing for Large Language Models
cs.CR 2025-10 unverdicted novelty 5.0

SAID is a training-free defense that distills obfuscated prompts into intents, probes them with safety prefixes, and rejects if any intent is unsafe, claiming SOTA jailbreak resistance on open LLMs.