Aligning Large Language Models via Fine-grained Supervision
read the original abstract
Pre-trained large-scale language models (LLMs) excel at producing coherent articles, yet their outputs may be untruthful, toxic, or fail to align with user expectations. Current approaches focus on using reinforcement learning with human feedback (RLHF) to improve model alignment, which works by transforming coarse human preferences of LLM outputs into a feedback signal that guides the model learning process. However, because this approach operates on sequence-level feedback, it lacks the precision to identify the exact parts of the output affecting user preferences. To address this gap, we propose a method to enhance LLM alignment through fine-grained token-level supervision. Specifically, we ask annotators to minimally edit less preferred responses within the standard reward modeling dataset to make them more favorable, ensuring changes are made only where necessary while retaining most of the original content. The refined dataset is used to train a token-level reward model, which is then used for training our fine-grained Proximal Policy Optimization (PPO) model. Our experiment results demonstrate that this approach can achieve up to an absolute improvement of $5.1\%$ in LLM performance, in terms of win rate against the reference model, compared with the traditional PPO model.
This paper has not been read by Pith yet.
Forward citations
Cited by 2 Pith papers
-
Selective-Advantage Entropy-Adaptive Horizon GRPO: Asymmetric Token-Level Discounting for Efficient Reinforcement Learning of Language Models
SA-AH-GRPO applies asymmetric entropy-based discounting only to negative-advantage trajectories in GRPO, yielding similar peak Pass@1 accuracy with 3.6x lower training variance on GSM8K for Qwen 2.5 models.
-
SAID: Safety-Aware Intent Defense via Prefix Probing for Large Language Models
SAID is a training-free defense that distills obfuscated prompts into intents, probes them with safety prefixes, and rejects if any intent is unsafe, claiming SOTA jailbreak resistance on open LLMs.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.