arXiv preprint arXiv:2505.02666 , year=

A survey on progress in llm alignment from the perspective of reward design , author= · 2025 · arXiv 2505.02666

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

read on arXiv browse 2 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Safety, Security, and Cognitive Risks in World Models

cs.CR · 2026-04-01 · unverdicted · novelty 6.0

World models enable efficient AI planning but create risks from adversarial corruption, goal misgeneralization, and human bias, demonstrated via attacks that amplify errors and reduce rewards on models like RSSM and DreamerV3.

ClaHF: A Human Feedback-inspired Reinforcement Learning Framework for Improving Classification Tasks

cs.LG · 2026-05-17 · unverdicted · novelty 4.0

ClaHF converts instance labels into preference signals via candidate predictions and a reward model, then applies RL optimization to improve text classification accuracy and calibration.

citing papers explorer

Showing 2 of 2 citing papers.

Safety, Security, and Cognitive Risks in World Models cs.CR · 2026-04-01 · unverdicted · none · ref 68
World models enable efficient AI planning but create risks from adversarial corruption, goal misgeneralization, and human bias, demonstrated via attacks that amplify errors and reduce rewards on models like RSSM and DreamerV3.
ClaHF: A Human Feedback-inspired Reinforcement Learning Framework for Improving Classification Tasks cs.LG · 2026-05-17 · unverdicted · none · ref 26
ClaHF converts instance labels into preference signals via candidate predictions and a reward model, then applies RL optimization to improve text classification accuracy and calibration.

arXiv preprint arXiv:2505.02666 , year=

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer