Scaling laws for reward model overoptimization

· 2023

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

browse 2 citing papers

representative citing papers

Same Words, Different Judgments: How Preferences Vary Across Modalities

cs.SD · 2026-02-26 · unverdicted · novelty 7.0

Human preferences for the same semantic content show near-chance agreement between text and audio, with audio raters using narrower decision thresholds, less length bias, and more user-oriented criteria.

RedDiffuser: Auditing Multimodal Safety Failures in Vision-Language Models via Reinforced Diffusion

cs.CV · 2025-03-08 · unverdicted · novelty 6.0

RedDiffuser is a reinforced diffusion framework that generates adversarial visual contexts to audit and expose widespread multimodal safety failures in VLMs, increasing unsafe response rates by up to 10.69% on LLaVA with transfer to other models.

citing papers explorer

Showing 2 of 2 citing papers.

Same Words, Different Judgments: How Preferences Vary Across Modalities cs.SD · 2026-02-26 · unverdicted · none · ref 26
Human preferences for the same semantic content show near-chance agreement between text and audio, with audio raters using narrower decision thresholds, less length bias, and more user-oriented criteria.
RedDiffuser: Auditing Multimodal Safety Failures in Vision-Language Models via Reinforced Diffusion cs.CV · 2025-03-08 · unverdicted · none · ref 40
RedDiffuser is a reinforced diffusion framework that generates adversarial visual contexts to audit and expose widespread multimodal safety failures in VLMs, increasing unsafe response rates by up to 10.69% on LLaVA with transfer to other models.

Scaling laws for reward model overoptimization

fields

years

verdicts

representative citing papers

citing papers explorer