pith. sign in

hub Mixed citations

Beavertails: Towards improved safety alignment of llm via a human-preference dataset

Mixed citation behavior. Most common role is background (67%).

31 Pith papers citing it
Background 67% of classified citations

hub tools

citation-role summary

background 4 dataset 2

citation-polarity summary

clear filters

representative citing papers

Theoretical Limits of Language Model Alignment

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

The maximum reward gain under KL-regularized LM alignment is a Jeffreys divergence term, estimable as covariance from base samples, with best-of-N approaching the theoretical limit.

Quality Is Not a Safety Proxy Under Quantization

cs.LG · 2026-06-08 · conditional · novelty 6.0

Across 51 quantized checkpoints, quality metrics fail to predict safety drops in 36 pairings and 10 hidden-danger cases, while a new RTSI screen routes all 10 dangerous rows to testing at matched bucket size.

Alignment Dynamics in LLM Fine-Tuning

cs.LG · 2026-05-18 · unverdicted · novelty 6.0

The paper introduces a dynamical model that decomposes alignment updates in LLM fine-tuning into rebound and driving forces and predicts a rehearsal priming effect.

The Safety-Aware Denoiser for Text Diffusion Models

cs.LG · 2026-04-28 · unverdicted · novelty 5.0

Safety-Aware Denoiser integrates safety guidance into the denoising steps of text diffusion models to reduce unsafe generations while maintaining quality.

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.