pith. sign in

hub Mixed citations

Beavertails: Towards improved safety alignment of llm via a human-preference dataset

Mixed citation behavior. Most common role is background (67%).

27 Pith papers citing it
Background 67% of classified citations

hub tools

citation-role summary

background 4 dataset 2

citation-polarity summary

clear filters

representative citing papers

Theoretical Limits of Language Model Alignment

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

The maximum reward gain under KL-regularized LM alignment is a Jeffreys divergence term, estimable as covariance from base samples, with best-of-N approaching the theoretical limit.

Alignment Dynamics in LLM Fine-Tuning

cs.LG · 2026-05-18 · unverdicted · novelty 6.0

The paper introduces a dynamical model that decomposes alignment updates in LLM fine-tuning into rebound and driving forces and predicts a rehearsal priming effect.

The Safety-Aware Denoiser for Text Diffusion Models

cs.LG · 2026-04-28 · unverdicted · novelty 5.0

Safety-Aware Denoiser integrates safety guidance into the denoising steps of text diffusion models to reduce unsafe generations while maintaining quality.

TrustLLM: Trustworthiness in Large Language Models

cs.CL · 2024-01-10 · unverdicted · novelty 5.0

TrustLLM defines eight trustworthiness principles, creates a six-dimension benchmark, and evaluates 16 LLMs showing proprietary models generally lead but some open-source ones are close while over-calibration can hurt utility.

ShieldGemma: Generative AI Content Moderation Based on Gemma

cs.CL · 2024-07-31 · unverdicted · novelty 4.0

ShieldGemma delivers a family of Gemma2-based classifiers that outperform Llama Guard and WildCard on public safety benchmarks while introducing a synthetic-data curation pipeline for safety tasks.

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.