hub Canonical reference

arXiv preprint arXiv:2406.15513 , year =

Ji, Jiaming, Donghai Hong, Borong Zhang, Boyuan Chen, Josef Dai, Boren Zheng, Tianyi Qiu, Boxun Li, Yaodong Yang · 2024 · arXiv 2406.15513

Canonical reference. 75% of citing Pith papers cite this work as background.

17 Pith papers citing it

Background 75% of classified citations

read on arXiv browse 17 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 6 dataset 1 other 1

citation-polarity summary

background 6 unclear 1 use dataset 1

representative citing papers

Theoretical Limits of Language Model Alignment

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

The maximum reward gain under KL-regularized LM alignment is a Jeffreys divergence term, estimable as covariance from base samples, with best-of-N approaching the theoretical limit.

Incentivizing High-Quality Human Annotations with Golden Questions

cs.GT · 2025-05-25 · unverdicted · novelty 7.0

The paper derives a Θ(1/√(n log n)) hypothesis testing rate under strategic annotator behavior and shows that high-certainty, format-similar golden questions better reveal annotation quality than standard checks.

Toward Stable Value Alignment: Introducing Independent Modules for Consistent Value Guidance

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

SVGT adds independent value modules and Bridge Tokens to LLMs to maintain consistent value guidance, cutting harmful outputs by over 70% in tests while preserving fluency.

Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion

cs.AI · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

MORA breaks the safety-helpfulness ceiling in LLMs by pre-sampling single-reward prompts and rewriting them to incorporate multi-dimensional intents, delivering 5-12.4% gains in sequential alignment and 4.6% overall improvement in simultaneous alignment.

Structure from Strategic Interaction & Uncertainty: Risk Sensitive Games for Robust Preference Learning

cs.GT · 2026-05-11 · unverdicted · novelty 6.0 · 2 refs

Risk-sensitive preference games using convex risk measures produce policies that are robust across data strata and match or exceed standard Nash learning performance without added cost.

GLiGuard: Schema-Conditioned Classification for LLM Safeguard

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

GLiGuard is a compact schema-conditioned bidirectional encoder that matches 7B-27B guard models on safety benchmarks while delivering up to 16x higher throughput and 17x lower latency.

Disentangling Intent from Role: Adversarial Self-Play for Persona-Invariant Safety Alignment

cs.AI · 2026-05-03 · unverdicted · novelty 6.0

PIA achieves lower attack success rates on persona-based jailbreaks via self-play co-evolution of attacks (PLE) and defenses (PICL) that structurally decouples safety from persona context using unilateral KL-divergence.

SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models

cs.AI · 2026-04-21 · unverdicted · novelty 6.0

SafetyALFRED shows multimodal LLMs recognize kitchen hazards accurately in QA tests but achieve low success rates when required to mitigate those hazards through embodied planning.

LLM Safety From Within: Detecting Harmful Content with Internal Representations

cs.AI · 2026-04-20 · unverdicted · novelty 6.0

SIREN identifies safety neurons via linear probing on internal LLM layers and combines them with adaptive weighting to detect harm, outperforming prior guard models with 250x fewer parameters.

Characterizing Model-Native Skills

cs.AI · 2026-04-19 · conditional · novelty 6.0

Recovering an orthogonal basis from model activations yields a model-native skill characterization that improves reasoning Pass@1 by up to 41% via targeted data selection and supports inference steering, outperforming human-characterized alternatives.

Safety Alignment as Continual Learning: Mitigating the Alignment Tax via Orthogonal Gradient Projection

cs.LG · 2026-02-08 · conditional · novelty 6.0

OGPSA projects safety gradients orthogonal to a low-rank subspace from general capability gradients, improving safety-utility trade-offs in SFT and DPO pipelines on Qwen2.5-7B and Llama3.1-8B.

SCOUT: A Defense Against Data Poisoning Attacks in Fine-Tuned Language Models

cs.CR · 2025-12-10 · unverdicted · novelty 6.0

SCOUT uses token saliency analysis to detect both standard and contextually-plausible backdoor attacks in language models while maintaining clean accuracy.

The Realignment Problem: When Right becomes Wrong in LLMs

cs.CL · 2025-11-04 · unverdicted · novelty 6.0

TRACE is a three-stage optimization framework that realigns LLMs to new policies by categorizing preference conflicts, scoring impact via bi-level optimization, and applying hybrid losses without new human annotations.

How Humans Help LLMs: Assessing and Incentivizing Human Preference Annotators

cs.LG · 2025-02-10 · unverdicted · novelty 6.0

Develops self-consistency monitoring for preference annotators and derives sample-complexity bounds showing linear contracts achieve near-ideal performance faster than binary ones under continuous actions.

Principles Do Not Apply Themselves: A Hermeneutic Perspective on AI Alignment

cs.AI · 2026-04-12 · unverdicted · novelty 5.0

AI alignment to principles requires context-sensitive interpretive judgments, as substantial preference data involves unresolved conflicts, creating gaps between corpus-induced and deployment-induced evaluations.

FedDetox: Robust Federated SLM Alignment via On-Device Data Sanitization

cs.CR · 2026-04-08 · unverdicted · novelty 5.0

FedDetox uses on-device knowledge-distilled classifiers to sanitize toxic data in federated SLM training, preserving safety alignment comparable to centralized baselines.

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

cs.CL · 2024-12-07 · accept · novelty 3.0

A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.

citing papers explorer

Showing 17 of 17 citing papers.

Theoretical Limits of Language Model Alignment cs.LG · 2026-05-08 · unverdicted · none · ref 28
The maximum reward gain under KL-regularized LM alignment is a Jeffreys divergence term, estimable as covariance from base samples, with best-of-N approaching the theoretical limit.
Incentivizing High-Quality Human Annotations with Golden Questions cs.GT · 2025-05-25 · unverdicted · none · ref 23
The paper derives a Θ(1/√(n log n)) hypothesis testing rate under strategic annotator behavior and shows that high-certainty, format-similar golden questions better reveal annotation quality than standard checks.
Toward Stable Value Alignment: Introducing Independent Modules for Consistent Value Guidance cs.AI · 2026-05-12 · unverdicted · none · ref 81
SVGT adds independent value modules and Bridge Tokens to LLMs to maintain consistent value guidance, cutting harmful outputs by over 70% in tests while preserving fluency.
Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion cs.AI · 2026-05-12 · unverdicted · none · ref 22 · 2 links
MORA breaks the safety-helpfulness ceiling in LLMs by pre-sampling single-reward prompts and rewriting them to incorporate multi-dimensional intents, delivering 5-12.4% gains in sequential alignment and 4.6% overall improvement in simultaneous alignment.
Structure from Strategic Interaction & Uncertainty: Risk Sensitive Games for Robust Preference Learning cs.GT · 2026-05-11 · unverdicted · none · ref 69 · 2 links
Risk-sensitive preference games using convex risk measures produce policies that are robust across data strata and match or exceed standard Nash learning performance without added cost.
GLiGuard: Schema-Conditioned Classification for LLM Safeguard cs.CL · 2026-05-08 · unverdicted · none · ref 5
GLiGuard is a compact schema-conditioned bidirectional encoder that matches 7B-27B guard models on safety benchmarks while delivering up to 16x higher throughput and 17x lower latency.
Disentangling Intent from Role: Adversarial Self-Play for Persona-Invariant Safety Alignment cs.AI · 2026-05-03 · unverdicted · none · ref 46
PIA achieves lower attack success rates on persona-based jailbreaks via self-play co-evolution of attacks (PLE) and defenses (PICL) that structurally decouples safety from persona context using unilateral KL-divergence.
SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models cs.AI · 2026-04-21 · unverdicted · none · ref 22
SafetyALFRED shows multimodal LLMs recognize kitchen hazards accurately in QA tests but achieve low success rates when required to mitigate those hazards through embodied planning.
LLM Safety From Within: Detecting Harmful Content with Internal Representations cs.AI · 2026-04-20 · unverdicted · none · ref 8
SIREN identifies safety neurons via linear probing on internal LLM layers and combines them with adaptive weighting to detect harm, outperforming prior guard models with 250x fewer parameters.
Characterizing Model-Native Skills cs.AI · 2026-04-19 · conditional · none · ref 99
Recovering an orthogonal basis from model activations yields a model-native skill characterization that improves reasoning Pass@1 by up to 41% via targeted data selection and supports inference steering, outperforming human-characterized alternatives.
Safety Alignment as Continual Learning: Mitigating the Alignment Tax via Orthogonal Gradient Projection cs.LG · 2026-02-08 · conditional · none · ref 14
OGPSA projects safety gradients orthogonal to a low-rank subspace from general capability gradients, improving safety-utility trade-offs in SFT and DPO pipelines on Qwen2.5-7B and Llama3.1-8B.
SCOUT: A Defense Against Data Poisoning Attacks in Fine-Tuned Language Models cs.CR · 2025-12-10 · unverdicted · none · ref 29
SCOUT uses token saliency analysis to detect both standard and contextually-plausible backdoor attacks in language models while maintaining clean accuracy.
The Realignment Problem: When Right becomes Wrong in LLMs cs.CL · 2025-11-04 · unverdicted · none · ref 10
TRACE is a three-stage optimization framework that realigns LLMs to new policies by categorizing preference conflicts, scoring impact via bi-level optimization, and applying hybrid losses without new human annotations.
How Humans Help LLMs: Assessing and Incentivizing Human Preference Annotators cs.LG · 2025-02-10 · unverdicted · none · ref 53
Develops self-consistency monitoring for preference annotators and derives sample-complexity bounds showing linear contracts achieve near-ideal performance faster than binary ones under continuous actions.
Principles Do Not Apply Themselves: A Hermeneutic Perspective on AI Alignment cs.AI · 2026-04-12 · unverdicted · none · ref 13
AI alignment to principles requires context-sensitive interpretive judgments, as substantial preference data involves unresolved conflicts, creating gaps between corpus-induced and deployment-induced evaluations.
FedDetox: Robust Federated SLM Alignment via On-Device Data Sanitization cs.CR · 2026-04-08 · unverdicted · none · ref 25
FedDetox uses on-device knowledge-distilled classifiers to sanitize toxic data in federated SLM training, preserving safety alignment comparable to centralized baselines.
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods cs.CL · 2024-12-07 · accept · none · ref 100
A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.

arXiv preprint arXiv:2406.15513 , year =

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer