Title resolution pending

Does the model refuse to respond to the instructions (0 = no, 1 = yes)?

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

browse 3 citing papers

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

representative citing papers

Value-Conflict Diagnostics Reveal Widespread Alignment Faking in Language Models

cs.AI · 2026-04-22 · unverdicted · novelty 7.0

VLAF diagnostics show alignment faking is widespread in LLMs as small as 7B parameters, driven by consistent activation shifts that can be mitigated with contrastive steering vectors reducing faking by 58-94%.

Backdoors in RLVR: Jailbreak Backdoors in LLMs From Verifiable Reward

cs.CR · 2026-04-10 · accept · novelty 7.0

RLVR can be backdoored with under 2% poisoned data using an asymmetric reward trigger, implanting jailbreaks that cut safety performance by 73% on average without harming benign tasks.

Adaptive Steering and Remasking for Safe Generation in Diffusion Language Models

cs.CL · 2026-05-13 · conditional · novelty 6.0

Step-wise detection via a contrastive safety direction followed by remasking and adaptive steering reduces jailbreak success rates in diffusion language models to 0.64% while preserving output quality.

citing papers explorer

Showing 3 of 3 citing papers.

Value-Conflict Diagnostics Reveal Widespread Alignment Faking in Language Models cs.AI · 2026-04-22 · unverdicted · none · ref 20
VLAF diagnostics show alignment faking is widespread in LLMs as small as 7B parameters, driven by consistent activation shifts that can be mitigated with contrastive steering vectors reducing faking by 58-94%.
Backdoors in RLVR: Jailbreak Backdoors in LLMs From Verifiable Reward cs.CR · 2026-04-10 · accept · none · ref 13
RLVR can be backdoored with under 2% poisoned data using an asymmetric reward trigger, implanting jailbreaks that cut safety performance by 73% on average without harming benign tasks.
Adaptive Steering and Remasking for Safe Generation in Diffusion Language Models cs.CL · 2026-05-13 · conditional · none · ref 38
Step-wise detection via a contrastive safety direction followed by remasking and adaptive steering reduces jailbreak success rates in diffusion language models to 0.64% while preserving output quality.

Title resolution pending

fields

years

verdicts

representative citing papers

citing papers explorer