pith. sign in

hub Mixed citations

arXiv:2406.14598 (2025)

Mixed citation behavior. Most common role is background (60%).

28 Pith papers citing it
Background 60% of classified citations

hub tools

citation-role summary

background 4 method 1

citation-polarity summary

clear filters

representative citing papers

Agentic Abstention: Do Agents Know When to Stop Instead of Act?

cs.AI · 2026-06-27 · unverdicted · novelty 7.0

LLM agents often fail to abstain at the right time in uncertain multi-turn tasks, and the CONVOLVE context engineering method raises timely abstention rates on WebShop from 26.7 to 57.4 without parameter updates.

What Do Safety-Aligned LLMs Learn From Mixed Compliance Demonstrations?

cs.AI · 2026-06-18 · unverdicted · novelty 6.0

Safety-aligned LLMs treat benign and harmful compliance demonstrations differently in in-context learning, with preference optimization preventing benign examples from increasing harmful compliance and strong recency bias in ordering.

Efficient Safety Benchmarking via Item Response Theory

cs.CY · 2026-05-26 · unverdicted · novelty 6.0

Item Response Theory enables adaptive and fixed-subset item selection that reduces safety benchmark costs by 80-99.9% while preserving high correlation with full rankings.

Few-Shot Truly Benign DPO Attack for Jailbreaking LLMs

cs.CR · 2026-05-09 · unverdicted · novelty 6.0

A truly benign DPO attack using 10 harmless preference pairs jailbreaks frontier LLMs by suppressing refusal behavior, achieving up to 81.73% attack success rate on GPT-4.1-nano at low cost.

Beyond I'm Sorry, I Can't: Dissecting Large Language Model Refusal

cs.CL · 2025-09-07 · unverdicted · novelty 6.0

Sparse autoencoders plus greedy filtering and factorization-machine interaction modeling identify minimal sets of features in Gemma-2-2B-IT and LLaMA-3.1-8B-IT whose ablation produces jailbreaks by flipping refusal to compliance.

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.