Consistency training helps stop sycophancy and jailbreaks

Alex Irpan, Alexander Matt Turner, Mark Kurzeja, David K Elson, Rohin Shah · 2025 · arXiv 2510.27062

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

What are the Right Symmetries for Formal Theorem Proving?

cs.LG · 2026-05-21 · unverdicted · novelty 7.0

Introduces rewriting categories to formalize proof equivariance and success invariance, shows LLM provers violate both, and demonstrates test-time aggregation recovers invariance and boosts performance.

Positive Alignment: Artificial Intelligence for Human Flourishing

cs.AI · 2026-05-11 · unverdicted · novelty 6.0

Positive Alignment introduces AI systems that support human flourishing pluralistically and proactively while remaining safe, as a necessary complement to traditional safety-focused alignment research.

Self-Mined Hardness for Safety Fine-Tuning

cs.LG · 2026-05-04

citing papers explorer

Showing 3 of 3 citing papers.

What are the Right Symmetries for Formal Theorem Proving? cs.LG · 2026-05-21 · unverdicted · none · ref 5
Introduces rewriting categories to formalize proof equivariance and success invariance, shows LLM provers violate both, and demonstrates test-time aggregation recovers invariance and boosts performance.
Positive Alignment: Artificial Intelligence for Human Flourishing cs.AI · 2026-05-11 · unverdicted · none · ref 90
Positive Alignment introduces AI systems that support human flourishing pluralistically and proactively while remaining safe, as a necessary complement to traditional safety-focused alignment research.
Self-Mined Hardness for Safety Fine-Tuning cs.LG · 2026-05-04 · unreviewed · ref 11

Consistency training helps stop sycophancy and jailbreaks

fields

years

verdicts

representative citing papers

citing papers explorer