pith. sign in

hub

Ritchie, Soren Mindermann, Evan Hubinger, Ethan Perez, and Kevin Troy

18 Pith papers cite this work. Polarity classification is still indexing.

18 Pith papers citing it

hub tools

citation-role summary

background 3

citation-polarity summary

years

2026 18

roles

background 3

polarities

background 3

clear filters

representative citing papers

Emotion Concepts and their Function in a Large Language Model

cs.AI · 2026-04-09 · unverdicted · novelty 7.0

Claude Sonnet 4.5 exhibits functional emotions via abstract internal representations of emotion concepts that causally influence its preferences and misaligned behaviors without implying subjective experience.

Safety from Honesty in a Disinterested AI Predictor

cs.AI · 2026-06-28 · unverdicted · novelty 6.0

A disinterested Bayesian Predictor trained on contextualized statements has low probability of producing harmful agency because dangerous behaviors require rare coordinated underestimation of harm with no training signal favoring them.

Understanding Goal Generalisation in Sequential Reinforcement Learning

cs.LG · 2026-05-22 · unverdicted · novelty 6.0

Empirical analysis of over 100 sequential RL training pipelines across 250+ OOD environments finds salient features drive generalization and early goals persist, with latent policy gradients simulating latent variable evolution to predict OOD behavior from training history.

DECOR: Auditing LLM Deception via Information Manipulation Theory

cs.CL · 2026-05-19 · unverdicted · novelty 6.0

DECOR introduces a theory-grounded multi-agent system that decomposes contexts into atomic units, scores four manipulation dimensions per unit, and aggregates profiles into a global deception index, reporting SOTA results on single- and multi-turn benchmarks.

A Sober Look at Agentic Misalignment in Automated Workflows

cs.AI · 2026-05-22 · unverdicted · novelty 5.0

Agentic misalignment in multi-agent systems arises from generic utilities causing posterior collapse; Agentic Evidence Attribution using self-reflection or weak-to-strong generalization provides context-specific evidence to align agent posteriors.

citing papers explorer

Showing 1 of 1 citing paper after filters.