A small set of sparse autoencoder features in LLMs drives shifts between generous and selfish allocations in dictator games, with causal patching and steering confirming their role and generalization to other social games.
International Conference on Learning Representations , year=
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 3verdicts
UNVERDICTED 3roles
background 1polarities
background 1representative citing papers
OGLS-SD improves LLM reasoning by using verifiable outcome rewards to guide logit steering that calibrates teacher distributions in on-policy self-distillation, addressing reflection-induced mismatches.
Generative AI evaluation must shift from static benchmark scores to measuring sustained improvements in human capabilities within specific deployment contexts.
citing papers explorer
-
Understanding the Mechanism of Altruism in Large Language Models
A small set of sparse autoencoder features in LLMs drives shifts between generous and selfish allocations in dictator games, with causal patching and steering confirming their role and generalization to other social games.
-
OGLS-SD: On-Policy Self-Distillation with Outcome-Guided Logit Steering for LLM Reasoning
OGLS-SD improves LLM reasoning by using verifiable outcome rewards to guide logit steering that calibrates teacher distributions in on-policy self-distillation, addressing reflection-induced mismatches.
-
Benchmarked Yet Not Measured -- Generative AI Should be Evaluated Against Real-World Utility
Generative AI evaluation must shift from static benchmark scores to measuring sustained improvements in human capabilities within specific deployment contexts.