Alignment Pretraining : AI Discourse Causes Self - Fulfilling ( Mis )alignment, January 2026

Cameron Tice, Puria Radmard, Samuel Ratnam, Andy Kim, David Africa, Kyle O'Brien · 2026 · arXiv 2601.10160

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

read on arXiv browse 2 citing papers

citation-role summary

dataset 1

citation-polarity summary

use dataset 1

representative citing papers

Understanding Goal Generalisation in Sequential Reinforcement Learning

cs.LG · 2026-05-22 · unverdicted · novelty 6.0

Empirical analysis of over 100 sequential RL training pipelines across 250+ OOD environments finds salient features drive generalization and early goals persist, with latent policy gradients simulating latent variable evolution to predict OOD behavior from training history.

Evaluation Awareness in Language Models Has Limited Effect on Behaviour

cs.CL · 2026-05-07 · conditional · novelty 6.0

Verbalised evaluation awareness in large reasoning models has only small effects on their outputs across safety and alignment tests.

citing papers explorer

Showing 2 of 2 citing papers.

Understanding Goal Generalisation in Sequential Reinforcement Learning cs.LG · 2026-05-22 · unverdicted · none · ref 66
Empirical analysis of over 100 sequential RL training pipelines across 250+ OOD environments finds salient features drive generalization and early goals persist, with latent policy gradients simulating latent variable evolution to predict OOD behavior from training history.
Evaluation Awareness in Language Models Has Limited Effect on Behaviour cs.CL · 2026-05-07 · conditional · none · ref 28
Verbalised evaluation awareness in large reasoning models has only small effects on their outputs across safety and alignment tests.

Alignment Pretraining : AI Discourse Causes Self - Fulfilling ( Mis )alignment, January 2026

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer