arXiv preprint arXiv:2603.17218 , year=

Alignment Makes Language Models Normative, Not Descriptive , author= · 2026 · arXiv 2603.17218

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

open full Pith review browse 3 citing papers arXiv PDF

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Predicting Decisions of AI Agents from Limited Interaction through Text-Tabular Modeling

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

A tabular foundation model with LLM-as-Observer features predicts AI agent decisions in controlled games, outperforming baselines by 4 AUC points and 14% lower error at K=16 interactions.

Overtrained, Not Misaligned

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

Emergent misalignment arises from overtraining after primary task convergence and is preventable by early stopping, which retains 93% of task performance on average.

Post-training makes large language models less human-like

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

Post-training reduces LLMs' behavioral alignment with humans across families and sizes, with the misalignment increasing in newer generations while persona induction fails to improve individual-level predictions.

citing papers explorer

Showing 3 of 3 citing papers.

Predicting Decisions of AI Agents from Limited Interaction through Text-Tabular Modeling cs.LG · 2026-05-12 · unverdicted · none · ref 64 · internal anchor
A tabular foundation model with LLM-as-Observer features predicts AI agent decisions in controlled games, outperforming baselines by 4 AUC points and 14% lower error at K=16 interactions.
Overtrained, Not Misaligned cs.LG · 2026-05-12 · unverdicted · none · ref 70 · internal anchor
Emergent misalignment arises from overtraining after primary task convergence and is preventable by early stopping, which retains 93% of task performance on average.
Post-training makes large language models less human-like cs.CL · 2026-05-08 · unverdicted · none · ref 40 · internal anchor
Post-training reduces LLMs' behavioral alignment with humans across families and sizes, with the misalignment increasing in newer generations while persona induction fails to improve individual-level predictions.

arXiv preprint arXiv:2603.17218 , year=

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer