pith. sign in

arXiv preprint arXiv:2603.17218 , year=

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

citation-role summary

background 1

citation-polarity summary

fields

cs.LG 2 cs.CL 1

years

2026 3

verdicts

UNVERDICTED 3

roles

background 1

polarities

background 1

representative citing papers

Overtrained, Not Misaligned

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

Emergent misalignment arises from overtraining after primary task convergence and is preventable by early stopping, which retains 93% of task performance on average.

Post-training makes large language models less human-like

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

Post-training reduces LLMs' behavioral alignment with humans across families and sizes, with the misalignment increasing in newer generations while persona induction fails to improve individual-level predictions.

citing papers explorer

Showing 3 of 3 citing papers.

  • Predicting Decisions of AI Agents from Limited Interaction through Text-Tabular Modeling cs.LG · 2026-05-12 · unverdicted · none · ref 64 · internal anchor

    A tabular foundation model with LLM-as-Observer features predicts AI agent decisions in controlled games, outperforming baselines by 4 AUC points and 14% lower error at K=16 interactions.

  • Overtrained, Not Misaligned cs.LG · 2026-05-12 · unverdicted · none · ref 70 · internal anchor

    Emergent misalignment arises from overtraining after primary task convergence and is preventable by early stopping, which retains 93% of task performance on average.

  • Post-training makes large language models less human-like cs.CL · 2026-05-08 · unverdicted · none · ref 40 · internal anchor

    Post-training reduces LLMs' behavioral alignment with humans across families and sizes, with the misalignment increasing in newer generations while persona induction fails to improve individual-level predictions.