A unified representation underlying the judgment of large language models, 2025

Yi-Long Lu, Jiajun Song, Wei Wang · 2025 · arXiv 2510.27328

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

How's it going? Reinforcement learning in language models recruits a functional welfare axis

cs.LG · 2026-05-28 · unverdicted · novelty 7.0

Reinforcement learning recruits rather than creates a functional welfare axis in language models, as reward and punishment vectors from a maze task generalize to unrelated settings and appear in pretrain-only models.

Probing Persona-Dependent Preferences in Language Models

cs.CL · 2026-05-13 · unverdicted · novelty 6.0

Linear probes on residual-stream activations identify a shared preference vector in LLMs that tracks choices across prompts and causally steers decisions even for anti-correlated personas.

How LLMs Detect and Correct Their Own Errors: The Role of Internal Confidence Signals

cs.LG · 2026-04-24 · unverdicted · novelty 6.0

LLMs implement a second-order confidence architecture where the PANL activation encodes both error likelihood and the ability to correct it, beyond verbal confidence or log-probabilities.

citing papers explorer

Showing 3 of 3 citing papers after filters.

How's it going? Reinforcement learning in language models recruits a functional welfare axis cs.LG · 2026-05-28 · unverdicted · none · ref 18
Reinforcement learning recruits rather than creates a functional welfare axis in language models, as reward and punishment vectors from a maze task generalize to unrelated settings and appear in pretrain-only models.
Probing Persona-Dependent Preferences in Language Models cs.CL · 2026-05-13 · unverdicted · none · ref 22
Linear probes on residual-stream activations identify a shared preference vector in LLMs that tracks choices across prompts and causally steers decisions even for anti-correlated personas.
How LLMs Detect and Correct Their Own Errors: The Role of Internal Confidence Signals cs.LG · 2026-04-24 · unverdicted · none · ref 15
LLMs implement a second-order confidence architecture where the PANL activation encodes both error likelihood and the ability to correct it, beyond verbal confidence or log-probabilities.

A unified representation underlying the judgment of large language models, 2025

fields

years

verdicts

representative citing papers

citing papers explorer