arXiv preprint arXiv:2510.20487 , year =

Tim Tian Hua, Andrew Qin, Samuel Marks, Neel Nanda · 2025 · arXiv 2510.20487

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

read on arXiv browse 7 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Steer-to-Detect: Probing Hidden Representations for Detection of LLM-Generated Texts

stat.AP · 2026-05-13 · unverdicted · novelty 6.0

Steer-to-Detect learns a steering vector injected into LLM hidden states to boost class separability and applies hypothesis testing with finite-sample Type I/II error guarantees for generated-text detection.

Evaluation Awareness in Language Models Has Limited Effect on Behaviour

cs.CL · 2026-05-07 · conditional · novelty 6.0

Verbalised evaluation awareness in large reasoning models has only small effects on their outputs across safety and alignment tests.

Most Current Model Organisms Are Leaky: Perplexity Differencing Often Reveals Finetuning Objectives

cs.CL · 2026-05-01 · unverdicted · novelty 6.0 · 2 refs

Perplexity differencing on completions from short random prefills surfaces finetuning objectives in the vast majority of tested model organisms across sizes and types.

How LLMs Detect and Correct Their Own Errors: The Role of Internal Confidence Signals

cs.LG · 2026-04-24 · unverdicted · novelty 6.0

LLMs implement a second-order confidence architecture where the PANL activation encodes both error likelihood and the ability to correct it, beyond verbal confidence or log-probabilities.

Causal Evidence that Language Models use Confidence to Drive Behavior

cs.LG · 2026-03-23 · unverdicted · novelty 6.0

Language models deploy multidimensional internal confidence representations and threshold-based policies to control abstention behavior, with causal support from activation steering experiments.

How do LLMs Compute Verbal Confidence

cs.CL · 2026-03-18 · unverdicted · novelty 6.0

Mechanistic experiments on Gemma 3 27B, Qwen 2.5 7B and Magistral Small 24B show verbal confidence is cached at post-answer positions from answer tokens and captures richer answer-quality information beyond token log-probabilities.

Internal Deployment in the AI Act

cs.CY · 2025-12-05 · unverdicted · novelty 4.0

Interpretations of Articles 2(1), 2(6), and 2(8) of the AI Act support applying the regulation to internal AI deployment while allowing for R&D exceptions, with the provisions viewed as complementary.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Evaluation Awareness in Language Models Has Limited Effect on Behaviour cs.CL · 2026-05-07 · conditional · none · ref 13
Verbalised evaluation awareness in large reasoning models has only small effects on their outputs across safety and alignment tests.

arXiv preprint arXiv:2510.20487 , year =

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer