pith. sign in

From Imitation to Introspection: Probing Self-Consciousness in Language Models

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

fields

cs.CL 1

years

2026 1

verdicts

UNVERDICTED 1

clear filters

representative citing papers

Can LLMs Reliably Self-Report Adversarial Prefills, and How?

cs.CL · 2026-06-22 · unverdicted · novelty 6.0 · 2 refs

No tested LLM reliably self-reports adversarial prefill attacks on its outputs; introspective signals are largely refusal-mediated, probe-dependent, and only partially improvable by targeted training.

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.