From Imitation to Introspection: Probing Self-Consciousness in Language Models

Chen, Sirui, Yu, Shu, Zhao, Shengjie, Lu, Chaochao · 2025 · DOI 10.18653/v1/2025.findings-acl.392

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open at publisher browse 1 citing papers

representative citing papers

Can LLMs Reliably Self-Report Adversarial Prefills, and How?

cs.CL · 2026-06-22 · unverdicted · novelty 6.0 · 2 refs

No tested LLM reliably self-reports adversarial prefill attacks on its outputs; introspective signals are largely refusal-mediated, probe-dependent, and only partially improvable by targeted training.

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.

From Imitation to Introspection: Probing Self-Consciousness in Language Models

fields

years

verdicts

representative citing papers

citing papers explorer