pith. sign in

How Context Shapes Truth: Geometric Transformations of Statement-level Truth Representations in LLMs

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it
abstract

Large Language Models (LLMs) often encode whether a statement is true as a vector in their residual stream activations. These vectors, also known as truth vectors, have been studied in prior work, however how they change when context is introduced remains unexplored. We study this question by measuring (1) the directional change ($\theta$) between the truth vectors with and without context and (2) the relative magnitude of the truth vectors upon adding context. Across four LLMs and four datasets, we find that (1) truth vectors are roughly orthogonal in early layers, converge in middle layers, and may stabilize or continue increasing in later layers; (2) adding context generally increases the truth vector magnitude, i.e., the separation between true and false representations in the activation space is amplified; (3) larger models distinguish relevant from irrelevant context mainly through directional change ($\theta$), while smaller models show this distinction through magnitude differences. We also find that context conflicting with parametric knowledge produces larger geometric changes than parametrically aligned context. To the best of our knowledge, this is the first work that provides a geometric characterization of how context transforms the truth vector in the activation space of LLMs.

fields

cs.CL 1

years

2026 1

verdicts

UNVERDICTED 1

clear filters

representative citing papers

ToxiREX: A Dataset on Toxic REasoning in ConteXt

cs.CL · 2026-06-26 · unverdicted · novelty 6.0

ToxiREX is a new dataset of 128k Reddit comments in six languages with hierarchical annotations for implicit toxicity in conversational context based on an existing reasoning schema.

citing papers explorer

Showing 1 of 1 citing paper after filters.

  • ToxiREX: A Dataset on Toxic REasoning in ConteXt cs.CL · 2026-06-26 · unverdicted · none · ref 42 · internal anchor

    ToxiREX is a new dataset of 128k Reddit comments in six languages with hierarchical annotations for implicit toxicity in conversational context based on an existing reasoning schema.