pith. sign in

arxiv: 2510.06700 · v3 · submitted 2025-10-08 · 💻 cs.CL

How Language Models Conflate Logical Validity with Plausibility: A Representational Analysis of Content Effects

Pith reviewed 2026-05-18 09:42 UTC · model grok-4.3

classification 💻 cs.CL
keywords language modelslogical validityplausibilitycontent effectsrepresentational geometrysteering vectorsdebiasingreasoning biases
0
0 comments X

The pith

Language models represent logical validity and plausibility as aligned linear directions in their activation space, causing them to conflate the two during reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models show content effects in which the plausibility of a statement's meaning influences whether they judge it as logically valid. The paper investigates the internal representations that produce this behavior. It finds that both validity and plausibility are encoded as linear directions and that these directions lie close to each other in the model's geometry. Because of this alignment, activating one concept shifts judgments about the other. The authors further show that the strength of the alignment across models predicts how strongly content effects appear in their outputs, and they build vectors that pull the two directions apart to reduce the bias.

Core claim

Both validity and plausibility are linearly represented and strongly aligned in representational geometry, leading models to conflate plausibility with validity. Using steering vectors, the work demonstrates that plausibility vectors can causally bias validity judgements, and vice versa, and that the degree of alignment between these two concepts predicts the magnitude of behavioral content effects across models. Debiasing vectors can be constructed that disentangle the concepts, reducing content effects and improving reasoning accuracy.

What carries the argument

Linear directions in activation space for the concepts of validity and plausibility, which are extracted by probes and used as steering vectors; their geometric alignment is the mechanism that produces conflation.

Load-bearing premise

The linear directions isolated by probes and steering vectors capture the abstract concepts of validity and plausibility without substantial mixing from other correlated features.

What would settle it

A demonstration that content effects remain unchanged after applying the debiasing vectors, or that the measured alignment between directions fails to predict effect size in a new set of models and tasks.

read the original abstract

Both humans and large language models (LLMs) exhibit content effects: biases in which the plausibility of the semantic content of a reasoning problem influences judgments regarding its logical validity. While this phenomenon in humans is best explained by the dual-process theory of reasoning, the mechanisms behind content effects in LLMs remain unclear. In this work, we address this issue by investigating how LLMs encode the concepts of validity and plausibility within their internal representations. We show that both concepts are linearly represented and strongly aligned in representational geometry, leading models to conflate plausibility with validity. Using steering vectors, we demonstrate that plausibility vectors can causally bias validity judgements, and vice versa, and that the degree of alignment between these two concepts predicts the magnitude of behavioral content effects across models. Finally, we construct debiasing vectors that disentangle these concepts, reducing content effects and improving reasoning accuracy. Our findings advance understanding of how abstract logical concepts are represented in LLMs and highlight representational interventions as a path toward more logical systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that LLMs linearly represent both logical validity and semantic plausibility, with these representations strongly aligned in activation space; this alignment causes models to conflate the two concepts and produce content effects in reasoning. The authors demonstrate this via linear probes to extract directions, geometric alignment measurements, causal steering interventions showing mutual biasing, correlation between alignment strength and behavioral content effects across models, and debiasing vectors that reduce the effects while improving accuracy.

Significance. If the core results hold, the work offers a mechanistic account of content effects grounded in representational geometry and causal interventions, moving beyond purely behavioral observations. The cross-model predictive link between alignment and content-effect magnitude, together with the steering-based causal evidence and the debiasing construction, constitute clear strengths. These elements provide both explanatory insight into how abstract logical concepts are encoded and a concrete path for representational interventions that could improve logical reasoning in LLMs.

major comments (2)
  1. [linear probing and representational geometry analysis] In the linear probing and representational geometry sections, the assumption that the extracted directions isolate the abstract notions of validity and plausibility (rather than correlated semantic or lexical features) is load-bearing for the alignment, cross-prediction, and steering results. Additional controls—such as training probes on datasets that hold lexical overlap and topic constant while varying validity/plausibility—would be required to rule out the alternative that the reported geometric alignment reflects content correlations instead of conceptual conflation.
  2. [causal steering and debiasing sections] In the causal steering and debiasing sections, the debiasing vectors presuppose that the original validity and plausibility directions are already sufficiently pure; if residual contamination from correlated features remains, the observed reduction in content effects may not demonstrate successful disentanglement of the intended abstract concepts.
minor comments (2)
  1. [methods] The description of probe training details (regularization, layer selection, and cross-validation procedure) could be expanded for reproducibility.
  2. [figures] Figure captions for the steering results should explicitly state the baseline condition and report confidence intervals or statistical tests for the observed bias shifts.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and the positive assessment of the significance of our work. We respond to each major comment below and outline the revisions we will make to address the concerns raised.

read point-by-point responses
  1. Referee: [linear probing and representational geometry analysis] In the linear probing and representational geometry sections, the assumption that the extracted directions isolate the abstract notions of validity and plausibility (rather than correlated semantic or lexical features) is load-bearing for the alignment, cross-prediction, and steering results. Additional controls—such as training probes on datasets that hold lexical overlap and topic constant while varying validity/plausibility—would be required to rule out the alternative that the reported geometric alignment reflects content correlations instead of conceptual conflation.

    Authors: We thank the referee for this important observation. Our current probing methodology employs datasets with varied lexical content and topics across conditions to mitigate such confounds, as described in the methods section. However, we recognize that more stringent controls would strengthen the evidence for conceptual rather than correlational alignment. In the revised manuscript, we will add experiments training linear probes on controlled datasets where lexical overlap and topic are held constant (e.g., using templated sentences with matched semantics but differing validity/plausibility). This addition will directly address the alternative explanation and support our interpretation of the geometric alignment as reflecting conflation of abstract concepts. revision: yes

  2. Referee: [causal steering and debiasing sections] In the causal steering and debiasing sections, the debiasing vectors presuppose that the original validity and plausibility directions are already sufficiently pure; if residual contamination from correlated features remains, the observed reduction in content effects may not demonstrate successful disentanglement of the intended abstract concepts.

    Authors: We agree that the effectiveness of the debiasing vectors depends on the purity of the underlying directions. To mitigate this, our construction of debiasing vectors involves orthogonalizing the validity direction against the plausibility direction, and we demonstrate through steering experiments that this leads to reduced content effects and improved accuracy on logical reasoning tasks. We will expand the revised paper to include additional controls, such as measuring residual correlations with semantic features after debiasing and verifying that the intervention does not introduce new biases unrelated to the target concepts. These additions will provide stronger evidence that the observed improvements reflect successful disentanglement of validity and plausibility representations. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical measurements and interventions are self-contained

full rationale

The paper's central claims rest on direct experimental procedures: training linear probes on validity and plausibility labels, computing cosine alignment between the resulting directions, applying steering interventions to measure causal effects on judgments, and correlating alignment scores with observed content-effect magnitudes across models. These steps are falsifiable against held-out data and do not reduce to any fitted parameter being relabeled as a prediction or to a self-citation chain that supplies the target result. Standard linear-probe techniques are referenced but function only as tools; the alignment, steering, and debiasing outcomes supply independent empirical content.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The analysis assumes that linear directions in activation space can be meaningfully interpreted as encoding distinct semantic concepts and that steering along those directions produces targeted causal effects without major unintended side effects on other model behaviors.

axioms (1)
  • domain assumption Linear representations in LLM hidden states can isolate abstract concepts such as logical validity and semantic plausibility
    Invoked throughout the representational analysis and steering sections; central to all claims about alignment and causal influence.

pith-pipeline@v0.9.0 · 5714 in / 1399 out tokens · 36192 ms · 2026-05-18T09:42:36.555245+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.