How Language Models Conflate Logical Validity with Plausibility: A Representational Analysis of Content Effects

Leonardo Bertolazzi; Raffaella Bernardi; Sandro Pezzelle

arxiv: 2510.06700 · v3 · submitted 2025-10-08 · 💻 cs.CL

How Language Models Conflate Logical Validity with Plausibility: A Representational Analysis of Content Effects

Leonardo Bertolazzi , Sandro Pezzelle , Raffaella Bernardi This is my paper

Pith reviewed 2026-05-18 09:42 UTC · model grok-4.3

classification 💻 cs.CL

keywords language modelslogical validityplausibilitycontent effectsrepresentational geometrysteering vectorsdebiasingreasoning biases

0 comments

The pith

Language models represent logical validity and plausibility as aligned linear directions in their activation space, causing them to conflate the two during reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models show content effects in which the plausibility of a statement's meaning influences whether they judge it as logically valid. The paper investigates the internal representations that produce this behavior. It finds that both validity and plausibility are encoded as linear directions and that these directions lie close to each other in the model's geometry. Because of this alignment, activating one concept shifts judgments about the other. The authors further show that the strength of the alignment across models predicts how strongly content effects appear in their outputs, and they build vectors that pull the two directions apart to reduce the bias.

Core claim

Both validity and plausibility are linearly represented and strongly aligned in representational geometry, leading models to conflate plausibility with validity. Using steering vectors, the work demonstrates that plausibility vectors can causally bias validity judgements, and vice versa, and that the degree of alignment between these two concepts predicts the magnitude of behavioral content effects across models. Debiasing vectors can be constructed that disentangle the concepts, reducing content effects and improving reasoning accuracy.

What carries the argument

Linear directions in activation space for the concepts of validity and plausibility, which are extracted by probes and used as steering vectors; their geometric alignment is the mechanism that produces conflation.

Load-bearing premise

The linear directions isolated by probes and steering vectors capture the abstract concepts of validity and plausibility without substantial mixing from other correlated features.

What would settle it

A demonstration that content effects remain unchanged after applying the debiasing vectors, or that the measured alignment between directions fails to predict effect size in a new set of models and tasks.

read the original abstract

Both humans and large language models (LLMs) exhibit content effects: biases in which the plausibility of the semantic content of a reasoning problem influences judgments regarding its logical validity. While this phenomenon in humans is best explained by the dual-process theory of reasoning, the mechanisms behind content effects in LLMs remain unclear. In this work, we address this issue by investigating how LLMs encode the concepts of validity and plausibility within their internal representations. We show that both concepts are linearly represented and strongly aligned in representational geometry, leading models to conflate plausibility with validity. Using steering vectors, we demonstrate that plausibility vectors can causally bias validity judgements, and vice versa, and that the degree of alignment between these two concepts predicts the magnitude of behavioral content effects across models. Finally, we construct debiasing vectors that disentangle these concepts, reducing content effects and improving reasoning accuracy. Our findings advance understanding of how abstract logical concepts are represented in LLMs and highlight representational interventions as a path toward more logical systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows validity and plausibility live in aligned linear directions inside LLMs, and steering plus debiasing vectors can reduce the resulting content effects.

read the letter

The core finding is that LLMs encode logical validity and semantic plausibility as linearly separable directions that sit close to each other in activation space. This alignment correlates with how much models let plausibility override validity on syllogisms, and the authors show you can steer one direction to change judgments on the other. They also build debiasing vectors that pull the two apart and improve accuracy without retraining the whole model. That combination of geometry, causal intervention, and a practical fix is the main new piece here. It extends earlier linear representation work to the specific case of content effects in deduction, and the cross-model prediction of behavioral bias strength gives the geometric story some external grounding. The experiments follow a clean pipeline: probe for each concept, measure alignment, run steering, check behavioral correlation, then debias. The results look consistent with the abstract claims. The soft spot is the usual one with these methods. The directions recovered by the probes could still mix in surface cues like lexical overlap or topic familiarity rather than isolating the abstract notions of validity and plausibility. The debiasing step assumes the original directions are clean enough to subtract from, and the paper would be stronger with more controls or ablation on what else those vectors capture. Still, the causal steering results and the link to observed error rates give the story more weight than pure correlational geometry. This paper is for people who work on mechanistic interpretability of reasoning in LLMs. Anyone already using linear probes or activation steering will get immediate value from the specific application and the debiasing construction. It is solid enough to deserve a serious referee; the central claims are testable and the intervention is reproducible in principle.

Referee Report

2 major / 2 minor

Summary. The paper claims that LLMs linearly represent both logical validity and semantic plausibility, with these representations strongly aligned in activation space; this alignment causes models to conflate the two concepts and produce content effects in reasoning. The authors demonstrate this via linear probes to extract directions, geometric alignment measurements, causal steering interventions showing mutual biasing, correlation between alignment strength and behavioral content effects across models, and debiasing vectors that reduce the effects while improving accuracy.

Significance. If the core results hold, the work offers a mechanistic account of content effects grounded in representational geometry and causal interventions, moving beyond purely behavioral observations. The cross-model predictive link between alignment and content-effect magnitude, together with the steering-based causal evidence and the debiasing construction, constitute clear strengths. These elements provide both explanatory insight into how abstract logical concepts are encoded and a concrete path for representational interventions that could improve logical reasoning in LLMs.

major comments (2)

[linear probing and representational geometry analysis] In the linear probing and representational geometry sections, the assumption that the extracted directions isolate the abstract notions of validity and plausibility (rather than correlated semantic or lexical features) is load-bearing for the alignment, cross-prediction, and steering results. Additional controls—such as training probes on datasets that hold lexical overlap and topic constant while varying validity/plausibility—would be required to rule out the alternative that the reported geometric alignment reflects content correlations instead of conceptual conflation.
[causal steering and debiasing sections] In the causal steering and debiasing sections, the debiasing vectors presuppose that the original validity and plausibility directions are already sufficiently pure; if residual contamination from correlated features remains, the observed reduction in content effects may not demonstrate successful disentanglement of the intended abstract concepts.

minor comments (2)

[methods] The description of probe training details (regularization, layer selection, and cross-validation procedure) could be expanded for reproducibility.
[figures] Figure captions for the steering results should explicitly state the baseline condition and report confidence intervals or statistical tests for the observed bias shifts.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and the positive assessment of the significance of our work. We respond to each major comment below and outline the revisions we will make to address the concerns raised.

read point-by-point responses

Referee: [linear probing and representational geometry analysis] In the linear probing and representational geometry sections, the assumption that the extracted directions isolate the abstract notions of validity and plausibility (rather than correlated semantic or lexical features) is load-bearing for the alignment, cross-prediction, and steering results. Additional controls—such as training probes on datasets that hold lexical overlap and topic constant while varying validity/plausibility—would be required to rule out the alternative that the reported geometric alignment reflects content correlations instead of conceptual conflation.

Authors: We thank the referee for this important observation. Our current probing methodology employs datasets with varied lexical content and topics across conditions to mitigate such confounds, as described in the methods section. However, we recognize that more stringent controls would strengthen the evidence for conceptual rather than correlational alignment. In the revised manuscript, we will add experiments training linear probes on controlled datasets where lexical overlap and topic are held constant (e.g., using templated sentences with matched semantics but differing validity/plausibility). This addition will directly address the alternative explanation and support our interpretation of the geometric alignment as reflecting conflation of abstract concepts. revision: yes
Referee: [causal steering and debiasing sections] In the causal steering and debiasing sections, the debiasing vectors presuppose that the original validity and plausibility directions are already sufficiently pure; if residual contamination from correlated features remains, the observed reduction in content effects may not demonstrate successful disentanglement of the intended abstract concepts.

Authors: We agree that the effectiveness of the debiasing vectors depends on the purity of the underlying directions. To mitigate this, our construction of debiasing vectors involves orthogonalizing the validity direction against the plausibility direction, and we demonstrate through steering experiments that this leads to reduced content effects and improved accuracy on logical reasoning tasks. We will expand the revised paper to include additional controls, such as measuring residual correlations with semantic features after debiasing and verifying that the intervention does not introduce new biases unrelated to the target concepts. These additions will provide stronger evidence that the observed improvements reflect successful disentanglement of validity and plausibility representations. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical measurements and interventions are self-contained

full rationale

The paper's central claims rest on direct experimental procedures: training linear probes on validity and plausibility labels, computing cosine alignment between the resulting directions, applying steering interventions to measure causal effects on judgments, and correlating alignment scores with observed content-effect magnitudes across models. These steps are falsifiable against held-out data and do not reduce to any fitted parameter being relabeled as a prediction or to a self-citation chain that supplies the target result. Standard linear-probe techniques are referenced but function only as tools; the alignment, steering, and debiasing outcomes supply independent empirical content.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The analysis assumes that linear directions in activation space can be meaningfully interpreted as encoding distinct semantic concepts and that steering along those directions produces targeted causal effects without major unintended side effects on other model behaviors.

axioms (1)

domain assumption Linear representations in LLM hidden states can isolate abstract concepts such as logical validity and semantic plausibility
Invoked throughout the representational analysis and steering sections; central to all claims about alignment and causal influence.

pith-pipeline@v0.9.0 · 5714 in / 1399 out tokens · 36192 ms · 2026-05-18T09:42:36.555245+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery theorem (equivNat, embed_injective) unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

both concepts are linearly represented and strongly aligned in representational geometry, leading models to conflate plausibility with validity
IndisputableMonolith/Foundation/LogicAsFunctionalEquation.lean SatisfiesLawsOfLogic and derivedCost uniqueness unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we construct debiasing vectors that disentangle these concepts

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.