How Language Models Conflate Logical Validity with Plausibility: A Representational Analysis of Content Effects
Pith reviewed 2026-05-18 09:42 UTC · model grok-4.3
The pith
Language models represent logical validity and plausibility as aligned linear directions in their activation space, causing them to conflate the two during reasoning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Both validity and plausibility are linearly represented and strongly aligned in representational geometry, leading models to conflate plausibility with validity. Using steering vectors, the work demonstrates that plausibility vectors can causally bias validity judgements, and vice versa, and that the degree of alignment between these two concepts predicts the magnitude of behavioral content effects across models. Debiasing vectors can be constructed that disentangle the concepts, reducing content effects and improving reasoning accuracy.
What carries the argument
Linear directions in activation space for the concepts of validity and plausibility, which are extracted by probes and used as steering vectors; their geometric alignment is the mechanism that produces conflation.
Load-bearing premise
The linear directions isolated by probes and steering vectors capture the abstract concepts of validity and plausibility without substantial mixing from other correlated features.
What would settle it
A demonstration that content effects remain unchanged after applying the debiasing vectors, or that the measured alignment between directions fails to predict effect size in a new set of models and tasks.
read the original abstract
Both humans and large language models (LLMs) exhibit content effects: biases in which the plausibility of the semantic content of a reasoning problem influences judgments regarding its logical validity. While this phenomenon in humans is best explained by the dual-process theory of reasoning, the mechanisms behind content effects in LLMs remain unclear. In this work, we address this issue by investigating how LLMs encode the concepts of validity and plausibility within their internal representations. We show that both concepts are linearly represented and strongly aligned in representational geometry, leading models to conflate plausibility with validity. Using steering vectors, we demonstrate that plausibility vectors can causally bias validity judgements, and vice versa, and that the degree of alignment between these two concepts predicts the magnitude of behavioral content effects across models. Finally, we construct debiasing vectors that disentangle these concepts, reducing content effects and improving reasoning accuracy. Our findings advance understanding of how abstract logical concepts are represented in LLMs and highlight representational interventions as a path toward more logical systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that LLMs linearly represent both logical validity and semantic plausibility, with these representations strongly aligned in activation space; this alignment causes models to conflate the two concepts and produce content effects in reasoning. The authors demonstrate this via linear probes to extract directions, geometric alignment measurements, causal steering interventions showing mutual biasing, correlation between alignment strength and behavioral content effects across models, and debiasing vectors that reduce the effects while improving accuracy.
Significance. If the core results hold, the work offers a mechanistic account of content effects grounded in representational geometry and causal interventions, moving beyond purely behavioral observations. The cross-model predictive link between alignment and content-effect magnitude, together with the steering-based causal evidence and the debiasing construction, constitute clear strengths. These elements provide both explanatory insight into how abstract logical concepts are encoded and a concrete path for representational interventions that could improve logical reasoning in LLMs.
major comments (2)
- [linear probing and representational geometry analysis] In the linear probing and representational geometry sections, the assumption that the extracted directions isolate the abstract notions of validity and plausibility (rather than correlated semantic or lexical features) is load-bearing for the alignment, cross-prediction, and steering results. Additional controls—such as training probes on datasets that hold lexical overlap and topic constant while varying validity/plausibility—would be required to rule out the alternative that the reported geometric alignment reflects content correlations instead of conceptual conflation.
- [causal steering and debiasing sections] In the causal steering and debiasing sections, the debiasing vectors presuppose that the original validity and plausibility directions are already sufficiently pure; if residual contamination from correlated features remains, the observed reduction in content effects may not demonstrate successful disentanglement of the intended abstract concepts.
minor comments (2)
- [methods] The description of probe training details (regularization, layer selection, and cross-validation procedure) could be expanded for reproducibility.
- [figures] Figure captions for the steering results should explicitly state the baseline condition and report confidence intervals or statistical tests for the observed bias shifts.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and the positive assessment of the significance of our work. We respond to each major comment below and outline the revisions we will make to address the concerns raised.
read point-by-point responses
-
Referee: [linear probing and representational geometry analysis] In the linear probing and representational geometry sections, the assumption that the extracted directions isolate the abstract notions of validity and plausibility (rather than correlated semantic or lexical features) is load-bearing for the alignment, cross-prediction, and steering results. Additional controls—such as training probes on datasets that hold lexical overlap and topic constant while varying validity/plausibility—would be required to rule out the alternative that the reported geometric alignment reflects content correlations instead of conceptual conflation.
Authors: We thank the referee for this important observation. Our current probing methodology employs datasets with varied lexical content and topics across conditions to mitigate such confounds, as described in the methods section. However, we recognize that more stringent controls would strengthen the evidence for conceptual rather than correlational alignment. In the revised manuscript, we will add experiments training linear probes on controlled datasets where lexical overlap and topic are held constant (e.g., using templated sentences with matched semantics but differing validity/plausibility). This addition will directly address the alternative explanation and support our interpretation of the geometric alignment as reflecting conflation of abstract concepts. revision: yes
-
Referee: [causal steering and debiasing sections] In the causal steering and debiasing sections, the debiasing vectors presuppose that the original validity and plausibility directions are already sufficiently pure; if residual contamination from correlated features remains, the observed reduction in content effects may not demonstrate successful disentanglement of the intended abstract concepts.
Authors: We agree that the effectiveness of the debiasing vectors depends on the purity of the underlying directions. To mitigate this, our construction of debiasing vectors involves orthogonalizing the validity direction against the plausibility direction, and we demonstrate through steering experiments that this leads to reduced content effects and improved accuracy on logical reasoning tasks. We will expand the revised paper to include additional controls, such as measuring residual correlations with semantic features after debiasing and verifying that the intervention does not introduce new biases unrelated to the target concepts. These additions will provide stronger evidence that the observed improvements reflect successful disentanglement of validity and plausibility representations. revision: partial
Circularity Check
No significant circularity; empirical measurements and interventions are self-contained
full rationale
The paper's central claims rest on direct experimental procedures: training linear probes on validity and plausibility labels, computing cosine alignment between the resulting directions, applying steering interventions to measure causal effects on judgments, and correlating alignment scores with observed content-effect magnitudes across models. These steps are falsifiable against held-out data and do not reduce to any fitted parameter being relabeled as a prediction or to a self-citation chain that supplies the target result. Standard linear-probe techniques are referenced but function only as tools; the alignment, steering, and debiasing outcomes supply independent empirical content.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Linear representations in LLM hidden states can isolate abstract concepts such as logical validity and semantic plausibility
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat recovery theorem (equivNat, embed_injective) unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
both concepts are linearly represented and strongly aligned in representational geometry, leading models to conflate plausibility with validity
-
IndisputableMonolith/Foundation/LogicAsFunctionalEquation.leanSatisfiesLawsOfLogic and derivedCost uniqueness unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we construct debiasing vectors that disentangle these concepts
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.