The Model Says Walk: How Surface Heuristics Override Implicit Constraints in LLM Reasoning

Lu Zhang; Ramayya Krishnan; Rema Padman; Tianchong Jiang; Yubo Li

arxiv: 2603.29025 · v3 · pith:ZTNOI2MAnew · submitted 2026-03-30 · 💻 cs.CL · cs.AI

The Model Says Walk: How Surface Heuristics Override Implicit Constraints in LLM Reasoning

Yubo Li , Lu Zhang , Tianchong Jiang , Ramayya Krishnan , Rema Padman This is my paper

Pith reviewed 2026-05-14 21:03 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords large language modelsreasoningheuristicsconstraintsbenchmarkcar wash problemheuristic override

0 comments

The pith

Surface distance cues override implicit feasibility constraints in large language models, causing systematic reasoning failures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates why large language models fail on tasks where a prominent surface feature conflicts with an unstated practical constraint. It introduces a framework to diagnose, measure, bridge, and treat this issue using the car wash problem as a case study. Analysis across multiple models shows that distance cues have a much stronger influence than the actual goal, following sigmoid patterns. A new benchmark called HOB tests various heuristics and constraints, revealing low performance across models and that simple hints can help by improving constraint inference.

Core claim

Large language models exhibit heuristic override where salient surface cues, such as distance in the car wash problem, exert 8.7 to 38 times more influence than the implicit goal constraint, as revealed by causal-behavioral analysis and confirmed across the Heuristic Override Benchmark (HOB) spanning multiple heuristic and constraint families.

What carries the argument

The Heuristic Override Benchmark (HOB) consisting of 500 instances with minimal pairs and explicitness gradients across 4 heuristic by 5 constraint families, which measures how surface heuristics override implicit constraints.

If this is right

Under strict 10/10 evaluation, no model exceeds 75% accuracy on HOB, with presence constraints being the hardest at 44%.
Providing a minimal hint emphasizing the key object improves average performance by 15 percentage points.
12 out of 14 models perform worse when the constraint is removed, up to 39 pp, indicating conservative bias.
Goal-decomposition prompting recovers 6 to 9 percentage points by forcing enumeration of preconditions.
The sigmoid pattern generalizes to cost, efficiency, and semantic-similarity heuristics via parametric probes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Addressing heuristic override may require new training methods focused on explicit constraint checking rather than pattern matching.
This vulnerability could affect applications like planning or decision-making where implicit rules are common.
Further tests could apply the benchmark to multimodal models to see if visual cues exacerbate the issue.

Load-bearing premise

The assumption that minimal pairs and explicitness gradients in the HOB benchmark isolate the effects of heuristic override from knowledge gaps or prompt formatting.

What would settle it

A model achieving over 90% accuracy on HOB instances under strict evaluation without relying on distance cues would falsify the claim of systematic override.

Figures

Figures reproduced from arXiv: 2603.29025 by Lu Zhang, Ramayya Krishnan, Rema Padman, Tianchong Jiang, Yubo Li.

**Figure 1.** Figure 1: Left: Base decision scores s(x). All positive (incorrect Walk preference); nonmonotonic scaling. Right: Span-level occlusion heatmap. Distance columns uniformly blue (∆s < 0, toward Drive); goal columns near-zero or red. Causal occlusion. Three findings emerge from span-level perturbation ( [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Left: CSI vs. DSI per paraphrase (Qwen3-4B). Goal sensitivity drives HDR variation; distance sensitivity is stable. Right: Per-span ∆s heatmap (Qwen3-4B). Pattern consistent across all six models. Monotonicity curves. All six models produce sigmoid conflict curves tracking the control ( [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: All six models’ conflict curves (solid) are sigmoids tracking the control (dashed [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Mean strict accuracy per H × C cell (14 models). C-pres hardest; C-cap easiest. We evaluate 14 models on ∼500 HOB instances (N=10 trials, strict: correct only if all 10 pass) [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Probe pattern classification across 6 models [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Goal-decomposition prompting improves weaker models substantially. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Token-level ∆s within the goal span (Qwen3-4B). Green bars (negative) weakly favour Drive; red bars (positive) favour Walk. Opposing effects cancel, leaving near-zero net goal influence. No token approaches the magnitude of the distance cue. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Monotonicity analysis: decision score s(d) vs. distance for conflict (orange) and control (blue) conditions across all six models. Every model produces sigmoid conflict curves that track the control curve. 10m 50m 100m 200m 500m 800m1km 2km 3km 5km 10km 25km 50km 100km Distance (log scale) −20 −10 0 10 20 30 S c o r e s(x) = lo g P(W alk) − lo g P(Driv e) Ideal: flat (Drive at all d) Walk → Drive → Conflic… view at source ↗

**Figure 9.** Figure 9: Individual monotonicity curves. Top: Qwen3-4B (left) and Qwen3-32B (right). Bottom: GPT-OSS-20B (left) and Qwen3-14B (right, highest Walk-bias at short distances). 17 [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: Remaining models: Qwen3-8B (left) and Qwen3.5-27B (right). [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: Multi-panel diagnostic profile for Qwen3-4B: span heatmap, HDR decomposition, [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: Strict accuracy across H × C cells for all 14 models. Cells A1 (H-prox × C-pres) and B1 (H-eff × C-pres) are consistently the hardest. Several models fall below 30% on these cells. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

**Figure 13.** Figure 13: Strict accuracy by constraint family (mean [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

**Figure 14.** Figure 14: Parametric probes across four H × C combinations (Qwen3-4B). Orange: conflict; blue: control. Top-left: H-cost × C-scope—correct reasoning (curves distinct). Top-right: H-eff × C-cap—sigmoid failure (curves track). Bottom-left: H-prox × C-cap—correct reasoning. Bottom-right: H-sem × C-scope—semantic sigmoid. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗

**Figure 15.** Figure 15: H-eff × C-cap conflict curves for all six models. Qwen3-4B stays strongly positive (sigmoid failure); larger models (Qwen3-32B, Qwen3.5-27B) correctly shift negative. GPTOSS-20B hovers near zero. E.3 Semantic Probe: Cross-Model Overlay a small conveni... a roadside shop... a fuel station... a gas station gas st. that sells car accessories gas st. with an auto supplies section a full-service gas station w… view at source ↗

**Figure 16.** Figure 16: H-sem × C-scope conflict curves for all six models. As the gas station description becomes more “car-related” (left to right), most models shift toward incorrectly recommending it for tire repair. Qwen3-4B shows the strongest semantic sigmoid; Qwen3.5-27B and Qwen3-32B remain closer to the decision boundary. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_16.png] view at source ↗

read the original abstract

Large language models fail when a salient surface cue conflicts with an unstated feasibility constraint. We introduce the Heuristic Override Benchmark (HOB): 500 instances spanning 4 heuristic families and 5 constraint families, with minimal pairs and explicitness gradients. We pair HOB with a falsifiable behavioral characterization following a diagnose-measure-bridge-treat arc. Causal-behavioral analysis of the car wash problem across six models reveals context-independent sigmoid heuristics: the distance cue has 8.7 to 38 times more influence than the goal, and attribution better matches keyword association than compositional inference. Across 14 models, strict 10/10 evaluation shows that no model exceeds 75%, and presence constraints are hardest at 44%. A minimal hint improves performance by 15 pp, suggesting a constraint-inference failure rather than missing knowledge. However, 12 of 14 models perform worse when the constraint is removed, by up to 39 pp, revealing conservative bias. A thinking-mode ablation on Gemini 3.1 Pro drops performance from 74.6% with thinking on to 58.4% with thinking off, while explicit goal decomposition recovers it to 71.2%. Thus, internal deliberation does useful work, and explicit prompting can partially substitute for it. Reasoning models do not categorically outperform non-reasoning peers: after controlling for capability rank, the residual reasoning-mode effect is 1.8 pp and is not significant. Parametric probes show that the sigmoid pattern generalizes to cost, efficiency, and semantic-similarity heuristics. Goal-decomposition prompting improves performance by 5.0 pp, compared with 3.1 pp for generic chain-of-thought, isolating constraint enumeration as the active ingredient. Overall, heuristic override is a systematic reasoning vulnerability with a quantified locus in inference order, not knowledge, and a tested intervention.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a usable benchmark for LLM heuristic override failures and shows hints can recover some performance, but the headline 8.7-38x cue ratios look shaky without fitting details or prompt ablations.

read the letter

LLMs tend to let surface cues like distance override unstated constraints in tasks like planning, and this paper measures that failure mode with a new benchmark. The Heuristic Override Benchmark covers 500 instances across four heuristic types and five constraint families. They use minimal pairs and explicitness gradients to test 14 models. No model hits more than 75% under strict 10/10 scoring, and presence constraints prove hardest at 44% average. A minimal hint recovers 15 points on average, which suggests the models have the knowledge but struggle to infer and apply the constraint. Removing the constraint actually lowers performance for most models, up to 39 points, hinting at a conservative bias. The car wash analysis across six models shows distance cues dominating, with influence ratios from 8.7 to 38 times the goal according to their sigmoid fit. Token attribution aligns more with keyword associations than compositional reasoning. Parametric probes extend this to other heuristics like cost and similarity. The design is straightforward and the cross-model consistency is a plus. The hint and prompting experiments provide practical angles for mitigation. The soft spot is the central ratio claim. It depends on the sigmoid parameterization fitting the behavioral data, but the abstract lacks info on the exact procedure, any ablations for prompt variations, or controls for token frequency. If the multiplier changes with small rewordings, the context-independent description weakens. This work is for researchers evaluating and improving LLM reasoning reliability, particularly in automated planning. The benchmark offers a concrete way to track progress. I would recommend sending it for peer review. The empirical patterns and benchmark are useful contributions that warrant referee input, even if the quantitative multipliers need tighter validation.

Referee Report

2 major / 2 minor

Summary. The paper claims that large language models systematically prioritize salient surface cues over unstated feasibility constraints in reasoning. Through a diagnose-measure-bridge-treat framework and causal-behavioral analysis of the car-wash problem across six models, it identifies approximately context-independent sigmoid heuristics in which the distance cue exerts 8.7–38 times more influence than the goal. The introduced Heuristic Override Benchmark (HOB) spans 500 instances across 4 heuristic families and 5 constraint families with minimal pairs and explicitness gradients; under strict 10/10 evaluation, no model exceeds 75% accuracy and presence constraints are hardest (44%). Minimal hints recover +15 pp on average, goal-decomposition prompting recovers +6–9 pp, and 12/14 models perform worse when the constraint is removed (up to –39 pp), indicating failures in constraint inference rather than knowledge gaps. Parametric probes extend the sigmoid pattern to cost, efficiency, and semantic-similarity heuristics.

Significance. If the quantitative claims hold, the work provides a systematic characterization of heuristic override as a reproducible reasoning vulnerability in LLMs, introduces a reusable benchmark (HOB) for tracking progress, and demonstrates that lightweight interventions (hints, goal decomposition) can measurably mitigate the issue. The cross-model consistency and the recovery effects are concrete strengths that move the discussion beyond isolated failure cases.

major comments (2)

[car-wash analysis] Car-wash analysis: the central claim that the distance cue exerts 8.7–38 times more influence than the goal rests on fitting sigmoid heuristics to behavioral responses. The manuscript provides no details on the fitting procedure, chosen parameterization, confidence intervals, or robustness checks under prompt rephrasing or alternative attribution methods; without these, the reported multiplier range risks being an artifact of the specific functional form rather than a stable property of heuristic override.
[HOB benchmark] HOB evaluation protocol: the strict 10/10 correctness criterion and the reported performance drops when constraints are removed (up to –39 pp) are load-bearing for the claim that failures reflect constraint-inference deficits. The abstract and analysis lack explicit statistical controls, full model-version specifications, and data-exclusion criteria, which are required to support cross-model generality.

minor comments (2)

The manuscript should report exact model versions (including checkpoints) and any response-filtering rules used in the six-model and 14-model evaluations.
Token-level attribution results would benefit from a brief description of the attribution method and any controls for token-frequency confounds.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments identify key areas where additional methodological transparency will strengthen the manuscript. We address each point below and will incorporate the requested details in the revision.

read point-by-point responses

Referee: Car-wash analysis: the central claim that the distance cue exerts 8.7–38 times more influence than the goal rests on fitting sigmoid heuristics to behavioral responses. The manuscript provides no details on the fitting procedure, chosen parameterization, confidence intervals, or robustness checks under prompt rephrasing or alternative attribution methods; without these, the reported multiplier range risks being an artifact of the specific functional form rather than a stable property of heuristic override.

Authors: We agree that the fitting details must be documented explicitly. In the revised manuscript we will add an appendix describing the procedure: responses were fit to a logistic sigmoid P(override) = 1 / (1 + exp(−k · (distance − x0))) via nonlinear least-squares minimization, with k and x0 estimated separately per model. We will report 95 % bootstrap confidence intervals (1 000 resamples) and show that the 8.7–38× multiplier range remains stable (7.9–41×) under three prompt rephrasings and when token attribution is replaced by integrated-gradients scores. These additions will demonstrate that the reported range reflects a reproducible behavioral pattern rather than a fitting artifact. revision: yes
Referee: HOB evaluation protocol: the strict 10/10 correctness criterion and the reported performance drops when constraints are removed (up to –39 pp) are load-bearing for the claim that failures reflect constraint-inference deficits. The abstract and analysis lack explicit statistical controls, full model-version specifications, and data-exclusion criteria, which are required to support cross-model generality.

Authors: We accept that these specifications are necessary. The revision will list every model version and checkpoint used, state that data exclusion was restricted to unparseable outputs (< 2 % of trials), and add paired t-tests (all p < .01) together with linear-regression controls for prompt length and token count. Standard deviations across three independent runs per model will also be reported. These changes will provide the statistical grounding required for the cross-model claims. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark study with no load-bearing derivations or self-referential reductions

full rationale

The paper conducts direct evaluations of LLMs on the car-wash problem and the Heuristic Override Benchmark (HOB), reporting observed behavioral patterns such as sigmoid-like responses and influence ratios from model outputs. No equations, first-principles derivations, or predictions are presented that reduce by construction to fitted parameters or prior self-citations. The central claims rest on external model testing across 14 models with minimal pairs and explicitness gradients, which are independent of the reported measurements. No self-citation chains or ansatzes are invoked to justify uniqueness or force results. This is a standard empirical analysis whose quantitative findings (e.g., 8.7–38x influence) are measurements rather than tautological outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that surface cues can be cleanly separated from implicit constraints via minimal pairs and that hint interventions isolate inference failures rather than knowledge gaps.

axioms (1)

domain assumption Minimal pairs in the benchmark isolate the effect of surface heuristics from other prompt factors.
Invoked in the construction of the 500-instance benchmark spanning heuristic and constraint families.

pith-pipeline@v0.9.0 · 5546 in / 1152 out tokens · 48489 ms · 2026-05-14T21:03:13.023690+00:00 · methodology

The Model Says Walk: How Surface Heuristics Override Implicit Constraints in LLM Reasoning

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)