pith. machine review for the scientific record. sign in

arxiv: 2603.18280 · v3 · submitted 2026-03-18 · 💻 cs.LG · cs.AI· cs.CL

Recognition: no theorem link

Detection Is Cheap, Routing Is Learned: Why Refusal-Based Alignment Evaluation Fails

Authors on Pith no claims yet

Pith reviewed 2026-05-15 09:16 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords alignment evaluationmodel routingrefusal mechanismspolitical censorshipinternal probesdirectional ablationbehavioral steeringlanguage models
0
0 comments X

The pith

Alignment in language models works by routing detected concepts to specific output policies rather than by changing what they know or forcing refusals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard alignment checks focus on whether models detect harmful ideas and refuse bad requests, but these tests miss the intermediate step where models decide how to express the idea. The paper uses political censorship in Chinese-origin models as a case study and applies probes plus targeted internal edits across nine models. Removing one specific direction in the model often stops the censorship and restores direct factual answers, while refusal rates can fall to zero as steering takes over. Different labs show distinct routing patterns that do not transfer between models. This means evaluations limited to detection or refusal will overlook the main mechanism controlling real behavior.

Core claim

Models often retain the relevant knowledge; alignment changes how that knowledge is expressed through a three-stage process of detect, route, generate. The routing stage maps detected concepts to behavioral policies and is learned in a lab-specific geometry. Probe accuracy alone is non-diagnostic because even null controls can score perfectly, so held-out generalization is required. Surgical ablation of the political-sensitivity direction removes censorship and restores accurate output in most models, but one architecture shows confabulation because knowledge and censorship are entangled. Within one model family, hard refusal drops to zero while narrative steering rises, making censorship un

What carries the argument

The routing mechanism within the detect-route-generate framework, which directs detected concepts to particular generation behaviors and is isolated through directional ablation.

If this is right

  • Probe accuracy on sensitive categories is non-diagnostic without held-out generalization tests.
  • Surgical removal of the sensitivity direction restores factual output in most but not all tested models.
  • Hard refusal rates can reach zero while censorship continues through narrative steering.
  • Routing directions are model- and lab-specific, so interventions do not transfer across models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Evaluations for safety or other alignments may need to test routing vectors directly instead of output behavior alone.
  • Similar routing changes could appear in non-political domains, requiring broader internal diagnostics.
  • Model-specific routing suggests alignment audits should include per-model ablation checks rather than universal benchmarks.

Load-bearing premise

Ablating the political-sensitivity direction affects only routing without side effects on factual retrieval, and political censorship in these models serves as a representative proxy for general alignment.

What would settle it

If ablating the identified direction in any model also degrades factual recall on unrelated topics or non-political knowledge, the claim that routing is cleanly isolated would be falsified.

read the original abstract

Current alignment evaluation mostly measures whether models encode dangerous concepts and whether they refuse harmful requests. Both miss the layer where alignment often operates: routing from concept detection to behavioral policy. We study political censorship in Chinese-origin language models as a natural experiment, using probes, surgical ablations, and behavioral tests across nine open-weight models from five labs. Three findings follow. First, probe accuracy alone is non-diagnostic: political probes, null controls, and permutation baselines can all reach 100%, so held-out category generalization is the informative test. Second, surgical ablation reveals lab-specific routing. Removing the political-sensitivity direction eliminates censorship and restores accurate factual output in most models tested, while one model confabulates because its architecture entangles factual knowledge with the censorship mechanism. Cross-model transfer fails, indicating that routing geometry is model- and lab-specific. Third, refusal is no longer the dominant censorship mechanism. Within one model family, hard refusal falls to zero while narrative steering rises to the maximum, making censorship invisible to refusal-only benchmarks. These results support a three-stage descriptive framework: detect, route, generate. Models often retain the relevant knowledge; alignment changes how that knowledge is expressed. Evaluations that audit only detection or refusal therefore miss the routing mechanism that most directly determines behavior.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that standard alignment evaluations, which focus on whether models detect dangerous concepts or refuse harmful requests, miss the critical routing layer where alignment policies determine behavioral expression. Using political censorship in Chinese-origin LLMs as a natural experiment, the authors apply probes, surgical ablations of sensitivity directions, and behavioral tests across nine open-weight models from five labs. They report three findings: probe accuracy is non-diagnostic without held-out generalization tests; ablation of the political-sensitivity direction eliminates censorship and restores factual output in most models (with one exception due to entanglement); and refusal is no longer the dominant mechanism, with narrative steering rising instead. This supports a detect-route-generate framework in which models retain relevant knowledge but alignment alters its expression.

Significance. If the ablation results and cross-model patterns hold, the work is significant for identifying a systematic gap in refusal-based benchmarks and for providing empirical support for a routing-centric view of alignment. The lab-specific routing geometry and shift from refusal to steering are potentially actionable for designing more robust evaluations.

major comments (2)
  1. [Ablation experiments (results section)] The central claim that ablation isolates routing (and thereby shows knowledge is preserved) rests on the assumption that direction removal has no side effects on factual retrieval. The manuscript notes one model confabulates post-ablation due to entanglement, but does not report controls such as accuracy on unrelated factual benchmarks before and after ablation. Without these, restoration of output could reflect broad capability changes rather than specific routing removal, weakening the detect-route-generate framework.
  2. [Probe analysis (results section)] The claim that probe accuracy alone is non-diagnostic depends on null controls and permutation baselines reaching 100% while held-out category generalization is the informative metric. The exact statistical thresholds, number of held-out categories, and cross-validation procedure used to establish this distinction are not fully specified, making it difficult to assess whether the non-diagnostic conclusion is robust across the nine models.
minor comments (2)
  1. [Abstract] The abstract states 'consistent patterns across nine models' but does not name the labs or model families; adding this would improve readability without lengthening the paragraph.
  2. [Methods] Notation for the 'political-sensitivity direction' should be defined once at first use (e.g., as the top principal component or linear probe direction) to avoid ambiguity in later ablation descriptions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the empirical requirements for supporting our detect-route-generate framework. We address each major comment below and will revise the manuscript to incorporate the requested controls and details.

read point-by-point responses
  1. Referee: The central claim that ablation isolates routing (and thereby shows knowledge is preserved) rests on the assumption that direction removal has no side effects on factual retrieval. The manuscript notes one model confabulates post-ablation due to entanglement, but does not report controls such as accuracy on unrelated factual benchmarks before and after ablation. Without these, restoration of output could reflect broad capability changes rather than specific routing removal, weakening the detect-route-generate framework.

    Authors: We agree that explicit controls on unrelated factual tasks are needed to rule out broad capability degradation. In the revised manuscript we will add pre- and post-ablation accuracy on standard factual benchmarks (e.g., subsets of TriviaQA and MMLU unrelated to politics) for all nine models. These results will be reported alongside the existing ablation outcomes to demonstrate that the restoration of factual output is specific to removal of the political-sensitivity direction rather than a general change in model capability. We already note the entanglement exception for the one model that confabulates; the new controls will further isolate this case. revision: yes

  2. Referee: The claim that probe accuracy alone is non-diagnostic depends on null controls and permutation baselines reaching 100% while held-out category generalization is the informative metric. The exact statistical thresholds, number of held-out categories, and cross-validation procedure used to establish this distinction are not fully specified, making it difficult to assess whether the non-diagnostic conclusion is robust across the nine models.

    Authors: We will expand the methods section to specify the exact statistical thresholds (including the criterion for declaring probe accuracy non-diagnostic), the precise number of held-out categories used in the generalization tests, and the full cross-validation procedure (category splits and number of folds). These details will be provided for all nine models so that readers can evaluate the robustness of the conclusion that held-out generalization, rather than raw probe accuracy, is the informative metric. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical ablation and cross-model tests ground the detect-route-generate framework

full rationale

The paper's central claims rest on probe accuracy tests, surgical direction ablations, and behavioral refusal measurements across nine models. These are independent experimental interventions whose outcomes (e.g., restored factual output after ablation in most models, lab-specific routing geometry) are not forced by any self-definition, fitted parameter renamed as prediction, or self-citation chain. The three-stage framework is presented as a descriptive summary of the observed results rather than a mathematical derivation that reduces to its inputs. No equations appear; the work is self-contained against external benchmarks via held-out generalization and cross-model transfer failures.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the assumption that linear probes isolate a causal sensitivity direction whose removal affects only routing, plus the domain assumption that political censorship serves as a clean natural experiment for alignment routing in general.

axioms (2)
  • domain assumption Linear probes accurately identify a causal political-sensitivity direction in model activations
    Invoked when the paper states that removing this direction eliminates censorship in most models.
  • domain assumption Ablation effects are specific to routing and do not broadly disrupt factual knowledge retrieval
    Required for interpreting restored factual output after ablation as evidence that knowledge was retained.
invented entities (1)
  • routing mechanism no independent evidence
    purpose: To describe the intermediate step between concept detection and behavioral output that alignment modifies
    Introduced as the load-bearing missing layer in current evaluation; no independent falsifiable handle provided beyond the ablation results themselves.

pith-pipeline@v0.9.0 · 5524 in / 1431 out tokens · 46209 ms · 2026-05-15T09:16:56.662720+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models

    cs.CL 2026-04 unverdicted novelty 7.0

    Alignment policy in language models is implemented as an early-commitment routing circuit of detection gates and amplifier heads that can be localized, scaled, and directly controlled without removing the underlying c...

  2. Correcting Suppressed Log-Probabilities in Language Models with Post-Transformer Adapters

    cs.CL 2026-03 conditional novelty 5.0

    A small post-transformer adapter trained on frozen hidden states corrects suppressed log-probabilities on 31 ideology facts across Qwen3 scales, generalizes to 11-39% of held-out facts, and enables coherent output onl...