Recognition: no theorem link
Detection Is Cheap, Routing Is Learned: Why Refusal-Based Alignment Evaluation Fails
Pith reviewed 2026-05-15 09:16 UTC · model grok-4.3
The pith
Alignment in language models works by routing detected concepts to specific output policies rather than by changing what they know or forcing refusals.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Models often retain the relevant knowledge; alignment changes how that knowledge is expressed through a three-stage process of detect, route, generate. The routing stage maps detected concepts to behavioral policies and is learned in a lab-specific geometry. Probe accuracy alone is non-diagnostic because even null controls can score perfectly, so held-out generalization is required. Surgical ablation of the political-sensitivity direction removes censorship and restores accurate output in most models, but one architecture shows confabulation because knowledge and censorship are entangled. Within one model family, hard refusal drops to zero while narrative steering rises, making censorship un
What carries the argument
The routing mechanism within the detect-route-generate framework, which directs detected concepts to particular generation behaviors and is isolated through directional ablation.
If this is right
- Probe accuracy on sensitive categories is non-diagnostic without held-out generalization tests.
- Surgical removal of the sensitivity direction restores factual output in most but not all tested models.
- Hard refusal rates can reach zero while censorship continues through narrative steering.
- Routing directions are model- and lab-specific, so interventions do not transfer across models.
Where Pith is reading between the lines
- Evaluations for safety or other alignments may need to test routing vectors directly instead of output behavior alone.
- Similar routing changes could appear in non-political domains, requiring broader internal diagnostics.
- Model-specific routing suggests alignment audits should include per-model ablation checks rather than universal benchmarks.
Load-bearing premise
Ablating the political-sensitivity direction affects only routing without side effects on factual retrieval, and political censorship in these models serves as a representative proxy for general alignment.
What would settle it
If ablating the identified direction in any model also degrades factual recall on unrelated topics or non-political knowledge, the claim that routing is cleanly isolated would be falsified.
read the original abstract
Current alignment evaluation mostly measures whether models encode dangerous concepts and whether they refuse harmful requests. Both miss the layer where alignment often operates: routing from concept detection to behavioral policy. We study political censorship in Chinese-origin language models as a natural experiment, using probes, surgical ablations, and behavioral tests across nine open-weight models from five labs. Three findings follow. First, probe accuracy alone is non-diagnostic: political probes, null controls, and permutation baselines can all reach 100%, so held-out category generalization is the informative test. Second, surgical ablation reveals lab-specific routing. Removing the political-sensitivity direction eliminates censorship and restores accurate factual output in most models tested, while one model confabulates because its architecture entangles factual knowledge with the censorship mechanism. Cross-model transfer fails, indicating that routing geometry is model- and lab-specific. Third, refusal is no longer the dominant censorship mechanism. Within one model family, hard refusal falls to zero while narrative steering rises to the maximum, making censorship invisible to refusal-only benchmarks. These results support a three-stage descriptive framework: detect, route, generate. Models often retain the relevant knowledge; alignment changes how that knowledge is expressed. Evaluations that audit only detection or refusal therefore miss the routing mechanism that most directly determines behavior.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that standard alignment evaluations, which focus on whether models detect dangerous concepts or refuse harmful requests, miss the critical routing layer where alignment policies determine behavioral expression. Using political censorship in Chinese-origin LLMs as a natural experiment, the authors apply probes, surgical ablations of sensitivity directions, and behavioral tests across nine open-weight models from five labs. They report three findings: probe accuracy is non-diagnostic without held-out generalization tests; ablation of the political-sensitivity direction eliminates censorship and restores factual output in most models (with one exception due to entanglement); and refusal is no longer the dominant mechanism, with narrative steering rising instead. This supports a detect-route-generate framework in which models retain relevant knowledge but alignment alters its expression.
Significance. If the ablation results and cross-model patterns hold, the work is significant for identifying a systematic gap in refusal-based benchmarks and for providing empirical support for a routing-centric view of alignment. The lab-specific routing geometry and shift from refusal to steering are potentially actionable for designing more robust evaluations.
major comments (2)
- [Ablation experiments (results section)] The central claim that ablation isolates routing (and thereby shows knowledge is preserved) rests on the assumption that direction removal has no side effects on factual retrieval. The manuscript notes one model confabulates post-ablation due to entanglement, but does not report controls such as accuracy on unrelated factual benchmarks before and after ablation. Without these, restoration of output could reflect broad capability changes rather than specific routing removal, weakening the detect-route-generate framework.
- [Probe analysis (results section)] The claim that probe accuracy alone is non-diagnostic depends on null controls and permutation baselines reaching 100% while held-out category generalization is the informative metric. The exact statistical thresholds, number of held-out categories, and cross-validation procedure used to establish this distinction are not fully specified, making it difficult to assess whether the non-diagnostic conclusion is robust across the nine models.
minor comments (2)
- [Abstract] The abstract states 'consistent patterns across nine models' but does not name the labs or model families; adding this would improve readability without lengthening the paragraph.
- [Methods] Notation for the 'political-sensitivity direction' should be defined once at first use (e.g., as the top principal component or linear probe direction) to avoid ambiguity in later ablation descriptions.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the empirical requirements for supporting our detect-route-generate framework. We address each major comment below and will revise the manuscript to incorporate the requested controls and details.
read point-by-point responses
-
Referee: The central claim that ablation isolates routing (and thereby shows knowledge is preserved) rests on the assumption that direction removal has no side effects on factual retrieval. The manuscript notes one model confabulates post-ablation due to entanglement, but does not report controls such as accuracy on unrelated factual benchmarks before and after ablation. Without these, restoration of output could reflect broad capability changes rather than specific routing removal, weakening the detect-route-generate framework.
Authors: We agree that explicit controls on unrelated factual tasks are needed to rule out broad capability degradation. In the revised manuscript we will add pre- and post-ablation accuracy on standard factual benchmarks (e.g., subsets of TriviaQA and MMLU unrelated to politics) for all nine models. These results will be reported alongside the existing ablation outcomes to demonstrate that the restoration of factual output is specific to removal of the political-sensitivity direction rather than a general change in model capability. We already note the entanglement exception for the one model that confabulates; the new controls will further isolate this case. revision: yes
-
Referee: The claim that probe accuracy alone is non-diagnostic depends on null controls and permutation baselines reaching 100% while held-out category generalization is the informative metric. The exact statistical thresholds, number of held-out categories, and cross-validation procedure used to establish this distinction are not fully specified, making it difficult to assess whether the non-diagnostic conclusion is robust across the nine models.
Authors: We will expand the methods section to specify the exact statistical thresholds (including the criterion for declaring probe accuracy non-diagnostic), the precise number of held-out categories used in the generalization tests, and the full cross-validation procedure (category splits and number of folds). These details will be provided for all nine models so that readers can evaluate the robustness of the conclusion that held-out generalization, rather than raw probe accuracy, is the informative metric. revision: yes
Circularity Check
No circularity: empirical ablation and cross-model tests ground the detect-route-generate framework
full rationale
The paper's central claims rest on probe accuracy tests, surgical direction ablations, and behavioral refusal measurements across nine models. These are independent experimental interventions whose outcomes (e.g., restored factual output after ablation in most models, lab-specific routing geometry) are not forced by any self-definition, fitted parameter renamed as prediction, or self-citation chain. The three-stage framework is presented as a descriptive summary of the observed results rather than a mathematical derivation that reduces to its inputs. No equations appear; the work is self-contained against external benchmarks via held-out generalization and cross-model transfer failures.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Linear probes accurately identify a causal political-sensitivity direction in model activations
- domain assumption Ablation effects are specific to routing and do not broadly disrupt factual knowledge retrieval
invented entities (1)
-
routing mechanism
no independent evidence
Forward citations
Cited by 2 Pith papers
-
How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models
Alignment policy in language models is implemented as an early-commitment routing circuit of detection gates and amplifier heads that can be localized, scaled, and directly controlled without removing the underlying c...
-
Correcting Suppressed Log-Probabilities in Language Models with Post-Transformer Adapters
A small post-transformer adapter trained on frozen hidden states corrects suppressed log-probabilities on 31 ideology facts across Qwen3 scales, generalizes to 11-39% of held-out facts, and enables coherent output onl...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.