Language steering in latent space to mitigate unintended code-switching

Alexey Zaytsev; Andrey Goncharov; Nikolai Kondusov

arxiv: 2510.13849 · v3 · submitted 2025-10-11 · 💻 cs.CL · cs.LG

Language steering in latent space to mitigate unintended code-switching

Andrey Goncharov , Nikolai Kondusov , Alexey Zaytsev This is my paper

Pith reviewed 2026-05-18 07:12 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords code-switchinglatent space steeringprincipal component analysismultilingual LLMsinference-time interventionlanguage controltoken embeddingsparallel translations

0 comments

The pith

Steering token embeddings along a PCA-derived direction from parallel translations reduces unintended code-switching by 63-99 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Multilingual LLMs often mix languages unintentionally in their outputs, which lowers reliability for tasks that need consistent language use. The paper introduces latent-space language steering, a method that finds a language direction with PCA on a small set of parallel translations and shifts token embeddings along that axis during inference. This controls the output language without retraining the model. Tests on Llama-3.2 and Qwen2.5 show sharp drops in code-switching metrics, high language classification accuracy from one component, and preserved semantics with almost no added cost. The analysis also indicates that language identity becomes nearly linearly separable in the final layers.

Core claim

The central claim is that a single linear direction in token embedding space, obtained by PCA on embeddings of parallel translations, captures language identity. Steering embeddings along this direction at inference time suppresses unintended code-switching while leaving semantic content intact, as shown by 95-99 percent classification accuracy, up to 55 percent lower next-token divergence, and 63-99 percent reduction in Code-Switching Index across four language pairs on Llama-3.2 with p less than 0.001.

What carries the argument

The first principal component extracted from embeddings of parallel translations, applied as a steering vector to shift token representations toward the target language during generation.

If this is right

Reduces Code-Switching Index by 63-99 percent on Llama-3.2 across four language pairs while preserving semantics.
Achieves 95-99 percent accuracy in classifying the generated language using only the first principal component.
Lowers next-token distributional divergence by up to 55 percent across tested language pairs.
Requires only minimal parallel data for calibration and adds negligible computational overhead.
Shows language representations concentrate in the final layers with near-perfect linear separability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same linear-steering approach could be tested on other controllable attributes such as formality or topic focus if those traits also align with single directions in embedding space.
Interventions might be most effective when restricted to the deeper layers where language identity is shown to concentrate.
The calibration step with parallel data could be repeated for new target languages to extend control without full model retraining.
If the direction proves prompt-stable, it could reduce reliance on prompt engineering for consistent multilingual output.

Load-bearing premise

Language identity is linearly separable along one stable direction derived from a small set of parallel translations, and shifting along it works reliably across new prompts without harming other capabilities.

What would settle it

Applying the steering vector to a fresh set of prompts outside the calibration translations produces no reduction in code-switching or changes the semantic meaning of the generated text.

read the original abstract

Multilingual Large Language Models (LLMs) often exhibit hallucinations such as unintended code-switching, reducing reliability in downstream tasks. We propose latent-space language steering, a lightweight inference-time method that identifies language directions via Principal Component Analysis (PCA) on parallel translations and steers token embeddings along these axes to control language identity. Our approach mitigates code-switching while preserving semantics with negligible computational overhead and requires only minimal parallel data for calibration. Empirically, we achieve 95-99\% language classification accuracy using a single principal component and reduce next-token distributional divergence by up to 55\% across multiple language pairs on Qwen2.5 and Llama-3.2 models. Generation-based evaluation on Llama-3.2 further demonstrates 63--99\% reduction in Code-Switching Index across four language pairs ($p < 0.001$). We further analyze the layer-wise evolution of language representations, revealing that language identity concentrates in final layers with near-perfect linear separability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PCA steering from parallel data cuts code-switching in the tests but risks capturing content variance instead of a pure language axis.

read the letter

The paper extracts a direction via PCA on parallel translations and adds a scaled version of it to token embeddings at inference time to push generations toward one language. On Llama-3.2 and Qwen2.5 across four pairs this produces the reported drops in code-switching index and distributional divergence while keeping overhead low and using only small calibration sets. They also track how linear separability improves in later layers, which is a concrete observation worth noting. The quantitative claims come with p-values and look reproducible from the abstract numbers. The combination of parallel-data PCA plus embedding offset for this exact failure mode is not in the cited prior steering work, so that part is new. The low data requirement and layer analysis are the practical strengths. The main soft spot is the one the stress-test flags: if the parallel sentences share topic or length patterns, the first component can mix those signals with language identity. Steering would then work best on prompts close to the calibration data and might drift semantics on unrelated inputs. The abstract claims semantic preservation but does not detail the checks or the exact steering formula, so it is hard to judge robustness without the full methods. Only two models and four pairs also keeps the scope narrow. This is for engineers who need a quick inference fix for multilingual reliability and for interpretability people tracking language directions. A reader who wants a lightweight, measurable trick would get value from the numbers and the layer plot. I would send it to peer review; the results are concrete enough that referees can verify the implementation and test the generality concern directly.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes latent-space language steering: a lightweight inference-time intervention that extracts language directions via PCA on token embeddings from a small set of parallel translations and adds scaled versions of the leading principal component to steer generation toward a target language. On Llama-3.2 and Qwen2.5 the method yields 95-99% language-classification accuracy from a single component, up to 55% reduction in next-token distributional divergence, and 63-99% reduction in Code-Switching Index (p<0.001) across four language pairs while claiming semantic preservation; it also reports that language identity becomes linearly separable in the final layers.

Significance. If the central claim holds, the work supplies a training-free, low-overhead technique that requires only minimal parallel data to enforce language consistency in multilingual LLMs. The quantitative gains on held-out generations and the layer-wise separability analysis would constitute a practical contribution to controllable multilingual generation.

major comments (2)

[§4] §4 (Generation-based evaluation): the 63--99% CSI reduction is the primary empirical support for the central claim, yet the manuscript provides no explicit controls demonstrating that the first PC is orthogonal to semantic or topical variance in the calibration pairs. Without such a check (e.g., correlation with sentence embeddings or out-of-domain prompt tests), the reported gains could reflect calibration-set separability rather than a general language axis.
[§3.2] §3.2 (Steering implementation): the paper states that steering is applied to token embeddings during generation, but does not report whether the same direction remains effective when prompts diverge in length, topic, or style from the parallel calibration set. This directly bears on the stability assumption underlying the 95-99% accuracy and CSI results.

minor comments (2)

[Abstract] Abstract: the phrase 'preserving semantics' is used without naming the concrete metric (e.g., cosine similarity of sentence embeddings or human ratings); a one-sentence clarification would improve readability.
[Results] Table or figure captions: several quantitative claims (e.g., 'up to 55% divergence reduction') would benefit from explicit reference to the corresponding table or figure number.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which have helped clarify aspects of our method's robustness. We address each major comment below and indicate revisions to the manuscript where we agree changes are warranted.

read point-by-point responses

Referee: [§4] §4 (Generation-based evaluation): the 63--99% CSI reduction is the primary empirical support for the central claim, yet the manuscript provides no explicit controls demonstrating that the first PC is orthogonal to semantic or topical variance in the calibration pairs. Without such a check (e.g., correlation with sentence embeddings or out-of-domain prompt tests), the reported gains could reflect calibration-set separability rather than a general language axis.

Authors: We agree that an explicit check would strengthen the interpretation that the leading principal component isolates language rather than semantic or topical factors. The calibration data consists of parallel translations, which hold semantics fixed by construction and thus make language the dominant source of variance; however, this is an implicit rather than demonstrated property. In the revised manuscript we add (i) Pearson correlations between the first PC loadings and sentence embeddings from a separate semantic encoder on the calibration pairs (showing low correlation) and (ii) a supplementary evaluation on out-of-domain prompts drawn from unrelated topics. These additions directly address the concern while preserving the original experimental design. revision: yes
Referee: [§3.2] §3.2 (Steering implementation): the paper states that steering is applied to token embeddings during generation, but does not report whether the same direction remains effective when prompts diverge in length, topic, or style from the parallel calibration set. This directly bears on the stability assumption underlying the 95-99% accuracy and CSI results.

Authors: The CSI and accuracy results in §4 were obtained on held-out generations whose prompts differ substantially in length, topic, and stylistic register from the short parallel calibration sentences; these prompts were sampled from diverse multilingual corpora to simulate realistic code-switching scenarios. The consistent 63–99 % CSI reductions across these divergent prompts already provide evidence of stability. To make the distinction explicit, the revision clarifies in §3.2 that the reported metrics use held-out prompts and adds a short table summarizing performance stratified by prompt length and topical distance from the calibration set. revision: partial

Circularity Check

0 steps flagged

Low circularity: PCA on external parallel data with held-out generation metrics

full rationale

The derivation applies standard PCA to a small set of external parallel translations to extract a leading component treated as a language direction, then adds/subtracts this vector from token embeddings at inference time. The primary reported outcome—63-99% CSI reduction with p<0.001—is measured on generated text from held-out prompts rather than being a direct algebraic consequence of the fitted components. The 95-99% classification accuracy is presented as supporting evidence of linear separability but does not constitute the core mitigation claim. No self-citation chains, self-definitional equations, or fitted parameters renamed as predictions reduce the central result to its inputs by construction; the approach remains empirically falsifiable on independent generations.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the empirical observation that language identity is captured by the top principal component of hidden states from parallel sentences; no new physical entities or untested axioms are introduced.

free parameters (1)

steering strength coefficient
Scaling factor applied to the language direction vector during embedding shift; value chosen to balance language control against semantic preservation.

axioms (1)

domain assumption Language identity is linearly separable in the token embedding space of the final layers.
Invoked to justify using a single principal component for steering.

pith-pipeline@v0.9.0 · 5702 in / 1392 out tokens · 30368 ms · 2026-05-18T07:12:42.879885+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We identify language-specific directions via PCA on parallel translations... v^(ℓ) = arg max ... (1); ˜h^(ℓ)_t = h^(ℓ)_t − s (h^(ℓ)_t · v^(ℓ)) v^(ℓ) (2)
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

language identity concentrates in final layers with near-perfect linear separability

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.