arxiv: 2605.05892 · v1 · submitted 2026-05-07 · 💻 cs.CL · cs.LG

Beyond Steering Vector: Flow-based Activation Steering for Inference-Time Intervention

Zehao Jin , Ruixuan Deng , Junran Wang , Xinjie Shen , Chao Zhang This is my paper

Pith reviewed 2026-05-08 10:54 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords activation steeringflow-based modelsinference-time interventionvelocity fieldlanguage model controlAxBenchgeneralizationactivation geometry

0 comments

The pith

A learned velocity field steers language model activations more effectively than prompting on unseen concepts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Language models can be controlled at inference by altering their internal activations rather than their weights. Earlier steering techniques relied on simple fixed adjustments that stay the same across tokens and use only one step. The paper proposes that these restrictions cause poor results compared to prompting and tests a new approach that learns a full flow of changes over time. This flow method, conditioned on the target concept, achieves better performance than prompting on a large benchmark of held-out concepts. The resulting paths in activation space turn out to be curved and to vary by token, indicating that activation geometry is more complex than earlier models assumed.

Core claim

FLAS learns a general, concept-conditioned velocity field v_t(h,t,c) that transports unsteered activations to steered ones without relying on fixed, single-step, position-invariant transforms. On AxBench, this is the first learned method to consistently outperform prompting, with held-out harmonic means of 1.015 on Gemma-2-2B-IT and 1.113 on Gemma-2-9B-IT. Analysis reveals that the learned flows follow curved, multi-step, token-varying trajectories.

What carries the argument

A concept-conditioned velocity field v_t(h,t,c) that defines the direction and magnitude of activation changes over a continuous time parameter t.

Load-bearing premise

That the limitations of existing steering methods are mainly due to their fixed single-step position-invariant assumptions, and that a learned general velocity field can overcome these limitations.

What would settle it

Observing that a non-flow steering method using multi-step or position-varying adjustments matches or exceeds FLAS performance on the AxBench held-out sets would challenge the necessity of the learned velocity field.

Figures

Figures reproduced from arXiv: 2605.05892 by Chao Zhang, Junran Wang, Ruixuan Deng, Xinjie Shen, Zehao Jin.

**Figure 1.** Figure 1: FLAS model architecture overview. The velocity field vθ(h, t, c) transports activations at layer ℓ of a frozen base LM. A frozen concept encoder ϕ produces concept representations consumed by a single FlowBlock via cross-attention. The flow is integrated by N-step Euler, shared between training and inference, yielding a steered activation h ′ = φT (h). The entire base language model (base LM) is frozen; on… view at source ↗

**Figure 2.** Figure 2: Held-in results on Gemma-2-2BIT, layer 20. FLAS exceeds the in-context prompting baseline by +0.294 and HyperSteer by +0.283. Gemma-2-2B-IT Gemma-2-9B-IT Method Held-in Held-out Held-in Held-out Prompting 0.731 0.762 1.075 1.091 Fine-tuning LoReFT 0.722 — 0.777 — SFT 0.714 — — — LoRA 0.641 — 0.602 — RePS 0.606 — 0.892 — Activation Steering HyperSteer 0.742 0.608 1.091 0.934 ReFT-r1 0.509 — 0.630 — AcT (A… view at source ↗

**Figure 3.** Figure 3: Concept scaling. Heldout harmonic mean versus the number of training concepts. We investigate how FLAS performance scales with the number of training concepts. We train models on subsets of 9, 500, 1.9 k, 5.5 k, and the full 16 k concepts with identical hyperparameters, and evaluate on the same held-out concepts at T = 2. As shown in view at source ↗

**Figure 4.** Figure 4: contrasts FLAS with three baselines on Gemma-2-9B-IT: ReFT-r1, DiffMean, and AcT [21] (reproduced at layer 20, see Appendix F). All three baselines collapse at higher strengths, while FLAS steadily improves concept score and maintains high instruction and fluency across the entire range. This robustness is not an artifact of training data abundance view at source ↗

**Figure 5.** Figure 5: Flow time across training-set sizes (Gemma-2-2B-IT held-out). Score decomposition versus T for five concept scales. Shaded bands show ±1 std, clipped to [0, 2]. 6 view at source ↗

**Figure 6.** Figure 6: visualizes the flow trajectories projected onto the leading principal components of the displacement vectors across various concepts, prompts, and integration steps. The trajectories are not straight lines. Every concept’s path leaves the origin in a shared direction, executes a pronounced bend, and then enters a concept-specific region. Once the bend completes, T controls how far along the concept-specifi… view at source ↗

**Figure 7.** Figure 7: quantifies the per-step structure of the learned flow. At larger flow times (T = 2.0 and T = 3.0), the late steps point in mutually consistent directions (cosine similarity > 0.7), while the early steps are markedly misaligned with these later directions (cosine similarity < 0.25). This separation between early and late step directions provides quantitative evidence that the bending observed in view at source ↗

**Figure 8.** Figure 8: Per-token displacement cosines (N = 3, T = 2). Left: mean pairwise cosine of total displacements hN −h0 across token positions. Right: distribution of off-diagonal cosines (µ= 0.294, σ= 0.133). Per-token steering is far from uniform. 7 Limitations and Future Work Our evaluation focuses on AxBench because it provides large-scale natural-language concepts, allowing us to test FLAS on zero-shot extrapolation … view at source ↗

**Figure 9.** Figure 9: decomposes the score across T ∈ [0.5, 4.0] on Gemma-2-2B-IT held-out concepts. Without Ldiv the concept score plateaus near 1.05 around T ≈ 1.5 and then declines, while the full configuration climbs monotonically and reaches 1.33 at T = 4. At large flow times the LM-only variant also suffers a sharp collapse in all scores (especially, fluency score drops to around 0.2 at T = 4 versus 0.85 for the full con… view at source ↗

read the original abstract

Activation steering has emerged as a promising alternative for controlling language-model behavior at inference time by modifying intermediate representations while keeping model parameters frozen. However, large-scale evaluations such as AxBench show that existing steering methods are often outperformed by simple in-context prompting and generalize poorly to unseen concepts. We hypothesize that these limitations arise from unvalidated simplifying assumptions shared across prior methods, which typically restrict steering interventions to fixed, single-step, position-invariant transforms. We propose FLAS (Flow-based Activation Steering), which learns a general, concept-conditioned velocity field $v_t(h,t,c)$ that transports unsteered activations to steered ones without relying on these assumptions. On AxBench, FLAS is the first learned method to consistently outperform prompting, reaching held-out harmonic means of $1.015$ on Gemma-2-2B-IT and $1.113$ on Gemma-2-9B-IT without per-concept tuning. Analysis of the learned flow shows curved, multi-step, token-varying trajectories, which suggests that previous hypotheses on activation space geometry might be incomplete.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FLAS learns a concept-conditioned velocity field for multi-step activation steering and beats prompting on AxBench, but the gains could easily come from extra capacity rather than the flow assumptions.

read the letter

The main takeaway is that this paper replaces the usual single fixed steering vector with a learned time- and concept-dependent velocity field v_t(h,t,c) that produces curved trajectories, and it reports the first case where a learned method beats in-context prompting on held-out AxBench concepts for Gemma models. That formulation is genuinely new within activation engineering and the trajectory analysis is a useful qualitative check on what the model actually does. The numbers (harmonic means around 1.015 and 1.113) are presented cleanly and the claim of no per-concept tuning is stated plainly. The paper does a reasonable job of motivating why prior single-step, position-invariant methods might be limited. The soft spots are more substantial. The abstract and available details give no training procedure, no regularization, no error bars, and no ablation against simpler position-dependent or multi-step baselines, so it is impossible to tell whether the lift comes from the flow geometry or from the model simply having more parameters to fit the benchmark distribution. The stress-test concern about overfitting therefore lands: without capacity-matched controls the causal link between relaxed assumptions and performance remains untested. The work is aimed at people already working on inference-time intervention and activation steering. A reader in that subfield will get a clear new direction to try, even if the current evidence is thin. It deserves a serious referee because the core idea is well-motivated and the benchmark results are at least falsifiable; the review process can force the missing ablations and statistical checks that are needed before the claims can be trusted.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces FLAS (Flow-based Activation Steering), which learns a concept-conditioned velocity field v_t(h,t,c) to transport unsteered activations to steered ones in LLMs at inference time. It hypothesizes that prior steering methods underperform prompting on AxBench due to their fixed, single-step, position-invariant assumptions, and claims that FLAS relaxes these to achieve superior held-out performance (harmonic means 1.015 on Gemma-2-2B-IT and 1.113 on Gemma-2-9B-IT) without per-concept tuning. Qualitative analysis of the learned flows is presented as evidence of curved, multi-step, token-varying trajectories that challenge prior views of activation-space geometry.

Significance. If the outperformance and its attribution to the relaxed assumptions hold after controls, this would advance inference-time intervention by demonstrating a more flexible, generalizable steering approach that avoids per-concept tuning. The flow analysis could inform future work on activation geometry. The practical advantage of consistent gains over prompting on held-out concepts is notable, but the overall significance hinges on whether the velocity field provides genuine generalization benefits beyond capacity increases.

major comments (3)

[Experiments] The central claim that FLAS outperforms prompting specifically because it drops the fixed/single-step/position-invariant restrictions is load-bearing but unsupported by ablations; the learned v_t(h,t,c) has higher capacity than parameter-free steering vectors, so gains on AxBench could arise from expressivity fitting the benchmark distribution rather than curved trajectories per se. No comparison to simpler position-dependent or multi-step baselines is described.
[Results] The held-out harmonic means (1.015/1.113) are reported as point estimates without error bars, number of runs, or statistical significance tests, undermining the assertion of 'consistent' outperformance and making it impossible to assess reliability of the gaps over prompting.
[Method] No details on regularization, capacity controls, or overfitting mitigation for the concept-conditioned velocity field are provided, leaving the overfitting risk unaddressed despite the learned nature of the model and the emphasis on generalization to unseen concepts.

minor comments (2)

[Method] The notation v_t(h,t,c) is introduced without an explicit equation defining the flow ODE or the training objective; adding this would clarify how the velocity field is optimized.
[Introduction] The abstract states FLAS is 'the first learned method' to outperform prompting, but this would be strengthened by a brief comparison table in the introduction or related work section citing the specific prior learned methods evaluated on AxBench.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below and outline revisions to strengthen the empirical support, statistical reporting, and methodological transparency of the manuscript.

read point-by-point responses

Referee: [Experiments] The central claim that FLAS outperforms prompting specifically because it drops the fixed/single-step/position-invariant restrictions is load-bearing but unsupported by ablations; the learned v_t(h,t,c) has higher capacity than parameter-free steering vectors, so gains on AxBench could arise from expressivity fitting the benchmark distribution rather than curved trajectories per se. No comparison to simpler position-dependent or multi-step baselines is described.

Authors: We acknowledge that isolating the contribution of the relaxed assumptions requires targeted controls. While the velocity field is trained jointly across concepts without per-concept tuning (providing a form of capacity regularization), we agree that direct comparisons to simpler position-dependent single-step or multi-step baselines would better attribute gains to trajectory curvature rather than expressivity alone. The qualitative flow visualizations already illustrate multi-step, token-varying paths, but we will add the requested baseline comparisons in the revised experiments section. revision: partial
Referee: [Results] The held-out harmonic means (1.015/1.113) are reported as point estimates without error bars, number of runs, or statistical significance tests, undermining the assertion of 'consistent' outperformance and making it impossible to assess reliability of the gaps over prompting.

Authors: We agree that reporting only point estimates limits assessment of reliability. In the revision we will rerun the held-out evaluations with multiple random seeds, report standard deviations or error bars, and include statistical significance tests (e.g., paired t-tests) against the prompting baseline to substantiate the consistency claim. revision: yes
Referee: [Method] No details on regularization, capacity controls, or overfitting mitigation for the concept-conditioned velocity field are provided, leaving the overfitting risk unaddressed despite the learned nature of the model and the emphasis on generalization to unseen concepts.

Authors: We will expand the method section to specify the regularization (weight decay, flow-matching loss weighting), network capacity (layer sizes and conditioning mechanism), and training protocol (early stopping on a validation split of concepts) used to mitigate overfitting. These details will clarify how generalization to held-out concepts is encouraged. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical learning and held-out evaluation are independent of inputs

full rationale

The paper learns a concept-conditioned velocity field v_t(h,t,c) from training data and reports empirical outperformance on held-out AxBench concepts (harmonic means 1.015/1.113). No equations reduce the claimed superiority to a quantity defined by construction from the same inputs, nor do any load-bearing steps rely on self-citations that themselves assume the target result. The hypothesis about prior methods' fixed/single-step assumptions is tested via the new model's performance rather than presupposed; the derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Based on abstract only; the velocity field is learned rather than hand-specified, so no obvious hand-tuned free parameters are visible. The approach assumes activation space admits a useful continuous flow without stating explicit axioms.

invented entities (1)

concept-conditioned velocity field v_t(h,t,c) no independent evidence
purpose: To define a general transport map from unsteered to steered activations that varies with time, position, and concept
Introduced as the core modeling choice to relax prior fixed-transform assumptions; no independent evidence provided in abstract

pith-pipeline@v0.9.0 · 5489 in / 1198 out tokens · 30239 ms · 2026-05-08T10:54:36.246612+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 7 canonical work pages · 2 internal anchors

[1]

Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, and Amanda Askell, et al

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, and Amanda Askell, et al. Language models are few-shot learners, July 2020

2020
[2]

The llama 3 herd of models, November 2024

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, and Alex Vaughan, et al. The llama 3 herd of models, November 2024

2024
[3]

Gemma 2: Improving open language models at a practical size, October 2024

Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, and Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size, October 2024

2024
[4]

Foundational challenges in assuring alignment and safety of large language models.Transactions on Machine Learning Research, 2024

Usman Anwar, Abulhair Saparov, Javier Rando, Daniel Paleka, Miles Turpin, Peter Hase, Ekdeep Singh Lubana, Erik Jenner, Stephen Casper, and Oliver Sourbut, et al. Foundational challenges in assuring alignment and safety of large language models.Transactions on Machine Learning Research, 2024. ISSN 2835-8856. URL https://openreview.net/forum?id= oVTkOs8Pka...

2024
[5]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, October 2021

2021
[6]

Understanding catastrophic forgetting in language models via implicit inference, April 2024

Suhas Kotha, Jacob Mitchell Springer, and Aditi Raghunathan. Understanding catastrophic forgetting in language models via implicit inference, April 2024

2024
[7]

An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning

Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. An empirical study of catastrophic forgetting in large language models during continual fine-tuning, 2025. URL https://arxiv.org/abs/2308.08747

work page internal anchor Pith review arXiv 2025
[8]

Steer llm latents for hallucination detection, 2025

Seongheon Park, Xuefeng Du, Min-Hsuan Yeh, Haobo Wang, and Yixuan Li. Steer llm latents for hallucination detection, 2025. URLhttps://arxiv.org/abs/2503.01917

work page arXiv 2025
[9]

Caught in the act: a mechanistic approach to detecting deception, 2025

Gerard Boxo, Ryan Socha, Daniel Yoo, and Shivam Raval. Caught in the act: a mechanistic approach to detecting deception, 2025. URLhttps://arxiv.org/abs/2508.19505

work page arXiv 2025
[10]

Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik Miehling, Pierre Dognin, Manish Nagireddy, and Amit Dhurandhar

Bruce W. Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik Miehling, Pierre Dognin, Manish Nagireddy, and Amit Dhurandhar. Programming refusal with conditional activation steering, February 2025

2025
[11]

Linear personality probing and steering in llms: A big five study, January 2026

Michel Frising and Daniel Balcells. Linear personality probing and steering in llms: A big five study, January 2026

2026
[12]

2025 , month = feb, number =

Shu Yang, Shenzhe Zhu, Liang Liu, Lijie Hu, Mengdi Li, and Di Wang. Exploring the personality traits of llms through latent features steering, 2025. URL https://arxiv.org/ abs/2410.10863

work page arXiv 2025
[13]

Under the hood: Using diagnostic classifiers to investigate and improve how language models track agreement information, 2021

Mario Giulianelli, Jacqueline Harding, Florian Mohnert, Dieuwke Hupkes, and Willem Zuidema. Under the hood: Using diagnostic classifiers to investigate and improve how language models track agreement information, 2021. URLhttps://arxiv.org/abs/1808.08079

work page arXiv 2021
[14]

Vazquez, Ulisse Mini, and Monte MacDiarmid

Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, and Monte MacDiarmid. Steering language models with activation engineering, October 2024

2024
[15]

Representation engineering: A top-down approach to ai transparency, March 2025

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, and Ann-Kathrin Dombrowski, et al. Representation engineering: A top-down approach to ai transparency, March 2025

2025
[16]

Manning, and Christopher Potts

Zhengxuan Wu, Aryaman Arora, Atticus Geiger, Zheng Wang, Jing Huang, Dan Jurafsky, Christopher D. Manning, and Christopher Potts. Axbench: Steering llms? even simple baselines outperform sparse autoencoders, March 2025

2025
[17]

The linear representation hypothesis and the geometry of large language models, July 2024

Kiho Park, Yo Joong Choe, and Victor Veitch. The linear representation hypothesis and the geometry of large language models, July 2024. 10

2024
[18]

Steering llama 2 via contrastive activation addition, July 2024

Nina Panickssery, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Matt Turner. Steering llama 2 via contrastive activation addition, July 2024

2024
[19]

Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet.Transformer Circuits Thread, 2024

Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, and Andy Jones, et al. Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet.Transformer Circuits Thread, 2024. URL https://transformer-circuits.pub/2024/scaling-monosemanticity/index. html

2024
[20]

Manning, and Christopher Potts

Zhengxuan Wu, Aryaman Arora, Zheng Wang, Atticus Geiger, Dan Jurafsky, Christopher D. Manning, and Christopher Potts. Reft: Representation finetuning for language models, May 2024

2024
[21]

Controlling language and diffusion models by transporting activations, November 2024

Pau Rodriguez, Arno Blaas, Michal Klein, Luca Zappella, Nicholas Apostoloff, Marco Cuturi, and Xavier Suau. Controlling language and diffusion models by transporting activations, November 2024

2024
[22]

Spherical steering: Geometry-aware activation rotation for language models, February 2026

Zejia You, Chunyuan Deng, and Hanjie Chen. Spherical steering: Geometry-aware activation rotation for language models, February 2026

2026
[23]

Phillips, Fazl Barez, and Amirali Abdullah

Shivam Raval, Hae Jin Song, Linlin Wu, Abir Harrasse, Jeff M. Phillips, Fazl Barez, and Amirali Abdullah. Curveball steering: The right direction to steer isn’t always linear, March 2026

2026
[24]

Beyond linear steering: Unified multi-attribute control for language models, April 2026

Narmeen Oozeer, Luke Marks, Shreyans Jain, Fazl Barez, and Amirali Abdullah. Beyond linear steering: Unified multi-attribute control for language models, April 2026

2026
[25]

Hypersteer: Activation steering at scale with hypernetworks, June 2025

Jiuding Sun, Sidharth Baskaran, Zhengxuan Wu, Michael Sklar, Christopher Potts, and Atticus Geiger. Hypersteer: Activation steering at scale with hypernetworks, June 2025

2025
[26]

Truthflow: Truthful llm generation via representation flow correction, February 2025

Hanyu Wang, Bochuan Cao, Yuanpu Cao, and Jinghui Chen. Truthflow: Truthful llm generation via representation flow correction, February 2025

2025
[27]

Steering large reasoning models towards concise reasoning via flow matching, February 2026

Yawei Li, Benjamin Bergner, Yinghan Zhao, Vihang Prakash Patil, Bei Chen, and Cheng Wang. Steering large reasoning models towards concise reasoning via flow matching, February 2026

2026
[28]

Odesteer: A unified ode-based steering framework for llm alignment, February 2026

Hongjue Zhao, Haosen Sun, Jiangtao Kong, Xiaochang Li, Qineng Wang, Liwei Jiang, Qi Zhu, Tarek Abdelzaher, Yejin Choi, and Manling Li, et al. Odesteer: A unified ode-based steering framework for llm alignment, February 2026

2026
[29]

Householder pseudo-rotation: A novel approach to activation editing in llms with direction-magnitude perspective, December 2024

Van-Cuong Pham and Thien Huu Nguyen. Householder pseudo-rotation: A novel approach to activation editing in llms with direction-magnitude perspective, December 2024

2024
[30]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling, February 2023

2023
[31]

Improving and generalizing flow-based generative models with minibatch optimal transport, March 2024

Alexander Tong, Kilian Fatras, Nikolay Malkin, Guillaume Huguet, Yanlei Zhang, Jarrid Rector- Brooks, Guy Wolf, and Yoshua Bengio. Improving and generalizing flow-based generative models with minibatch optimal transport, March 2024

2024
[32]

Yaron Lipman, Marton Havasi, Peter Holderrieth, Neta Shaul, Matt Le, Brian Karrer, Ricky T. Q. Chen, David Lopez-Paz, Heli Ben-Hamu, and Itai Gat. Flow matching guide and code, December 2024

2024
[33]

Heli Ben-Hamu, Samuel Cohen, Joey Bose, Brandon Amos, Aditya Grover, Maximilian Nickel, Ricky T. Q. Chen, and Yaron Lipman. Matching normalizing flows and probability paths on manifolds, July 2022

2022
[34]

Flow matching in latent space, July 2023

Quan Dao, Hao Phung, Binh Nguyen, and Anh Tran. Flow matching in latent space, July 2023

2023
[35]

The origins of representation manifolds in large language models, May 2025

Alexander Modell, Patrick Rubin-Delanchy, and Nick Whiteley. The origins of representation manifolds in large language models, May 2025

2025
[36]

The geometry of hidden representations of large transformer models, October 2023

Lucrezia Valeriani, Diego Doimo, Francesca Cuturello, Alessandro Laio, Alessio Ansuini, and Alberto Cazzaniga. The geometry of hidden representations of large transformer models, October 2023. 11

2023
[37]

Mohamed A. Mabrok. Latent semantic manifolds in large language models, March 2026

2026
[38]

Estimating the intrinsic dimension of datasets by a minimal neighborhood information.Scientific Reports, 7(1):12140, September 2017

Elena Facco, Maria d’Errico, Alex Rodriguez, and Alessandro Laio. Estimating the intrinsic dimension of datasets by a minimal neighborhood information.Scientific Reports, 7(1):12140, September 2017. ISSN 2045-2322. doi: 10.1038/s41598-017-11873-y

work page doi:10.1038/s41598-017-11873-y 2017
[39]

From internal representations to text quality: A geometric approach to llm evaluation, September 2025

Viacheslav Yusupov, Danil Maksimov, Ameliia Alaeva, Anna Vasileva, Anna Antipina, Tatyana Zaitseva, Alina Ermilova, Evgeny Burnaev, and Egor Shvetsov. From internal representations to text quality: A geometric approach to llm evaluation, September 2025

2025
[40]

Classifier-free diffusion guidance, July 2022

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance, July 2022

2022
[41]

Gpt-4 technical report, March 2024

OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, and Sam Altman, et al. Gpt-4 technical report, March 2024

2024
[42]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, and Chenxu Lv, et al. Qwen3 technical report. https://arxiv.org/abs/2505.09388v1, May 2025

work page internal anchor Pith review arXiv 2025
[43]

like a number’s precise digits

Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B. Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators, March 2025. 12 A Training Details Parameter Value Base model Gemma-2-2B-IT / Gemma-2-9B-IT (frozen) Steering layerℓ20 FlowBlock countB1 FlowBlock trainable parameters97.6M on 2B,255M on 9B Optimizer AdamW with ...

2025