pith. machine review for the scientific record.
sign in

arxiv: 2605.05892 · v1 · submitted 2026-05-07 · 💻 cs.CL · cs.LG

Beyond Steering Vector: Flow-based Activation Steering for Inference-Time Intervention

Pith reviewed 2026-05-08 10:54 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords activation steeringflow-based modelsinference-time interventionvelocity fieldlanguage model controlAxBenchgeneralizationactivation geometry
0
0 comments X

The pith

A learned velocity field steers language model activations more effectively than prompting on unseen concepts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Language models can be controlled at inference by altering their internal activations rather than their weights. Earlier steering techniques relied on simple fixed adjustments that stay the same across tokens and use only one step. The paper proposes that these restrictions cause poor results compared to prompting and tests a new approach that learns a full flow of changes over time. This flow method, conditioned on the target concept, achieves better performance than prompting on a large benchmark of held-out concepts. The resulting paths in activation space turn out to be curved and to vary by token, indicating that activation geometry is more complex than earlier models assumed.

Core claim

FLAS learns a general, concept-conditioned velocity field v_t(h,t,c) that transports unsteered activations to steered ones without relying on fixed, single-step, position-invariant transforms. On AxBench, this is the first learned method to consistently outperform prompting, with held-out harmonic means of 1.015 on Gemma-2-2B-IT and 1.113 on Gemma-2-9B-IT. Analysis reveals that the learned flows follow curved, multi-step, token-varying trajectories.

What carries the argument

A concept-conditioned velocity field v_t(h,t,c) that defines the direction and magnitude of activation changes over a continuous time parameter t.

Load-bearing premise

That the limitations of existing steering methods are mainly due to their fixed single-step position-invariant assumptions, and that a learned general velocity field can overcome these limitations.

What would settle it

Observing that a non-flow steering method using multi-step or position-varying adjustments matches or exceeds FLAS performance on the AxBench held-out sets would challenge the necessity of the learned velocity field.

Figures

Figures reproduced from arXiv: 2605.05892 by Chao Zhang, Junran Wang, Ruixuan Deng, Xinjie Shen, Zehao Jin.

Figure 1
Figure 1. Figure 1: FLAS model architecture overview. The velocity field vθ(h, t, c) transports activations at layer ℓ of a frozen base LM. A frozen concept encoder ϕ produces concept representations consumed by a single FlowBlock via cross-attention. The flow is integrated by N-step Euler, shared between training and inference, yielding a steered activation h ′ = φT (h). The entire base language model (base LM) is frozen; on… view at source ↗
Figure 2
Figure 2. Figure 2: Held-in results on Gemma-2-2B￾IT, layer 20. FLAS exceeds the in-context prompting baseline by +0.294 and Hyper￾Steer by +0.283. Gemma-2-2B-IT Gemma-2-9B-IT Method Held-in Held-out Held-in Held-out Prompting 0.731 0.762 1.075 1.091 Fine-tuning LoReFT 0.722 — 0.777 — SFT 0.714 — — — LoRA 0.641 — 0.602 — RePS 0.606 — 0.892 — Activation Steering HyperSteer 0.742 0.608 1.091 0.934 ReFT-r1 0.509 — 0.630 — AcT (A… view at source ↗
Figure 3
Figure 3. Figure 3: Concept scaling. Held￾out harmonic mean versus the number of training concepts. We investigate how FLAS performance scales with the number of training concepts. We train models on subsets of 9, 500, 1.9 k, 5.5 k, and the full 16 k concepts with identical hyperparameters, and evaluate on the same held-out concepts at T = 2. As shown in view at source ↗
Figure 4
Figure 4. Figure 4: contrasts FLAS with three baselines on Gemma-2-9B-IT: ReFT-r1, DiffMean, and AcT [21] (reproduced at layer 20, see Appendix F). All three baselines collapse at higher strengths, while FLAS steadily improves concept score and maintains high instruction and fluency across the entire range. This robustness is not an artifact of training data abundance view at source ↗
Figure 5
Figure 5. Figure 5: Flow time across training-set sizes (Gemma-2-2B-IT held-out). Score decomposition versus T for five concept scales. Shaded bands show ±1 std, clipped to [0, 2]. 6 view at source ↗
Figure 6
Figure 6. Figure 6: visualizes the flow trajectories projected onto the leading principal components of the displacement vectors across various concepts, prompts, and integration steps. The trajectories are not straight lines. Every concept’s path leaves the origin in a shared direction, executes a pronounced bend, and then enters a concept-specific region. Once the bend completes, T controls how far along the concept-specifi… view at source ↗
Figure 7
Figure 7. Figure 7: quantifies the per-step structure of the learned flow. At larger flow times (T = 2.0 and T = 3.0), the late steps point in mutually consistent directions (cosine similarity > 0.7), while the early steps are markedly misaligned with these later directions (cosine similarity < 0.25). This separation between early and late step directions provides quantitative evidence that the bending observed in view at source ↗
Figure 8
Figure 8. Figure 8: Per-token displacement cosines (N = 3, T = 2). Left: mean pairwise cosine of total displacements hN −h0 across token positions. Right: distribution of off-diagonal cosines (µ= 0.294, σ= 0.133). Per-token steering is far from uniform. 7 Limitations and Future Work Our evaluation focuses on AxBench because it provides large-scale natural-language concepts, allowing us to test FLAS on zero-shot extrapolation … view at source ↗
Figure 9
Figure 9. Figure 9: decomposes the score across T ∈ [0.5, 4.0] on Gemma-2-2B-IT held-out concepts. Without Ldiv the concept score plateaus near 1.05 around T ≈ 1.5 and then declines, while the full configura￾tion climbs monotonically and reaches 1.33 at T = 4. At large flow times the LM-only variant also suffers a sharp collapse in all scores (especially, fluency score drops to around 0.2 at T = 4 versus 0.85 for the full con… view at source ↗
read the original abstract

Activation steering has emerged as a promising alternative for controlling language-model behavior at inference time by modifying intermediate representations while keeping model parameters frozen. However, large-scale evaluations such as AxBench show that existing steering methods are often outperformed by simple in-context prompting and generalize poorly to unseen concepts. We hypothesize that these limitations arise from unvalidated simplifying assumptions shared across prior methods, which typically restrict steering interventions to fixed, single-step, position-invariant transforms. We propose FLAS (Flow-based Activation Steering), which learns a general, concept-conditioned velocity field $v_t(h,t,c)$ that transports unsteered activations to steered ones without relying on these assumptions. On AxBench, FLAS is the first learned method to consistently outperform prompting, reaching held-out harmonic means of $1.015$ on Gemma-2-2B-IT and $1.113$ on Gemma-2-9B-IT without per-concept tuning. Analysis of the learned flow shows curved, multi-step, token-varying trajectories, which suggests that previous hypotheses on activation space geometry might be incomplete.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces FLAS (Flow-based Activation Steering), which learns a concept-conditioned velocity field v_t(h,t,c) to transport unsteered activations to steered ones in LLMs at inference time. It hypothesizes that prior steering methods underperform prompting on AxBench due to their fixed, single-step, position-invariant assumptions, and claims that FLAS relaxes these to achieve superior held-out performance (harmonic means 1.015 on Gemma-2-2B-IT and 1.113 on Gemma-2-9B-IT) without per-concept tuning. Qualitative analysis of the learned flows is presented as evidence of curved, multi-step, token-varying trajectories that challenge prior views of activation-space geometry.

Significance. If the outperformance and its attribution to the relaxed assumptions hold after controls, this would advance inference-time intervention by demonstrating a more flexible, generalizable steering approach that avoids per-concept tuning. The flow analysis could inform future work on activation geometry. The practical advantage of consistent gains over prompting on held-out concepts is notable, but the overall significance hinges on whether the velocity field provides genuine generalization benefits beyond capacity increases.

major comments (3)
  1. [Experiments] The central claim that FLAS outperforms prompting specifically because it drops the fixed/single-step/position-invariant restrictions is load-bearing but unsupported by ablations; the learned v_t(h,t,c) has higher capacity than parameter-free steering vectors, so gains on AxBench could arise from expressivity fitting the benchmark distribution rather than curved trajectories per se. No comparison to simpler position-dependent or multi-step baselines is described.
  2. [Results] The held-out harmonic means (1.015/1.113) are reported as point estimates without error bars, number of runs, or statistical significance tests, undermining the assertion of 'consistent' outperformance and making it impossible to assess reliability of the gaps over prompting.
  3. [Method] No details on regularization, capacity controls, or overfitting mitigation for the concept-conditioned velocity field are provided, leaving the overfitting risk unaddressed despite the learned nature of the model and the emphasis on generalization to unseen concepts.
minor comments (2)
  1. [Method] The notation v_t(h,t,c) is introduced without an explicit equation defining the flow ODE or the training objective; adding this would clarify how the velocity field is optimized.
  2. [Introduction] The abstract states FLAS is 'the first learned method' to outperform prompting, but this would be strengthened by a brief comparison table in the introduction or related work section citing the specific prior learned methods evaluated on AxBench.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below and outline revisions to strengthen the empirical support, statistical reporting, and methodological transparency of the manuscript.

read point-by-point responses
  1. Referee: [Experiments] The central claim that FLAS outperforms prompting specifically because it drops the fixed/single-step/position-invariant restrictions is load-bearing but unsupported by ablations; the learned v_t(h,t,c) has higher capacity than parameter-free steering vectors, so gains on AxBench could arise from expressivity fitting the benchmark distribution rather than curved trajectories per se. No comparison to simpler position-dependent or multi-step baselines is described.

    Authors: We acknowledge that isolating the contribution of the relaxed assumptions requires targeted controls. While the velocity field is trained jointly across concepts without per-concept tuning (providing a form of capacity regularization), we agree that direct comparisons to simpler position-dependent single-step or multi-step baselines would better attribute gains to trajectory curvature rather than expressivity alone. The qualitative flow visualizations already illustrate multi-step, token-varying paths, but we will add the requested baseline comparisons in the revised experiments section. revision: partial

  2. Referee: [Results] The held-out harmonic means (1.015/1.113) are reported as point estimates without error bars, number of runs, or statistical significance tests, undermining the assertion of 'consistent' outperformance and making it impossible to assess reliability of the gaps over prompting.

    Authors: We agree that reporting only point estimates limits assessment of reliability. In the revision we will rerun the held-out evaluations with multiple random seeds, report standard deviations or error bars, and include statistical significance tests (e.g., paired t-tests) against the prompting baseline to substantiate the consistency claim. revision: yes

  3. Referee: [Method] No details on regularization, capacity controls, or overfitting mitigation for the concept-conditioned velocity field are provided, leaving the overfitting risk unaddressed despite the learned nature of the model and the emphasis on generalization to unseen concepts.

    Authors: We will expand the method section to specify the regularization (weight decay, flow-matching loss weighting), network capacity (layer sizes and conditioning mechanism), and training protocol (early stopping on a validation split of concepts) used to mitigate overfitting. These details will clarify how generalization to held-out concepts is encouraged. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical learning and held-out evaluation are independent of inputs

full rationale

The paper learns a concept-conditioned velocity field v_t(h,t,c) from training data and reports empirical outperformance on held-out AxBench concepts (harmonic means 1.015/1.113). No equations reduce the claimed superiority to a quantity defined by construction from the same inputs, nor do any load-bearing steps rely on self-citations that themselves assume the target result. The hypothesis about prior methods' fixed/single-step assumptions is tested via the new model's performance rather than presupposed; the derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Based on abstract only; the velocity field is learned rather than hand-specified, so no obvious hand-tuned free parameters are visible. The approach assumes activation space admits a useful continuous flow without stating explicit axioms.

invented entities (1)
  • concept-conditioned velocity field v_t(h,t,c) no independent evidence
    purpose: To define a general transport map from unsteered to steered activations that varies with time, position, and concept
    Introduced as the core modeling choice to relax prior fixed-transform assumptions; no independent evidence provided in abstract

pith-pipeline@v0.9.0 · 5489 in / 1198 out tokens · 30239 ms · 2026-05-08T10:54:36.246612+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 7 canonical work pages · 2 internal anchors

  1. [1]

    Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, and Amanda Askell, et al

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, and Amanda Askell, et al. Language models are few-shot learners, July 2020

  2. [2]

    The llama 3 herd of models, November 2024

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, and Alex Vaughan, et al. The llama 3 herd of models, November 2024

  3. [3]

    Gemma 2: Improving open language models at a practical size, October 2024

    Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, and Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size, October 2024

  4. [4]

    Foundational challenges in assuring alignment and safety of large language models.Transactions on Machine Learning Research, 2024

    Usman Anwar, Abulhair Saparov, Javier Rando, Daniel Paleka, Miles Turpin, Peter Hase, Ekdeep Singh Lubana, Erik Jenner, Stephen Casper, and Oliver Sourbut, et al. Foundational challenges in assuring alignment and safety of large language models.Transactions on Machine Learning Research, 2024. ISSN 2835-8856. URL https://openreview.net/forum?id= oVTkOs8Pka...

  5. [5]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, October 2021

  6. [6]

    Understanding catastrophic forgetting in language models via implicit inference, April 2024

    Suhas Kotha, Jacob Mitchell Springer, and Aditi Raghunathan. Understanding catastrophic forgetting in language models via implicit inference, April 2024

  7. [7]

    An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning

    Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. An empirical study of catastrophic forgetting in large language models during continual fine-tuning, 2025. URL https://arxiv.org/abs/2308.08747

  8. [8]

    Steer llm latents for hallucination detection, 2025

    Seongheon Park, Xuefeng Du, Min-Hsuan Yeh, Haobo Wang, and Yixuan Li. Steer llm latents for hallucination detection, 2025. URLhttps://arxiv.org/abs/2503.01917

  9. [9]

    Caught in the act: a mechanistic approach to detecting deception, 2025

    Gerard Boxo, Ryan Socha, Daniel Yoo, and Shivam Raval. Caught in the act: a mechanistic approach to detecting deception, 2025. URLhttps://arxiv.org/abs/2508.19505

  10. [10]

    Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik Miehling, Pierre Dognin, Manish Nagireddy, and Amit Dhurandhar

    Bruce W. Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik Miehling, Pierre Dognin, Manish Nagireddy, and Amit Dhurandhar. Programming refusal with conditional activation steering, February 2025

  11. [11]

    Linear personality probing and steering in llms: A big five study, January 2026

    Michel Frising and Daniel Balcells. Linear personality probing and steering in llms: A big five study, January 2026

  12. [12]

    2025 , month = feb, number =

    Shu Yang, Shenzhe Zhu, Liang Liu, Lijie Hu, Mengdi Li, and Di Wang. Exploring the personality traits of llms through latent features steering, 2025. URL https://arxiv.org/ abs/2410.10863

  13. [13]

    Under the hood: Using diagnostic classifiers to investigate and improve how language models track agreement information, 2021

    Mario Giulianelli, Jacqueline Harding, Florian Mohnert, Dieuwke Hupkes, and Willem Zuidema. Under the hood: Using diagnostic classifiers to investigate and improve how language models track agreement information, 2021. URLhttps://arxiv.org/abs/1808.08079

  14. [14]

    Vazquez, Ulisse Mini, and Monte MacDiarmid

    Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, and Monte MacDiarmid. Steering language models with activation engineering, October 2024

  15. [15]

    Representation engineering: A top-down approach to ai transparency, March 2025

    Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, and Ann-Kathrin Dombrowski, et al. Representation engineering: A top-down approach to ai transparency, March 2025

  16. [16]

    Manning, and Christopher Potts

    Zhengxuan Wu, Aryaman Arora, Atticus Geiger, Zheng Wang, Jing Huang, Dan Jurafsky, Christopher D. Manning, and Christopher Potts. Axbench: Steering llms? even simple baselines outperform sparse autoencoders, March 2025

  17. [17]

    The linear representation hypothesis and the geometry of large language models, July 2024

    Kiho Park, Yo Joong Choe, and Victor Veitch. The linear representation hypothesis and the geometry of large language models, July 2024. 10

  18. [18]

    Steering llama 2 via contrastive activation addition, July 2024

    Nina Panickssery, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Matt Turner. Steering llama 2 via contrastive activation addition, July 2024

  19. [19]

    Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet.Transformer Circuits Thread, 2024

    Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, and Andy Jones, et al. Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet.Transformer Circuits Thread, 2024. URL https://transformer-circuits.pub/2024/scaling-monosemanticity/index. html

  20. [20]

    Manning, and Christopher Potts

    Zhengxuan Wu, Aryaman Arora, Zheng Wang, Atticus Geiger, Dan Jurafsky, Christopher D. Manning, and Christopher Potts. Reft: Representation finetuning for language models, May 2024

  21. [21]

    Controlling language and diffusion models by transporting activations, November 2024

    Pau Rodriguez, Arno Blaas, Michal Klein, Luca Zappella, Nicholas Apostoloff, Marco Cuturi, and Xavier Suau. Controlling language and diffusion models by transporting activations, November 2024

  22. [22]

    Spherical steering: Geometry-aware activation rotation for language models, February 2026

    Zejia You, Chunyuan Deng, and Hanjie Chen. Spherical steering: Geometry-aware activation rotation for language models, February 2026

  23. [23]

    Phillips, Fazl Barez, and Amirali Abdullah

    Shivam Raval, Hae Jin Song, Linlin Wu, Abir Harrasse, Jeff M. Phillips, Fazl Barez, and Amirali Abdullah. Curveball steering: The right direction to steer isn’t always linear, March 2026

  24. [24]

    Beyond linear steering: Unified multi-attribute control for language models, April 2026

    Narmeen Oozeer, Luke Marks, Shreyans Jain, Fazl Barez, and Amirali Abdullah. Beyond linear steering: Unified multi-attribute control for language models, April 2026

  25. [25]

    Hypersteer: Activation steering at scale with hypernetworks, June 2025

    Jiuding Sun, Sidharth Baskaran, Zhengxuan Wu, Michael Sklar, Christopher Potts, and Atticus Geiger. Hypersteer: Activation steering at scale with hypernetworks, June 2025

  26. [26]

    Truthflow: Truthful llm generation via representation flow correction, February 2025

    Hanyu Wang, Bochuan Cao, Yuanpu Cao, and Jinghui Chen. Truthflow: Truthful llm generation via representation flow correction, February 2025

  27. [27]

    Steering large reasoning models towards concise reasoning via flow matching, February 2026

    Yawei Li, Benjamin Bergner, Yinghan Zhao, Vihang Prakash Patil, Bei Chen, and Cheng Wang. Steering large reasoning models towards concise reasoning via flow matching, February 2026

  28. [28]

    Odesteer: A unified ode-based steering framework for llm alignment, February 2026

    Hongjue Zhao, Haosen Sun, Jiangtao Kong, Xiaochang Li, Qineng Wang, Liwei Jiang, Qi Zhu, Tarek Abdelzaher, Yejin Choi, and Manling Li, et al. Odesteer: A unified ode-based steering framework for llm alignment, February 2026

  29. [29]

    Householder pseudo-rotation: A novel approach to activation editing in llms with direction-magnitude perspective, December 2024

    Van-Cuong Pham and Thien Huu Nguyen. Householder pseudo-rotation: A novel approach to activation editing in llms with direction-magnitude perspective, December 2024

  30. [30]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling, February 2023

  31. [31]

    Improving and generalizing flow-based generative models with minibatch optimal transport, March 2024

    Alexander Tong, Kilian Fatras, Nikolay Malkin, Guillaume Huguet, Yanlei Zhang, Jarrid Rector- Brooks, Guy Wolf, and Yoshua Bengio. Improving and generalizing flow-based generative models with minibatch optimal transport, March 2024

  32. [32]

    Yaron Lipman, Marton Havasi, Peter Holderrieth, Neta Shaul, Matt Le, Brian Karrer, Ricky T. Q. Chen, David Lopez-Paz, Heli Ben-Hamu, and Itai Gat. Flow matching guide and code, December 2024

  33. [33]

    Heli Ben-Hamu, Samuel Cohen, Joey Bose, Brandon Amos, Aditya Grover, Maximilian Nickel, Ricky T. Q. Chen, and Yaron Lipman. Matching normalizing flows and probability paths on manifolds, July 2022

  34. [34]

    Flow matching in latent space, July 2023

    Quan Dao, Hao Phung, Binh Nguyen, and Anh Tran. Flow matching in latent space, July 2023

  35. [35]

    The origins of representation manifolds in large language models, May 2025

    Alexander Modell, Patrick Rubin-Delanchy, and Nick Whiteley. The origins of representation manifolds in large language models, May 2025

  36. [36]

    The geometry of hidden representations of large transformer models, October 2023

    Lucrezia Valeriani, Diego Doimo, Francesca Cuturello, Alessandro Laio, Alessio Ansuini, and Alberto Cazzaniga. The geometry of hidden representations of large transformer models, October 2023. 11

  37. [37]

    Mohamed A. Mabrok. Latent semantic manifolds in large language models, March 2026

  38. [38]

    Estimating the intrinsic dimension of datasets by a minimal neighborhood information.Scientific Reports, 7(1):12140, September 2017

    Elena Facco, Maria d’Errico, Alex Rodriguez, and Alessandro Laio. Estimating the intrinsic dimension of datasets by a minimal neighborhood information.Scientific Reports, 7(1):12140, September 2017. ISSN 2045-2322. doi: 10.1038/s41598-017-11873-y

  39. [39]

    From internal representations to text quality: A geometric approach to llm evaluation, September 2025

    Viacheslav Yusupov, Danil Maksimov, Ameliia Alaeva, Anna Vasileva, Anna Antipina, Tatyana Zaitseva, Alina Ermilova, Evgeny Burnaev, and Egor Shvetsov. From internal representations to text quality: A geometric approach to llm evaluation, September 2025

  40. [40]

    Classifier-free diffusion guidance, July 2022

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance, July 2022

  41. [41]

    Gpt-4 technical report, March 2024

    OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, and Sam Altman, et al. Gpt-4 technical report, March 2024

  42. [42]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, and Chenxu Lv, et al. Qwen3 technical report. https://arxiv.org/abs/2505.09388v1, May 2025

  43. [43]

    like a number’s precise digits

    Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B. Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators, March 2025. 12 A Training Details Parameter Value Base model Gemma-2-2B-IT / Gemma-2-9B-IT (frozen) Steering layerℓ20 FlowBlock countB1 FlowBlock trainable parameters97.6M on 2B,255M on 9B Optimizer AdamW with ...