arxiv: 2605.14323 · v1 · submitted 2026-05-14 · 💻 cs.LG · cs.AI· cs.CL

Recognition: 2 theorem links

· Lean Theorem

Dynamic Latent Routing

Fangyuan Yu , Xin Su , Amir Abdullah

Authors on Pith no claims yet

Pith reviewed 2026-05-15 01:45 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords dynamic latent routinglanguage model fine-tuningdiscrete latent codespolicy compositionMDP sub-policieslow-data adaptationrouting policies

0 comments

The pith

Dynamic Latent Routing recovers globally optimal policies by composing learned sub-policies to improve low-data language model fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that principles of optimal search in Markov Decision Processes with time-varying rewards can be transferred to language model post-training. It proves that globally optimal goal-reaching policies arise from temporal composition of intermediate sub-policies and then builds Dynamic Latent Routing to apply the underlying search-select-update process. In this method, discrete latent codes, routing policies, and model parameters are learned jointly in one training stage. The result is that DLR matches or exceeds standard supervised fine-tuning on limited data while earlier discrete-latent approaches fall short, and the learned routes exhibit distinct causal roles.

Core claim

General Dijkstra Search shows that optimal policies in MDPs with changing rewards can be recovered exactly by concatenating intermediate optimal sub-policies over time. Dynamic Latent Routing implements the search-select-update principle from this result inside language model training: it performs dynamic search over discrete latent codes to select and compose sub-policies, updating the model parameters in the same stage, yielding structured routing that improves adaptation when data is scarce.

What carries the argument

Dynamic Latent Routing (DLR), a single-stage training procedure that jointly optimizes discrete latent codes and routing policies via dynamic search to compose sub-policies.

Load-bearing premise

The optimality guarantees and search principle from General Dijkstra Search in MDPs transfer effectively to the non-stationary, high-dimensional setting of language model post-training without introducing hidden biases or optimization instabilities.

What would settle it

An experiment in which DLR fails to match or exceed supervised fine-tuning performance on the four datasets and six models, or in which removing the dynamic search component eliminates any observed gains.

Figures

Figures reproduced from arXiv: 2605.14323 by Amir Abdullah, Fangyuan Yu, Xin Su.

**Figure 1.** Figure 1: DLR method overview. Left: chunk-level steering: each chunk m is routed to a discrete code am, which indexes a steering vector α eam from the codebook and is added in place to every hidden state h1, . . . , hK in the chunk. Right: per-step search: N code sequences are sampled from the routing head (SEARCH), the one maximizing pθ(x | a) is selected (SELECT), and LDLR is used to jointly update the codes, rou… view at source ↗

**Figure 2.** Figure 2: Fraction of N-grams with above threshold topic purity vs. N. Model steered scale→0 rand. replace Qwen3-0.6B 55.3% −6.2 −4.8 Qwen3-1.7B 64.1% −6.4 −5.7 Qwen3-4B 72.3% −12.2 −8.1 Qwen3-8B 80.6% −17.4 −11.7 [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Per-topic ∆acc (pp) under single-code ablation. borrow cascades, that transformers must implement to solve the task. (See [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Six-digit addition 959,271 + 040,756 = 1,000,027, a four-deep carry cascade. At each answer-digit position DLR assigns one abstraction code (bottom row). Codes t2 and t6 cluster on cascade positions (UC/US); t16 marks the carry source (SC); t3 marks the trivial position (SA). Code assignments from model add_sub_sorl_v1_abs30_K1_100K (K=1, 30-code codebook). Setup. The main interpretability analyses use a 2… view at source ↗

**Figure 5.** Figure 5: Code-subtask heatmap for 2L/1H/128d (100K). Each cell shows P(subtask | code) - the fraction of that code’s occurrences that fall into each Quirke subtask - over 2,600 held-out examples across all addition and subtraction splits. Rows are the 23 active codes (of 30 in the codebook), sorted by dominant subtask; columns are the 10 subtask labels (SA, SC, SS, UC, US for addition; MD, MB, ME, UB, UD for subtra… view at source ↗

read the original abstract

We investigate the temporal concatenation of sub-policies in Markov Decision Processes (MDP) with time-varying reward functions. We introduce General Dijkstra Search (GDS), and prove that globally optimal goal-reaching policies can be recovered through temporal composition of intermediate optimal sub-policies. Motivated by the "search, select, update" principle underlying GDS, we propose Dynamic Latent Routing (DLR), a language-model post-training method that jointly learns discrete latent codes, routing policies, and model parameters through dynamic search in a single training stage. In low-data fine-tuning settings, DLR matches or outperforms supervised fine-tuning across four datasets and six models, achieving a mean gain of +6.6 percentage points, while prior discrete-latent baselines consistently underperform SFT. Mechanistic analyses and targeted code ablations show that DLR learns structured routing behaviors with distinct causal roles.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DLR gets consistent low-data gains over SFT but the GDS MDP proof does not clearly explain the routing mechanism in the LM experiments.

read the letter

The paper proves that General Dijkstra Search recovers globally optimal policies by composing sub-policies under time-varying rewards in MDPs, then uses the same search-select-update idea to motivate Dynamic Latent Routing for language models. DLR jointly trains discrete latent codes, a routing policy, and the model parameters in one stage instead of the usual two-stage latent approaches. On four datasets and six models in low-data fine-tuning, it matches or beats supervised fine-tuning by a mean of 6.6 points while earlier discrete-latent baselines fall short. The mechanistic analyses and targeted ablations indicate the routing learns structured behaviors with distinct causal roles, which is a concrete step beyond just adding auxiliary losses. That empirical pattern is the strongest part of the work. The soft spot is the missing link between the MDP result and the LM setting. The proof relies on explicit states and time-varying rewards, but LM post-training uses a fixed supervised loss on token sequences with no derivation showing that the learned routing recovers comparable optimality or avoids joint-optimization instabilities. The gains could come from the regularization effect of the discrete latents rather than the GDS principle itself. The stress-test note is on point here; without tighter controls or an explicit transfer argument the central claim rests more on the experiments than on the theory. This is worth a serious referee for groups working on data-efficient adaptation and latent structure in post-training. The empirical consistency and the new joint-training setup give it enough substance to review, even if the theoretical bridge needs more work. I would send it out.

Referee Report

3 major / 2 minor

Summary. The paper proves that globally optimal goal-reaching policies in MDPs with time-varying rewards can be recovered via temporal composition of optimal sub-policies using General Dijkstra Search (GDS). Motivated by the 'search, select, update' principle, it introduces Dynamic Latent Routing (DLR) for language-model post-training: a single-stage method that jointly optimizes discrete latent codes, a routing policy, and model parameters. In low-data fine-tuning, DLR matches or exceeds supervised fine-tuning (SFT) by a mean +6.6 percentage points across four datasets and six models, while prior discrete-latent methods underperform SFT; mechanistic analyses indicate structured routing with distinct causal roles.

Significance. If the empirical gains are shown to arise from the GDS-derived routing mechanism rather than auxiliary regularization, the work could offer a principled route to structured, sample-efficient post-training of language models. The reported mean gain, breadth of models/datasets, and mechanistic ablations provide concrete evidence worth further scrutiny; however, the absence of a formal link between the MDP optimality result and the LM objective limits the theoretical significance.

major comments (3)

[Motivation and Method] Motivation section: the claim that DLR implements the GDS 'search, select, update' principle in the LM setting lacks a derivation showing that the jointly learned discrete codes and routing recover sub-policy optimality (or avoid optimization instabilities) under a fixed supervised loss on token sequences rather than explicit time-varying MDP rewards.
[Experiments] Experimental results: the reported +6.6 pp mean gain over SFT is presented without per-run variance, number of random seeds, or statistical significance tests; this makes it impossible to determine whether the advantage is robust or could be explained by differences in effective capacity or regularization between DLR and the discrete-latent baselines.
[Theoretical Analysis] Proof of GDS optimality: the global-optimality guarantee is stated for finite-state MDPs with time-varying rewards, yet the manuscript does not address how (or whether) the same composition principle extends to the non-stationary, countably infinite state space of autoregressive language models without introducing hidden biases in the learned routing policy.

minor comments (2)

[Method] Notation for the routing policy and latent code distribution should be introduced with explicit equations rather than prose descriptions to improve reproducibility.
[Analysis] The abstract states 'mechanistic analyses and targeted code ablations' but the manuscript would benefit from a dedicated subsection listing the exact ablations performed and the quantitative effect sizes observed.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and have revised the manuscript to improve clarity, reporting, and discussion of limitations.

read point-by-point responses

Referee: [Motivation and Method] Motivation section: the claim that DLR implements the GDS 'search, select, update' principle in the LM setting lacks a derivation showing that the jointly learned discrete codes and routing recover sub-policy optimality (or avoid optimization instabilities) under a fixed supervised loss on token sequences rather than explicit time-varying MDP rewards.

Authors: We agree that the connection is motivational rather than a formal derivation. DLR is inspired by the search-select-update principle but operates under the standard next-token prediction loss without explicit MDP rewards or optimality guarantees. In the revised manuscript we have clarified this distinction in the motivation section, removed any implication of recovering sub-policy optimality, and added a short discussion of how joint optimization in practice avoids certain instabilities observed in prior discrete-latent methods. revision: partial
Referee: [Experiments] Experimental results: the reported +6.6 pp mean gain over SFT is presented without per-run variance, number of random seeds, or statistical significance tests; this makes it impossible to determine whether the advantage is robust or could be explained by differences in effective capacity or regularization between DLR and the discrete-latent baselines.

Authors: We accept this criticism. The revised version now reports results averaged over 5 random seeds with standard deviations for all models and datasets. We also include paired t-tests showing that the +6.6 pp mean gain over SFT is statistically significant (p < 0.05) in the majority of settings. Additional controls matching effective parameter count and regularization strength between DLR and baselines have been added to the experimental section. revision: yes
Referee: [Theoretical Analysis] Proof of GDS optimality: the global-optimality guarantee is stated for finite-state MDPs with time-varying rewards, yet the manuscript does not address how (or whether) the same composition principle extends to the non-stationary, countably infinite state space of autoregressive language models without introducing hidden biases in the learned routing policy.

Authors: The global-optimality result is proven only for finite-state MDPs; we do not claim it transfers directly to the countably infinite, non-stationary state space of autoregressive LMs. The manuscript presents DLR as a heuristic motivated by GDS rather than a formal extension. We have added a dedicated limitations paragraph discussing the challenges of extending temporal composition to infinite state spaces and potential routing biases, supported by the existing mechanistic analyses that show structured rather than biased routing in practice. A rigorous theoretical bridge remains future work. revision: partial

Circularity Check

0 steps flagged

No circularity: GDS optimality proof and DLR empirical gains remain independent

full rationale

The paper first proves global optimality for temporal composition of sub-policies under time-varying MDP rewards via General Dijkstra Search. It then motivates Dynamic Latent Routing by the 'search, select, update' principle and reports direct empirical measurements of +6.6 pp mean gains over SFT in low-data LM fine-tuning across four datasets and six models. No equation, fitted parameter, or self-citation reduces the reported performance numbers to quantities defined by the MDP proof or by construction; the LM results are presented as measured outcomes under a fixed supervised loss, with no derivation claiming that the learned routing recovers MDP optimality. The central claim is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the transfer of MDP optimality results to language-model training and on the assumption that dynamic search during fine-tuning discovers useful discrete latents without additional stages.

axioms (1)

domain assumption Globally optimal goal-reaching policies can be recovered through temporal composition of intermediate optimal sub-policies.
Stated as the key proved property of General Dijkstra Search that motivates DLR.

pith-pipeline@v0.9.0 · 5437 in / 1174 out tokens · 49996 ms · 2026-05-15T01:45:27.516595+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · 4 internal anchors

[1]

Hindsight experience replay

Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. InAdvances in Neural Information Processing Systems, volume 30, 2017

work page 2017
[2]

Introducing Claude Opus 4.7

Anthropic. Introducing Claude Opus 4.7. https://www.anthropic.com/news/ claude-opus-4-7, 2026

work page 2026
[3]

The option-critic architecture

Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture. InProceedings of the AAAI Conference on Artificial Intelligence, volume 31, 2017

work page 2017
[4]

Successor features for transfer in reinforcement learning

André Barreto, Will Dabney, Rémi Munos, Jonathan J Hunt, Tom Schaul, Hado van Hasselt, and David Silver. Successor features for transfer in reinforcement learning. InAdvances in Neural Information Processing Systems, volume 30, 2017

work page 2017
[5]

Probing classifiers: Promises, shortcomings, and advances.Computational Linguistics, 48(1):207–219, 2022

Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances.Computational Linguistics, 48(1):207–219, 2022. URLhttps://aclanthology.org/2022.cl-1.7/

work page 2022
[6]

Language models can explain neurons in language models

Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, and William Saunders. Language models can explain neurons in language models. Technical report, OpenAI, 2023. URL https://openaipublic.blob. core.windows.net/neuron-explainer/paper/index.html

work page 2023
[7]

Zero-shot composi- tional policy learning via language grounding.arXiv preprint arXiv:2004.07200, 2020

Tianshi Cao, Jingkang Wang, Yining Zhang, and Sivabalan Manivasagam. Zero-shot composi- tional policy learning via language grounding.arXiv preprint arXiv:2004.07200, 2020

work page arXiv 2004
[8]

SEAL: Steerable reasoning calibration of large language models for free

Runjin Chen, Zhenyu Zhang, Junyuan Hong, Souvik Kundu, and Zhangyang Wang. SEAL: Steerable reasoning calibration of large language models for free. InConference on Language Modeling (COLM), 2025

work page 2025
[9]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[10]

Feudal reinforcement learning

Peter Dayan and Geoffrey E Hinton. Feudal reinforcement learning. InAdvances in Neural Information Processing Systems, volume 5. Morgan Kaufmann, 1993

work page 1993
[11]

DeepSeek-V4 model card

DeepSeek-AI. DeepSeek-V4 model card. https://fe-static.deepseek.com/chat/ transparency/deepseek-V4-model-card-EN.pdf, 2026

work page 2026
[12]

Dietterich

Thomas G. Dietterich. Hierarchical reinforcement learning with the MAXQ value function decomposition.Journal of Artificial Intelligence Research, 13:227–303, 2000

work page 2000
[13]

Toy models of superposition.Transformer Circuits Thread, 2022

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition.Transformer Circuits Thread, 2022. URL https: //transformer-circuits.pub/20...

work page 2022
[14]

Diversity is all you need: Learning skills without a reward function

Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning skills without a reward function. InInternational Conference on Learning Representations, 2019. URLhttps://openreview.net/forum?id=SJx63jRqFm

work page 2019
[15]

Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies

Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics, 9:346–361, 2021

work page 2021
[16]

Think before you speak: Training language models with pause tokens

Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, and Vaishnavh Nagarajan. Think before you speak: Training language models with pause tokens. InThe Twelfth International Conference on Learning Representations, 2024. URL https: //arxiv.org/abs/2310.02226. 10

work page arXiv 2024
[17]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

The symbol grounding problem.Physica D: Nonlinear Phenomena, 42(1–3): 335–346, 1990

Stevan Harnad. The symbol grounding problem.Physica D: Nonlinear Phenomena, 42(1–3): 335–346, 1990

work page 1990
[19]

RLP: Reinforcement as a pretraining objective

Ali Hatamizadeh, Syeda Nahida Akter, Shrimai Prabhumoye, Jan Kautz, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, and Yejin Choi. RLP: Reinforcement as a pretraining objective. InInternational Conference on Learning Representations (ICLR), 2026. URL https://arxiv.org/abs/2510.01265. arXiv:2510.01265

work page arXiv 2026
[20]

Sparse autoencoders find highly interpretable features in language models

Robert Huben, Hoagy Cunningham, Logan Riggs Smith, Aidan Ewart, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum? id=F76bwRSLeK

work page 2024
[21]

Composing entropic policies using divergence correction

Jonathan J Hunt, Andre Barreto, Timothy P Lillicrap, and Nicolas Heess. Composing entropic policies using divergence correction. InProceedings of the 36th International Conference on Machine Learning, pages 2911–2920. PMLR, 2019

work page 2019
[22]

Hopkins, David Bau, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg

Kenneth Li, Aspen K. Hopkins, David Bau, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Emergent world representations: Exploring a sequence model trained on a synthetic task. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=DeG07_TcZvT

work page 2023
[23]

Learn to explain: Multimodal reasoning via thought chains for science question answering

Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. InAdvances in Neural Information Processing Systems, volume 35, 2022

work page 2022
[24]

Machado, André Barreto, Doina Precup, and Michael Bowling

Marlos C. Machado, André Barreto, Doina Precup, and Michael Bowling. Temporal abstraction in reinforcement learning with the successor representation.Journal of Machine Learning Research, 2023. arXiv:2110.05740

work page arXiv 2023
[25]

Kimi K2.6 tech blog: Advancing open-source coding

Moonshot AI. Kimi K2.6 tech blog: Advancing open-source coding. https://www.kimi. com/blog/kimi-k2-6, 2026

work page 2026
[26]

Progress mea- sures for grokking via mechanistic interpretability

Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress mea- sures for grokking via mechanistic interpretability. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=9XFSbDPmdW

work page 2023
[27]

Emergent linear representations in world models of self-supervised sequence models

Neel Nanda, Andrew Lee, and Martin Wattenberg. Emergent linear representations in world models of self-supervised sequence models. InProceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, 2023. URL https://aclanthology. org/2023.blackboxnlp-1.2/

work page 2023
[28]

In-context learning and induction heads

Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, a...

work page 2022
[29]

GPT-5.5 system card

OpenAI. GPT-5.5 system card. https://openai.com/index/gpt-5-5-system-card/ , 2026

work page 2026
[30]

Automatically inter- preting millions of features in large language models

Gonçalo Santos Paulo, Alex Troy Mallen, Caden Juang, and Nora Belrose. Automatically inter- preting millions of features in large language models. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research. PMLR, 2025. URLhttps://openreview.net/forum?id=EemtbhJOXc. 11

work page 2025
[31]

Jacob Pfau, William Merrill, and Samuel R. Bowman. Let’s think dot by dot: Hidden com- putation in transformer language models. InConference on Language Modeling (COLM), 2024

work page 2024
[32]

Understanding addition in transformers

Philip Quirke and Fazl Barez. Understanding addition in transformers. InThe Twelfth Inter- national Conference on Learning Representations, 2024. URL https://openreview.net/ forum?id=rIx1YXVWZb

work page 2024
[33]

Understanding addition and subtraction in transformers.arXiv preprint arXiv:2402.02619, 2024

Philip Quirke, Clement Neo, and Fazl Barez. Understanding addition and subtraction in transformers.arXiv preprint arXiv:2402.02619, 2024. URL https://arxiv.org/abs/2402. 02619

work page arXiv 2024
[34]

Qwen3 technical report, 2025

Qwen Team. Qwen3 technical report, 2025

work page 2025
[35]

Qwen3.5: A natively multimodal foundation model

Qwen Team. Qwen3.5: A natively multimodal foundation model. https://www. alibabagroup.com/en-US/document-1960233590314762240, 2026

work page 2026
[36]

Language models are unsupervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. Technical report, OpenAI, 2019

work page 2019
[37]

Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought

Keshav Ramji, Tahira Naseem, and Ramón Fernandez Astudillo. Thinking without words: Efficient latent reasoning with abstract chain-of-thought. InLatent & Implicit Thinking Workshop at the International Conference on Learning Representations (ICLR), 2026. URL https://arxiv.org/abs/2604.22709

work page internal anchor Pith review Pith/arXiv arXiv 2026
[38]

Universal value function approx- imators

Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal value function approx- imators. InProceedings of the 32nd International Conference on Machine Learning, pages 1312–1320. PMLR, 2015

work page 2015
[39]

Shah, Khush Gupta, Keshav Ramji, and Pratik Chaudhari

Alok N. Shah, Khush Gupta, Keshav Ramji, and Pratik Chaudhari. Language modeling with learned meta-tokens. InProceedings of the 42nd International Conference on Machine Learning (ICML), 2025. URLhttps://openreview.net/pdf?id=oaHYnLldHM

work page 2025
[40]

Dynamics-aware unsupervised discovery of skills

Archit Sharma, Shixiang Shane Gu, Sergey Levine, Vikash Kumar, and Karol Hausman. Dynamics-aware unsupervised discovery of skills. InInternational Conference on Learning Representations, 2020. URLhttps://openreview.net/forum?id=HJgLZR4KvH

work page 2020
[41]

Henry, Zhun Wang, Yang Liu, Dawn Song, and Chenguang Wang

Vincent Siu, Nicholas Crispino, David Park, Nathan W. Henry, Zhun Wang, Yang Liu, Dawn Song, and Chenguang Wang. SteeringSafety: A systematic safety evaluation framework of representation steering in LLMs. InNeurIPS 2025 Workshop on Socially Responsible and Trustworthy Foundation Models, 2025. URLhttps://arxiv.org/abs/2509.13450

work page arXiv 2025
[42]

Token assorted: Mixing latent and text tokens for improved language model reasoning

DiJia Su, Hanlin Zhu, Yingchen Xu, Jiantao Jiao, Yuandong Tian, and Qinqing Zheng. Token assorted: Mixing latent and text tokens for improved language model reasoning. InICML 2025 Workshop on Long-Context Foundation Models (LCFM), 2025. URL https://arxiv.org/ abs/2502.03275

work page arXiv 2025
[43]

Probing for arithmetic errors in language models

Yucheng Sun, Alessandro Stolfo, and Mrinmaya Sachan. Probing for arithmetic errors in language models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025. URL https://aclanthology.org/2025.emnlp-main.411/

work page 2025
[44]

MIT Press, 2nd edition, 2018

Richard S Sutton and Andrew G Barto.Reinforcement Learning: An Introduction. MIT Press, 2nd edition, 2018

work page 2018
[45]

Sutton, Doina Precup, and Satinder Singh

Richard S. Sutton, Doina Precup, and Satinder Singh. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning.Artificial Intelligence, 112(1–2): 181–211, 1999

work page 1999
[46]

CommonsenseQA: A question answering challenge targeting commonsense knowledge

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. CommonsenseQA: A question answering challenge targeting commonsense knowledge. InProceedings of NAACL- HLT, pages 4149–4158, 2019. 12

work page 2019
[47]

Unlocking general long chain-of-thought reasoning capabilities of large language models via representation engineering

Xinyu Tang, Xiaolei Wang, Zhihao Lv, Yingqian Min, Wayne Xin Zhao, Binbin Hu, Ziqi Liu, and Zhiqiang Zhang. Unlocking general long chain-of-thought reasoning capabilities of large language models via representation engineering. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vienna, Austri...

work page 2025
[48]

Why representation engineering works: A theoretical and empirical study in vision-language models

Bowei Tian, Xuntao Lyu, Meng Liu, Hongyi Wang, and Ang Li. Why representation engineering works: A theoretical and empirical study in vision-language models. arXiv:2503.22720, 2025

work page arXiv 2025
[49]

Compositionality of optimal control laws

Emanuel Todorov. Compositionality of optimal control laws. InAdvances in Neural Information Processing Systems, volume 22, 2009

work page 2009
[50]

Composing value functions in reinforcement learning

Benjamin Van Niekerk, Steven James, Adam Earle, and Benjamin Rosman. Composing value functions in reinforcement learning. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 6401–6409. PMLR, 09–15 Jun 2019. URL https://pro...

work page 2019
[51]

FeUdal networks for hierarchical reinforcement learning

Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, and Koray Kavukcuoglu. FeUdal networks for hierarchical reinforcement learning. InProceedings of the 34th International Conference on Machine Learning, pages 3540–3549. PMLR, 2017

work page 2017
[52]

Interpretability in the wild: a circuit for indirect object identification in GPT-2 small

Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. In The Eleventh International Conference on Learning Representations, 2023. URL https: //openreview.net/forum?id=NpsVSN6o4ul

work page 2023
[53]

Representation bending for large language model safety

Ashkan Yousefpour, Taeheon Kim, Ryan Sungmo Kwon, Seungbeen Lee, Wonje Jeung, Se- ungju Han, Alvin Wan, Harrison Ngan, Youngjae Yu, and Jonghyun Choi. Representation bending for large language model safety. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), 2025. URL https://aclanthology.org/ 2025.acl-long.1173/

work page 2025
[54]

Interpreting and improving large language models in arithmetic calculation

Wei Zhang, Chaoqun Wan, Yonggang Zhang, Yiu-Ming Cheung, Xinmei Tian, Xu Shen, and Jieping Ye. Interpreting and improving large language models in arithmetic calculation. InProceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research. PMLR, 2024. URLhttps://proceedings.mlr. press/v235/zhang24bk.html

work page 2024
[55]

Representation Engineering: A Top-Down Approach to AI Transparency

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, and Dan Hendrycks. Representation engineering: A top- down approach to ...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[56]

There exists a unique fixed pointx ∗ ∈ Xsuch that T(x ∗) =x ∗

work page
[57]

lim n→∞ xn =x ∗

Moreover, for any initial pointx 0 ∈ X, the sequence(x n)n∈N defined recursively by xn :=T(x n−1), n≥1, converges tox ∗, i.e. lim n→∞ xn =x ∗. Theorem 7(Existence and uniqueness of optimal value function).There exists a unique optimal value functionV ∗ ∈ Vsuch that it is invariant under the Bellman value operatorT: ˆV → ˆV, i.e. V ∗ t (s) = (T V ∗)t(s)∀s∈...

work page
[58]

A policyπis said todominatea goal setGwhen starting froms∈ Sif G ⊂ G π(s)

work page
[59]

Lemma 12.Ifπ 1,∗ dominatesπ 1, thenπ 1,∗ ◦π 2 also dominatesπ 1 ◦π 2

A policyπis said to bedominated bya goal setGwhen starting froms∈ Sif Gπ(s)⊂ G. Lemma 12.Ifπ 1,∗ dominatesπ 1, thenπ 1,∗ ◦π 2 also dominatesπ 1 ◦π 2. Proof.Sinceπ 1,∗ dominatesπ 1, we have Gπ1 (s)⊂ G π1,∗ (s). Therefore, Gπ1◦π2 (s) = [ s′∈Gπ1(s) Gπ2 (s′)⊂ [ s′∈Gπ1,∗(s) Gπ2 (s′) =G π1,∗◦π2 (s). Henceπ 1,∗ ◦π 2 dominatesπ 1 ◦π 2. Lemma 13.In the General Dij...

work page
[60]

policies already popped fromQ,

work page
[61]

policies currently inQ,

work page
[62]

policies that will be added toQin the future,

work page
[63]

Sinceπ ∗ is popped from the priority queue, it has value at least as large as every policy currently in the queue

policies that never appear inQ. Sinceπ ∗ is popped from the priority queue, it has value at least as large as every policy currently in the queue. Hence for every policyπin group (2), V π∗ 0 (s)≥V π 0 (s). Now let π be a policy that will be added to the queue in the future. Then π must extend some current queue elementπ 1:t1 witht 1 < T. By Theorem 12, V ...

work page
[67]

Because π∗ is the first popped and unskipped policy that does not dominate any previous policy in the queue, there is no policy in group (1) whose goal set is contained inG π∗ (s)

policies that never appear inQ. Because π∗ is the first popped and unskipped policy that does not dominate any previous policy in the queue, there is no policy in group (1) whose goal set is contained inG π∗ (s). That is, n π∈ T[ t=1 Πt Gπ(s)⊂ G π∗ (s) o ∩(1) =∅. Now let π∈ ST t=1 Πt satisfy Gπ(s)⊂ G π∗ (s). We consider the possible groups to which π belo...

work page
[68]

policies popped fromQat some step≤n,

work page
[69]

policies currently inQat stepn,

work page
[70]

policies that will enterQat some step> n,

work page
[71]

stuttering

policies that never appear inQ. Because π∗ is the first popped and unskipped policy that is not dominated by any previous policy in the queue, there is no policy in group (1) whose goal set containsG π∗ (s). That is, n π∈ T[ t=1 Πt Gπ∗ (s)⊂ G π(s) o ∩(1) =∅. Now let π∈ ST t=1 Πt satisfy Gπ∗ (s)⊂ G π(s). We consider the possible groups to which π belongs. ...

work page 2039
[72]

internal

via internal circuit analysis — with no access to ground-truth circuit labels. The three carry regimes map onto disjoint code clusters (e.g. t21/t6 for sum-9 uncertain in addition; t23/t7 for borrow-uncertain in subtraction), and an analogous structure appears for subtraction borrow cascades. What Quirke et al. needed PCA of hidden activations to reveal, ...

work page 2040
[73]

Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

Institutional review board (IRB) approvals or equivalent for research with human subjects 68 Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country ...

work page