pith. machine review for the scientific record. sign in

arxiv: 2605.14323 · v1 · submitted 2026-05-14 · 💻 cs.LG · cs.AI· cs.CL

Recognition: 2 theorem links

· Lean Theorem

Dynamic Latent Routing

Authors on Pith no claims yet

Pith reviewed 2026-05-15 01:45 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords dynamic latent routinglanguage model fine-tuningdiscrete latent codespolicy compositionMDP sub-policieslow-data adaptationrouting policies
0
0 comments X

The pith

Dynamic Latent Routing recovers globally optimal policies by composing learned sub-policies to improve low-data language model fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that principles of optimal search in Markov Decision Processes with time-varying rewards can be transferred to language model post-training. It proves that globally optimal goal-reaching policies arise from temporal composition of intermediate sub-policies and then builds Dynamic Latent Routing to apply the underlying search-select-update process. In this method, discrete latent codes, routing policies, and model parameters are learned jointly in one training stage. The result is that DLR matches or exceeds standard supervised fine-tuning on limited data while earlier discrete-latent approaches fall short, and the learned routes exhibit distinct causal roles.

Core claim

General Dijkstra Search shows that optimal policies in MDPs with changing rewards can be recovered exactly by concatenating intermediate optimal sub-policies over time. Dynamic Latent Routing implements the search-select-update principle from this result inside language model training: it performs dynamic search over discrete latent codes to select and compose sub-policies, updating the model parameters in the same stage, yielding structured routing that improves adaptation when data is scarce.

What carries the argument

Dynamic Latent Routing (DLR), a single-stage training procedure that jointly optimizes discrete latent codes and routing policies via dynamic search to compose sub-policies.

Load-bearing premise

The optimality guarantees and search principle from General Dijkstra Search in MDPs transfer effectively to the non-stationary, high-dimensional setting of language model post-training without introducing hidden biases or optimization instabilities.

What would settle it

An experiment in which DLR fails to match or exceed supervised fine-tuning performance on the four datasets and six models, or in which removing the dynamic search component eliminates any observed gains.

Figures

Figures reproduced from arXiv: 2605.14323 by Amir Abdullah, Fangyuan Yu, Xin Su.

Figure 1
Figure 1. Figure 1: DLR method overview. Left: chunk-level steering: each chunk m is routed to a discrete code am, which indexes a steering vector α eam from the codebook and is added in place to every hidden state h1, . . . , hK in the chunk. Right: per-step search: N code sequences are sampled from the routing head (SEARCH), the one maximizing pθ(x | a) is selected (SELECT), and LDLR is used to jointly update the codes, rou… view at source ↗
Figure 2
Figure 2. Figure 2: Fraction of N-grams with above threshold topic purity vs. N. Model steered scale→0 rand. replace Qwen3-0.6B 55.3% −6.2 −4.8 Qwen3-1.7B 64.1% −6.4 −5.7 Qwen3-4B 72.3% −12.2 −8.1 Qwen3-8B 80.6% −17.4 −11.7 [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Per-topic ∆acc (pp) under single-code ablation. borrow cascades, that transformers must implement to solve the task. (See [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Six-digit addition 959,271 + 040,756 = 1,000,027, a four-deep carry cascade. At each answer-digit position DLR assigns one abstraction code (bottom row). Codes t2 and t6 cluster on cascade positions (UC/US); t16 marks the carry source (SC); t3 marks the trivial position (SA). Code assignments from model add_sub_sorl_v1_abs30_K1_100K (K=1, 30-code codebook). Setup. The main interpretability analyses use a 2… view at source ↗
Figure 5
Figure 5. Figure 5: Code-subtask heatmap for 2L/1H/128d (100K). Each cell shows P(subtask | code) - the fraction of that code’s occurrences that fall into each Quirke subtask - over 2,600 held-out examples across all addition and subtraction splits. Rows are the 23 active codes (of 30 in the codebook), sorted by dominant subtask; columns are the 10 subtask labels (SA, SC, SS, UC, US for addition; MD, MB, ME, UB, UD for subtra… view at source ↗
read the original abstract

We investigate the temporal concatenation of sub-policies in Markov Decision Processes (MDP) with time-varying reward functions. We introduce General Dijkstra Search (GDS), and prove that globally optimal goal-reaching policies can be recovered through temporal composition of intermediate optimal sub-policies. Motivated by the "search, select, update" principle underlying GDS, we propose Dynamic Latent Routing (DLR), a language-model post-training method that jointly learns discrete latent codes, routing policies, and model parameters through dynamic search in a single training stage. In low-data fine-tuning settings, DLR matches or outperforms supervised fine-tuning across four datasets and six models, achieving a mean gain of +6.6 percentage points, while prior discrete-latent baselines consistently underperform SFT. Mechanistic analyses and targeted code ablations show that DLR learns structured routing behaviors with distinct causal roles.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proves that globally optimal goal-reaching policies in MDPs with time-varying rewards can be recovered via temporal composition of optimal sub-policies using General Dijkstra Search (GDS). Motivated by the 'search, select, update' principle, it introduces Dynamic Latent Routing (DLR) for language-model post-training: a single-stage method that jointly optimizes discrete latent codes, a routing policy, and model parameters. In low-data fine-tuning, DLR matches or exceeds supervised fine-tuning (SFT) by a mean +6.6 percentage points across four datasets and six models, while prior discrete-latent methods underperform SFT; mechanistic analyses indicate structured routing with distinct causal roles.

Significance. If the empirical gains are shown to arise from the GDS-derived routing mechanism rather than auxiliary regularization, the work could offer a principled route to structured, sample-efficient post-training of language models. The reported mean gain, breadth of models/datasets, and mechanistic ablations provide concrete evidence worth further scrutiny; however, the absence of a formal link between the MDP optimality result and the LM objective limits the theoretical significance.

major comments (3)
  1. [Motivation and Method] Motivation section: the claim that DLR implements the GDS 'search, select, update' principle in the LM setting lacks a derivation showing that the jointly learned discrete codes and routing recover sub-policy optimality (or avoid optimization instabilities) under a fixed supervised loss on token sequences rather than explicit time-varying MDP rewards.
  2. [Experiments] Experimental results: the reported +6.6 pp mean gain over SFT is presented without per-run variance, number of random seeds, or statistical significance tests; this makes it impossible to determine whether the advantage is robust or could be explained by differences in effective capacity or regularization between DLR and the discrete-latent baselines.
  3. [Theoretical Analysis] Proof of GDS optimality: the global-optimality guarantee is stated for finite-state MDPs with time-varying rewards, yet the manuscript does not address how (or whether) the same composition principle extends to the non-stationary, countably infinite state space of autoregressive language models without introducing hidden biases in the learned routing policy.
minor comments (2)
  1. [Method] Notation for the routing policy and latent code distribution should be introduced with explicit equations rather than prose descriptions to improve reproducibility.
  2. [Analysis] The abstract states 'mechanistic analyses and targeted code ablations' but the manuscript would benefit from a dedicated subsection listing the exact ablations performed and the quantitative effect sizes observed.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and have revised the manuscript to improve clarity, reporting, and discussion of limitations.

read point-by-point responses
  1. Referee: [Motivation and Method] Motivation section: the claim that DLR implements the GDS 'search, select, update' principle in the LM setting lacks a derivation showing that the jointly learned discrete codes and routing recover sub-policy optimality (or avoid optimization instabilities) under a fixed supervised loss on token sequences rather than explicit time-varying MDP rewards.

    Authors: We agree that the connection is motivational rather than a formal derivation. DLR is inspired by the search-select-update principle but operates under the standard next-token prediction loss without explicit MDP rewards or optimality guarantees. In the revised manuscript we have clarified this distinction in the motivation section, removed any implication of recovering sub-policy optimality, and added a short discussion of how joint optimization in practice avoids certain instabilities observed in prior discrete-latent methods. revision: partial

  2. Referee: [Experiments] Experimental results: the reported +6.6 pp mean gain over SFT is presented without per-run variance, number of random seeds, or statistical significance tests; this makes it impossible to determine whether the advantage is robust or could be explained by differences in effective capacity or regularization between DLR and the discrete-latent baselines.

    Authors: We accept this criticism. The revised version now reports results averaged over 5 random seeds with standard deviations for all models and datasets. We also include paired t-tests showing that the +6.6 pp mean gain over SFT is statistically significant (p < 0.05) in the majority of settings. Additional controls matching effective parameter count and regularization strength between DLR and baselines have been added to the experimental section. revision: yes

  3. Referee: [Theoretical Analysis] Proof of GDS optimality: the global-optimality guarantee is stated for finite-state MDPs with time-varying rewards, yet the manuscript does not address how (or whether) the same composition principle extends to the non-stationary, countably infinite state space of autoregressive language models without introducing hidden biases in the learned routing policy.

    Authors: The global-optimality result is proven only for finite-state MDPs; we do not claim it transfers directly to the countably infinite, non-stationary state space of autoregressive LMs. The manuscript presents DLR as a heuristic motivated by GDS rather than a formal extension. We have added a dedicated limitations paragraph discussing the challenges of extending temporal composition to infinite state spaces and potential routing biases, supported by the existing mechanistic analyses that show structured rather than biased routing in practice. A rigorous theoretical bridge remains future work. revision: partial

Circularity Check

0 steps flagged

No circularity: GDS optimality proof and DLR empirical gains remain independent

full rationale

The paper first proves global optimality for temporal composition of sub-policies under time-varying MDP rewards via General Dijkstra Search. It then motivates Dynamic Latent Routing by the 'search, select, update' principle and reports direct empirical measurements of +6.6 pp mean gains over SFT in low-data LM fine-tuning across four datasets and six models. No equation, fitted parameter, or self-citation reduces the reported performance numbers to quantities defined by the MDP proof or by construction; the LM results are presented as measured outcomes under a fixed supervised loss, with no derivation claiming that the learned routing recovers MDP optimality. The central claim is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the transfer of MDP optimality results to language-model training and on the assumption that dynamic search during fine-tuning discovers useful discrete latents without additional stages.

axioms (1)
  • domain assumption Globally optimal goal-reaching policies can be recovered through temporal composition of intermediate optimal sub-policies.
    Stated as the key proved property of General Dijkstra Search that motivates DLR.

pith-pipeline@v0.9.0 · 5437 in / 1174 out tokens · 49996 ms · 2026-05-15T01:45:27.516595+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · 4 internal anchors

  1. [1]

    Hindsight experience replay

    Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. InAdvances in Neural Information Processing Systems, volume 30, 2017

  2. [2]

    Introducing Claude Opus 4.7

    Anthropic. Introducing Claude Opus 4.7. https://www.anthropic.com/news/ claude-opus-4-7, 2026

  3. [3]

    The option-critic architecture

    Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture. InProceedings of the AAAI Conference on Artificial Intelligence, volume 31, 2017

  4. [4]

    Successor features for transfer in reinforcement learning

    André Barreto, Will Dabney, Rémi Munos, Jonathan J Hunt, Tom Schaul, Hado van Hasselt, and David Silver. Successor features for transfer in reinforcement learning. InAdvances in Neural Information Processing Systems, volume 30, 2017

  5. [5]

    Probing classifiers: Promises, shortcomings, and advances.Computational Linguistics, 48(1):207–219, 2022

    Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances.Computational Linguistics, 48(1):207–219, 2022. URLhttps://aclanthology.org/2022.cl-1.7/

  6. [6]

    Language models can explain neurons in language models

    Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, and William Saunders. Language models can explain neurons in language models. Technical report, OpenAI, 2023. URL https://openaipublic.blob. core.windows.net/neuron-explainer/paper/index.html

  7. [7]

    Zero-shot composi- tional policy learning via language grounding.arXiv preprint arXiv:2004.07200, 2020

    Tianshi Cao, Jingkang Wang, Yining Zhang, and Sivabalan Manivasagam. Zero-shot composi- tional policy learning via language grounding.arXiv preprint arXiv:2004.07200, 2020

  8. [8]

    SEAL: Steerable reasoning calibration of large language models for free

    Runjin Chen, Zhenyu Zhang, Junyuan Hong, Souvik Kundu, and Zhangyang Wang. SEAL: Steerable reasoning calibration of large language models for free. InConference on Language Modeling (COLM), 2025

  9. [9]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  10. [10]

    Feudal reinforcement learning

    Peter Dayan and Geoffrey E Hinton. Feudal reinforcement learning. InAdvances in Neural Information Processing Systems, volume 5. Morgan Kaufmann, 1993

  11. [11]

    DeepSeek-V4 model card

    DeepSeek-AI. DeepSeek-V4 model card. https://fe-static.deepseek.com/chat/ transparency/deepseek-V4-model-card-EN.pdf, 2026

  12. [12]

    Dietterich

    Thomas G. Dietterich. Hierarchical reinforcement learning with the MAXQ value function decomposition.Journal of Artificial Intelligence Research, 13:227–303, 2000

  13. [13]

    Toy models of superposition.Transformer Circuits Thread, 2022

    Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition.Transformer Circuits Thread, 2022. URL https: //transformer-circuits.pub/20...

  14. [14]

    Diversity is all you need: Learning skills without a reward function

    Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning skills without a reward function. InInternational Conference on Learning Representations, 2019. URLhttps://openreview.net/forum?id=SJx63jRqFm

  15. [15]

    Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies

    Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics, 9:346–361, 2021

  16. [16]

    Think before you speak: Training language models with pause tokens

    Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, and Vaishnavh Nagarajan. Think before you speak: Training language models with pause tokens. InThe Twelfth International Conference on Learning Representations, 2024. URL https: //arxiv.org/abs/2310.02226. 10

  17. [17]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  18. [18]

    The symbol grounding problem.Physica D: Nonlinear Phenomena, 42(1–3): 335–346, 1990

    Stevan Harnad. The symbol grounding problem.Physica D: Nonlinear Phenomena, 42(1–3): 335–346, 1990

  19. [19]

    RLP: Reinforcement as a pretraining objective

    Ali Hatamizadeh, Syeda Nahida Akter, Shrimai Prabhumoye, Jan Kautz, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, and Yejin Choi. RLP: Reinforcement as a pretraining objective. InInternational Conference on Learning Representations (ICLR), 2026. URL https://arxiv.org/abs/2510.01265. arXiv:2510.01265

  20. [20]

    Sparse autoencoders find highly interpretable features in language models

    Robert Huben, Hoagy Cunningham, Logan Riggs Smith, Aidan Ewart, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum? id=F76bwRSLeK

  21. [21]

    Composing entropic policies using divergence correction

    Jonathan J Hunt, Andre Barreto, Timothy P Lillicrap, and Nicolas Heess. Composing entropic policies using divergence correction. InProceedings of the 36th International Conference on Machine Learning, pages 2911–2920. PMLR, 2019

  22. [22]

    Hopkins, David Bau, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg

    Kenneth Li, Aspen K. Hopkins, David Bau, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Emergent world representations: Exploring a sequence model trained on a synthetic task. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=DeG07_TcZvT

  23. [23]

    Learn to explain: Multimodal reasoning via thought chains for science question answering

    Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. InAdvances in Neural Information Processing Systems, volume 35, 2022

  24. [24]

    Machado, André Barreto, Doina Precup, and Michael Bowling

    Marlos C. Machado, André Barreto, Doina Precup, and Michael Bowling. Temporal abstraction in reinforcement learning with the successor representation.Journal of Machine Learning Research, 2023. arXiv:2110.05740

  25. [25]

    Kimi K2.6 tech blog: Advancing open-source coding

    Moonshot AI. Kimi K2.6 tech blog: Advancing open-source coding. https://www.kimi. com/blog/kimi-k2-6, 2026

  26. [26]

    Progress mea- sures for grokking via mechanistic interpretability

    Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress mea- sures for grokking via mechanistic interpretability. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=9XFSbDPmdW

  27. [27]

    Emergent linear representations in world models of self-supervised sequence models

    Neel Nanda, Andrew Lee, and Martin Wattenberg. Emergent linear representations in world models of self-supervised sequence models. InProceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, 2023. URL https://aclanthology. org/2023.blackboxnlp-1.2/

  28. [28]

    In-context learning and induction heads

    Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, a...

  29. [29]

    GPT-5.5 system card

    OpenAI. GPT-5.5 system card. https://openai.com/index/gpt-5-5-system-card/ , 2026

  30. [30]

    Automatically inter- preting millions of features in large language models

    Gonçalo Santos Paulo, Alex Troy Mallen, Caden Juang, and Nora Belrose. Automatically inter- preting millions of features in large language models. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research. PMLR, 2025. URLhttps://openreview.net/forum?id=EemtbhJOXc. 11

  31. [31]

    Jacob Pfau, William Merrill, and Samuel R. Bowman. Let’s think dot by dot: Hidden com- putation in transformer language models. InConference on Language Modeling (COLM), 2024

  32. [32]

    Understanding addition in transformers

    Philip Quirke and Fazl Barez. Understanding addition in transformers. InThe Twelfth Inter- national Conference on Learning Representations, 2024. URL https://openreview.net/ forum?id=rIx1YXVWZb

  33. [33]

    Understanding addition and subtraction in transformers.arXiv preprint arXiv:2402.02619, 2024

    Philip Quirke, Clement Neo, and Fazl Barez. Understanding addition and subtraction in transformers.arXiv preprint arXiv:2402.02619, 2024. URL https://arxiv.org/abs/2402. 02619

  34. [34]

    Qwen3 technical report, 2025

    Qwen Team. Qwen3 technical report, 2025

  35. [35]

    Qwen3.5: A natively multimodal foundation model

    Qwen Team. Qwen3.5: A natively multimodal foundation model. https://www. alibabagroup.com/en-US/document-1960233590314762240, 2026

  36. [36]

    Language models are unsupervised multitask learners

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. Technical report, OpenAI, 2019

  37. [37]

    Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought

    Keshav Ramji, Tahira Naseem, and Ramón Fernandez Astudillo. Thinking without words: Efficient latent reasoning with abstract chain-of-thought. InLatent & Implicit Thinking Workshop at the International Conference on Learning Representations (ICLR), 2026. URL https://arxiv.org/abs/2604.22709

  38. [38]

    Universal value function approx- imators

    Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal value function approx- imators. InProceedings of the 32nd International Conference on Machine Learning, pages 1312–1320. PMLR, 2015

  39. [39]

    Shah, Khush Gupta, Keshav Ramji, and Pratik Chaudhari

    Alok N. Shah, Khush Gupta, Keshav Ramji, and Pratik Chaudhari. Language modeling with learned meta-tokens. InProceedings of the 42nd International Conference on Machine Learning (ICML), 2025. URLhttps://openreview.net/pdf?id=oaHYnLldHM

  40. [40]

    Dynamics-aware unsupervised discovery of skills

    Archit Sharma, Shixiang Shane Gu, Sergey Levine, Vikash Kumar, and Karol Hausman. Dynamics-aware unsupervised discovery of skills. InInternational Conference on Learning Representations, 2020. URLhttps://openreview.net/forum?id=HJgLZR4KvH

  41. [41]

    Henry, Zhun Wang, Yang Liu, Dawn Song, and Chenguang Wang

    Vincent Siu, Nicholas Crispino, David Park, Nathan W. Henry, Zhun Wang, Yang Liu, Dawn Song, and Chenguang Wang. SteeringSafety: A systematic safety evaluation framework of representation steering in LLMs. InNeurIPS 2025 Workshop on Socially Responsible and Trustworthy Foundation Models, 2025. URLhttps://arxiv.org/abs/2509.13450

  42. [42]

    Token assorted: Mixing latent and text tokens for improved language model reasoning

    DiJia Su, Hanlin Zhu, Yingchen Xu, Jiantao Jiao, Yuandong Tian, and Qinqing Zheng. Token assorted: Mixing latent and text tokens for improved language model reasoning. InICML 2025 Workshop on Long-Context Foundation Models (LCFM), 2025. URL https://arxiv.org/ abs/2502.03275

  43. [43]

    Probing for arithmetic errors in language models

    Yucheng Sun, Alessandro Stolfo, and Mrinmaya Sachan. Probing for arithmetic errors in language models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025. URL https://aclanthology.org/2025.emnlp-main.411/

  44. [44]

    MIT Press, 2nd edition, 2018

    Richard S Sutton and Andrew G Barto.Reinforcement Learning: An Introduction. MIT Press, 2nd edition, 2018

  45. [45]

    Sutton, Doina Precup, and Satinder Singh

    Richard S. Sutton, Doina Precup, and Satinder Singh. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning.Artificial Intelligence, 112(1–2): 181–211, 1999

  46. [46]

    CommonsenseQA: A question answering challenge targeting commonsense knowledge

    Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. CommonsenseQA: A question answering challenge targeting commonsense knowledge. InProceedings of NAACL- HLT, pages 4149–4158, 2019. 12

  47. [47]

    Unlocking general long chain-of-thought reasoning capabilities of large language models via representation engineering

    Xinyu Tang, Xiaolei Wang, Zhihao Lv, Yingqian Min, Wayne Xin Zhao, Binbin Hu, Ziqi Liu, and Zhiqiang Zhang. Unlocking general long chain-of-thought reasoning capabilities of large language models via representation engineering. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vienna, Austri...

  48. [48]

    Why representation engineering works: A theoretical and empirical study in vision-language models

    Bowei Tian, Xuntao Lyu, Meng Liu, Hongyi Wang, and Ang Li. Why representation engineering works: A theoretical and empirical study in vision-language models. arXiv:2503.22720, 2025

  49. [49]

    Compositionality of optimal control laws

    Emanuel Todorov. Compositionality of optimal control laws. InAdvances in Neural Information Processing Systems, volume 22, 2009

  50. [50]

    Composing value functions in reinforcement learning

    Benjamin Van Niekerk, Steven James, Adam Earle, and Benjamin Rosman. Composing value functions in reinforcement learning. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 6401–6409. PMLR, 09–15 Jun 2019. URL https://pro...

  51. [51]

    FeUdal networks for hierarchical reinforcement learning

    Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, and Koray Kavukcuoglu. FeUdal networks for hierarchical reinforcement learning. InProceedings of the 34th International Conference on Machine Learning, pages 3540–3549. PMLR, 2017

  52. [52]

    Interpretability in the wild: a circuit for indirect object identification in GPT-2 small

    Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. In The Eleventh International Conference on Learning Representations, 2023. URL https: //openreview.net/forum?id=NpsVSN6o4ul

  53. [53]

    Representation bending for large language model safety

    Ashkan Yousefpour, Taeheon Kim, Ryan Sungmo Kwon, Seungbeen Lee, Wonje Jeung, Se- ungju Han, Alvin Wan, Harrison Ngan, Youngjae Yu, and Jonghyun Choi. Representation bending for large language model safety. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), 2025. URL https://aclanthology.org/ 2025.acl-long.1173/

  54. [54]

    Interpreting and improving large language models in arithmetic calculation

    Wei Zhang, Chaoqun Wan, Yonggang Zhang, Yiu-Ming Cheung, Xinmei Tian, Xu Shen, and Jieping Ye. Interpreting and improving large language models in arithmetic calculation. InProceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research. PMLR, 2024. URLhttps://proceedings.mlr. press/v235/zhang24bk.html

  55. [55]

    Representation Engineering: A Top-Down Approach to AI Transparency

    Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, and Dan Hendrycks. Representation engineering: A top- down approach to ...

  56. [56]

    There exists a unique fixed pointx ∗ ∈ Xsuch that T(x ∗) =x ∗

  57. [57]

    lim n→∞ xn =x ∗

    Moreover, for any initial pointx 0 ∈ X, the sequence(x n)n∈N defined recursively by xn :=T(x n−1), n≥1, converges tox ∗, i.e. lim n→∞ xn =x ∗. Theorem 7(Existence and uniqueness of optimal value function).There exists a unique optimal value functionV ∗ ∈ Vsuch that it is invariant under the Bellman value operatorT: ˆV → ˆV, i.e. V ∗ t (s) = (T V ∗)t(s)∀s∈...

  58. [58]

    A policyπis said todominatea goal setGwhen starting froms∈ Sif G ⊂ G π(s)

  59. [59]

    Lemma 12.Ifπ 1,∗ dominatesπ 1, thenπ 1,∗ ◦π 2 also dominatesπ 1 ◦π 2

    A policyπis said to bedominated bya goal setGwhen starting froms∈ Sif Gπ(s)⊂ G. Lemma 12.Ifπ 1,∗ dominatesπ 1, thenπ 1,∗ ◦π 2 also dominatesπ 1 ◦π 2. Proof.Sinceπ 1,∗ dominatesπ 1, we have Gπ1 (s)⊂ G π1,∗ (s). Therefore, Gπ1◦π2 (s) = [ s′∈Gπ1(s) Gπ2 (s′)⊂ [ s′∈Gπ1,∗(s) Gπ2 (s′) =G π1,∗◦π2 (s). Henceπ 1,∗ ◦π 2 dominatesπ 1 ◦π 2. Lemma 13.In the General Dij...

  60. [60]

    policies already popped fromQ,

  61. [61]

    policies currently inQ,

  62. [62]

    policies that will be added toQin the future,

  63. [63]

    Sinceπ ∗ is popped from the priority queue, it has value at least as large as every policy currently in the queue

    policies that never appear inQ. Sinceπ ∗ is popped from the priority queue, it has value at least as large as every policy currently in the queue. Hence for every policyπin group (2), V π∗ 0 (s)≥V π 0 (s). Now let π be a policy that will be added to the queue in the future. Then π must extend some current queue elementπ 1:t1 witht 1 < T. By Theorem 12, V ...

  64. [67]

    Because π∗ is the first popped and unskipped policy that does not dominate any previous policy in the queue, there is no policy in group (1) whose goal set is contained inG π∗ (s)

    policies that never appear inQ. Because π∗ is the first popped and unskipped policy that does not dominate any previous policy in the queue, there is no policy in group (1) whose goal set is contained inG π∗ (s). That is, n π∈ T[ t=1 Πt Gπ(s)⊂ G π∗ (s) o ∩(1) =∅. Now let π∈ ST t=1 Πt satisfy Gπ(s)⊂ G π∗ (s). We consider the possible groups to which π belo...

  65. [68]

    policies popped fromQat some step≤n,

  66. [69]

    policies currently inQat stepn,

  67. [70]

    policies that will enterQat some step> n,

  68. [71]

    stuttering

    policies that never appear inQ. Because π∗ is the first popped and unskipped policy that is not dominated by any previous policy in the queue, there is no policy in group (1) whose goal set containsG π∗ (s). That is, n π∈ T[ t=1 Πt Gπ∗ (s)⊂ G π(s) o ∩(1) =∅. Now let π∈ ST t=1 Πt satisfy Gπ∗ (s)⊂ G π(s). We consider the possible groups to which π belongs. ...

  69. [72]

    internal

    via internal circuit analysis — with no access to ground-truth circuit labels. The three carry regimes map onto disjoint code clusters (e.g. t21/t6 for sum-9 uncertain in addition; t23/t7 for borrow-uncertain in subtraction), and an analogous structure appears for subtraction borrow cascades. What Quirke et al. needed PCA of hidden activations to reveal, ...

  70. [73]

    Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

    Institutional review board (IRB) approvals or equivalent for research with human subjects 68 Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country ...