Recognition: 2 theorem links
· Lean TheoremDynamic Latent Routing
Pith reviewed 2026-05-15 01:45 UTC · model grok-4.3
The pith
Dynamic Latent Routing recovers globally optimal policies by composing learned sub-policies to improve low-data language model fine-tuning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
General Dijkstra Search shows that optimal policies in MDPs with changing rewards can be recovered exactly by concatenating intermediate optimal sub-policies over time. Dynamic Latent Routing implements the search-select-update principle from this result inside language model training: it performs dynamic search over discrete latent codes to select and compose sub-policies, updating the model parameters in the same stage, yielding structured routing that improves adaptation when data is scarce.
What carries the argument
Dynamic Latent Routing (DLR), a single-stage training procedure that jointly optimizes discrete latent codes and routing policies via dynamic search to compose sub-policies.
Load-bearing premise
The optimality guarantees and search principle from General Dijkstra Search in MDPs transfer effectively to the non-stationary, high-dimensional setting of language model post-training without introducing hidden biases or optimization instabilities.
What would settle it
An experiment in which DLR fails to match or exceed supervised fine-tuning performance on the four datasets and six models, or in which removing the dynamic search component eliminates any observed gains.
Figures
read the original abstract
We investigate the temporal concatenation of sub-policies in Markov Decision Processes (MDP) with time-varying reward functions. We introduce General Dijkstra Search (GDS), and prove that globally optimal goal-reaching policies can be recovered through temporal composition of intermediate optimal sub-policies. Motivated by the "search, select, update" principle underlying GDS, we propose Dynamic Latent Routing (DLR), a language-model post-training method that jointly learns discrete latent codes, routing policies, and model parameters through dynamic search in a single training stage. In low-data fine-tuning settings, DLR matches or outperforms supervised fine-tuning across four datasets and six models, achieving a mean gain of +6.6 percentage points, while prior discrete-latent baselines consistently underperform SFT. Mechanistic analyses and targeted code ablations show that DLR learns structured routing behaviors with distinct causal roles.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proves that globally optimal goal-reaching policies in MDPs with time-varying rewards can be recovered via temporal composition of optimal sub-policies using General Dijkstra Search (GDS). Motivated by the 'search, select, update' principle, it introduces Dynamic Latent Routing (DLR) for language-model post-training: a single-stage method that jointly optimizes discrete latent codes, a routing policy, and model parameters. In low-data fine-tuning, DLR matches or exceeds supervised fine-tuning (SFT) by a mean +6.6 percentage points across four datasets and six models, while prior discrete-latent methods underperform SFT; mechanistic analyses indicate structured routing with distinct causal roles.
Significance. If the empirical gains are shown to arise from the GDS-derived routing mechanism rather than auxiliary regularization, the work could offer a principled route to structured, sample-efficient post-training of language models. The reported mean gain, breadth of models/datasets, and mechanistic ablations provide concrete evidence worth further scrutiny; however, the absence of a formal link between the MDP optimality result and the LM objective limits the theoretical significance.
major comments (3)
- [Motivation and Method] Motivation section: the claim that DLR implements the GDS 'search, select, update' principle in the LM setting lacks a derivation showing that the jointly learned discrete codes and routing recover sub-policy optimality (or avoid optimization instabilities) under a fixed supervised loss on token sequences rather than explicit time-varying MDP rewards.
- [Experiments] Experimental results: the reported +6.6 pp mean gain over SFT is presented without per-run variance, number of random seeds, or statistical significance tests; this makes it impossible to determine whether the advantage is robust or could be explained by differences in effective capacity or regularization between DLR and the discrete-latent baselines.
- [Theoretical Analysis] Proof of GDS optimality: the global-optimality guarantee is stated for finite-state MDPs with time-varying rewards, yet the manuscript does not address how (or whether) the same composition principle extends to the non-stationary, countably infinite state space of autoregressive language models without introducing hidden biases in the learned routing policy.
minor comments (2)
- [Method] Notation for the routing policy and latent code distribution should be introduced with explicit equations rather than prose descriptions to improve reproducibility.
- [Analysis] The abstract states 'mechanistic analyses and targeted code ablations' but the manuscript would benefit from a dedicated subsection listing the exact ablations performed and the quantitative effect sizes observed.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and have revised the manuscript to improve clarity, reporting, and discussion of limitations.
read point-by-point responses
-
Referee: [Motivation and Method] Motivation section: the claim that DLR implements the GDS 'search, select, update' principle in the LM setting lacks a derivation showing that the jointly learned discrete codes and routing recover sub-policy optimality (or avoid optimization instabilities) under a fixed supervised loss on token sequences rather than explicit time-varying MDP rewards.
Authors: We agree that the connection is motivational rather than a formal derivation. DLR is inspired by the search-select-update principle but operates under the standard next-token prediction loss without explicit MDP rewards or optimality guarantees. In the revised manuscript we have clarified this distinction in the motivation section, removed any implication of recovering sub-policy optimality, and added a short discussion of how joint optimization in practice avoids certain instabilities observed in prior discrete-latent methods. revision: partial
-
Referee: [Experiments] Experimental results: the reported +6.6 pp mean gain over SFT is presented without per-run variance, number of random seeds, or statistical significance tests; this makes it impossible to determine whether the advantage is robust or could be explained by differences in effective capacity or regularization between DLR and the discrete-latent baselines.
Authors: We accept this criticism. The revised version now reports results averaged over 5 random seeds with standard deviations for all models and datasets. We also include paired t-tests showing that the +6.6 pp mean gain over SFT is statistically significant (p < 0.05) in the majority of settings. Additional controls matching effective parameter count and regularization strength between DLR and baselines have been added to the experimental section. revision: yes
-
Referee: [Theoretical Analysis] Proof of GDS optimality: the global-optimality guarantee is stated for finite-state MDPs with time-varying rewards, yet the manuscript does not address how (or whether) the same composition principle extends to the non-stationary, countably infinite state space of autoregressive language models without introducing hidden biases in the learned routing policy.
Authors: The global-optimality result is proven only for finite-state MDPs; we do not claim it transfers directly to the countably infinite, non-stationary state space of autoregressive LMs. The manuscript presents DLR as a heuristic motivated by GDS rather than a formal extension. We have added a dedicated limitations paragraph discussing the challenges of extending temporal composition to infinite state spaces and potential routing biases, supported by the existing mechanistic analyses that show structured rather than biased routing in practice. A rigorous theoretical bridge remains future work. revision: partial
Circularity Check
No circularity: GDS optimality proof and DLR empirical gains remain independent
full rationale
The paper first proves global optimality for temporal composition of sub-policies under time-varying MDP rewards via General Dijkstra Search. It then motivates Dynamic Latent Routing by the 'search, select, update' principle and reports direct empirical measurements of +6.6 pp mean gains over SFT in low-data LM fine-tuning across four datasets and six models. No equation, fitted parameter, or self-citation reduces the reported performance numbers to quantities defined by the MDP proof or by construction; the LM results are presented as measured outcomes under a fixed supervised loss, with no derivation claiming that the learned routing recovers MDP optimality. The central claim is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Globally optimal goal-reaching policies can be recovered through temporal composition of intermediate optimal sub-policies.
Reference graph
Works this paper leans on
-
[1]
Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. InAdvances in Neural Information Processing Systems, volume 30, 2017
work page 2017
-
[2]
Anthropic. Introducing Claude Opus 4.7. https://www.anthropic.com/news/ claude-opus-4-7, 2026
work page 2026
-
[3]
The option-critic architecture
Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture. InProceedings of the AAAI Conference on Artificial Intelligence, volume 31, 2017
work page 2017
-
[4]
Successor features for transfer in reinforcement learning
André Barreto, Will Dabney, Rémi Munos, Jonathan J Hunt, Tom Schaul, Hado van Hasselt, and David Silver. Successor features for transfer in reinforcement learning. InAdvances in Neural Information Processing Systems, volume 30, 2017
work page 2017
-
[5]
Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances.Computational Linguistics, 48(1):207–219, 2022. URLhttps://aclanthology.org/2022.cl-1.7/
work page 2022
-
[6]
Language models can explain neurons in language models
Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, and William Saunders. Language models can explain neurons in language models. Technical report, OpenAI, 2023. URL https://openaipublic.blob. core.windows.net/neuron-explainer/paper/index.html
work page 2023
-
[7]
Tianshi Cao, Jingkang Wang, Yining Zhang, and Sivabalan Manivasagam. Zero-shot composi- tional policy learning via language grounding.arXiv preprint arXiv:2004.07200, 2020
-
[8]
SEAL: Steerable reasoning calibration of large language models for free
Runjin Chen, Zhenyu Zhang, Junyuan Hong, Souvik Kundu, and Zhangyang Wang. SEAL: Steerable reasoning calibration of large language models for free. InConference on Language Modeling (COLM), 2025
work page 2025
-
[9]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[10]
Peter Dayan and Geoffrey E Hinton. Feudal reinforcement learning. InAdvances in Neural Information Processing Systems, volume 5. Morgan Kaufmann, 1993
work page 1993
-
[11]
DeepSeek-AI. DeepSeek-V4 model card. https://fe-static.deepseek.com/chat/ transparency/deepseek-V4-model-card-EN.pdf, 2026
work page 2026
-
[12]
Thomas G. Dietterich. Hierarchical reinforcement learning with the MAXQ value function decomposition.Journal of Artificial Intelligence Research, 13:227–303, 2000
work page 2000
-
[13]
Toy models of superposition.Transformer Circuits Thread, 2022
Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition.Transformer Circuits Thread, 2022. URL https: //transformer-circuits.pub/20...
work page 2022
-
[14]
Diversity is all you need: Learning skills without a reward function
Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning skills without a reward function. InInternational Conference on Learning Representations, 2019. URLhttps://openreview.net/forum?id=SJx63jRqFm
work page 2019
-
[15]
Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies
Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics, 9:346–361, 2021
work page 2021
-
[16]
Think before you speak: Training language models with pause tokens
Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, and Vaishnavh Nagarajan. Think before you speak: Training language models with pause tokens. InThe Twelfth International Conference on Learning Representations, 2024. URL https: //arxiv.org/abs/2310.02226. 10
-
[17]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
The symbol grounding problem.Physica D: Nonlinear Phenomena, 42(1–3): 335–346, 1990
Stevan Harnad. The symbol grounding problem.Physica D: Nonlinear Phenomena, 42(1–3): 335–346, 1990
work page 1990
-
[19]
RLP: Reinforcement as a pretraining objective
Ali Hatamizadeh, Syeda Nahida Akter, Shrimai Prabhumoye, Jan Kautz, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, and Yejin Choi. RLP: Reinforcement as a pretraining objective. InInternational Conference on Learning Representations (ICLR), 2026. URL https://arxiv.org/abs/2510.01265. arXiv:2510.01265
-
[20]
Sparse autoencoders find highly interpretable features in language models
Robert Huben, Hoagy Cunningham, Logan Riggs Smith, Aidan Ewart, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum? id=F76bwRSLeK
work page 2024
-
[21]
Composing entropic policies using divergence correction
Jonathan J Hunt, Andre Barreto, Timothy P Lillicrap, and Nicolas Heess. Composing entropic policies using divergence correction. InProceedings of the 36th International Conference on Machine Learning, pages 2911–2920. PMLR, 2019
work page 2019
-
[22]
Hopkins, David Bau, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg
Kenneth Li, Aspen K. Hopkins, David Bau, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Emergent world representations: Exploring a sequence model trained on a synthetic task. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=DeG07_TcZvT
work page 2023
-
[23]
Learn to explain: Multimodal reasoning via thought chains for science question answering
Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. InAdvances in Neural Information Processing Systems, volume 35, 2022
work page 2022
-
[24]
Machado, André Barreto, Doina Precup, and Michael Bowling
Marlos C. Machado, André Barreto, Doina Precup, and Michael Bowling. Temporal abstraction in reinforcement learning with the successor representation.Journal of Machine Learning Research, 2023. arXiv:2110.05740
-
[25]
Kimi K2.6 tech blog: Advancing open-source coding
Moonshot AI. Kimi K2.6 tech blog: Advancing open-source coding. https://www.kimi. com/blog/kimi-k2-6, 2026
work page 2026
-
[26]
Progress mea- sures for grokking via mechanistic interpretability
Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress mea- sures for grokking via mechanistic interpretability. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=9XFSbDPmdW
work page 2023
-
[27]
Emergent linear representations in world models of self-supervised sequence models
Neel Nanda, Andrew Lee, and Martin Wattenberg. Emergent linear representations in world models of self-supervised sequence models. InProceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, 2023. URL https://aclanthology. org/2023.blackboxnlp-1.2/
work page 2023
-
[28]
In-context learning and induction heads
Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, a...
work page 2022
-
[29]
OpenAI. GPT-5.5 system card. https://openai.com/index/gpt-5-5-system-card/ , 2026
work page 2026
-
[30]
Automatically inter- preting millions of features in large language models
Gonçalo Santos Paulo, Alex Troy Mallen, Caden Juang, and Nora Belrose. Automatically inter- preting millions of features in large language models. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research. PMLR, 2025. URLhttps://openreview.net/forum?id=EemtbhJOXc. 11
work page 2025
-
[31]
Jacob Pfau, William Merrill, and Samuel R. Bowman. Let’s think dot by dot: Hidden com- putation in transformer language models. InConference on Language Modeling (COLM), 2024
work page 2024
-
[32]
Understanding addition in transformers
Philip Quirke and Fazl Barez. Understanding addition in transformers. InThe Twelfth Inter- national Conference on Learning Representations, 2024. URL https://openreview.net/ forum?id=rIx1YXVWZb
work page 2024
-
[33]
Understanding addition and subtraction in transformers.arXiv preprint arXiv:2402.02619, 2024
Philip Quirke, Clement Neo, and Fazl Barez. Understanding addition and subtraction in transformers.arXiv preprint arXiv:2402.02619, 2024. URL https://arxiv.org/abs/2402. 02619
- [34]
-
[35]
Qwen3.5: A natively multimodal foundation model
Qwen Team. Qwen3.5: A natively multimodal foundation model. https://www. alibabagroup.com/en-US/document-1960233590314762240, 2026
work page 2026
-
[36]
Language models are unsupervised multitask learners
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. Technical report, OpenAI, 2019
work page 2019
-
[37]
Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought
Keshav Ramji, Tahira Naseem, and Ramón Fernandez Astudillo. Thinking without words: Efficient latent reasoning with abstract chain-of-thought. InLatent & Implicit Thinking Workshop at the International Conference on Learning Representations (ICLR), 2026. URL https://arxiv.org/abs/2604.22709
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[38]
Universal value function approx- imators
Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal value function approx- imators. InProceedings of the 32nd International Conference on Machine Learning, pages 1312–1320. PMLR, 2015
work page 2015
-
[39]
Shah, Khush Gupta, Keshav Ramji, and Pratik Chaudhari
Alok N. Shah, Khush Gupta, Keshav Ramji, and Pratik Chaudhari. Language modeling with learned meta-tokens. InProceedings of the 42nd International Conference on Machine Learning (ICML), 2025. URLhttps://openreview.net/pdf?id=oaHYnLldHM
work page 2025
-
[40]
Dynamics-aware unsupervised discovery of skills
Archit Sharma, Shixiang Shane Gu, Sergey Levine, Vikash Kumar, and Karol Hausman. Dynamics-aware unsupervised discovery of skills. InInternational Conference on Learning Representations, 2020. URLhttps://openreview.net/forum?id=HJgLZR4KvH
work page 2020
-
[41]
Henry, Zhun Wang, Yang Liu, Dawn Song, and Chenguang Wang
Vincent Siu, Nicholas Crispino, David Park, Nathan W. Henry, Zhun Wang, Yang Liu, Dawn Song, and Chenguang Wang. SteeringSafety: A systematic safety evaluation framework of representation steering in LLMs. InNeurIPS 2025 Workshop on Socially Responsible and Trustworthy Foundation Models, 2025. URLhttps://arxiv.org/abs/2509.13450
-
[42]
Token assorted: Mixing latent and text tokens for improved language model reasoning
DiJia Su, Hanlin Zhu, Yingchen Xu, Jiantao Jiao, Yuandong Tian, and Qinqing Zheng. Token assorted: Mixing latent and text tokens for improved language model reasoning. InICML 2025 Workshop on Long-Context Foundation Models (LCFM), 2025. URL https://arxiv.org/ abs/2502.03275
-
[43]
Probing for arithmetic errors in language models
Yucheng Sun, Alessandro Stolfo, and Mrinmaya Sachan. Probing for arithmetic errors in language models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025. URL https://aclanthology.org/2025.emnlp-main.411/
work page 2025
-
[44]
Richard S Sutton and Andrew G Barto.Reinforcement Learning: An Introduction. MIT Press, 2nd edition, 2018
work page 2018
-
[45]
Sutton, Doina Precup, and Satinder Singh
Richard S. Sutton, Doina Precup, and Satinder Singh. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning.Artificial Intelligence, 112(1–2): 181–211, 1999
work page 1999
-
[46]
CommonsenseQA: A question answering challenge targeting commonsense knowledge
Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. CommonsenseQA: A question answering challenge targeting commonsense knowledge. InProceedings of NAACL- HLT, pages 4149–4158, 2019. 12
work page 2019
-
[47]
Xinyu Tang, Xiaolei Wang, Zhihao Lv, Yingqian Min, Wayne Xin Zhao, Binbin Hu, Ziqi Liu, and Zhiqiang Zhang. Unlocking general long chain-of-thought reasoning capabilities of large language models via representation engineering. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vienna, Austri...
work page 2025
-
[48]
Why representation engineering works: A theoretical and empirical study in vision-language models
Bowei Tian, Xuntao Lyu, Meng Liu, Hongyi Wang, and Ang Li. Why representation engineering works: A theoretical and empirical study in vision-language models. arXiv:2503.22720, 2025
-
[49]
Compositionality of optimal control laws
Emanuel Todorov. Compositionality of optimal control laws. InAdvances in Neural Information Processing Systems, volume 22, 2009
work page 2009
-
[50]
Composing value functions in reinforcement learning
Benjamin Van Niekerk, Steven James, Adam Earle, and Benjamin Rosman. Composing value functions in reinforcement learning. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 6401–6409. PMLR, 09–15 Jun 2019. URL https://pro...
work page 2019
-
[51]
FeUdal networks for hierarchical reinforcement learning
Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, and Koray Kavukcuoglu. FeUdal networks for hierarchical reinforcement learning. InProceedings of the 34th International Conference on Machine Learning, pages 3540–3549. PMLR, 2017
work page 2017
-
[52]
Interpretability in the wild: a circuit for indirect object identification in GPT-2 small
Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. In The Eleventh International Conference on Learning Representations, 2023. URL https: //openreview.net/forum?id=NpsVSN6o4ul
work page 2023
-
[53]
Representation bending for large language model safety
Ashkan Yousefpour, Taeheon Kim, Ryan Sungmo Kwon, Seungbeen Lee, Wonje Jeung, Se- ungju Han, Alvin Wan, Harrison Ngan, Youngjae Yu, and Jonghyun Choi. Representation bending for large language model safety. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), 2025. URL https://aclanthology.org/ 2025.acl-long.1173/
work page 2025
-
[54]
Interpreting and improving large language models in arithmetic calculation
Wei Zhang, Chaoqun Wan, Yonggang Zhang, Yiu-Ming Cheung, Xinmei Tian, Xu Shen, and Jieping Ye. Interpreting and improving large language models in arithmetic calculation. InProceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research. PMLR, 2024. URLhttps://proceedings.mlr. press/v235/zhang24bk.html
work page 2024
-
[55]
Representation Engineering: A Top-Down Approach to AI Transparency
Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, and Dan Hendrycks. Representation engineering: A top- down approach to ...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[56]
There exists a unique fixed pointx ∗ ∈ Xsuch that T(x ∗) =x ∗
-
[57]
Moreover, for any initial pointx 0 ∈ X, the sequence(x n)n∈N defined recursively by xn :=T(x n−1), n≥1, converges tox ∗, i.e. lim n→∞ xn =x ∗. Theorem 7(Existence and uniqueness of optimal value function).There exists a unique optimal value functionV ∗ ∈ Vsuch that it is invariant under the Bellman value operatorT: ˆV → ˆV, i.e. V ∗ t (s) = (T V ∗)t(s)∀s∈...
-
[58]
A policyπis said todominatea goal setGwhen starting froms∈ Sif G ⊂ G π(s)
-
[59]
Lemma 12.Ifπ 1,∗ dominatesπ 1, thenπ 1,∗ ◦π 2 also dominatesπ 1 ◦π 2
A policyπis said to bedominated bya goal setGwhen starting froms∈ Sif Gπ(s)⊂ G. Lemma 12.Ifπ 1,∗ dominatesπ 1, thenπ 1,∗ ◦π 2 also dominatesπ 1 ◦π 2. Proof.Sinceπ 1,∗ dominatesπ 1, we have Gπ1 (s)⊂ G π1,∗ (s). Therefore, Gπ1◦π2 (s) = [ s′∈Gπ1(s) Gπ2 (s′)⊂ [ s′∈Gπ1,∗(s) Gπ2 (s′) =G π1,∗◦π2 (s). Henceπ 1,∗ ◦π 2 dominatesπ 1 ◦π 2. Lemma 13.In the General Dij...
-
[60]
policies already popped fromQ,
-
[61]
policies currently inQ,
-
[62]
policies that will be added toQin the future,
-
[63]
policies that never appear inQ. Sinceπ ∗ is popped from the priority queue, it has value at least as large as every policy currently in the queue. Hence for every policyπin group (2), V π∗ 0 (s)≥V π 0 (s). Now let π be a policy that will be added to the queue in the future. Then π must extend some current queue elementπ 1:t1 witht 1 < T. By Theorem 12, V ...
-
[67]
policies that never appear inQ. Because π∗ is the first popped and unskipped policy that does not dominate any previous policy in the queue, there is no policy in group (1) whose goal set is contained inG π∗ (s). That is, n π∈ T[ t=1 Πt Gπ(s)⊂ G π∗ (s) o ∩(1) =∅. Now let π∈ ST t=1 Πt satisfy Gπ(s)⊂ G π∗ (s). We consider the possible groups to which π belo...
-
[68]
policies popped fromQat some step≤n,
-
[69]
policies currently inQat stepn,
-
[70]
policies that will enterQat some step> n,
-
[71]
policies that never appear inQ. Because π∗ is the first popped and unskipped policy that is not dominated by any previous policy in the queue, there is no policy in group (1) whose goal set containsG π∗ (s). That is, n π∈ T[ t=1 Πt Gπ∗ (s)⊂ G π(s) o ∩(1) =∅. Now let π∈ ST t=1 Πt satisfy Gπ∗ (s)⊂ G π(s). We consider the possible groups to which π belongs. ...
work page 2039
-
[72]
via internal circuit analysis — with no access to ground-truth circuit labels. The three carry regimes map onto disjoint code clusters (e.g. t21/t6 for sum-9 uncertain in addition; t23/t7 for borrow-uncertain in subtraction), and an analogous structure appears for subtraction borrow cascades. What Quirke et al. needed PCA of hidden activations to reveal, ...
work page 2040
-
[73]
Institutional review board (IRB) approvals or equivalent for research with human subjects 68 Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.