TARPO: Token-Wise Latent-Explicit Reasoning via Action-Routing Policy Optimization

Jianye Wang; Liting Zhang; Qicheng Li; Shiwan Zhao; Xuyang Zhao; Zichen Xu

arxiv: 2606.05859 · v1 · pith:RXQJYKANnew · submitted 2026-06-04 · 💻 cs.CL

TARPO: Token-Wise Latent-Explicit Reasoning via Action-Routing Policy Optimization

Liting Zhang , Shiwan Zhao , Xuyang Zhao , Zichen Xu , Jianye Wang , Qicheng Li This is my paper

Pith reviewed 2026-06-28 01:38 UTC · model grok-4.3

classification 💻 cs.CL

keywords latent reasoningreinforcement learningaction routingtoken-wise decisionsLLM reasoningpolicy optimizationhybrid reasoning

0 comments

The pith

TARPO trains LLMs to decide at each token whether to output explicitly or reason in continuous latent space using reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents TARPO as a reinforcement learning approach that lets models switch between generating discrete tokens and performing latent reasoning on a per-token basis. A lightweight router samples a binary choice from the current hidden state, which keeps the policy stochastic enough for exploration while the backbone and router train jointly on a shared advantage signal. Experiments on Qwen2.5 models from 1.5B to 7B and Llama-3.1-8B show consistent improvements over both explicit CoT and prior latent reasoning RL methods across multiple benchmarks. The work demonstrates that models learn adaptive switching patterns without destabilizing training.

Core claim

TARPO is a pure RL framework in which an action head router observes the hidden state at each step and samples a binary decision to route either to discrete token generation from the vocabulary or to continuous latent reasoning; the LLM and router are optimized end-to-end with group-relative advantages, producing better reasoning performance than baselines that stay fixed in one mode.

What carries the argument

The lightweight action head router that samples a binary mode-selection decision from the current hidden state to choose between explicit token output and latent reasoning.

If this is right

Models learn to insert latent steps only where they help, avoiding unnecessary explicit tokens.
Joint optimization keeps training dynamics stable while exploring both reasoning modes.
The same routing mechanism works across model families and sizes from 1.5B to 8B parameters.
Adaptive switching reduces reliance on fixed-length explicit chains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The binary router could be extended to multi-way decisions that include different latent granularities.
If the router generalizes, it may allow hybrid reasoning on tasks where pure latent methods currently underperform.
Ablating the router after training would test whether the learned policy truly depends on per-step adaptation.

Load-bearing premise

A simple router looking only at the current hidden state can produce routing decisions that preserve enough randomness for effective exploration in the combined discrete-latent policy.

What would settle it

Running the same benchmarks with the router removed or with its sampling made deterministic yields no improvement or a drop relative to the non-routed latent or explicit baselines.

Figures

Figures reproduced from arXiv: 2606.05859 by Jianye Wang, Liting Zhang, Qicheng Li, Shiwan Zhao, Xuyang Zhao, Zichen Xu.

**Figure 2.** Figure 2: Comparison of training curves between TARPO and baselines (GRPO, HRPO) across different model [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Analysis of TARPO action head bias initialization and KL (Qwen2.5-3B-Instruct on MATH). (a) and (b) [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: A case study on a MATH500 problem. (a) GRPO outputs with token entropy highlighted in blue. (b) [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Example Prompt for Mathematical Reasoning in TARPO. [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Example Prompt for GPQA-Diamond and ARC-C in TARPO. [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Example Prompt for HumanEval in TARPO. Bias Act-KL GSM8K MATH MATH500 AMC23 OlympiadBench P@1 P@32 M@32 P@1 P@32 M@32 P@1 P@32 M@32 P@1 P@32 M@32 P@1 P@32 M@32 Qwen2.5-7B-Instruct [4.6, 0] ✓ 89.92 97.27 93.18 70.22 88.22 77.04 70.26 87.80 76.20 53.52 95.00 65.00 30.69 54.22 43.11 [2.2, 0] ✓ 89.82 97.50 93.18 69.93 88.28 76.98 69.91 88.20 77.60 52.03 90.00 62.50 30.56 54.22 42.07 Qwen2.5-1.5B-Instruct [4.6,… view at source ↗

**Figure 8.** Figure 8: Effect of action head initial bias (b0) and action KL regularization on Qwen2.5-7B-Instruct trained on GSM8K, under four configurations combining two bias settings (b0 = [2.2, 0] and b0 = [4.6, 0]) with and without action KL penalty [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: Effect of action head initial bias (b0) and action KL regularization on Qwen2.5-3B-Instruct trained on GSM8K. Bias Act-KL GPQA-Diamond ARC-C HumanEval P@1 P@32 M@32 #Tok P@1 P@32 M@32 #Tok P@1 P@32 M@32 #Tok Qwen2.5-3B-Instruct [4.6, 0] × 28.09 88.38 30.30 532.3 74.01 98.89 84.04 189.4 60.25 89.63 55.49 301.7 [4.6, 0] ✓ 28.41 90.91 29.29 568.3 74.55 97.87 85.49 213.7 63.62 88.41 61.59 256.1 [PITH_FULL_IMA… view at source ↗

**Figure 10.** Figure 10: Training curves of Qwen2.5-3B-Instruct on DAPO-MATH-17k. From left to right: Training Entropy, [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗

**Figure 11.** Figure 11: Training curves of Qwen2.5-7B-Instruct on MATH. [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗

**Figure 12.** Figure 12: Training curves of Llama-3.1-8B-Instruct on MATH. [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗

read the original abstract

Latent reasoning has emerged as a promising alternative to discrete Chain-of-Thought (CoT) in large language models (LLMs), enabling more expressive reasoning by operating over continuous representations. However, the inherently deterministic nature of continuous representations limits policy exploration in reinforcement learning (RL). To address this, we propose TARPO (Token-Wise Latent-Explicit Reasoning via Action-Routing Policy Optimization), a pure RL framework that adaptively switches between discrete token generation and continuous latent reasoning at each step. TARPO introduces a lightweight action head router that observes the current hidden state and samples a routing decision from a binary mode-selection space, preserving the stochasticity of discrete token sampling from the vocabulary. The LLM backbone and router are jointly optimized end-to-end with a shared group-relative advantage signal. Extensive experiments across Qwen2.5 (from 1.5B to 7B) and Llama-3.1-8B backbones demonstrate that TARPO consistently outperforms existing explicit and latent reasoning RL baselines across diverse benchmarks. Further analysis shows that TARPO learns adaptive token-wise switching behaviors while maintaining stable training dynamics. Our code is available at https://github.com/NKU-LITI/TARPO-master.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TARPO adds a binary router for per-token latent/explicit switching in pure RL, but the abstract supplies zero experimental details so the outperformance claim cannot be checked.

read the letter

The paper's actual addition is a small action head that, at every token, samples a binary decision to either emit a real token or stay in continuous latent space. The router sees only the current hidden state and the whole system trains end-to-end with one shared group-relative advantage. That design is new enough in the RL-for-LLM literature and they release the code, which is useful.

They report results on Qwen2.5 models from 1.5B to 7B plus Llama-3.1-8B and say the method beats both pure explicit and pure latent RL baselines on several benchmarks while keeping training stable. The token-wise adaptivity is the part that could matter for people who want to avoid committing to one reasoning style.

The obvious problem is that none of this is shown. The abstract names no baselines, reports no numbers, and gives no ablation on the router or the advantage sharing. Without those tables it is impossible to tell whether the router actually improves anything or whether the gains are just from extra compute or different hyperparameters.

The entropy concern also needs checking. A binary choice carries at most one bit; if the router learns to always pick the same mode the exploration benefit disappears. The abstract claims it preserves stochasticity but does not describe temperature, Gumbel noise, or router entropy regularization, so that part remains unverified.

This is for the narrow group working on RL post-training of LLMs. It is worth sending to review if the full paper contains the missing controls and the released code reproduces the claimed gains; otherwise the contribution stays too thin to evaluate.

Referee Report

2 major / 1 minor

Summary. The paper proposes TARPO, a pure RL framework for LLMs that adaptively routes between discrete token generation and continuous latent reasoning at each step via a lightweight action head router. The router observes the hidden state and samples from a binary mode-selection space; the backbone and router are jointly optimized end-to-end using a shared group-relative advantage. Experiments on Qwen2.5 (1.5B–7B) and Llama-3.1-8B models report consistent outperformance over explicit and latent RL baselines across diverse benchmarks, with additional analysis of learned switching behavior and training stability. Code is released.

Significance. If the empirical gains and the router's exploration properties hold under scrutiny, TARPO would offer a practical way to combine the expressivity of latent reasoning with the stochasticity of discrete sampling inside a single RL loop. The end-to-end joint optimization and released code are positive features for reproducibility.

major comments (2)

[Abstract / router description] The central claim that the binary router 'preserves the stochasticity of discrete token sampling' (Abstract) rests on an unverified assumption: that a 1-bit decision per step, conditioned only on the hidden state and trained with the same group-relative advantage, will not collapse under joint optimization. No entropy regularization, temperature schedule, or router-entropy curves are referenced to counteract this risk; without such evidence the advantage over purely latent baselines is not yet load-bearing.
[Abstract] The outperformance claim is stated without any quantitative support in the provided abstract (no metrics, baselines, or ablation tables). Even if the full experimental section exists, the absence of these details in the summary makes it impossible to evaluate whether the router mechanism, rather than other implementation choices, drives the reported gains.

minor comments (1)

Notation for the binary mode-selection space and the precise interface between the router output and the LLM forward pass should be formalized with an equation or pseudocode for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the router's stochastic properties and the abstract's clarity. We address each major comment below and indicate the revisions we will incorporate.

read point-by-point responses

Referee: [Abstract / router description] The central claim that the binary router 'preserves the stochasticity of discrete token sampling' (Abstract) rests on an unverified assumption: that a 1-bit decision per step, conditioned only on the hidden state and trained with the same group-relative advantage, will not collapse under joint optimization. No entropy regularization, temperature schedule, or router-entropy curves are referenced to counteract this risk; without such evidence the advantage over purely latent baselines is not yet load-bearing.

Authors: We agree that the manuscript would benefit from explicit evidence that the router does not collapse. The router samples from a binary mode-selection space at each step, which introduces stochasticity by construction. The full manuscript analyzes learned switching behavior showing non-trivial, context-dependent routing decisions. To directly address the concern about potential collapse under joint optimization, we will add router-entropy curves and related analysis in the revised version. revision: yes
Referee: [Abstract] The outperformance claim is stated without any quantitative support in the provided abstract (no metrics, baselines, or ablation tables). Even if the full experimental section exists, the absence of these details in the summary makes it impossible to evaluate whether the router mechanism, rather than other implementation choices, drives the reported gains.

Authors: We acknowledge that the current abstract lacks specific quantitative results, which limits immediate evaluation. The full paper reports consistent gains on Qwen2.5 and Llama-3.1 models across benchmarks, but to improve clarity we will revise the abstract to include key metrics (e.g., average improvements) and name the primary baselines. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical framework with independent components

full rationale

The paper introduces TARPO as an RL method with a new lightweight router for binary mode selection, jointly optimized via group-relative advantage, and supports its claims solely through benchmark experiments on Qwen and Llama models. No derivation chain, first-principles predictions, or mathematical results are presented that reduce to fitted parameters, self-definitions, or self-citation chains. The router's stochasticity claim is an architectural definition, not a tautological prediction, and performance outperformance is an empirical observation rather than a constructed result. The framework is self-contained against external benchmarks with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Only abstract available; no details on specific free parameters, background axioms, or additional invented entities beyond the described router component.

invented entities (1)

action head router no independent evidence
purpose: samples binary routing decision from hidden state to switch between discrete and latent modes
Lightweight component introduced to enable adaptive switching while preserving stochasticity

pith-pipeline@v0.9.1-grok · 5760 in / 990 out tokens · 27852 ms · 2026-06-28T01:38:02.487089+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 14 linked inside Pith

[1]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=
[2]

International Conference on Machine Learning , pages=

Metastable Dynamics of Chain-of-Thought Reasoning: Provable Benefits of Search, RL and Distillation , author=. International Conference on Machine Learning , pages=. 2025 , organization=

2025
[3]

arXiv preprint arXiv:2501.12948 , year=

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

Pith/arXiv arXiv
[4]

Advances in neural information processing systems , volume=

Large language models are zero-shot reasoners , author=. Advances in neural information processing systems , volume=
[5]

Findings of the Association for Computational Linguistics: ACL 2023 , pages=

Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes , author=. Findings of the Association for Computational Linguistics: ACL 2023 , pages=

2023
[6]

Advances in Neural Information Processing Systems , volume=

S-grpo: Early exit via reinforcement learning in reasoning models , author=. Advances in Neural Information Processing Systems , volume=
[7]

arXiv preprint arXiv:2402.03300 , year=

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

Pith/arXiv arXiv
[8]

2025 , eprint=

A Survey on Latent Reasoning , author=. 2025 , eprint=

2025
[9]

arXiv preprint arXiv:2505.16782 , year=

Reasoning Beyond Language: A Comprehensive Survey on Latent Chain-of-Thought Reasoning , author=. arXiv preprint arXiv:2505.16782 , year=

arXiv
[10]

arXiv preprint arXiv:2501.19201 , year=

Efficient reasoning with hidden thinking , author=. arXiv preprint arXiv:2501.19201 , year=

Pith/arXiv arXiv
[11]

arXiv preprint arXiv:2502.03275 , year=

Token assorted: Mixing latent and text tokens for improved language model reasoning , author=. arXiv preprint arXiv:2502.03275 , year=

arXiv
[12]

arXiv preprint arXiv:2412.06769 , year=

Training large language models to reason in a continuous latent space , author=. arXiv preprint arXiv:2412.06769 , year=

Pith/arXiv arXiv
[13]

Advances in Neural Information Processing Systems , volume=

Reasoning by superposition: A theoretical perspective on chain of continuous thought , author=. Advances in Neural Information Processing Systems , volume=
[14]

arXiv preprint arXiv:2405.14838 , year=

From explicit cot to implicit cot: Learning to internalize cot step by step , author=. arXiv preprint arXiv:2405.14838 , year=

Pith/arXiv arXiv
[15]

International Conference on Learning Representations , volume=

Think before you speak: Training language models with pause tokens , author=. International Conference on Learning Representations , volume=
[16]

arXiv preprint arXiv:2403.09629 , year=

Quiet-star: Language models can teach themselves to think before speaking , author=. arXiv preprint arXiv:2403.09629 , year=

Pith/arXiv arXiv
[17]

Advances in Neural Information Processing Systems , volume=

System-1.5 reasoning: Traversal in language and latent spaces with dynamic shortcuts , author=. Advances in Neural Information Processing Systems , volume=
[18]

arXiv preprint arXiv:2505.13308 , year=

Seek in the dark: Reasoning via test-time instance-level policy gradient in latent space , author=. arXiv preprint arXiv:2505.13308 , year=

arXiv
[19]

Advances in Neural Information Processing Systems , volume=

Soft thinking: Unlocking the reasoning potential of llms in continuous concept space , author=. Advances in Neural Information Processing Systems , volume=
[20]

Advances in Neural Information Processing Systems , volume=

Scaling up test-time compute with latent reasoning: A recurrent depth approach , author=. Advances in Neural Information Processing Systems , volume=
[21]

arXiv preprint arXiv:2508.03440 , year=

Llms are single-threaded reasoners: Demystifying the working mechanism of soft thinking , author=. arXiv preprint arXiv:2508.03440 , year=

arXiv
[22]

arXiv preprint arXiv:2509.19170 , year=

Soft tokens, hard truths , author=. arXiv preprint arXiv:2509.19170 , year=

arXiv
[23]

arXiv preprint arXiv:2511.06411 , year=

SofT-GRPO: Surpassing Discrete-Token LLM Reinforcement Learning via Gumbel-Reparameterized Soft-Thinking Policy Optimization , author=. arXiv preprint arXiv:2511.06411 , year=

arXiv
[24]

arXiv e-prints , pages=

LEPO: Latent Reasoning Policy Optimization for Large Language Models , author=. arXiv e-prints , pages=
[25]

Advances in Neural Information Processing Systems , volume=

Think silently, think fast: Dynamic latent compression of llm reasoning chains , author=. Advances in Neural Information Processing Systems , volume=
[26]

arXiv preprint arXiv:2604.27998 , year=

Latent-GRPO: Group Relative Policy Optimization for Latent Reasoning , author=. arXiv preprint arXiv:2604.27998 , year=

Pith/arXiv arXiv
[27]

arXiv preprint arXiv:2505.19092 , year=

Reinforced latent reasoning for llm-based recommendation , author=. arXiv preprint arXiv:2505.19092 , year=

arXiv
[28]

Advances in Neural Information Processing Systems , volume=

Hybrid latent reasoning via reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=
[29]

arXiv preprint arXiv:2601.08808 , year=

Multiplex Thinking: Reasoning via Token-wise Branch-and-Merge , author=. arXiv preprint arXiv:2601.08808 , year=

arXiv
[30]

arXiv preprint arXiv:2510.05069 , year=

SwiReasoning: Switch-Thinking in Latent and Explicit for Pareto-Superior Reasoning LLMs , author=. arXiv preprint arXiv:2510.05069 , year=

arXiv
[31]

arXiv preprint arXiv:2602.11683 , year=

ThinkRouter: Efficient Reasoning via Routing Thinking between Latent and Discrete Spaces , author=. arXiv preprint arXiv:2602.11683 , year=

arXiv
[32]

arXiv preprint arXiv:2604.08299 , year=

SeLaR: Selective Latent Reasoning in Large Language Models , author=. arXiv preprint arXiv:2604.08299 , year=

Pith/arXiv arXiv
[33]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Beyond Tokens: Dynamic Latent Reasoning via Semantic Residual Refinement , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[34]

arXiv preprint arXiv:2512.02240 , year=

Lightweight Latent Reasoning for Narrative Tasks , author=. arXiv preprint arXiv:2512.02240 , year=

Pith/arXiv arXiv
[35]

The Fourteenth International Conference on Learning Representations , year=

Learning to Reason over Continuous Tokens with Reinforcement Learning , author=. The Fourteenth International Conference on Learning Representations , year=
[36]

arXiv preprint arXiv:2511.08577 , year=

Think-at-Hard: Selective Latent Iterations to Improve Reasoning Language Models , author=. arXiv preprint arXiv:2511.08577 , year=

Pith/arXiv arXiv
[37]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Softcot: Soft chain-of-thought for efficient reasoning with llms , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[38]

arXiv preprint arXiv:2110.14168 , year=

Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=

Pith/arXiv arXiv
[39]

Measuring Mathematical Problem Solving With the MATH Dataset , author=
[40]

Advances in Neural Information Processing Systems , volume=

Dapo: An open-source llm reinforcement learning system at scale , author=. Advances in Neural Information Processing Systems , volume=
[41]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[42]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark , author=
[43]

Bowman , booktitle=

David Rein and Betty Li Hou and Asa Cooper Stickland and Jackson Petty and Richard Yuanzhe Pang and Julien Dirani and Julian Michael and Samuel R. Bowman , booktitle=. 2024 , url=

2024
[44]

arXiv preprint arXiv:2311.12022 , year=

Gpqa: A graduate-level google-proof q&a benchmark , author=. arXiv preprint arXiv:2311.12022 , year=

Pith/arXiv arXiv
[45]

arXiv preprint arXiv:1803.05457 , year=

Think you have solved question answering? try arc, the ai2 reasoning challenge , author=. arXiv preprint arXiv:1803.05457 , year=

Pith/arXiv arXiv
[46]

arXiv preprint arXiv:2107.03374 , year=

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

Pith/arXiv arXiv

[1] [1]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

[2] [2]

International Conference on Machine Learning , pages=

Metastable Dynamics of Chain-of-Thought Reasoning: Provable Benefits of Search, RL and Distillation , author=. International Conference on Machine Learning , pages=. 2025 , organization=

2025

[3] [3]

arXiv preprint arXiv:2501.12948 , year=

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

Pith/arXiv arXiv

[4] [4]

Advances in neural information processing systems , volume=

Large language models are zero-shot reasoners , author=. Advances in neural information processing systems , volume=

[5] [5]

Findings of the Association for Computational Linguistics: ACL 2023 , pages=

Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes , author=. Findings of the Association for Computational Linguistics: ACL 2023 , pages=

2023

[6] [6]

Advances in Neural Information Processing Systems , volume=

S-grpo: Early exit via reinforcement learning in reasoning models , author=. Advances in Neural Information Processing Systems , volume=

[7] [7]

arXiv preprint arXiv:2402.03300 , year=

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

Pith/arXiv arXiv

[8] [8]

2025 , eprint=

A Survey on Latent Reasoning , author=. 2025 , eprint=

2025

[9] [9]

arXiv preprint arXiv:2505.16782 , year=

Reasoning Beyond Language: A Comprehensive Survey on Latent Chain-of-Thought Reasoning , author=. arXiv preprint arXiv:2505.16782 , year=

arXiv

[10] [10]

arXiv preprint arXiv:2501.19201 , year=

Efficient reasoning with hidden thinking , author=. arXiv preprint arXiv:2501.19201 , year=

Pith/arXiv arXiv

[11] [11]

arXiv preprint arXiv:2502.03275 , year=

Token assorted: Mixing latent and text tokens for improved language model reasoning , author=. arXiv preprint arXiv:2502.03275 , year=

arXiv

[12] [12]

arXiv preprint arXiv:2412.06769 , year=

Training large language models to reason in a continuous latent space , author=. arXiv preprint arXiv:2412.06769 , year=

Pith/arXiv arXiv

[13] [13]

Advances in Neural Information Processing Systems , volume=

Reasoning by superposition: A theoretical perspective on chain of continuous thought , author=. Advances in Neural Information Processing Systems , volume=

[14] [14]

arXiv preprint arXiv:2405.14838 , year=

From explicit cot to implicit cot: Learning to internalize cot step by step , author=. arXiv preprint arXiv:2405.14838 , year=

Pith/arXiv arXiv

[15] [15]

International Conference on Learning Representations , volume=

Think before you speak: Training language models with pause tokens , author=. International Conference on Learning Representations , volume=

[16] [16]

arXiv preprint arXiv:2403.09629 , year=

Quiet-star: Language models can teach themselves to think before speaking , author=. arXiv preprint arXiv:2403.09629 , year=

Pith/arXiv arXiv

[17] [17]

Advances in Neural Information Processing Systems , volume=

System-1.5 reasoning: Traversal in language and latent spaces with dynamic shortcuts , author=. Advances in Neural Information Processing Systems , volume=

[18] [18]

arXiv preprint arXiv:2505.13308 , year=

Seek in the dark: Reasoning via test-time instance-level policy gradient in latent space , author=. arXiv preprint arXiv:2505.13308 , year=

arXiv

[19] [19]

Advances in Neural Information Processing Systems , volume=

Soft thinking: Unlocking the reasoning potential of llms in continuous concept space , author=. Advances in Neural Information Processing Systems , volume=

[20] [20]

Advances in Neural Information Processing Systems , volume=

Scaling up test-time compute with latent reasoning: A recurrent depth approach , author=. Advances in Neural Information Processing Systems , volume=

[21] [21]

arXiv preprint arXiv:2508.03440 , year=

Llms are single-threaded reasoners: Demystifying the working mechanism of soft thinking , author=. arXiv preprint arXiv:2508.03440 , year=

arXiv

[22] [22]

arXiv preprint arXiv:2509.19170 , year=

Soft tokens, hard truths , author=. arXiv preprint arXiv:2509.19170 , year=

arXiv

[23] [23]

arXiv preprint arXiv:2511.06411 , year=

SofT-GRPO: Surpassing Discrete-Token LLM Reinforcement Learning via Gumbel-Reparameterized Soft-Thinking Policy Optimization , author=. arXiv preprint arXiv:2511.06411 , year=

arXiv

[24] [24]

arXiv e-prints , pages=

LEPO: Latent Reasoning Policy Optimization for Large Language Models , author=. arXiv e-prints , pages=

[25] [25]

Advances in Neural Information Processing Systems , volume=

Think silently, think fast: Dynamic latent compression of llm reasoning chains , author=. Advances in Neural Information Processing Systems , volume=

[26] [26]

arXiv preprint arXiv:2604.27998 , year=

Latent-GRPO: Group Relative Policy Optimization for Latent Reasoning , author=. arXiv preprint arXiv:2604.27998 , year=

Pith/arXiv arXiv

[27] [27]

arXiv preprint arXiv:2505.19092 , year=

Reinforced latent reasoning for llm-based recommendation , author=. arXiv preprint arXiv:2505.19092 , year=

arXiv

[28] [28]

Advances in Neural Information Processing Systems , volume=

Hybrid latent reasoning via reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=

[29] [29]

arXiv preprint arXiv:2601.08808 , year=

Multiplex Thinking: Reasoning via Token-wise Branch-and-Merge , author=. arXiv preprint arXiv:2601.08808 , year=

arXiv

[30] [30]

arXiv preprint arXiv:2510.05069 , year=

SwiReasoning: Switch-Thinking in Latent and Explicit for Pareto-Superior Reasoning LLMs , author=. arXiv preprint arXiv:2510.05069 , year=

arXiv

[31] [31]

arXiv preprint arXiv:2602.11683 , year=

ThinkRouter: Efficient Reasoning via Routing Thinking between Latent and Discrete Spaces , author=. arXiv preprint arXiv:2602.11683 , year=

arXiv

[32] [32]

arXiv preprint arXiv:2604.08299 , year=

SeLaR: Selective Latent Reasoning in Large Language Models , author=. arXiv preprint arXiv:2604.08299 , year=

Pith/arXiv arXiv

[33] [33]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Beyond Tokens: Dynamic Latent Reasoning via Semantic Residual Refinement , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

[34] [34]

arXiv preprint arXiv:2512.02240 , year=

Lightweight Latent Reasoning for Narrative Tasks , author=. arXiv preprint arXiv:2512.02240 , year=

Pith/arXiv arXiv

[35] [35]

The Fourteenth International Conference on Learning Representations , year=

Learning to Reason over Continuous Tokens with Reinforcement Learning , author=. The Fourteenth International Conference on Learning Representations , year=

[36] [36]

arXiv preprint arXiv:2511.08577 , year=

Think-at-Hard: Selective Latent Iterations to Improve Reasoning Language Models , author=. arXiv preprint arXiv:2511.08577 , year=

Pith/arXiv arXiv

[37] [37]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Softcot: Soft chain-of-thought for efficient reasoning with llms , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[38] [38]

arXiv preprint arXiv:2110.14168 , year=

Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=

Pith/arXiv arXiv

[39] [39]

Measuring Mathematical Problem Solving With the MATH Dataset , author=

[40] [40]

Advances in Neural Information Processing Systems , volume=

Dapo: An open-source llm reinforcement learning system at scale , author=. Advances in Neural Information Processing Systems , volume=

[41] [41]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[42] [42]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark , author=

[43] [43]

Bowman , booktitle=

David Rein and Betty Li Hou and Asa Cooper Stickland and Jackson Petty and Richard Yuanzhe Pang and Julien Dirani and Julian Michael and Samuel R. Bowman , booktitle=. 2024 , url=

2024

[44] [44]

arXiv preprint arXiv:2311.12022 , year=

Gpqa: A graduate-level google-proof q&a benchmark , author=. arXiv preprint arXiv:2311.12022 , year=

Pith/arXiv arXiv

[45] [45]

arXiv preprint arXiv:1803.05457 , year=

Think you have solved question answering? try arc, the ai2 reasoning challenge , author=. arXiv preprint arXiv:1803.05457 , year=

Pith/arXiv arXiv

[46] [46]

arXiv preprint arXiv:2107.03374 , year=

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

Pith/arXiv arXiv