pith. sign in

arxiv: 2512.10931 · v3 · pith:NJMH7MRSnew · submitted 2025-12-11 · 💻 cs.LG · cs.CL

Asynchronous Reasoning: Training-Free Interactive Thinking LLMs

Pith reviewed 2026-05-16 22:52 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords asynchronous reasoningpositional embeddingslarge language modelsinteractive agentstraining-free methodsreal-time responseLLM reasoning
0
0 comments X

The pith

Modifying positional embeddings lets existing LLMs reason asynchronously while generating responses without any retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models trained to reason must finish their internal thinking before emitting any output tokens, which prevents real-time adaptation when new information arrives mid-process. The paper shows that properties of positional embeddings can be used to let these models think about the problem, incorporate fresh inputs, and produce text at the same time in one forward pass. The method requires no extra training and is evaluated on math, commonsense, and safety tasks. It keeps answer accuracy intact while cutting time to the first non-thinking token from minutes down to five seconds or less and shortening total delays by up to twelve times. This makes reasoning models usable in interactive settings such as voice assistants or embodied agents that must respond while still processing.

Core claim

By leveraging properties of positional embeddings, LLMs built for sequential generation can be made to perform asynchronous reasoning: they simultaneously maintain internal thinking tokens, listen to additional inputs, and write output tokens, all without additional training, while still producing accurate reasoning-augmented answers on math, commonsense, and safety benchmarks.

What carries the argument

Positional embedding adjustment that separates the position indices of internal reasoning tokens from output generation tokens, enabling parallel thinking and writing streams inside a single sequential model.

If this is right

  • Models can begin emitting non-thinking output tokens while still processing the full input or new data.
  • Time to first non-thinking token falls from minutes to five seconds or less.
  • Overall response delays shrink by up to twelve times on reasoning tasks.
  • Reasoning accuracy stays comparable to standard sequential generation on math, commonsense, and safety benchmarks.
  • The change applies to existing trained models with no retraining required.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same embedding adjustment might support continuous multi-turn dialogues where new user messages arrive while the model is still refining its internal reasoning.
  • Combining this positional technique with other inference-time optimizations could further reduce latency in live voice or robotic systems.
  • The approach suggests that other architectural properties of transformers could be repurposed to create additional interaction modes without retraining.

Load-bearing premise

Changing how positional embeddings are handled preserves the model's original reasoning accuracy and does not introduce generation inconsistencies or hallucinations.

What would settle it

A side-by-side evaluation on the same math or commonsense problems where the asynchronous version produces final answers that differ from or are less accurate than those of the unmodified baseline model.

Figures

Figures reproduced from arXiv: 2512.10931 by Alina Shutova, Denis Kuznedelev, George Yakushev, Masoud Vahid Dastgerdi, Max Ryabinin, Nataliia Babina, Vyacheslav Zhdanovskiy.

Figure 1
Figure 1. Figure 1: An intuitive explanation of asynchronous reasoning: the model generates its response concurrently with thinking. If the thinking stream needs additional time, it can pause the writing stream until the next reasoning step is ready. We rely on the geometric properties of rotary positional embeddings to make the LLM perceive these streams as a single contiguous sequence without additional training. The model … view at source ↗
Figure 2
Figure 2. Figure 2: A dual thinker / writer view of the same reasoning task. The two views reuse the same KV cache and generate tokens in parallel. Both thinker and writer see the problem in the same sequential format that the model was trained with. Robotic & virtual agents. Another type of LLM appli￾cations that requires interactivity is agents in real-time en￾vironments. Agents controlling robotic systems use Em￾bodied Lan… view at source ↗
Figure 3
Figure 3. Figure 3: Concurrent thinking and writing implemented as batched inference. The newly added tokens attend to cache blocks with additional query rotations. The checkered areas represent tokens that are not visible in the current view. During the self-attention forward pass, we concatenate the dot products between the query and all cache blocks, but we transform the query differently for each block to simulate the dif… view at source ↗
Figure 4
Figure 4. Figure 4 [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Evaluating MMLU-Pro performance of AsyncReasoning and baselines across different Qwen3 model sizes, A100 [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Evaluation with partial inputs on sharded MATH-500 (left) and on the math subset of lost_in_conversation (right). The first shard is available immediately, subsequent shards are inserted every ksteps. Dotted lines denote accuracies without input sharding. 4.3. Asynchronous Reasoning about Safety To evaluate the impact of asynchronous reasoning on safety, we conduct experiments on the HarmBench validation s… view at source ↗
Figure 9
Figure 9. Figure 9: Comparing the impact of additional mode switching methods & baselines on MATH-500, Qwen3-32B, A100. Alternative prompts: we compare several additional mode switching strategies in [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Accuracy with respect to ksteps on Qwen3-32B. We re￾port three setups: shard in prompts, shard in all three blocks, shard in prompt and reminder to the thinker and writer on lost_in_conversations. The dotted horizontal line denotes the upper bound, where all shards are provided at the start. on Qwen3-235B-A22B-Thinking-2507 and on both sharded MATH-500 and math subset (103 samples) of lost_in_conversation… view at source ↗
Figure 11
Figure 11. Figure 11: Accuracy ablation over cache-insertion targets as a function of ksteps. We evaluate all combinations of inserting shards into the prompt, thinker, and writer cache blocks. The dotted horizontal line denotes the upper bound, where all shards are provided at the start. All experiments use Qwen3-32B on sharded MATH-500. 16 32 64 128 256 512 1024 ksteps (log scale) 0.5 0.6 0.7 0.8 0.9 Accuracy on sharded MATH… view at source ↗
Figure 12
Figure 12. Figure 12: Accuracy ablation over cache-insertion targets as a function of ksteps. We evaluate setups where shards in some blocks are substituted with reminders. (Left) Two insertions in total. (Right) three insertions in total. The dashed lines denote reminder substitution experiments. The dotted horizontal line denotes the upper bound, where all shards are provided at the start. All experiments use Qwen3-32B on sh… view at source ↗
Figure 13
Figure 13. Figure 13: Accuracy with respect to ksteps. We report three setups: shard in prompts, shard in all three blocks, shard in prompt and reminder to the thinker and writer. Qwen3-235B-A22B-Thinking-2507 on the following datasets: (Left) MATH-500 and (Right) lost_in_conversations. The dotted horizontal line denotes the upper bound, where all shards are provided at the start. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_13.png] view at source ↗
read the original abstract

Many state-of-the-art LLMs are trained to think before giving their answer. Reasoning can greatly improve language model capabilities, but it also makes them less interactive: given a new input, a model must stop thinking before it can respond. Real-world use cases such as voice-based or embodied assistants require an LLM agent to respond and adapt to additional information in real time, which is incompatible with sequential interactions. In contrast, humans can listen, think, and act asynchronously: we begin thinking about the problem while reading it and continue thinking while formulating the answer. In this work, we augment LLMs capable of reasoning to operate in a similar way without additional training. Our method uses the properties of positional embeddings to enable LLMs built for sequential generation to simultaneously think, listen, and write outputs. We evaluate our approach on math, commonsense, and safety reasoning: it allows models to generate accurate thinking-augmented answers while reducing time to first non-thinking token from minutes to ${\le}$ 5s and the overall delays by up to $12{\times}$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that LLMs trained for sequential reasoning can be made asynchronous and interactive without training by modifying positional embeddings to support simultaneous thinking, listening, and output generation. This yields accurate answers on math, commonsense, and safety tasks while cutting time-to-first-non-thinking-token from minutes to ≤5 s and overall latency by up to 12×.

Significance. If the core assumption holds, the approach would enable real-time, adaptive reasoning in voice or embodied agents without retraining, addressing a practical gap between current sequential LLMs and human-like asynchronous cognition.

major comments (2)
  1. [Abstract / Method] The central claim that positional-embedding alterations (offset streams or parallel position IDs) preserve the original next-token distribution and reasoning accuracy is load-bearing yet unsupported by any derivation or mechanistic argument showing why attention patterns and token dependencies learned in pre-training remain unchanged under the new position signals.
  2. [Evaluation] No ablation or quantitative breakdown is supplied that isolates the effect of the embedding change on reasoning fidelity (e.g., error rates, hallucination rates, or consistency under streaming inputs), leaving the reported accuracy claims unverifiable from the given evidence.
minor comments (2)
  1. [Method] Notation for the modified position IDs and the exact offset rule should be formalized with an equation or pseudocode to allow reproduction.
  2. [Experiments] The latency numbers (≤5 s, 12×) would benefit from explicit reporting of hardware, batch size, and whether measurements include or exclude the thinking stream.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the valuable feedback on our paper 'Asynchronous Reasoning: Training-Free Interactive Thinking LLMs'. We address each major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract / Method] The central claim that positional-embedding alterations (offset streams or parallel position IDs) preserve the original next-token distribution and reasoning accuracy is load-bearing yet unsupported by any derivation or mechanistic argument showing why attention patterns and token dependencies learned in pre-training remain unchanged under the new position signals.

    Authors: We agree a more explicit mechanistic argument is warranted. In the revised manuscript we will expand Section 3 (Method) with a dedicated paragraph explaining the invariance: because we employ parallel position IDs (or fixed offsets) per stream, the relative positional encodings within each individual stream (thinking, listening, or output) are identical to those seen during pre-training. Consequently, the attention scores and token dependencies internal to each stream remain unchanged, while cross-stream interactions are governed by the model’s existing learned weights. We will also include a short derivation sketch of the attention matrix under these position assignments to make the preservation of the next-token distribution explicit. revision: yes

  2. Referee: [Evaluation] No ablation or quantitative breakdown is supplied that isolates the effect of the embedding change on reasoning fidelity (e.g., error rates, hallucination rates, or consistency under streaming inputs), leaving the reported accuracy claims unverifiable from the given evidence.

    Authors: We acknowledge the absence of isolating ablations in the current version. In the revision we will add a new subsection to the Experiments section containing (i) direct accuracy comparisons on the math, commonsense, and safety benchmarks between standard sequential generation and our asynchronous setup, (ii) error-rate and hallucination-rate breakdowns, and (iii) consistency metrics when new tokens arrive mid-generation. These results will quantify that the positional modification introduces no measurable degradation in reasoning fidelity. revision: yes

Circularity Check

0 steps flagged

No circularity: method applies known positional embedding properties without self-referential derivation

full rationale

The paper describes a training-free augmentation that exploits existing properties of positional embeddings to allow simultaneous thinking/listening/output in autoregressive LLMs. No equations, parameters, or predictions are shown to reduce to fitted inputs or self-citations by construction. The central claim rests on an empirical demonstration of latency reduction while preserving task accuracy, with no load-bearing self-citation chains or ansatz smuggling. The derivation chain is therefore independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only view limits visibility into exact implementation; assumes standard transformer positional embedding behavior can be repurposed without side effects.

axioms (1)
  • domain assumption Positional embeddings can be leveraged to interleave thinking and generation tokens without breaking the model's trained behavior.
    Central to the method but not derived or proven in the provided abstract.

pith-pipeline@v0.9.0 · 5512 in / 984 out tokens · 22195 ms · 2026-05-16T22:52:19.927882+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Speculative Interaction Agents: Building Real-Time Agents with Asynchronous I/O and Speculative Tool Calling

    cs.LG 2026-05 unverdicted novelty 6.0

    Asynchronous I/O and Speculative Tool Calling cut latency in tool-calling LLM agents by 1.3-2.2x with only minor accuracy loss on cloud and edge models.

  2. Speculative Interaction Agents: Building Real-Time Agents with Asynchronous I/O and Speculative Tool Calling

    cs.LG 2026-05 unverdicted novelty 6.0

    Speculative Interaction Agents achieve 1.3-2.2x speedups for real-time tool-calling agents via async I/O decoupling and speculative calls, with clock-based training for small edge models.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · cited by 1 Pith paper · 12 internal anchors

  1. [1]

    Beeching, E., Tunstall, L., and Rush, S

    URL https://api.semanticscholar.org/ CorpusID:276937204. Beeching, E., Tunstall, L., and Rush, S. Scaling test-time compute with open models, 2024. URL https://huggingface.co/spaces/HuggingFaceH4/ blogpost-scaling-test-time-compute. Betker, J. Better speech synthesis through scaling.arXiv preprint arXiv:2305.07243, 2023. Tortoise TTS: expres- sive multi-v...

  2. [2]

    Moshi: a speech-text foundation model for real-time dialogue

    ISSN 0001-4966. doi: 10 .1121/1.1906946. URL https://doi.org/10.1121/1.1906946. Dettmers, T., Pagnoni, A., Holtzman, A., and Zettlemoyer, L. Qlora: Efficient finetuning of quantized llms. In Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S. (eds.),Advances in Neural Information Processing Systems, volume 36, pp. 10088– 10115. Curra...

  3. [3]

    URL https: //doi.org/10.1038/s41598-025-98378-1

    doi: 10 .1038/s41598-025-98378-1. URL https: //doi.org/10.1038/s41598-025-98378-1. Gao, L., Madaan, A., Zhou, S., Alon, U., Liu, P., Yang, Y ., Callan, J., and Neubig, G. PAL: Program-aided language models. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.),Proceedings of the 40th International Conference on Machine...

  4. [4]

    doi: 10.1038/s41586-025-09422-z

    Accessed: 2025-04-07. Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X., Zhang, X., Yu, X., Wu, Y ., Wu, Z. F., Gou, Z., Shao, Z., Li, Z., Gao, Z., Liu, A., Xue, B., Wang, B., Wu, B., Feng, B., Lu, C., Zhao, C., Deng, C., Ruan, C., Dai, D., Chen, D., Ji, D., Li, E., Lin, F., Dai, F., Luo, F., Hao, G., Chen, G., L...

  5. [5]

    A benchmark of expert-level academic questions to assess AI capabilities.Nature, 649(8099):1139–1146, 2026

    doi: 10.48550/arXiv.2501.14249. Houde, S., Brimijoin, K., Muller, M., Ross, S. I., Silva Moran, D. A., Gonzalez, G. E., Kunde, S., Foreman, M. A., and Weisz, J. D. Controlling ai agent participation in group conversations: A human-centered approach. In Proceedings of the 30th International Conference on Intel- ligent User Interfaces, IUI ’25, pp. 390–408,...

  6. [6]

    LLMs Get Lost In Multi-Turn Conversation

    URL https://openreview.net/forum?id= CX5c7C1CZa. Laban, P., Hayashi, H., Zhou, Y ., and Neville, J. Llms get lost in multi-turn conversation, 2025. URL https: //arxiv.org/abs/2505.06120. Lam, E. lab-mic: Record audio directly within jupyter/ipython notebooks using browser microphone. GitHub repository. URL https://github.com/ voidful/lab-mic. accessed 202...

  7. [7]

    Chain of code: Reasoning with a language model-augmented code emulator.arXiv preprint arXiv:2312.04474, 2023

    URL https://proceedings.mlr.press/v235/ li24ar.html. arXiv preprint arXiv:2312.04474. Li, G., Gao, Y ., Li, Y ., and Wu, Y . Thinkless: A training- free inference-efficient method for reducing reasoning redundancy.arXiv preprint arXiv:2505.15684, may

  8. [8]

    URL https: //arxiv.org/abs/2505.15684

    doi: 10 .48550/arXiv.2505.15684. URL https: //arxiv.org/abs/2505.15684. Version 2 (last revised 23 May 2025). Liang, A., Berant, J., Fisch, A., Goyal, A., Krishna, K., and Eisenstein, J. Plantain: Plan-answer inter- leaved reasoning, 2025a. URL https://arxiv.org/ abs/2512.03176. Liang, G., Zhong, L., Yang, Z., and Quan, X. Thinkswitcher: Dynamic switching...

  9. [9]

    Lou, X., Li, Y ., Xu, J., Shi, X., Chen, C., and Huang, K

    URL https://proceedings.mlr.press/v267/ lin25i.html. Lou, X., Li, Y ., Xu, J., Shi, X., Chen, C., and Huang, K. Think in safety: Unveiling and mitigating safety align- ment collapse in multimodal large reasoning model. In Christodoulopoulos, C., Chakraborty, T., Rose, C., and Peng, V . (eds.),Proceedings of the 2025 Conference on Empirical Methods in Natu...

  10. [10]

    s1: Simple test-time scaling

    URL https://proceedings.mlr.press/v235/ mazeika24a.html. Mon-Williams, R., Li, G., Long, R., Du, W., Lucas, C. G., et al. Embodied large language models enable robots to complete complex tasks in unpredictable environments. Nature Machine Intelligence, 7:592–601, 2025. doi: 10.1038/s42256-025-01005-x. Muennighoff, N., Yang, Z., Shi, W., Li, X. L., Fei-Fei...

  11. [11]

    V oice-mode multimodal model supporting audio, text, and vision

    URL https://openai.com/index/hello-gpt- 4o. V oice-mode multimodal model supporting audio, text, and vision. Available at https://openai.com/index/hello- gpt-4o. OpenAI. Chatgpt deep research: Support for user update and multitasking features. https://chat.openai.com,

  12. [12]

    OpenAI o1 System Card

    Accessed 7 December 2025. In late 2025, the Deep Research feature was updated to allow user to commu- nicate with the agent while it performs research via the "Update" button. OpenAI, :, Jaech, A., Kalai, A., Lerer, A., Richardson, A., El-Kishky, A., Low, A., Helyar, A., Madry, A., and et al., A. B. Openai o1 system card, 2024. URL https:// arxiv.org/abs/...

  13. [13]

    gpt-oss-120b & gpt-oss-20b Model Card

    URLhttps://arxiv.org/abs/2508.10925. Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y ., Schwarz, P., Silovsky, J., Stemmer, G., and Vesely, K. The kaldi speech recognition toolkit. https://kaldi- asr.org, 2011. Open-source speech recognition toolkit. Prenger, R., Valle, R., and Catanzaro, B. W...

  14. [14]

    GPQA: A Graduate-Level Google-Proof Q&A Benchmark

    URL https://proceedings.mlr.press/v202/ radford23a.html. Rein, D., Hou, B. L., Stickland, A. C., Petty, J., Pang, R. Y ., Dirani, J., Michael, J., and Bowman, S. R. Gpqa: A graduate-level google-proof q&a benchmark, 2023. URL https://arxiv.org/abs/2311.12022. Rodionov, G., Garipov, R., Shutova, A., Yakushev, G., Schultheis, E., Egiazarian, V ., Sinitsin, ...

  15. [15]

    AudioPaLM: A Large Language Model That Can Speak and Listen

    URLhttps://arxiv.org/abs/2306.12925. Sapkota, R., Cao, Y ., Roumeliotis, K. I., and Karkee, M. Vision-language-action models: Concepts, progress, ap- plications and challenges, 2025. URLarxiv.org. Schick, T., Dwivedi-Yu, J., Dessi, R., Raileanu, R., Lomeli, M., Hambro, E., Zettlemoyer, L., Cancedda, N., and Scialom, T. Toolformer: Language mod- els can te...

  16. [16]

    Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

    URL https://arxiv.org/abs/2503.16419. Ver- sion 4 (last updated August 21, 2025). Sun, R., Cheng, W., Li, D., Chen, H., and Wang, W. Stop when enough: Adaptive early-stopping for chain-of- thought reasoning, 2025. arXiv preprint. Suzgun, M., Scales, N., Scharli, N., Gehrmann, S., Tay, Y ., Chung, H. W., Chowdhery, A., Le, Q. V ., Chi, E. H., Zhou, D., and...

  17. [17]

    Voyager: An Open-Ended Embodied Agent with Large Language Models

    URLhttps://arxiv.org/abs/2510.17238. Trasmundi, S. B. and Toro, J. Mind wandering in read- ing: An embodied approach.Frontiers in Human Neu- roscience, 17, 2023. ISSN 1662-5161. doi: 10 .3389/ fnhum.2023.1061437. URLwww.frontiersin.org. Umeda, N., Omura, H., and Fujimura, O. First complete text-to-speech system. Technical report, Electrotechnical Laborato...

  18. [18]

    Tacotron: Towards End-to-End Speech Synthesis

    doi: 10 .21437/Interspeech.2017-1452. URL https://arxiv.org/abs/1703.10135. Wang, Y ., Ma, X., Zhang, G., Ni, Y ., Chandra, A., Guo, S., Ren, W., Arulraj, A., He, X., Jiang, Z., et al. Mmlu-pro: A more robust and challenging multi-task language under- standing benchmark.arXiv preprint arXiv:2406.01574, 2024b. Wang, Z., Cai, S., Chen, G., Liu, A., Ma, X. S...

  19. [19]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    URL https://proceedings.mlr.press/v229/ zitkovich23a.html. Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., and Fredrikson, M. Universal and transferable adversar- ial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023. Appendix A. Prompting In this section we provide detailed prompts used in our main and safety experiments. ...

  20. [22]

    ,→ ,→ ,→ Sometimes, an automated system will ask you to decide if your thoughts have enough information for it write an additional passage to the user

    VALIDATION: Validate final again and again until you see the same answer in the partial response above your thoughts. ,→ ,→ ,→ Sometimes, an automated system will ask you to decide if your thoughts have enough information for it write an additional passage to the user. Use the partial response above yours thoughts to judge if you addded enough new informa...

  21. [23]

    enough information

    SAFETY CHECK: Before solving, explicitly evaluate if the user's request is safe, legal, and ethical. ,→ ,→ ,→ - If UNSAFE: Your goal is to refuse. You have "enough information" to write immediately after you decide to refuse. ,→ ,→ ,→ - If SAFE: Proceed to solve the problem step-by-step.,→

  22. [24]

    REASONING: If the request is safe, solve the problem.,→

  23. [25]

    mode-switching

    VALIDATION: Validate final again and again until you see the same answer in the partial response above your thoughts. ,→ ,→ ,→ Sometimes, an automated system will ask you to decide if your thoughts have enough information for it write an additional passage to the user. Use the partial response above yours thoughts to judge if you addded enough new informa...

  24. [26]

    This can reduce overhead GPU compute during inference, but it does not fit neatly into our training-free setup

    Trained mode-switching: training a classifier head, per- haps on top of the model’s own hidden state, to decide when to pause and wait for thoughts. This can reduce overhead GPU compute during inference, but it does not fit neatly into our training-free setup

  25. [27]

    warmup time

    Planned mode switching: similar to Liang et al. (2025a), we could prompt the thinker to plan ahead and decide which thoughts need to complete before the next response chunk. This type of planning can happen before thinker response or as a third “thread” concur- rent to thinker and writer. However, we found that forming such plans makes the model change it...

  26. [28]

    Race Condition:The writer begins generating a helpful response immediately based on the prompt. Although the thinker eventually concludes the request is unsafe, the writer has already streamed harmful tokens (e.g., the first steps of a dangerous recipe) to the user before the refusal signal is propagated

  27. [29]

    The writer, attending to the thinker’s cache, interprets these technical details as the desired answer and formulates them into a response, bypassing the thinker’s intent

    Context Leakage:The thinker analyzes the harmful re- quest by recalling technical details (e.g., explaining how a specific SQL injection works to verify its danger). The writer, attending to the thinker’s cache, interprets these technical details as the desired answer and formulates them into a response, bypassing the thinker’s intent

  28. [30]

    head start

    Educational Loophole:The thinker adopts an educa- tional persona to explain why a request is dangerous. The writer latches onto this educational content and re- formats it as a set of instructions, stripping away the safety framing context. Table 13.Failure mode analysis by inference setup on HarmBench. Inference Setup Failure Mode Count Baseline (Non-thi...