arxiv: 2604.06427 · v1 · submitted 2026-04-07 · 💻 cs.LG · cs.AI· cs.CL

Recognition: no theorem link

The Depth Ceiling: On the Limits of Large Language Models in Discovering Latent Planning

Yi Xu , Philipp Jettkant , Laura Ruis

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:51 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords latent planningLLM reasoning limitschain-of-thoughtgraph path findingscaling limitsinternal representations

0 comments

The pith

Large language models reach a limit on discovering multi-step latent planning strategies even with massive scaling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests the boundaries of hidden reasoning in language models by asking whether they can discover and carry out multi-step plans entirely inside their own representations, without any supervision on the intermediate steps. To do this, the authors create graph path-finding problems that require a known number of planning operations to solve. They find that the largest models can discover strategies needing up to five latent steps when trained only on the final answer, while smaller models stop at three and prompting pushes some to seven. Once discovered, these strategies can be applied to problems that need one or two extra steps at test time. The result suggests that reliable deep planning will often require making the steps explicit rather than counting on internal execution alone.

Core claim

Using graph path-finding tasks that precisely control the number of required latent planning steps, the work shows that models discover latent strategies up to a depth of five when trained with final-answer supervision alone. Tiny transformers reach three steps, fine-tuned large models reach five, and few-shot prompting allows seven. The same discovered strategy then generalizes to eight steps at test time, revealing a clear separation between the depth at which a strategy can be found and the depth at which it can be executed once found.

What carries the argument

Graph path-finding tasks that control the exact number of latent planning steps required, allowing direct measurement of how many internal reasoning operations a model performs without external hints or intermediate supervision.

If this is right

Planning strategies that require more than five coordinated latent steps must be taught explicitly or externalized rather than expected to emerge from final-answer training.
Chain-of-thought monitoring stays viable because models cannot perform arbitrarily deep hidden reasoning.
Scaling model size alone does not remove the limit on the depth of planning that can be discovered during training.
The gap between training-time discovery (five steps) and test-time execution (eight steps) shows that execution capacity exceeds discovery capacity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar depth limits may appear in other multi-step reasoning domains such as mathematical proofs or program synthesis.
Hybrid systems that combine limited latent execution with external step verification could extend reliable planning beyond the observed ceiling.
New training objectives that reward discovery of deeper internal strategies might raise the limit without changing model size.

Load-bearing premise

That the graph path-finding tasks cleanly isolate latent planning depth without confounds from memorization, surface patterns, or other non-planning mechanisms that could allow models to succeed without performing the intended number of latent steps.

What would settle it

A model trained from scratch with only final-answer labels that reliably solves graph tasks requiring six or more latent planning steps would falsify the reported discovery ceiling.

Figures

Figures reproduced from arXiv: 2604.06427 by Laura Ruis, Philipp Jettkant, Yi Xu.

**Figure 1.** Figure 1: (Right) We study latent planning using path-finding on star graphs. Shown is an example of a star graph G(3,3) together with the corresponding input tokens and target label for training the transformer. (Left) Latent planning capacity across models and evaluation settings. Despite scaling from a 1.6M-parameter transformer to GPT-4o, the maximum LPC discovered during training increases by only two steps. Ho… view at source ↗

**Figure 2.** Figure 2: Out-of-distribution (OOD) generalization of latent planning across depths. Each model is trained on a single configuration (k ∗ , m∗ ) (marked with † ), and evaluated on unseen graphs at all depths without further training. All models achieve nearperfect skill at depths up to m∗ , and maintain high performance beyond the training depth, although this generalization systematically decays as the test d… view at source ↗

**Figure 4.** Figure 4: Training dynamics of the from-scratch transformer across planning depths m ∈ {3, 4} and branch factors k ∈ K. The model exhibits a two-stage learning process: validation accuracy first rises to the random baseline ( 1 k ) as the model learns to predict valid neighbors, then either jumps to near-perfect accuracy (successful strategy discovery) or stagnates while training loss continues to decrease (overfitt… view at source ↗

**Figure 5.** Figure 5: Training dynamics of Qwen 2.5 models (7B and 32B) across graph configurations. [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗

**Figure 6.** Figure 6: Training dynamics of Qwen 3 models (8B and 32B) across graph configurations. [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗

**Figure 7.** Figure 7: Training loss under explicit chain-of-thought supervision. All models are trained with dense supervision that provides the full backtracking trace from vtarget to vsource. Under this setting, all LLMs converge within approximately 20 training updates across all configurations, confirming that the task itself is not inherently difficult. transformer also shows significant improvement, though its average ski… view at source ↗

**Figure 8.** Figure 8: Attention visualization for successful configurations ( [PITH_FULL_IMAGE:figures/full_fig_p029_8.png] view at source ↗

**Figure 9.** Figure 9: Attention visualization for failed configurations ( [PITH_FULL_IMAGE:figures/full_fig_p030_9.png] view at source ↗

read the original abstract

The viability of chain-of-thought (CoT) monitoring hinges on models being unable to reason effectively in their latent representations. Yet little is known about the limits of such latent reasoning in LLMs. We test these limits by studying whether models can discover multi-step planning strategies without supervision on intermediate steps and execute them latently, within a single forward pass. Using graph path-finding tasks that precisely control the number of required latent planning steps, we uncover a striking limitation unresolved by massive scaling: tiny transformers trained from scratch discover strategies requiring up to three latent steps, fine-tuned GPT-4o and Qwen3-32B reach five, and GPT-5.4 attains seven under few-shot prompting. Although the maximum latent planning depth models can learn during training is five, the discovered strategy generalizes up to eight latent steps at test-time. This reveals a dissociation between the ability to discover a latent strategy under final-answer supervision alone and the ability to execute it once discovered. If similar limits hold more broadly, strategies requiring multiple coordinated latent planning steps may need to be explicitly taught or externalized, lending credence to CoT monitoring.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives concrete numbers on how deep LLMs can plan inside one forward pass, but the graph tasks may still allow shortcuts that weaken the depth-ceiling claim.

read the letter

The main thing to know is that this work measures a hard limit on latent planning depth using controlled graph path-finding: small transformers top out at three steps, fine-tuned 32B models reach five, and GPT-5.4 hits seven under few-shot. They also report that models can discover a strategy under final-answer supervision but then execute it deeper at test time than they ever saw in training. That dissociation is the clearest new observation here and it lines up with the abstract's numbers without obvious circularity in the claims themselves.

Referee Report

2 major / 2 minor

Summary. The manuscript presents an empirical study on the limits of latent multi-step planning in LLMs using graph path-finding tasks designed to control the required planning depth. Key findings include depth ceilings of 3 for tiny transformers trained from scratch, 5 for fine-tuned GPT-4o and Qwen3-32B, and 7 for GPT-5.4 with few-shot prompting, alongside a dissociation where strategies learned up to depth 5 generalize to 8 at test time. The work argues this supports the need for explicit supervision in complex planning.

Significance. If the experimental isolation holds, the findings would be significant for bounding internal reasoning capabilities in LLMs and informing CoT monitoring approaches in AI safety. The controlled task design and scaling comparisons across model sizes are strengths that could falsify assumptions about unbounded latent planning.

major comments (2)

[Abstract and methods] Task construction (abstract and methods): The claim that graph path-finding tasks 'precisely control' the number of required latent planning steps is load-bearing for all reported ceilings and the discovery-execution dissociation. No explicit argument or verification is provided that generated instances cannot be solved via surface statistics, repeated substructures, or partial memorization rather than exactly d on-the-fly latent operations in a single forward pass.
[Results] Results on depth ceilings and generalization: The reported maxima (3/5/7) and the training limit of 5 with test generalization to 8 require confirmation that success metrics isolate latent steps (e.g., via ablation on graph features or step-by-step tracing). Without this, the dissociation result cannot be attributed to planning depth rather than other mechanisms.

minor comments (2)

[Results] Include error bars, number of trials, and statistical tests for all reported depth limits and success rates to support the specific numerical claims.
[Methods] Clarify the operational definition of a 'latent planning step' and how it is measured or inferred from model behavior in the single forward pass.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive comments, which have helped us improve the clarity and rigor of our work. We address each major comment below, and have revised the manuscript accordingly to provide additional verification of our task design and results.

read point-by-point responses

Referee: [Abstract and methods] Task construction (abstract and methods): The claim that graph path-finding tasks 'precisely control' the number of required latent planning steps is load-bearing for all reported ceilings and the discovery-execution dissociation. No explicit argument or verification is provided that generated instances cannot be solved via surface statistics, repeated substructures, or partial memorization rather than exactly d on-the-fly latent operations in a single forward pass.

Authors: We acknowledge the importance of ruling out alternative mechanisms. The manuscript's Methods section describes the use of randomly generated graphs with controlled shortest path lengths, where nodes and edges are assigned unique identifiers to minimize repeated substructures. To address this concern explicitly, we have added a new subsection in the Methods detailing why surface statistics are insufficient (e.g., no fixed patterns across instances) and included an ablation study in the appendix. In this study, we perturb the graphs by randomizing edge connections while preserving path lengths and observe a sharp drop in performance, supporting that models rely on computing the specific path rather than memorized patterns. We have also tested on out-of-distribution graph topologies. These additions provide the requested verification. revision: yes
Referee: [Results] Results on depth ceilings and generalization: The reported maxima (3/5/7) and the training limit of 5 with test generalization to 8 require confirmation that success metrics isolate latent steps (e.g., via ablation on graph features or step-by-step tracing). Without this, the dissociation result cannot be attributed to planning depth rather than other mechanisms.

Authors: We agree that isolating the contribution of latent planning depth is essential. As noted in our response to the first comment, we have incorporated ablations on graph features, including removing specific node attributes and using graphs with varying connectivity, which confirm that performance correlates with the controlled depth rather than other features. For step-by-step tracing, direct observation of latent steps is inherently limited in black-box models like LLMs; however, we provide supporting evidence through error patterns that align with depth limits (e.g., failures occur at specific depths) and the observed dissociation, which would not be expected under alternative mechanisms like partial memorization. We have expanded the Results and Discussion sections to include this analysis and a more detailed justification. revision: partial

standing simulated objections not resolved

Direct step-by-step tracing of internal latent operations is not feasible with current methods without significant additional interpretability work, which is beyond the empirical focus of this paper.

Circularity Check

0 steps flagged

No circularity: empirical measurement of observed model limits

full rationale

The paper conducts controlled experiments on graph path-finding tasks to measure the maximum latent planning depth achievable by different LLMs under training and prompting regimes. No mathematical derivation chain, fitted parameters renamed as predictions, or self-citation load-bearing steps are present; the reported ceilings (3/5/7 steps) are direct empirical observations from model behavior on generated instances rather than quantities constructed from the task definition itself. The isolation of latent steps is an experimental design choice whose validity can be externally tested, not a self-referential loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical construction of graph tasks that are assumed to require a controllable number of latent planning steps. No free parameters are introduced beyond standard model training; the depth numbers are measured outcomes.

axioms (1)

domain assumption Graph path-finding tasks can be constructed so that success requires a precise, controllable number of latent planning steps.
This assumption allows the authors to vary depth as the independent variable.

pith-pipeline@v0.9.0 · 5503 in / 1327 out tokens · 87890 ms · 2026-05-10T19:51:50.310278+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 23 canonical work pages · 12 internal anchors

[1]

The pitfalls of next-token prediction

Gregor Bachmann and Vaishnavh Nagarajan. The pitfalls of next-token prediction. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=76zq8Wkl6Z

2024
[2]

Monitoring reason- ing models for misbehavior and the risks of promoting obfuscation.arXiv preprint arXiv:2503.11926, 2025

Bowen Baker, Joost Huizinga, Leo Gao, Zehao Dou, Melody Y Guan, Aleksander Madry, Wojciech Zaremba, Jakub Pachocki, and David Farhi. Monitoring reasoning models for misbehavior and the risks of promoting obfuscation. arXiv preprint arXiv:2503.11926, 2025

work page arXiv 2025
[3]

Lessons from studying two-hop latent reasoning

Mikita Balesni, Tomek Korbak, and Owain Evans. Lessons from studying two-hop latent reasoning. arXiv preprint arXiv:2411.16353, 2024

work page arXiv 2024
[4]

Relational inductive biases, deep learning, and graph networks

Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018

work page internal anchor Pith review arXiv 2018
[5]

Taken out of context: On measuring situational awareness in llms, 2023

Lukas Berglund, Asa Cooper Stickland, Mikita Balesni, Max Kaufmann, Meg Tong, Tomasz Korbak, Daniel Kokotajlo, and Owain Evans. Taken out of context: On measuring situational awareness in llms. arXiv preprint arXiv:2309.00667, 2023

work page arXiv 2023
[6]

LlamaFirewall: An open source guardrail system for building secure AI agents

Sahana Chennabasappa, Cyrus Nikolaidis, Daniel Song, David Molnar, Stephanie Ding, Shengye Wan, Spencer Whitman, Lauren Deason, Nicholas Doucette, Abraham Montilla, et al. Llamafirewall: An open source guardrail system for building secure ai agents. arXiv preprint arXiv:2505.03574, 2025

work page arXiv 2025
[7]

DeepSeek-R1 Incentivizes Reasoning in LLMs Through Reinforcement Learning

DeepSeek-AI. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature, 645 0 (8081): 0 633–638, September 2025. ISSN 1476-4687. doi:10.1038/s41586-025-09422-z. URL http://dx.doi.org/10.1038/s41586-025-09422-z

work page doi:10.1038/s41586-025-09422-z 2025
[8]

Implicit chain of thought reasoning via knowledge distillation, 2023

Yuntian Deng, Kiran Prasad, Roland Fernandez, Paul Smolensky, Vishrav Chaudhary, and Stuart Shieber. Implicit chain of thought reasoning via knowledge distillation. arXiv preprint arXiv:2311.01460, 2023

work page arXiv 2023
[9]

From explicit cot to implicit cot: Learning to internalize cot step by step

Yuntian Deng, Yejin Choi, and Stuart Shieber. From explicit cot to implicit cot: Learning to internalize cot step by step. arXiv preprint arXiv:2405.14838, 2024

work page arXiv 2024
[10]

When chain of thought is necessary, language models struggle to evade monitors.arXiv preprint arXiv:2507.05246, 2025

Scott Emmons, Erik Jenner, David K Elson, Rif A Saurous, Senthooran Rajamanoharan, Heng Chen, Irhum Shafkat, and Rohin Shah. When chain of thought is necessary, language models struggle to evade monitors. arXiv preprint arXiv:2507.05246, 2025

work page arXiv 2025
[11]

Alignment faking in large language models

Ryan Greenblatt, Carson Denison, Benjamin Wright, Fabien Roger, Monte MacDiarmid, Sam Marks, Johannes Treutlein, Tim Belonax, Jack Chen, David Duvenaud, et al. Alignment faking in large language models, 2024. URL https://arxiv. org/abs/2412.14093, 2024 a

work page internal anchor Pith review arXiv 2024
[12]

AI control: Improving safety despite intentional subversion

Ryan Greenblatt, Buck Shlegeris, Kshitij Sachan, and Fabien Roger. AI control: Improving safety despite intentional subversion. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp (eds.), Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings ...

2024
[13]

Training Large Language Models to Reason in a Continuous Latent Space

Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. Training large language models to reason in a continuous latent space. arXiv preprint arXiv:2412.06769, 2024

work page internal anchor Pith review arXiv 2024
[14]

Gaussian Error Linear Units (GELUs)

Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[15]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[16]

Large language models are zero-shot reasoners

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=e2TBb5y0yFf

2022
[17]

Chain of thought monitorability: A new and fragile opportunity for ai safety.arXiv preprint arXiv: 2507.11473, 2025

Tomek Korbak, Mikita Balesni, Elizabeth Barnes, Yoshua Bengio, Joe Benton, Joseph Bloom, Mark Chen, Alan Cooney, Allan Dafoe, Anca Dragan, et al. Chain of thought monitorability: A new and fragile opportunity for ai safety. arXiv preprint arXiv:2507.11473, 2025

work page arXiv 2025
[18]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[19]

Introducing docent, 2025

Kevin Meng, Vincent Huang, Jacob Steinhardt, and Sarah Schwettmann. Introducing docent, 2025

2025
[20]

In-context Learning and Induction Heads

Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022

work page internal anchor Pith review arXiv 2022
[21]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

OpenAI o1 System Card

OpenAI. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

OpenAI GPT-5 System Card

OpenAI. Openai gpt-5 system card. arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Language models are unsupervised multitask learners

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019

2019
[25]

Transformers struggle to learn to search

Abulhair Saparov, Srushti Ajay Pawar, Shreyas Pimpalgaonkar, Nitish Joshi, Richard Yuanzhe Pang, Vishakh Padmakumar, Mehran Kazemi, Najoung Kim, and He He. Transformers struggle to learn to search. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=9cQB1Hwrtw

2025
[26]

Next-latent prediction transformers learn compact world models.arXiv preprint arXiv:2511.05963, 2025

Jayden Teoh, Manan Tomar, Kwangjun Ahn, Edward S Hu, Pratyusha Sharma, Riashat Islam, Alex Lamb, and John Langford. Next-latent prediction transformers learn compact world models. arXiv preprint arXiv:2511.05963, 2025

work page arXiv 2025
[27]

On the planning abilities of large language models - a critical investigation

Karthik Valmeekam, Matthew Marquez, Sarath Sreedharan, and Subbarao Kambhampati. On the planning abilities of large language models - a critical investigation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=X6dEqXIsEW

2023
[28]

Chi, Quoc V

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS '22, Red Hook, NY, USA, 2022. Curran Associates Inc. ISBN 9781713871088

2022
[29]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

P Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric. P Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023

2023
[32]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
[33]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
[34]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
[35]

daytime" vs

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page arXiv 2024