The Geometric Reasoner: Manifold-Informed Latent Foresight Search for Long-Context Reasoning

Ben Wang; Ren Zhuang; Shuifa Sun

arxiv: 2601.18832 · v3 · submitted 2026-01-25 · 💻 cs.LG · cs.AI

The Geometric Reasoner: Manifold-Informed Latent Foresight Search for Long-Context Reasoning

Ren Zhuang , Ben Wang , Shuifa Sun This is my paper

Pith reviewed 2026-05-16 10:40 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords long-context reasoningchain-of-thoughttest-time computelatent foresight searchgeometric regularizerstraining-free inferencetrajectory coverageKV cache management

0 comments

The pith

The Geometric Reasoner improves long chain-of-thought coverage by scoring latent anchors with look-ahead estimates and geometric regularizers at each chunk boundary.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents The Geometric Reasoner (TGR) as a training-free method to resolve the cost-coverage trade-off in scaling test-time compute for extended reasoning chains. It performs manifold-informed latent foresight search, where candidate anchors receive scores from a lightweight look-ahead estimate plus soft geometric regularizers that favor smooth trajectories and broader exploration. Chunk-wise KV cache resets keep memory linear in chunk length rather than quadratic in total context. On math and code benchmarks the approach raises the area under the Pass@k curve by up to 13 points on an 8B model while adding only modest overhead. The central claim is that these geometric constraints on latent trajectories deliver higher robust coverage without any model training.

Core claim

TGR is a training-free framework that performs manifold-informed latent foresight search under strict memory bounds. At each chunk boundary it scores candidate latent anchors via a lightweight look-ahead estimate combined with soft geometric regularizers that encourage smooth trajectories and diverse exploration, then resets the KV cache chunk-wise to keep memory linear in chunk length. On challenging math and code benchmarks this yields up to 13-point gains in area under the Pass@k curve with 1.1–1.3 times overhead.

What carries the argument

Manifold-informed latent foresight search that scores candidate latent anchors at chunk boundaries using a lightweight look-ahead estimate plus soft geometric regularizers for smoothness and diversity.

If this is right

Higher robust coverage becomes available on existing models without retraining.
Memory stays linear in chunk length rather than growing with total context length.
Exploration improves while redundant trajectories decrease under fixed compute budgets.
The same scoring mechanism can be applied at inference time to any base model that supports chunked KV caching.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The regularizers could be tuned further to target specific failure modes such as repetitive loops in code generation.
If the latent manifold structure generalizes across domains, similar scoring could apply to long-horizon planning tasks outside math and code.
Combining the method with modest distillation of the look-ahead scorer might reduce the 1.1–1.3 overhead while preserving coverage gains.

Load-bearing premise

That scoring latent anchors with a lightweight look-ahead estimate and soft geometric regularizers will produce measurably higher trajectory coverage without any training or high extra cost.

What would settle it

An experiment showing no AUC gain, or memory usage exceeding linear scaling, when the same models run the method on the reported math and code benchmarks.

Figures

Figures reproduced from arXiv: 2601.18832 by Ben Wang, Ren Zhuang, Shuifa Sun.

**Figure 1.** Figure 1: TGR-Latent consistently outperforms baselines on Qwen3-8B. Manifold-informed latent foresight search steering converts modest inference-time compute into robust coverage without weight updates. et al., 2024; Zheng et al., 2025). While effective for singlesample accuracy, these methods demand substantial training compute and can collapse the trajectory distribution (Yue et al., 2025; Srivastava & Aggarwal… view at source ↗

**Figure 2.** Figure 2: Overview of reasoning frameworks. Unlike (a) test-time sampling, exploring trajectories without explicit structure, or (b) reinforcement learning, which internalizes preferences through costly training, (c) TGR introduces a training-free inference-time search over the latent manifold. It selects optimal chunk-level anchors via a soft geometric score combining foresight, bumpiness, and uniformity, then inje… view at source ↗

**Figure 3.** Figure 3: TGR dominates the inference efficiency frontier. Left: Pass@k curves on MATH500 reveal that TGR-Latent sustains marginal gains beyond k = 32 where baselines plateau. Middle & Right: On the cost–robustness plane, TGR-Latent occupies the upper-left corner, achieving the highest AUC at moderate token cost on both math and code benchmarks. clusion that soft geometric scoring improves the conversion rate from i… view at source ↗

**Figure 4.** Figure 4: Left: Latent-space mode diversity. RL-tuned baselines collapse into a unimodal cone, while TGR preserves a well-dispersed distribution, capturing a fuller range of valid reasoning paths. Right: Hyperparameter robustness. AUC increases with rollout depth s and beam width K, but with diminishing returns. ploration (Yue et al., 2025; Srivastava & Aggarwal, 2025). Training-time architectural regularization emb… view at source ↗

**Figure 5.** Figure 5: Training stage modulates inference-time controllability. TGR-Latent yields substantially larger gains on the SFT model (top), while improvement narrows after RL optimization (bottom), suggesting that inference-time search benefits models whose trajectory distribution retains residual flexibility. 6. Conclusion We introduced TGR, a training-free inference-time framework that steers long-horizon reasoning … view at source ↗

read the original abstract

Scaling test-time compute enhances long chain-of-thought (CoT) reasoning, yet existing approaches face a fundamental trade-off between computational cost and coverage quality: either incurring high training expense or yielding redundant trajectories. We introduce The Geometric Reasoner (TGR), a training-free framework that performs manifold-informed latent foresight search under strict memory bounds. At each chunk boundary, TGR scores candidate latent anchors via a lightweight look-ahead estimate combined with soft geometric regularizers that encourage smooth trajectories and diverse exploration. Chunk-wise KV cache resets keep memory linear in chunk length. On challenging math and code benchmarks, TGR improves robust trajectory coverage, measured by the area under the Pass@k curve (AUC), by up to 13 points on Qwen3-8B, with negligible overhead of about 1.1--1.3 times.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The manuscript introduces The Geometric Reasoner (TGR), a training-free framework for long-context reasoning that performs manifold-informed latent foresight search. At chunk boundaries it scores candidate latent anchors with a lightweight look-ahead estimate plus soft geometric regularizers, while using chunk-wise KV-cache resets to enforce linear memory scaling. The central empirical claim is that this yields up to 13-point gains in AUC under the Pass@k curve on math and code benchmarks (e.g., Qwen3-8B) at 1.1–1.3× overhead.

Significance. If the reported coverage gains prove reproducible, the result would be significant: it supplies a concrete, training-free mechanism that improves trajectory diversity without the usual training or quadratic-memory penalties, directly addressing the cost-coverage trade-off in test-time scaling for long CoT. The geometric-regularizer formulation on latent anchors is a distinctive technical contribution that could be adopted or extended by other inference-time search methods.

major comments (3)

[§4.2] §4.2 (Latent Foresight Search): the scoring function that combines the look-ahead estimate with the soft geometric regularizers is described only qualitatively; no explicit equation or pseudocode is given for the anchor selection criterion, which is load-bearing for both the reproducibility of the 13-point AUC claim and the assertion of negligible overhead.
[§5.1] §5.1 and Table 2: the reported AUC improvements (up to 13 points) are presented without standard deviations, confidence intervals, or statistical significance tests across the N runs, undermining the robustness claim that is central to the paper’s contribution.
[§5.3] §5.3 (Implementation Details): hyperparameters of the geometric regularizers and the precise form of the lightweight look-ahead estimator are omitted, making it impossible to verify that the method truly operates without hidden training cost or post-hoc tuning.

minor comments (3)

[Abstract] Abstract: the term 'manifold-informed' is introduced without a brief parenthetical gloss, which would help readers unfamiliar with the geometric framing.
[Figure 3] Figure 3 caption: axis labels and legend entries are too small to read at standard print size; enlarge or simplify.
[Related Work] Related Work section: citation to recent test-time scaling papers (e.g., on latent-space search) is sparse; adding two or three key references would strengthen context.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the positive assessment of the work's significance and for the detailed comments on reproducibility. We address each major point below and will revise the manuscript to incorporate the requested clarifications and additions.

read point-by-point responses

Referee: [§4.2] §4.2 (Latent Foresight Search): the scoring function that combines the look-ahead estimate with the soft geometric regularizers is described only qualitatively; no explicit equation or pseudocode is given for the anchor selection criterion, which is load-bearing for both the reproducibility of the 13-point AUC claim and the assertion of negligible overhead.

Authors: We agree that the scoring function requires an explicit formulation. In the revised manuscript we will add a precise equation in §4.2 that defines the anchor selection criterion as the sum of the lightweight look-ahead estimate and the weighted soft geometric regularizers. We will also include pseudocode for the full latent foresight search step at chunk boundaries. These additions will make the 13-point AUC claim and the overhead analysis fully reproducible while preserving the original method. revision: yes
Referee: [§5.1] §5.1 and Table 2: the reported AUC improvements (up to 13 points) are presented without standard deviations, confidence intervals, or statistical significance tests across the N runs, undermining the robustness claim that is central to the paper’s contribution.

Authors: We acknowledge the omission of variability measures. Although multiple random seeds were used in the reported experiments, standard deviations and confidence intervals were not included. In the revision we will update §5.1 and Table 2 with these statistics together with the results of paired significance tests across runs. This will directly strengthen the robustness claim without changing the reported AUC gains. revision: yes
Referee: [§5.3] §5.3 (Implementation Details): hyperparameters of the geometric regularizers and the precise form of the lightweight look-ahead estimator are omitted, making it impossible to verify that the method truly operates without hidden training cost or post-hoc tuning.

Authors: We agree that these implementation details must be supplied. The revised §5.3 will list the exact hyperparameter values (weighting coefficients for the smoothness and diversity regularizers) and the closed-form expression for the look-ahead estimator. All values match those used to obtain the reported results, confirming the training-free nature and the 1.1–1.3× overhead. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces a training-free framework whose core components (lightweight look-ahead scoring, soft geometric regularizers, chunk-wise KV resets) are presented as explicit engineering choices. Performance is measured directly via AUC on external math and code benchmarks with no internal derivation that reduces a claimed prediction to a fitted parameter or self-citation by construction. No load-bearing step equates the output to the input via definition or renaming; the empirical gains are tested against independent benchmarks rather than derived tautologically.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption of manifold structure in latent representations and the introduction of latent anchors as a new entity for the search process.

axioms (1)

domain assumption Latent space of language models admits a manifold structure suitable for geometric regularization to encourage smooth and diverse trajectories.
Invoked to justify the use of soft geometric regularizers at chunk boundaries.

invented entities (1)

latent anchors no independent evidence
purpose: Candidate points in latent space for foresight search and scoring.
New concept introduced as part of the search mechanism without external validation mentioned.

pith-pipeline@v0.9.0 · 5442 in / 1136 out tokens · 27752 ms · 2026-05-16T10:40:27.519737+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

scores candidate latent anchors via a lightweight look-ahead estimate combined with soft geometric regularizers that encourage smooth trajectories and diverse exploration
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Score(a;c, z) = Vfore(a;c) − λb Pbum(a;c) − λu Puni(a;z)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering
cs.AI 2026-04 unverdicted novelty 6.0

HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · cited by 1 Pith paper · 21 internal anchors

[1]

Phi-4-reasoning Technical Report

Abdin, M., Agarwal, S., Awadallah, A., Balachandran, V ., Behl, H., Chen, L., de Rosa, G., Gunasekar, S., Javaheripi, M., Joshi, N., et al. Phi-4-reasoning technical report. arXiv preprint arXiv:2504.21318,

work page internal anchor Pith review arXiv
[2]

The Bayesian Geometry of Transformer Attention

URL https: //arxiv.org/abs/2512.22471. Aghajohari, M., Chitsaz, K., Kazemnejad, A., Chandar, S., Sordoni, A., Courville, A., and Reddy, S. The markovian thinker: Architecture-agnostic linear scaling of reasoning. arXiv preprint arXiv:2510.06557,

work page internal anchor Pith review arXiv
[3]

Longformer: The Long-Document Transformer

Beltagy, I., Peters, M. E., and Cohan, A. Long- former: The long-document transformer.arXiv preprint arXiv:2004.05150,

work page internal anchor Pith review Pith/arXiv arXiv 2004
[4]

A Conceptual Introduction to Hamiltonian Monte Carlo

Betancourt, M. A conceptual introduction to hamiltonian monte carlo.arXiv preprint arXiv:1701.02434,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901,

work page 1901
[6]

Evaluating Large Language Models Trained on Code

Chen, M. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Dao, T. Flashattention-2: Faster attention with bet- ter parallelism and work partitioning.arXiv preprint arXiv:2307.08691,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

The curved spacetime of transformer architectures.arXiv preprint arXiv:2511.03060,

Di Sipio, R., Diaz-Rodriguez, J., and Serrano, L. The curved spacetime of transformer architectures.arXiv preprint arXiv:2511.03060,

work page arXiv
[9]

Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models

Gao, B., Song, F., Yang, Z., Cai, Z., Miao, Y ., Dong, Q., Li, L., Ma, C., Chen, L., Xu, R., et al. Omni-math: A universal olympiad level mathematic benchmark for large language models.arXiv preprint arXiv:2410.07985,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- centivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Measuring Mathematical Problem Solving With the MATH Dataset

Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring math- ematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Jain, N., Han, K., Gu, A., Li, W.-D., Yan, F., Zhang, T., Wang, S., Solar-Lezama, A., Sen, K., and Stoica, I. 9 TGR: Manifold-Informed Latent Foresight Search Livecodebench: Holistic and contamination free eval- uation of large language models for code.arXiv preprint arXiv:2403.07974,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

LaDiR: Latent Diffusion Enhances LLMs for Text Reasoning

Kang, H., Zhang, Y ., Kuang, N. L., Majamaki, N., Jaitly, N., Ma, Y .-A., and Qin, L. Ladir: Latent diffusion enhances llms for text reasoning.arXiv preprint arXiv:2510.04573,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Reasoning with Sampling: Your Base Model is Smarter Than You Think

Karan, A. and Du, Y . Reasoning with sampling: Your base model is smarter than you think.arXiv preprint arXiv:2510.14901,

work page internal anchor Pith review arXiv
[15]

Muennighoff, N., Yang, Z., Shi, W., Li, X

URL https://huggingface.co/ datasets/math-ai/aime25. Muennighoff, N., Yang, Z., Shi, W., Li, X. L., Fei-Fei, L., Hajishirzi, H., Zettlemoyer, L., Liang, P., Cand `es, E., and Hashimoto, T. B. s1: Simple test-time scaling. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 20286– 20332,

work page 2025
[16]

Simko: Simple pass@ k policy optimization.arXiv preprint arXiv:2510.14807, 2025

Peng, R., Ren, Y ., Yu, Z., Liu, W., and Wen, Y . Simko: Simple pass@ k policy optimization.arXiv preprint arXiv:2510.14807,

work page arXiv
[17]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Srivastava, S. S. and Aggarwal, V . A technical survey of rein- forcement learning techniques for large language models. arXiv preprint arXiv:2507.04136,

work page arXiv
[20]

Carbon: Cali- brated best-of-n sampling improves test-time reasoning

Tang, Y .-C., Chen, P.-Y ., and Cavallaro, A. Carbon: Cali- brated best-of-n sampling improves test-time reasoning. arXiv preprint arXiv:2510.15674,

work page arXiv
[21]

Reasoning aware self-consistency: Leveraging reasoning paths for efficient llm sampling

Wan, G., Wu, Y ., Chen, J., and Li, S. Reasoning aware self-consistency: Leveraging reasoning paths for efficient llm sampling. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Associa- tion for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 3613–3635,

work page 2025
[22]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., and Zhou, D. Self-consistency im- proves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

mHC: Manifold-Constrained Hyper-Connections

URLhttps://arxiv.org/abs/2512.24880. Xu, F., Hao, Q., Zong, Z., Wang, J., Zhang, Y ., Wang, J., Lan, X., Gong, J., Ouyang, T., Meng, F., et al. To- wards large reasoning models: A survey of reinforced reasoning with large language models.arXiv preprint arXiv:2501.09686,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

10 TGR: Manifold-Informed Latent Foresight Search Yue, Y ., Chen, Z., Lu, R., Zhao, A., Wang, Z., Song, S., and Huang, G. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?arXiv preprint arXiv:2504.13837,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

A Survey of Reinforcement Learning for Large Reasoning Models

Zhang, K., Zuo, Y ., He, B., Sun, Y ., Liu, R., Jiang, C., Fan, Y ., Tian, K., Jia, G., Li, P., et al. A survey of reinforcement learning for large reasoning models.arXiv preprint arXiv:2509.08827, 2025a. Zhang, Q., Lyu, F., Sun, Z., Wang, L., Zhang, W., Hua, W., Wu, H., Guo, Z., Wang, Y ., Muennighoff, N., et al. A survey on test-time scaling in large la...

work page internal anchor Pith review arXiv
[27]

Group Sequence Policy Optimization

Zheng, C., Liu, S., Li, M., Chen, X.-H., Yu, B., Gao, C., Dang, K., Liu, Y ., Men, R., Yang, A., et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071,

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Pan, Zhangyang Wang, Yuandong Tian, and Kai Sheng Tai

Zhu, H., Zhang, Z., Huang, H., Su, D., Liu, Z., Zhao, J., Fedorov, I., Pirsiavash, H., Sha, Z., Lee, J., et al. The path not taken: Rlvr provably learns off the principals.arXiv preprint arXiv:2511.08567,

work page arXiv
[29]

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

Zhuo, T. Y ., Vu, M. C., Chim, J., Hu, H., Yu, W., Widyasari, R., Yusuf, I. N. B., Zhan, H., He, J., Paul, I., et al. Big- codebench: Benchmarking code generation with diverse function calls and complex instructions.arXiv preprint arXiv:2406.15877,

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

Phi-4-reasoning Technical Report

Abdin, M., Agarwal, S., Awadallah, A., Balachandran, V ., Behl, H., Chen, L., de Rosa, G., Gunasekar, S., Javaheripi, M., Joshi, N., et al. Phi-4-reasoning technical report. arXiv preprint arXiv:2504.21318,

work page internal anchor Pith review arXiv

[2] [2]

The Bayesian Geometry of Transformer Attention

URL https: //arxiv.org/abs/2512.22471. Aghajohari, M., Chitsaz, K., Kazemnejad, A., Chandar, S., Sordoni, A., Courville, A., and Reddy, S. The markovian thinker: Architecture-agnostic linear scaling of reasoning. arXiv preprint arXiv:2510.06557,

work page internal anchor Pith review arXiv

[3] [3]

Longformer: The Long-Document Transformer

Beltagy, I., Peters, M. E., and Cohan, A. Long- former: The long-document transformer.arXiv preprint arXiv:2004.05150,

work page internal anchor Pith review Pith/arXiv arXiv 2004

[4] [4]

A Conceptual Introduction to Hamiltonian Monte Carlo

Betancourt, M. A conceptual introduction to hamiltonian monte carlo.arXiv preprint arXiv:1701.02434,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901,

work page 1901

[6] [6]

Evaluating Large Language Models Trained on Code

Chen, M. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Dao, T. Flashattention-2: Faster attention with bet- ter parallelism and work partitioning.arXiv preprint arXiv:2307.08691,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

The curved spacetime of transformer architectures.arXiv preprint arXiv:2511.03060,

Di Sipio, R., Diaz-Rodriguez, J., and Serrano, L. The curved spacetime of transformer architectures.arXiv preprint arXiv:2511.03060,

work page arXiv

[9] [9]

Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models

Gao, B., Song, F., Yang, Z., Cai, Z., Miao, Y ., Dong, Q., Li, L., Ma, C., Chen, L., Xu, R., et al. Omni-math: A universal olympiad level mathematic benchmark for large language models.arXiv preprint arXiv:2410.07985,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- centivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Measuring Mathematical Problem Solving With the MATH Dataset

Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring math- ematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Jain, N., Han, K., Gu, A., Li, W.-D., Yan, F., Zhang, T., Wang, S., Solar-Lezama, A., Sen, K., and Stoica, I. 9 TGR: Manifold-Informed Latent Foresight Search Livecodebench: Holistic and contamination free eval- uation of large language models for code.arXiv preprint arXiv:2403.07974,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

LaDiR: Latent Diffusion Enhances LLMs for Text Reasoning

Kang, H., Zhang, Y ., Kuang, N. L., Majamaki, N., Jaitly, N., Ma, Y .-A., and Qin, L. Ladir: Latent diffusion enhances llms for text reasoning.arXiv preprint arXiv:2510.04573,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Reasoning with Sampling: Your Base Model is Smarter Than You Think

Karan, A. and Du, Y . Reasoning with sampling: Your base model is smarter than you think.arXiv preprint arXiv:2510.14901,

work page internal anchor Pith review arXiv

[15] [15]

Muennighoff, N., Yang, Z., Shi, W., Li, X

URL https://huggingface.co/ datasets/math-ai/aime25. Muennighoff, N., Yang, Z., Shi, W., Li, X. L., Fei-Fei, L., Hajishirzi, H., Zettlemoyer, L., Liang, P., Cand `es, E., and Hashimoto, T. B. s1: Simple test-time scaling. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 20286– 20332,

work page 2025

[16] [16]

Simko: Simple pass@ k policy optimization.arXiv preprint arXiv:2510.14807, 2025

Peng, R., Ren, Y ., Yu, Z., Liu, W., and Wen, Y . Simko: Simple pass@ k policy optimization.arXiv preprint arXiv:2510.14807,

work page arXiv

[17] [17]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Srivastava, S. S. and Aggarwal, V . A technical survey of rein- forcement learning techniques for large language models. arXiv preprint arXiv:2507.04136,

work page arXiv

[20] [20]

Carbon: Cali- brated best-of-n sampling improves test-time reasoning

Tang, Y .-C., Chen, P.-Y ., and Cavallaro, A. Carbon: Cali- brated best-of-n sampling improves test-time reasoning. arXiv preprint arXiv:2510.15674,

work page arXiv

[21] [21]

Reasoning aware self-consistency: Leveraging reasoning paths for efficient llm sampling

Wan, G., Wu, Y ., Chen, J., and Li, S. Reasoning aware self-consistency: Leveraging reasoning paths for efficient llm sampling. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Associa- tion for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 3613–3635,

work page 2025

[22] [22]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., and Zhou, D. Self-consistency im- proves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171,

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

mHC: Manifold-Constrained Hyper-Connections

URLhttps://arxiv.org/abs/2512.24880. Xu, F., Hao, Q., Zong, Z., Wang, J., Zhang, Y ., Wang, J., Lan, X., Gong, J., Ouyang, T., Meng, F., et al. To- wards large reasoning models: A survey of reinforced reasoning with large language models.arXiv preprint arXiv:2501.09686,

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

10 TGR: Manifold-Informed Latent Foresight Search Yue, Y ., Chen, Z., Lu, R., Zhao, A., Wang, Z., Song, S., and Huang, G. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?arXiv preprint arXiv:2504.13837,

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

A Survey of Reinforcement Learning for Large Reasoning Models

Zhang, K., Zuo, Y ., He, B., Sun, Y ., Liu, R., Jiang, C., Fan, Y ., Tian, K., Jia, G., Li, P., et al. A survey of reinforcement learning for large reasoning models.arXiv preprint arXiv:2509.08827, 2025a. Zhang, Q., Lyu, F., Sun, Z., Wang, L., Zhang, W., Hua, W., Wu, H., Guo, Z., Wang, Y ., Muennighoff, N., et al. A survey on test-time scaling in large la...

work page internal anchor Pith review arXiv

[27] [27]

Group Sequence Policy Optimization

Zheng, C., Liu, S., Li, M., Chen, X.-H., Yu, B., Gao, C., Dang, K., Liu, Y ., Men, R., Yang, A., et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071,

work page internal anchor Pith review Pith/arXiv arXiv

[28] [28]

Pan, Zhangyang Wang, Yuandong Tian, and Kai Sheng Tai

Zhu, H., Zhang, Z., Huang, H., Su, D., Liu, Z., Zhao, J., Fedorov, I., Pirsiavash, H., Sha, Z., Lee, J., et al. The path not taken: Rlvr provably learns off the principals.arXiv preprint arXiv:2511.08567,

work page arXiv

[29] [29]

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

Zhuo, T. Y ., Vu, M. C., Chim, J., Hu, H., Yu, W., Widyasari, R., Yusuf, I. N. B., Zhan, H., He, J., Paul, I., et al. Big- codebench: Benchmarking code generation with diverse function calls and complex instructions.arXiv preprint arXiv:2406.15877,

work page internal anchor Pith review Pith/arXiv arXiv