DenseSteer: Steering Small Language Models towards Dense Math Reasoning

Jung-Eun Kim; Shuhang Lin; Yang Ouyang

arxiv: 2605.29247 · v1 · pith:UIUWJZUInew · submitted 2026-05-28 · 💻 cs.AI · cs.CL· cs.LG

DenseSteer: Steering Small Language Models towards Dense Math Reasoning

Yang Ouyang , Shuhang Lin , Jung-Eun Kim This is my paper

Pith reviewed 2026-06-29 07:43 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG

keywords dense reasoningsmall language modelsmath reasoninginference-time steeringchain-of-thoughtinternal representation modulationnegative log-likelihood

0 comments

The pith

Steering small models toward dense reasoning raises math accuracy without raising uncertainty.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper finds that stronger reasoning tends to use fewer steps but pack more information into each step. From this pattern in larger models the authors derive DenseSteer, a method that adjusts internal activations in small models at inference time to favor the same pattern. Experiments show the change lifts accuracy on math benchmarks while leaving token-level negative log-likelihood unchanged. The approach requires no additional training. A reader would care because it points to a low-cost way to improve reasoning structure in models that are already small.

Core claim

Analyses of the Qwen-2.5 family show that proficient reasoning correlates with fewer reasoning steps but higher information density per step. DenseSteer uses this observation to modulate internal representations of small models toward dense reasoning patterns during inference, producing consistent accuracy gains on mathematical problem-solving benchmarks without increasing token-level negative log-likelihood.

What carries the argument

DenseSteer, a training-free inference-time steering framework that modulates internal representations toward dense reasoning patterns.

If this is right

Small models (<=3B) achieve higher accuracy on multi-step math tasks.
Reasoning improvements occur without extra training or higher token uncertainty.
Dense reasoning functions as a controllable structural property rather than an emergent effect of scale alone.
The method transfers the observed step-density pattern across different small models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same modulation approach might transfer to non-mathematical reasoning domains that also rely on step-wise generation.
If dense reasoning can be induced this way, combining it with other inference-time controls could further narrow the performance gap between small and large models.
Testing whether the accuracy gains persist when the target pattern is taken from models outside the original family would clarify how general the density signal is.

Load-bearing premise

The correlation between fewer steps and higher per-step density observed in one model family can be induced in other small models by adjusting their internal states at inference time.

What would settle it

Applying DenseSteer to a small model on math benchmarks and finding neither an accuracy increase nor stable negative log-likelihood would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.29247 by Jung-Eun Kim, Shuhang Lin, Yang Ouyang.

**Figure 1.** Figure 1: Analysis of Reasoning Patterns on GSM8K. For each model and each GSM8K test problem, we sample 8 responses. Error bars denote ±1 standard error of the mean (SEM), computed over question-level averages. Across Qwen2.5-Instruct models, stronger models often use fewer reasoning steps and higher perstep density than smaller, lower-accuracy models. This coarse trend motivates our focus on dense, model-aligned … view at source ↗

**Figure 2.** Figure 2: Overview of the proposed DenseSteer framework. Given a query, the target model first produces a baseline reasoning trace. This trace is minimally rewritten into a dense variant while preserving semantic content, forming a contrastive pair. Hidden-state differences between dense and sparse variants are aggregated to extract a steering vector at a selected layer. At inference time, this steering vector is in… view at source ↗

**Figure 3.** Figure 3: Distribution of token-level NLL evaluated under Qwen2.5-3B-Instruct on 200 GSM8K samples. (a) Reference reasoning traces generated by same-family (Qwen2.5- 7B-Instruct) and cross-family (Llama-3.2-8B-Instruct) models. (b) Comparison between the same-family reference traces and Dense-Rewriting target-model responses. The black dotted line indicates the target model’s self-likelihood baseline. generated by … view at source ↗

**Figure 4.** Figure 4: Sensitivity analysis on Qwen2.5-3B-Instruct. Metric trends for reasoning steps (top), density (middle), and total tokens (bottom) are shown across varying λ. Middle layers (L16/L17) exhibit the most effective and stable control over reasoning length, while early layers show limited sensitivity, and later layers tend to be unstable. and directly enforces the desired reasoning behavior at the representation … view at source ↗

**Figure 5.** Figure 5: Impact of Steering Strength on Accuracy. Reasoning accuracy on GSM8K improves for Layer 17 (middle layer) as the steering coefficient λ increases. This demonstrates that steering toward the dense reasoning effectively enhances model performance [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Impact of Steering Strength on Token-Level NLL. Average token-level negative log-likelihood (NLL) of steered reasoning traces under the base model as a function of λ. Steering towards dense reasoning reduces NLL, indicating improved alignment with the model’s intrinsic generation prior. Takeaway 4 Middle layers act as the primary control point for Dense Reasoning. Lambda Sensitivity. Beyond steering loca… view at source ↗

**Figure 7.** Figure 7: DAS under steering controls [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative rewrite audit. The dense rewrite primarily merges adjacent redundant steps while preserving the original computations and final answer [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Calibration set-size ablation. DenseSteer and InFamilySteer reach strong performance with few contrastive pairs and remain most effective around 25–50 pairs. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

read the original abstract

Large language models (LLMs) demonstrate strong chain-of-thought (CoT) reasoning abilities, while smaller models (<= 3B parameters) significantly underperform on multi-step reasoning tasks. Based on empirical analyses of the Qwen-2.5 model family on math reasoning benchmarks, we find that more proficient reasoning is associated with fewer reasoning steps but higher information density per step, a property we term Dense Reasoning. Motivated by this observation, we propose DenseSteer, a training-free inference-time steering framework that enhances small-model reasoning by modulating internal representations toward dense reasoning patterns. Experiments show that our method yields consistent accuracy improvements without increasing token-level Negative Log-Likelihood, highlighting dense reasoning as an effective structural approach to mathematical problem solving.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DenseSteer claims a training-free steering trick lifts small-model math accuracy by targeting denser reasoning steps, but the abstract supplies no methods, numbers, or controls to check if it works.

read the letter

The paper's main point is that better math reasoning in the Qwen-2.5 family shows up as fewer steps but higher information density per step, and they introduce DenseSteer to steer other small models toward that pattern at inference time without any training.

What stands out is the training-free angle and the focus on internal representation modulation rather than prompt engineering or fine-tuning. That keeps the method lightweight, which fits the goal of helping models under 3B parameters.

The soft spots are clear from the abstract alone. No implementation details appear on how the steering actually modulates the representations, no baseline comparisons are mentioned, and no accuracy deltas, variance, or statistical checks are given. The central transfer claim—that the Qwen pattern can be induced elsewhere—rests on an unshown empirical step. Without those pieces the gains could be real or could be noise.

This work would interest people building practical inference methods for small models on reasoning tasks. A reader already working on activation engineering or test-time compute might pick up the dense-reasoning framing as a new target.

The paper deserves a serious referee if the full version supplies the missing experiments and ablations; the idea is straightforward enough that clear results would make it worth citing or extending.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes DenseSteer, a training-free inference-time steering framework that modulates internal representations in small language models (≤3B parameters) to induce 'dense reasoning' patterns—fewer reasoning steps with higher per-step information density—observed empirically in the Qwen-2.5 family on math benchmarks. It claims this yields consistent accuracy gains without increasing token-level negative log-likelihood.

Significance. If the claimed results hold with proper controls, the work would provide evidence that structural properties of reasoning (information density) can be induced via representation steering at inference time, offering a lightweight, training-free route to improve small-model math performance. The absence of NLL degradation would be a notable practical advantage.

major comments (2)

[Abstract] Abstract: the central claim of 'consistent accuracy improvements' is stated without any description of the small models tested, the math benchmarks used, baselines, number of runs, statistical significance, or implementation details of the steering mechanism, leaving the empirical result unsupported by visible evidence.
[Abstract] Abstract: the transfer assumption—that dense-reasoning patterns observed in Qwen-2.5 can be induced in other small models via internal representation modulation—is presented without ablations, controls for the steering vectors, or verification that the modulation actually produces fewer steps/higher density in the target models.

minor comments (1)

[Abstract] The abstract would benefit from naming the specific small models and benchmarks to allow readers to assess the scope of the claimed gains.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed feedback on the abstract of our manuscript. We will revise the abstract to better support our claims with key experimental details. We address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of 'consistent accuracy improvements' is stated without any description of the small models tested, the math benchmarks used, baselines, number of runs, statistical significance, or implementation details of the steering mechanism, leaving the empirical result unsupported by visible evidence.

Authors: The abstract is intended as a high-level summary, with full details provided in the body of the paper, including the specific models (≤3B parameters), benchmarks, and the steering approach in Section 3. Nevertheless, we agree that including more specifics in the abstract would strengthen it. We will revise the abstract to mention the models tested, the benchmarks, that results are consistent across multiple runs, and briefly note the steering mechanism. revision: yes
Referee: [Abstract] Abstract: the transfer assumption—that dense-reasoning patterns observed in Qwen-2.5 can be induced in other small models via internal representation modulation—is presented without ablations, controls for the steering vectors, or verification that the modulation actually produces fewer steps/higher density in the target models.

Authors: The empirical results in the manuscript demonstrate accuracy gains on target models when applying the steering derived from Qwen-2.5 patterns, supporting the transfer. Controls for the steering are implicit in the comparison to baselines. However, we recognize that explicit ablations and direct verification of reduced step count and increased density in the target models' outputs are not detailed in the current version. We will add such verification in the revised manuscript to more rigorously support the transfer assumption. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The provided abstract and description contain no equations, fitted parameters, predictions, or self-citations. The method is explicitly training-free and observation-driven from external Qwen-2.5 analyses, with the central claim resting on empirical accuracy improvements rather than any derivation that reduces to its own inputs by construction. This is the expected self-contained case.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no details on parameters, axioms, or entities are provided.

pith-pipeline@v0.9.1-grok · 5655 in / 1074 out tokens · 24664 ms · 2026-06-29T07:43:53.452908+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Closing the Prior-Posterior Loop: Self-Reflective Molecular Design with Analysis-Driven LLM Iteration
physics.chem-ph 2026-06 unverdicted novelty 6.0

LLM molecular design framework uses self-reflection on full physicochemical data from first-principles calculations to achieve low deviation on HOMO-LUMO gaps and generalize to other properties.

Reference graph

Works this paper leans on

22 extracted references · 20 canonical work pages · cited by 1 Pith paper · 13 internal anchors

[1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Anthropic

URL https://arxiv.org/abs/ 2306.13649. Anthropic. Introducing claude,

work page arXiv
[3]

Language Models are Few-Shot Learners

URL https:// arxiv.org/abs/2005.14165. Chen, R., Zhang, Z., Hong, J., Kundu, S., and Wang, Z. Seal: Steerable reasoning calibration of large language models for free.arXiv preprint arXiv:2504.07986, 2025a. Chen, X., Zhou, S., Liang, K., Sun, X., and Liu, X. Skip- thinking: Chunk-wise chain-of-thought distillation en- able smaller language models to reason...

work page internal anchor Pith review Pith/arXiv arXiv 2005
[4]

DeepSeek-V3 Technical Report

URL https: //arxiv.org/abs/2412.19437. Gu, Y ., Dong, L., Wei, F., and Huang, M. Minillm: Knowl- edge distillation of large language models. InInterna- tional Conference on Learning Representations, volume 2024, pp. 32694–32717,

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

URL https://arxiv.org/abs/2402.14008. Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring mas- sive multitask language understanding.arXiv preprint arXiv:2009.03300,

work page internal anchor Pith review Pith/arXiv arXiv 2009
[6]

Measuring Mathematical Problem Solving With the MATH Dataset

URLhttps://arxiv.org/abs/2103.03874. Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Distilling the Knowledge in a Neural Network

URL https: //arxiv.org/abs/1503.02531. Højer, B., Jarvis, O., and Heinrich, S. Improving reasoning performance in large language models via representation engineering.arXiv preprint arXiv:2504.19483,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Y ., Ra- masubramanian, B., and Poovendran, R

Li, Y ., Yue, X., Xu, Z., Jiang, F., Niu, L., Lin, B. Y ., Ra- masubramanian, B., and Poovendran, R. Small models struggle to learn from strong reasoners.arXiv preprint arXiv:2502.12143,

work page arXiv
[9]

Logiqa: A challenge dataset for machine reading comprehension with logical reasoning.arXiv preprint arXiv:2007.08124,

Liu, J., Cui, L., Liu, H., Huang, D., Wang, Y ., and Zhang, Y . Logiqa: A challenge dataset for machine reading comprehension with logical reasoning.arXiv preprint arXiv:2007.08124,

work page arXiv 2007
[10]

Steering Llama 2 via Contrastive Activation Addition

URL https: //arxiv.org/abs/2312.06681. Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y . K., Wu, Y ., and Guo, D. Deepseekmath: Pushing the limits of mathemat- ical reasoning in open language models,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

URL https://arxiv.org/abs/2402.03300. Son, W., Na, J., Choi, J., and Hwang, W. Densely guided knowledge distillation using multiple teacher as- sistants,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Suzgun, M., Scales, N., Schärli, N., Gehrmann, S., Tay, Y ., Chung, H

URL https://arxiv.org/abs/ 2009.08825. Suzgun, M., Scales, N., Schärli, N., Gehrmann, S., Tay, Y ., Chung, H. W., Chowdhery, A., Le, Q., Chi, E., Zhou, D., et al. Challenging big-bench tasks and whether chain-of- thought can solve them. InFindings of the Association for Computational Linguistics: ACL 2023, pp. 13003–13051,

work page arXiv 2009
[13]

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., and Lample, G

URL https: //transformer-circuits.pub/2024/ scaling-monosemanticity/index.html. Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., and Lample, G. Llama: Open and efficient foundation lan- guage models,

2024
[14]

LLaMA: Open and Efficient Foundation Language Models

URL https://arxiv.org/ abs/2302.13971. Turner, A. M., Thiergart, L., Leech, G., Udell, D., Vazquez, J. J., Mini, U., and MacDiarmid, M. Steering language 10 DenseSteer: Steering Small Language Models towards Dense Math Reasoning models with activation engineering,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Steering Language Models With Activation Engineering

URL https: //arxiv.org/abs/2308.10248. Wang, Z., Wang, Y ., Zhang, Z., Zhou, Z., Jin, H., Hu, T., Sun, J., Li, Z., Zhang, Y ., and Xu, Z.-Q. J. Un- derstanding the language model to solve the symbolic multi-step reasoning problem from the perspective of buffer mechanism,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

org/abs/2405.15302

URL https://arxiv. org/abs/2405.15302. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., and Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. arXiv:2201.11903,

work page arXiv
[17]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

URL https://arxiv.org/abs/ 2201.11903. Xu, H., Peng, B., Awadalla, H., Chen, D., Chen, Y .-C., Gao, M., Kim, Y . J., Li, Y ., Ren, L., Shen, Y ., Wang, S., Xu, W., Gao, J., and Chen, W. Phi-4-mini-reasoning: Exploring the limits of small reasoning language models in math.arXiv preprint arXiv:2504.21233,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Yang, A., Zhang, B., Hui, B., Gao, B., Yu, B., Li, C., Liu, D., Tu, J., Zhou, J., Lin, J., Lu, K., Xue, M., Lin, R., Liu, T., Ren, X., and Zhang, Z

URL https://arxiv.org/abs/2504.21233. Yang, A., Zhang, B., Hui, B., Gao, B., Yu, B., Li, C., Liu, D., Tu, J., Zhou, J., Lin, J., Lu, K., Xue, M., Lin, R., Liu, T., Ren, X., and Zhang, Z. Qwen2.5-math techni- cal report: Toward mathematical expert model via self- improvement,

work page arXiv
[19]

Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

URL https://arxiv.org/ abs/2409.12122. Yang, Y ., Sun, H., Wang, J., Qi, Q., Zhuang, Z., Wang, H., Ren, P., Wang, J., and Liao, J. Unveiling internal reason- ing modes in LLMs: A deep dive into latent reasoning vs. factual shortcuts with attribute rate ratio. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 218...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Yang, Z., Qi, P., Zhang, S., Bengio, Y ., Cohen, W., Salakhutdinov, R., and Manning, C. D. HotpotQA: A dataset for diverse, explainable multi-hop question an- swering. In Riloff, E., Chiang, D., Hockenmaier, J., and Tsujii, J. (eds.),Proceedings of the 2018 Confer- ence on Empirical Methods in Natural Language Pro- cessing, pp. 2369–2380, Brussels, Belgiu...

2018
[21]

Cohen, Ruslan Salakhut- dinov, and Christopher D

Association for Computational Lin- guistics. doi: 10.18653/v1/D18-1259. URL https: //aclanthology.org/D18-1259/. Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., Pan, A., Yin, X., Mazeika, M., Dombrowski, A.-K., Goel, S., Li, N., Byun, M. J., Wang, Z., Mallen, A., Basart, S., Koyejo, S., Song, D., Fredrikson, M., Kolter, J. Z., and Hendrycks, ...

work page doi:10.18653/v1/d18-1259
[22]

Representation Engineering: A Top-Down Approach to AI Transparency

URL https://arxiv.org/abs/2310.01405. 11 DenseSteer: Steering Small Language Models towards Dense Math Reasoning Appendix A. Prompts details A.1. Baseline: Chain-of-Thought Prompt Normal Chain-of-Thought Prompt Solve the following math problem. Present the final answer in the format: Final Answer: \\boxed{your_answer}.\nProblem: {{problem}}\nAnswer: A.2. ...

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Anthropic

URL https://arxiv.org/abs/ 2306.13649. Anthropic. Introducing claude,

work page arXiv

[3] [3]

Language Models are Few-Shot Learners

URL https:// arxiv.org/abs/2005.14165. Chen, R., Zhang, Z., Hong, J., Kundu, S., and Wang, Z. Seal: Steerable reasoning calibration of large language models for free.arXiv preprint arXiv:2504.07986, 2025a. Chen, X., Zhou, S., Liang, K., Sun, X., and Liu, X. Skip- thinking: Chunk-wise chain-of-thought distillation en- able smaller language models to reason...

work page internal anchor Pith review Pith/arXiv arXiv 2005

[4] [4]

DeepSeek-V3 Technical Report

URL https: //arxiv.org/abs/2412.19437. Gu, Y ., Dong, L., Wei, F., and Huang, M. Minillm: Knowl- edge distillation of large language models. InInterna- tional Conference on Learning Representations, volume 2024, pp. 32694–32717,

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

URL https://arxiv.org/abs/2402.14008. Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring mas- sive multitask language understanding.arXiv preprint arXiv:2009.03300,

work page internal anchor Pith review Pith/arXiv arXiv 2009

[6] [6]

Measuring Mathematical Problem Solving With the MATH Dataset

URLhttps://arxiv.org/abs/2103.03874. Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Distilling the Knowledge in a Neural Network

URL https: //arxiv.org/abs/1503.02531. Højer, B., Jarvis, O., and Heinrich, S. Improving reasoning performance in large language models via representation engineering.arXiv preprint arXiv:2504.19483,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Y ., Ra- masubramanian, B., and Poovendran, R

Li, Y ., Yue, X., Xu, Z., Jiang, F., Niu, L., Lin, B. Y ., Ra- masubramanian, B., and Poovendran, R. Small models struggle to learn from strong reasoners.arXiv preprint arXiv:2502.12143,

work page arXiv

[9] [9]

Logiqa: A challenge dataset for machine reading comprehension with logical reasoning.arXiv preprint arXiv:2007.08124,

Liu, J., Cui, L., Liu, H., Huang, D., Wang, Y ., and Zhang, Y . Logiqa: A challenge dataset for machine reading comprehension with logical reasoning.arXiv preprint arXiv:2007.08124,

work page arXiv 2007

[10] [10]

Steering Llama 2 via Contrastive Activation Addition

URL https: //arxiv.org/abs/2312.06681. Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y . K., Wu, Y ., and Guo, D. Deepseekmath: Pushing the limits of mathemat- ical reasoning in open language models,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

URL https://arxiv.org/abs/2402.03300. Son, W., Na, J., Choi, J., and Hwang, W. Densely guided knowledge distillation using multiple teacher as- sistants,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Suzgun, M., Scales, N., Schärli, N., Gehrmann, S., Tay, Y ., Chung, H

URL https://arxiv.org/abs/ 2009.08825. Suzgun, M., Scales, N., Schärli, N., Gehrmann, S., Tay, Y ., Chung, H. W., Chowdhery, A., Le, Q., Chi, E., Zhou, D., et al. Challenging big-bench tasks and whether chain-of- thought can solve them. InFindings of the Association for Computational Linguistics: ACL 2023, pp. 13003–13051,

work page arXiv 2009

[13] [13]

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., and Lample, G

URL https: //transformer-circuits.pub/2024/ scaling-monosemanticity/index.html. Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., and Lample, G. Llama: Open and efficient foundation lan- guage models,

2024

[14] [14]

LLaMA: Open and Efficient Foundation Language Models

URL https://arxiv.org/ abs/2302.13971. Turner, A. M., Thiergart, L., Leech, G., Udell, D., Vazquez, J. J., Mini, U., and MacDiarmid, M. Steering language 10 DenseSteer: Steering Small Language Models towards Dense Math Reasoning models with activation engineering,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Steering Language Models With Activation Engineering

URL https: //arxiv.org/abs/2308.10248. Wang, Z., Wang, Y ., Zhang, Z., Zhou, Z., Jin, H., Hu, T., Sun, J., Li, Z., Zhang, Y ., and Xu, Z.-Q. J. Un- derstanding the language model to solve the symbolic multi-step reasoning problem from the perspective of buffer mechanism,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

org/abs/2405.15302

URL https://arxiv. org/abs/2405.15302. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., and Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. arXiv:2201.11903,

work page arXiv

[17] [17]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

URL https://arxiv.org/abs/ 2201.11903. Xu, H., Peng, B., Awadalla, H., Chen, D., Chen, Y .-C., Gao, M., Kim, Y . J., Li, Y ., Ren, L., Shen, Y ., Wang, S., Xu, W., Gao, J., and Chen, W. Phi-4-mini-reasoning: Exploring the limits of small reasoning language models in math.arXiv preprint arXiv:2504.21233,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Yang, A., Zhang, B., Hui, B., Gao, B., Yu, B., Li, C., Liu, D., Tu, J., Zhou, J., Lin, J., Lu, K., Xue, M., Lin, R., Liu, T., Ren, X., and Zhang, Z

URL https://arxiv.org/abs/2504.21233. Yang, A., Zhang, B., Hui, B., Gao, B., Yu, B., Li, C., Liu, D., Tu, J., Zhou, J., Lin, J., Lu, K., Xue, M., Lin, R., Liu, T., Ren, X., and Zhang, Z. Qwen2.5-math techni- cal report: Toward mathematical expert model via self- improvement,

work page arXiv

[19] [19]

Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

URL https://arxiv.org/ abs/2409.12122. Yang, Y ., Sun, H., Wang, J., Qi, Q., Zhuang, Z., Wang, H., Ren, P., Wang, J., and Liao, J. Unveiling internal reason- ing modes in LLMs: A deep dive into latent reasoning vs. factual shortcuts with attribute rate ratio. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 218...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [20]

Yang, Z., Qi, P., Zhang, S., Bengio, Y ., Cohen, W., Salakhutdinov, R., and Manning, C. D. HotpotQA: A dataset for diverse, explainable multi-hop question an- swering. In Riloff, E., Chiang, D., Hockenmaier, J., and Tsujii, J. (eds.),Proceedings of the 2018 Confer- ence on Empirical Methods in Natural Language Pro- cessing, pp. 2369–2380, Brussels, Belgiu...

2018

[21] [21]

Cohen, Ruslan Salakhut- dinov, and Christopher D

Association for Computational Lin- guistics. doi: 10.18653/v1/D18-1259. URL https: //aclanthology.org/D18-1259/. Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., Pan, A., Yin, X., Mazeika, M., Dombrowski, A.-K., Goel, S., Li, N., Byun, M. J., Wang, Z., Mallen, A., Basart, S., Koyejo, S., Song, D., Fredrikson, M., Kolter, J. Z., and Hendrycks, ...

work page doi:10.18653/v1/d18-1259

[22] [22]

Representation Engineering: A Top-Down Approach to AI Transparency

URL https://arxiv.org/abs/2310.01405. 11 DenseSteer: Steering Small Language Models towards Dense Math Reasoning Appendix A. Prompts details A.1. Baseline: Chain-of-Thought Prompt Normal Chain-of-Thought Prompt Solve the following math problem. Present the final answer in the format: Final Answer: \\boxed{your_answer}.\nProblem: {{problem}}\nAnswer: A.2. ...

work page internal anchor Pith review Pith/arXiv arXiv