pith. sign in

arxiv: 2605.20369 · v1 · pith:HUQEYUCZnew · submitted 2026-05-19 · 💻 cs.CL · cs.AI· cs.LG

DEL: Digit Entropy Loss for Numerical Learning of Large Language Models

Pith reviewed 2026-05-21 07:31 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords digit entropy lossnumerical learninglarge language modelsmathematical reasoningfloating-point optimizationsupervised entropybinary cross-entropyloss function
0
0 comments X

The pith

Digit Entropy Loss improves number prediction in LLMs by supervising digit probabilities with binary cross-entropy while dropping numerical distance terms and extending to floating-point values.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models often produce inaccurate numbers in math and code tasks even when they reason correctly about other parts of a problem. Standard maximum likelihood training does not target numerical accuracy, and recent penalty methods that add numerical distance create either overly peaked or overly flat digit distributions. The paper shows that these methods share a criterion-distance structure and introduces Digit Entropy Loss to replace the unsupervised entropy objective with a supervised version that conditions on prior digits, applies binary cross-entropy, removes the distance term entirely, and treats decimal points and digits uniformly so the model learns complete floating-point numbers. If the approach holds, models would generate more accurate numerical answers on reasoning benchmarks without the distribution problems of earlier losses.

Core claim

Existing numerical learning methods for LLMs follow a criterion-distance formulation in which the criterion defines the optimization pattern and the distance term supplies a geometric prior. Digit Entropy Loss reformulates the conventional unsupervised entropy optimization through three changes: it uses digit conditional probability together with binary cross-entropy to make entropy optimization supervised; it removes the distance term to sidestep over-sharpening and over-flattening; and it generalizes the objective from integers to full floating-point numbers that include decimal digits and points. On seven mathematical reasoning benchmarks and four LLMs the resulting loss yields higher end

What carries the argument

Digit Entropy Loss (DEL), which converts unsupervised entropy optimization into a supervised objective by conditioning on previous digits, applying binary cross-entropy, discarding the numerical distance term, and treating decimal points as ordinary tokens so the loss operates over entire floating-point numbers.

If this is right

  • DEL produces higher overall prediction accuracy than prior numerical losses on mathematical reasoning tasks.
  • DEL yields smaller numerical distance errors on the same benchmarks.
  • DEL supports optimization over floating-point numbers that contain decimal points and digits.
  • DEL applies across multiple LLMs including CodeLlama, Mistral, DeepSeek, and Qwen-2.5.
  • DEL expands the training objective from isolated digits to complete numbers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar supervised entropy formulations could be tested on other structured sequential outputs such as dates or measurements.
  • Removing explicit distance penalties may simplify training pipelines that currently combine multiple loss terms for numerical data.
  • The approach invites direct comparison with token-level calibration methods that also aim to shape probability distributions without geometric penalties.
  • Extending the same digit-level conditioning to code-generation settings that require precise numeric literals could be checked in follow-up experiments.

Load-bearing premise

That removing the numerical distance term and guiding entropy with supervised binary cross-entropy on digit conditional probabilities will avoid over-sharpening and over-flattening while successfully extending integer learning to floating-point numbers.

What would settle it

If DEL fails to produce higher overall prediction accuracy or lower numerical distance than the compared losses when evaluated on the same seven benchmarks with CodeLlama, Mistral, DeepSeek, and Qwen-2.5, the central claim would be falsified.

read the original abstract

Number prediction stands as a fundamental capability of large language models (LLMs) in mathematical problem-solving and code generation. The widely adopted maximum likelihood estimation (MLE) for LLM training is not tailored to number prediction. Recently, penalty-driven approaches, e.g., Number Token Loss and Discretized Distance Loss, introduce an inductive bias of numerical distance but induce over-sharpened and over-flattened digit distributions, respectively. In this paper, we make an in-depth analysis on LLM numerical learning, and show that existing numerical learning methods conceptually follow a criterion-distance formulation, where the criterion term represents optimization pattern and the distance term instills geometric prior. Consequently, we present Digit Entropy Loss (DEL) for auto-regressive numerical learning, which reformulates the conventional unsupervised entropy optimization in three key designs: leveraging digit conditional probability and binary cross-entropy to guide the entropy optimization into a supervised manner; deprecating the distance term to bypass the issue of numerical distance; and generalizing the integer-based numerical learning to floating-point number optimization, enabling more accurate number prediction. Our DEL formulation can incorporate integers, decimals, and decimal points, expanding the learning objective from a single digit to the floating-point number domain. Experiments conducted on seven mathematical reasoning benchmarks with four representative LLMs, including CodeLlama, Mistral, DeepSeek, and Qwen-2.5, demonstrate that DEL consistently outperforms its counterparts in both overall prediction accuracy and numerical distance. Source codes are at https://github.com/PolyU-VCLab/DEL

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Digit Entropy Loss (DEL) for numerical learning in LLMs. It frames prior methods (MLE, Number Token Loss, Discretized Distance Loss) as criterion-distance formulations and introduces DEL via three changes: supervised entropy optimization using digit conditional probabilities and binary cross-entropy, removal of the numerical distance term, and extension from integers to floating-point numbers by incorporating decimal points into the token sequence. Experiments across seven mathematical reasoning benchmarks and four LLMs (CodeLlama, Mistral, DeepSeek, Qwen-2.5) report consistent gains in both prediction accuracy and numerical distance metrics.

Significance. If the central claims hold, DEL offers a parameter-free alternative that sidesteps over-sharpening and over-flattening while extending numerical supervision to decimals. The multi-LLM, multi-benchmark evaluation and public code release provide a reproducible empirical foundation for improved number prediction in mathematical reasoning and code generation tasks.

major comments (2)
  1. [Section 3] Section 3: The generalization to floating-point numbers incorporates decimal points but applies uniform binary cross-entropy across all digits without place-value weighting (e.g., scaling post-decimal digits by 10^{-k}). Because numerical distance is evaluated in absolute or log scale, the per-digit supervision signal lacks explicit magnitude awareness; this assumption is load-bearing for the claim that DEL successfully extends integer learning to floats while improving numerical distance.
  2. [Results section] Results section: The reported outperformance on numerical distance is presented as evidence that deprecating the distance term succeeds, yet the manuscript does not include an ablation that isolates the effect of uniform digit treatment on decimal-heavy subsets of the benchmarks. Without this, it remains unclear whether the observed gains in distance metrics are robust to the lack of place-value scaling.
minor comments (2)
  1. The description of how decimal points are tokenized and how the conditional probability is computed over the extended vocabulary could be accompanied by an explicit equation or pseudocode for clarity.
  2. Details on the exact data splits, number of evaluation runs, and any statistical significance testing for the benchmark results are not fully elaborated, which would strengthen the empirical claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, providing clarifications on our design choices and indicating where revisions have been made to strengthen the presentation and empirical support.

read point-by-point responses
  1. Referee: [Section 3] Section 3: The generalization to floating-point numbers incorporates decimal points but applies uniform binary cross-entropy across all digits without place-value weighting (e.g., scaling post-decimal digits by 10^{-k}). Because numerical distance is evaluated in absolute or log scale, the per-digit supervision signal lacks explicit magnitude awareness; this assumption is load-bearing for the claim that DEL successfully extends integer learning to floats while improving numerical distance.

    Authors: We appreciate the referee's observation on the uniform binary cross-entropy in the floating-point generalization. This uniformity is a deliberate choice to preserve a parameter-free formulation that directly supervises digit-level conditional probabilities via binary cross-entropy, allowing the autoregressive model to capture positional significance through sequence context rather than explicit scaling. Introducing place-value weights would add hyperparameters that risk reintroducing the over-sharpening or over-flattening issues we sought to avoid by deprecating the distance term. Benchmarks in our evaluation contain floating-point numbers, and the observed gains in both accuracy and numerical distance metrics indicate that the supervised entropy optimization supplies adequate signal. In the revised manuscript we have expanded Section 3 with a dedicated paragraph explaining this design rationale and its implications for magnitude awareness. revision: partial

  2. Referee: [Results section] Results section: The reported outperformance on numerical distance is presented as evidence that deprecating the distance term succeeds, yet the manuscript does not include an ablation that isolates the effect of uniform digit treatment on decimal-heavy subsets of the benchmarks. Without this, it remains unclear whether the observed gains in distance metrics are robust to the lack of place-value scaling.

    Authors: We agree that an ablation isolating uniform digit treatment on decimal-heavy subsets would provide stronger evidence of robustness. Our primary experiments already span seven mathematical reasoning benchmarks that include a range of floating-point expressions, with consistent improvements in numerical distance after removing the distance term. To directly address the concern, the revised results section now includes a post-hoc breakdown on subsets with elevated decimal density; the gains in distance metrics remain stable under this analysis, supporting that the supervised entropy approach does not rely on place-value scaling for its benefits. revision: yes

Circularity Check

0 steps flagged

No significant circularity; DEL is an independent reformulation

full rationale

The paper derives DEL by analyzing existing methods as following a criterion-distance formulation, then defines a new loss that uses digit conditional probability with binary cross-entropy for supervised entropy optimization, explicitly deprecates the distance term, and extends the formulation to include decimal points and floating-point numbers. This construction is presented directly via new equations and design choices rather than by fitting parameters to data subsets or reducing predictions to inputs by construction. No load-bearing self-citations, uniqueness theorems from prior author work, or smuggled ansatzes are invoked to justify the core claim. Experiments on seven external benchmarks with four LLMs provide independent validation, confirming the derivation chain remains self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard machine-learning assumptions about loss optimization and the validity of the three design choices for numerical learning; no free parameters or new entities are introduced in the abstract.

axioms (1)
  • domain assumption Reformulating unsupervised entropy optimization into a supervised form using digit conditional probability and binary cross-entropy provides a valid inductive bias for numerical learning.
    This is the core conceptual step described in the abstract's analysis of existing methods.

pith-pipeline@v0.9.0 · 5825 in / 1250 out tokens · 39644 ms · 2026-05-21T07:31:38.883732+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · 13 internal anchors

  1. [1]

    Math-shepherd: Verify and reinforce llms step-by-step without human annotations

    Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. InACL, pages 9426–9439, 2024

  2. [2]

    DART-Math: Difficulty-aware rejection tuning for mathematical problem-solving

    Yuxuan Tong, Xiwen Zhang, Rui Wang, Ruidong Wu, and Junxian He. DART-Math: Difficulty-aware rejection tuning for mathematical problem-solving. InNeurIPS, volume 37, pages 7821–7846, 2024

  3. [3]

    Mathscale: Scaling instruction tuning for mathematical reasoning

    Zhengyang Tang, Xingxing Zhang, Benyou Wang, and Furu Wei. Mathscale: Scaling instruction tuning for mathematical reasoning. InICML, pages 47885–47900, 2024

  4. [4]

    Code Llama: Open Foundation Models for Code

    Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. Code Llama: Open foundation models for code.arXiv preprint arXiv:2308.12950, 2023

  5. [5]

    Mammoth: Building math generalist models through hybrid instruction tuning

    Xiang Yue, Xingwei Qu, Ge Zhang, Yao Fu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mammoth: Building math generalist models through hybrid instruction tuning. InICLR, 2024

  6. [6]

    Openmathinstruct-1: A 1.8 million math instruction tuning dataset

    ShubhamToshniwal, IvanMoshkov, SeanNarenthiran, DariaGitman, FeiJia, andIgorGitman. Openmathinstruct-1: A 1.8 million math instruction tuning dataset. InNeurIPS, volume 37, pages 34737–34774, 2024

  7. [7]

    Class-based n-gram models of natural language.Computational linguistics, 18(4):467–480, 1992

    Peter F Brown, Vincent J Della Pietra, Peter V Desouza, Jennifer C Lai, and Robert L Mercer. Class-based n-gram models of natural language.Computational linguistics, 18(4):467–480, 1992

  8. [8]

    The curious case of neural text degeneration

    Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. In ICLR, 2020

  9. [9]

    MixCE: Training autoregressive language models by mixing forward and reverse cross-entropies

    Shiyue Zhang, Shijie Wu, Ozan Irsoy, Steven Lu, Mohit Bansal, Mark Dredze, and David Rosenberg. MixCE: Training autoregressive language models by mixing forward and reverse cross-entropies. InACL, pages 9027–9050, 2023

  10. [10]

    Tailoring language generation models under total variation distance

    Haozhe Ji, Pei Ke, Zhipeng Hu, Rongsheng Zhang, and Minlie Huang. Tailoring language generation models under total variation distance. InICLR, 2023

  11. [11]

    Siyu Ren, Zhiyong Wu, and Kenny Q. Zhu. EMO: Earth mover distance optimization for auto-regressive language modeling. InICLR, 2024

  12. [12]

    Regress, don’t guess–a regression-like loss on number tokens for language models

    Jonas Zausinger, Lars Pennig, Anamarija Kozina, Sean Sdahl, Julian Sikora, Adrian Dendorfer, Timofey Kuznetsov, Mohamad Hagog, Nina Wiedemann, Kacper Chlodny, et al. Regress, don’t guess–a regression-like loss on number tokens for language models. InICML, 2025

  13. [13]

    Teaching metric distance to autoregressive multimodal foundational models

    Jiwan Chung, Saejin Kim, Yongrae Jo, Jaewoo Park, Dongjun Min, and Youngjae Yu. Teaching metric distance to autoregressive multimodal foundational models. InICLR, 2026

  14. [14]

    Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning

    Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xiong-Hui Chen, Jianxin Yang, Zhenru Zhang, et al. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning. InNeurIPS, 2025

  15. [15]

    Minimum error rate training in statistical machine translation

    Franz Josef Och. Minimum error rate training in statistical machine translation. InACL, pages 160–167, 2003

  16. [16]

    Neural machine translation of rare words with subword units

    Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. InACL, pages 1715–1725, 2016

  17. [17]

    Towards end-to-end speech recognition with recurrent neural networks

    Alex Graves and Navdeep Jaitly. Towards end-to-end speech recognition with recurrent neural networks. InICML, pages 1764–1772, 2014

  18. [18]

    Robust speech recognition via large-scale weak supervision

    Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InICML, pages 28492–28518, 2023

  19. [19]

    Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks

    Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. InICML, pages 369–376, 2006. Visual Computing Lab·The Hong Kong Polytechnic University 10 / 22

  20. [20]

    Trocr: Transformer-basedopticalcharacterrecognitionwithpre-trainedmodels

    Minghao Li, Tengchao Lv, Jingye Chen, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, and Furu Wei. Trocr: Transformer-basedopticalcharacterrecognitionwithpre-trainedmodels. InAAAI,pages13094–13102, 2023

  21. [21]

    Long short-term memory.Neural computation, 9(8):1735–1780, 1997

    Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.Neural computation, 9(8):1735–1780, 1997

  22. [22]

    Attention is all you need.NeurIPS, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.NeurIPS, 2017

  23. [23]

    A neural probabilistic language model

    Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic language model. Journal of Machine Learning Research, 3(Feb):1137–1155, 2003

  24. [24]

    Efficient Estimation of Word Representations in Vector Space

    Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space.arXiv preprint arXiv:1301.3781, 2013

  25. [25]

    BERT: Pre-training of deep bidirectional transformers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. InNAACL, pages 4171–4186, 2019

  26. [26]

    Language models are few-shot learners

    TomBrown,BenjaminMann,NickRyder,MelanieSubbiah,JaredDKaplan,PrafullaDhariwal,ArvindNeelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. InNeurIPS, pages 1877–1901, 2020

  27. [27]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. InNeurIPS, pages 24824–24837, 2022

  28. [28]

    Compositional chain-of-thought prompting for large multimodal models

    Chancharik Mitra, Brandon Huang, Trevor Darrell, and Roei Herzig. Compositional chain-of-thought prompting for large multimodal models. InCVPR, pages 14420–14431, 2024

  29. [29]

    Tree of thoughts: Deliberate problem solving with large language models

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. InNeurIPS, pages 11809–11822, 2023

  30. [30]

    Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks.Transactions on Machine Learning Research, 2023

    Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W Cohen. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks.Transactions on Machine Learning Research, 2023

  31. [31]

    PAL: Program-aided language models

    Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. PAL: Program-aided language models. InICML, pages 10764–10799, 2023

  32. [32]

    How well do large language models perform in arithmetic tasks?arXiv preprint arXiv:2304.02015, 2023

    Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, and Songfang Huang. How well do large language models perform in arithmetic tasks?arXiv preprint arXiv:2304.02015, 2023

  33. [33]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  34. [34]

    Measuring mathematical problem solving with the math dataset.NeurIPS, 2021

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.NeurIPS, 2021

  35. [35]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, pages 34892–34916, 2023

  36. [36]

    InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InCVPR, pages 24185–24198, 2024

  37. [37]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-VL: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

  38. [38]

    Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 21(140):1–67, 2020

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 21(140):1–67, 2020

  39. [39]

    Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

  40. [40]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. Visual Computing Lab·The Hong Kong Polytechnic University 11 / 22

  41. [41]

    GPT-4 Technical Report

    JoshAchiam,StevenAdler,SandhiniAgarwal,LamaAhmad,IlgeAkkaya,FlorenciaLeoniAleman,DiogoAlmeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  42. [42]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

  43. [43]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

  44. [44]

    Qwen2 Technical Report

    Qwen Team et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2024

  45. [45]

    Qwen2.5 technical report, 2025

    Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

  46. [46]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023

  47. [47]

    Judging llm-as-a-judge with mt-bench and chatbot arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. InNeurIPS, volume 36, pages 46595–46623, 2023

  48. [48]

    A mathematical theory of communication.The Bell System Technical Journal, 27(3): 379–423, 1948

    Claude Elwood Shannon. A mathematical theory of communication.The Bell System Technical Journal, 27(3): 379–423, 1948

  49. [49]

    Griffiths, and Ilia Sucholutsky

    Raja Marjieh, Veniamin Veselovsky, Thomas L. Griffiths, and Ilia Sucholutsky. What is a number, that a large language model may know it? InNeurIPS Workshop, 2025

  50. [50]

    Benford’s curse: Tracing digit bias to numerical hallucination in LLMs

    Jiandong Shao, Yao Lu, and Jianfei Yang. Benford’s curse: Tracing digit bias to numerical hallucination in LLMs. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  51. [51]

    Visualizing data using t-sne.Journal of Machine Learning Research, 9(11), 2008

    Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of Machine Learning Research, 9(11), 2008

  52. [52]

    DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning

    Zhiwei He, Tian Liang, Jiahao Xu, Qiuzhi Liu, Xingyu Chen, Yue Wang, Linfeng Song, Dian Yu, Zhenwen Liang, Wenxuan Wang, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, and Dong Yu. DeepMath-103K: A large-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning.arXiv preprint arXiv:2504.11456, 2025

  53. [53]

    Galactica: A Large Language Model for Science

    RossTaylor,MarcinKardas,GuillemCucurull,ThomasScialom,AnthonyHartshorn,ElvisSaravia,AndrewPoulton, Viktor Kerkez, and Robert Stojnic. Galactica: A large language model for science.arXiv preprint arXiv:2211.09085, 2022

  54. [54]

    Metamath: Bootstrap your own mathematical questions for large language models

    Longhui Yu, Weisen Jiang, Han Shi, Jincheng YU, Zhengying Liu, Yu Zhang, James Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models. In ICLR, 2024

  55. [55]

    Common 7b language models already possess strong math capabilities.arXiv preprint arXiv:2403.04706, 2024

    Chen Li, Weiqi Wang, Jingcheng Hu, Yixuan Wei, Nanning Zheng, Han Hu, Zheng Zhang, and Houwen Peng. Common 7b language models already possess strong math capabilities.arXiv preprint arXiv:2403.04706, 2024

  56. [56]

    WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct

    Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, and Dongmei Zhang. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct.arXiv preprint arXiv:2308.09583, 2023

  57. [57]

    Skywork Open Reasoner 1 Technical Report

    JujieHe,JiacaiLiu,ChrisYuhaoLiu,RuiYan,ChaojieWang,PengCheng,XiaoyuZhang,FuxiangZhang,Jiacheng Xu, Wei Shen, et al. Skywork open reasoner 1 technical report.arXiv preprint arXiv:2505.22312, 2025

  58. [58]

    Are nlpmodels reallyable to solve simple math wordproblems? InNAACL, pages 2080–2094, 2021

    Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are nlpmodels reallyable to solve simple math wordproblems? InNAACL, pages 2080–2094, 2021. Visual Computing Lab·The Hong Kong Polytechnic University 12 / 22

  59. [59]

    MAWPS: A math word problem repository

    Rik Koncel-Kedziorski, Subhro Roy, Aida Amini, Nate Kushman, and Hannaneh Hajishirzi. MAWPS: A math word problem repository. InNAACL, pages 1152–1157, 2016

  60. [60]

    Program induction by rationale generation: Learning to solve and explain algebraic word problems

    Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program induction by rationale generation: Learning to solve and explain algebraic word problems. InACL, pages 158–167, 2017

  61. [61]

    AGIEval: A human-centric benchmark for evaluating foundation models

    Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. AGIEval: A human-centric benchmark for evaluating foundation models. InNAACL, pages 2299–2314, 2024

  62. [62]

    Measuring massive multitask language understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InICLR, 2021

  63. [63]

    Transformers: State-of-the-art natural language processing

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. Transformers: State-of-the-art natural language processing. In EMNLP, pages 38–45, 2020

  64. [64]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2019

  65. [65]

    Feature information driven position gaussian distribution estimation for tiny object detection

    Jinghao Bian, Mingtao Feng, Weisheng Dong, Fangfang Wu, Jianqiao Luo, Yaonan Wang, and Guangming Shi. Feature information driven position gaussian distribution estimation for tiny object detection. InCVPR, pages 30376–30386, 2025

  66. [66]

    The unreasonable effectiveness of entropy minimization in LLM reasoning

    Shivam Agarwal, Zimin Zhang, Lifan Yuan, Jiawei Han, and Hao Peng. The unreasonable effectiveness of entropy minimization in LLM reasoning. InNeurIPS, 2025

  67. [67]

    Preserving diversity in supervised fine-tuning of large language models

    Ziniu Li, Congliang Chen, Tian Xu, Zeyu Qin, Jiancong Xiao, Zhi-Quan Luo, and Ruoyu Sun. Preserving diversity in supervised fine-tuning of large language models. InICLR, 2025

  68. [68]

    Qiufu Li, Huibin Xiao, and Linlin Shen. BCE vs. CE in deep feature learning. InICML, 2025

  69. [69]

    Focal loss for dense object detection

    Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In ICCV, pages 2980–2988, 2017

  70. [70]

    Large language models can be easily distracted by irrelevant context

    Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H Chi, Nathanael Schärli, and Denny Zhou. Large language models can be easily distracted by irrelevant context. InICML, pages 31210–31227, 2023. Visual Computing Lab·The Hong Kong Polytechnic University 13 / 22 Appendix In appendix, we provide the following materials: A. More analysis ...