DEL: Digit Entropy Loss for Numerical Learning of Large Language Models
Pith reviewed 2026-05-21 07:31 UTC · model grok-4.3
The pith
Digit Entropy Loss improves number prediction in LLMs by supervising digit probabilities with binary cross-entropy while dropping numerical distance terms and extending to floating-point values.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Existing numerical learning methods for LLMs follow a criterion-distance formulation in which the criterion defines the optimization pattern and the distance term supplies a geometric prior. Digit Entropy Loss reformulates the conventional unsupervised entropy optimization through three changes: it uses digit conditional probability together with binary cross-entropy to make entropy optimization supervised; it removes the distance term to sidestep over-sharpening and over-flattening; and it generalizes the objective from integers to full floating-point numbers that include decimal digits and points. On seven mathematical reasoning benchmarks and four LLMs the resulting loss yields higher end
What carries the argument
Digit Entropy Loss (DEL), which converts unsupervised entropy optimization into a supervised objective by conditioning on previous digits, applying binary cross-entropy, discarding the numerical distance term, and treating decimal points as ordinary tokens so the loss operates over entire floating-point numbers.
If this is right
- DEL produces higher overall prediction accuracy than prior numerical losses on mathematical reasoning tasks.
- DEL yields smaller numerical distance errors on the same benchmarks.
- DEL supports optimization over floating-point numbers that contain decimal points and digits.
- DEL applies across multiple LLMs including CodeLlama, Mistral, DeepSeek, and Qwen-2.5.
- DEL expands the training objective from isolated digits to complete numbers.
Where Pith is reading between the lines
- Similar supervised entropy formulations could be tested on other structured sequential outputs such as dates or measurements.
- Removing explicit distance penalties may simplify training pipelines that currently combine multiple loss terms for numerical data.
- The approach invites direct comparison with token-level calibration methods that also aim to shape probability distributions without geometric penalties.
- Extending the same digit-level conditioning to code-generation settings that require precise numeric literals could be checked in follow-up experiments.
Load-bearing premise
That removing the numerical distance term and guiding entropy with supervised binary cross-entropy on digit conditional probabilities will avoid over-sharpening and over-flattening while successfully extending integer learning to floating-point numbers.
What would settle it
If DEL fails to produce higher overall prediction accuracy or lower numerical distance than the compared losses when evaluated on the same seven benchmarks with CodeLlama, Mistral, DeepSeek, and Qwen-2.5, the central claim would be falsified.
read the original abstract
Number prediction stands as a fundamental capability of large language models (LLMs) in mathematical problem-solving and code generation. The widely adopted maximum likelihood estimation (MLE) for LLM training is not tailored to number prediction. Recently, penalty-driven approaches, e.g., Number Token Loss and Discretized Distance Loss, introduce an inductive bias of numerical distance but induce over-sharpened and over-flattened digit distributions, respectively. In this paper, we make an in-depth analysis on LLM numerical learning, and show that existing numerical learning methods conceptually follow a criterion-distance formulation, where the criterion term represents optimization pattern and the distance term instills geometric prior. Consequently, we present Digit Entropy Loss (DEL) for auto-regressive numerical learning, which reformulates the conventional unsupervised entropy optimization in three key designs: leveraging digit conditional probability and binary cross-entropy to guide the entropy optimization into a supervised manner; deprecating the distance term to bypass the issue of numerical distance; and generalizing the integer-based numerical learning to floating-point number optimization, enabling more accurate number prediction. Our DEL formulation can incorporate integers, decimals, and decimal points, expanding the learning objective from a single digit to the floating-point number domain. Experiments conducted on seven mathematical reasoning benchmarks with four representative LLMs, including CodeLlama, Mistral, DeepSeek, and Qwen-2.5, demonstrate that DEL consistently outperforms its counterparts in both overall prediction accuracy and numerical distance. Source codes are at https://github.com/PolyU-VCLab/DEL
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Digit Entropy Loss (DEL) for numerical learning in LLMs. It frames prior methods (MLE, Number Token Loss, Discretized Distance Loss) as criterion-distance formulations and introduces DEL via three changes: supervised entropy optimization using digit conditional probabilities and binary cross-entropy, removal of the numerical distance term, and extension from integers to floating-point numbers by incorporating decimal points into the token sequence. Experiments across seven mathematical reasoning benchmarks and four LLMs (CodeLlama, Mistral, DeepSeek, Qwen-2.5) report consistent gains in both prediction accuracy and numerical distance metrics.
Significance. If the central claims hold, DEL offers a parameter-free alternative that sidesteps over-sharpening and over-flattening while extending numerical supervision to decimals. The multi-LLM, multi-benchmark evaluation and public code release provide a reproducible empirical foundation for improved number prediction in mathematical reasoning and code generation tasks.
major comments (2)
- [Section 3] Section 3: The generalization to floating-point numbers incorporates decimal points but applies uniform binary cross-entropy across all digits without place-value weighting (e.g., scaling post-decimal digits by 10^{-k}). Because numerical distance is evaluated in absolute or log scale, the per-digit supervision signal lacks explicit magnitude awareness; this assumption is load-bearing for the claim that DEL successfully extends integer learning to floats while improving numerical distance.
- [Results section] Results section: The reported outperformance on numerical distance is presented as evidence that deprecating the distance term succeeds, yet the manuscript does not include an ablation that isolates the effect of uniform digit treatment on decimal-heavy subsets of the benchmarks. Without this, it remains unclear whether the observed gains in distance metrics are robust to the lack of place-value scaling.
minor comments (2)
- The description of how decimal points are tokenized and how the conditional probability is computed over the extended vocabulary could be accompanied by an explicit equation or pseudocode for clarity.
- Details on the exact data splits, number of evaluation runs, and any statistical significance testing for the benchmark results are not fully elaborated, which would strengthen the empirical claims.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, providing clarifications on our design choices and indicating where revisions have been made to strengthen the presentation and empirical support.
read point-by-point responses
-
Referee: [Section 3] Section 3: The generalization to floating-point numbers incorporates decimal points but applies uniform binary cross-entropy across all digits without place-value weighting (e.g., scaling post-decimal digits by 10^{-k}). Because numerical distance is evaluated in absolute or log scale, the per-digit supervision signal lacks explicit magnitude awareness; this assumption is load-bearing for the claim that DEL successfully extends integer learning to floats while improving numerical distance.
Authors: We appreciate the referee's observation on the uniform binary cross-entropy in the floating-point generalization. This uniformity is a deliberate choice to preserve a parameter-free formulation that directly supervises digit-level conditional probabilities via binary cross-entropy, allowing the autoregressive model to capture positional significance through sequence context rather than explicit scaling. Introducing place-value weights would add hyperparameters that risk reintroducing the over-sharpening or over-flattening issues we sought to avoid by deprecating the distance term. Benchmarks in our evaluation contain floating-point numbers, and the observed gains in both accuracy and numerical distance metrics indicate that the supervised entropy optimization supplies adequate signal. In the revised manuscript we have expanded Section 3 with a dedicated paragraph explaining this design rationale and its implications for magnitude awareness. revision: partial
-
Referee: [Results section] Results section: The reported outperformance on numerical distance is presented as evidence that deprecating the distance term succeeds, yet the manuscript does not include an ablation that isolates the effect of uniform digit treatment on decimal-heavy subsets of the benchmarks. Without this, it remains unclear whether the observed gains in distance metrics are robust to the lack of place-value scaling.
Authors: We agree that an ablation isolating uniform digit treatment on decimal-heavy subsets would provide stronger evidence of robustness. Our primary experiments already span seven mathematical reasoning benchmarks that include a range of floating-point expressions, with consistent improvements in numerical distance after removing the distance term. To directly address the concern, the revised results section now includes a post-hoc breakdown on subsets with elevated decimal density; the gains in distance metrics remain stable under this analysis, supporting that the supervised entropy approach does not rely on place-value scaling for its benefits. revision: yes
Circularity Check
No significant circularity; DEL is an independent reformulation
full rationale
The paper derives DEL by analyzing existing methods as following a criterion-distance formulation, then defines a new loss that uses digit conditional probability with binary cross-entropy for supervised entropy optimization, explicitly deprecates the distance term, and extends the formulation to include decimal points and floating-point numbers. This construction is presented directly via new equations and design choices rather than by fitting parameters to data subsets or reducing predictions to inputs by construction. No load-bearing self-citations, uniqueness theorems from prior author work, or smuggled ansatzes are invoked to justify the core claim. Experiments on seven external benchmarks with four LLMs provide independent validation, confirming the derivation chain remains self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Reformulating unsupervised entropy optimization into a supervised form using digit conditional probability and binary cross-entropy provides a valid inductive bias for numerical learning.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
DEL reformulates unsupervised entropy optimization via digit conditional probability and binary cross-entropy; deprecates the distance term; generalizes to floating-point with place weighting u(t).
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Criterion-distance formulation (NTL, DIST2Loss) vs. DEL entropy criterion without geometric prior.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Math-shepherd: Verify and reinforce llms step-by-step without human annotations
Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. InACL, pages 9426–9439, 2024
work page 2024
-
[2]
DART-Math: Difficulty-aware rejection tuning for mathematical problem-solving
Yuxuan Tong, Xiwen Zhang, Rui Wang, Ruidong Wu, and Junxian He. DART-Math: Difficulty-aware rejection tuning for mathematical problem-solving. InNeurIPS, volume 37, pages 7821–7846, 2024
work page 2024
-
[3]
Mathscale: Scaling instruction tuning for mathematical reasoning
Zhengyang Tang, Xingxing Zhang, Benyou Wang, and Furu Wei. Mathscale: Scaling instruction tuning for mathematical reasoning. InICML, pages 47885–47900, 2024
work page 2024
-
[4]
Code Llama: Open Foundation Models for Code
Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. Code Llama: Open foundation models for code.arXiv preprint arXiv:2308.12950, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Mammoth: Building math generalist models through hybrid instruction tuning
Xiang Yue, Xingwei Qu, Ge Zhang, Yao Fu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mammoth: Building math generalist models through hybrid instruction tuning. InICLR, 2024
work page 2024
-
[6]
Openmathinstruct-1: A 1.8 million math instruction tuning dataset
ShubhamToshniwal, IvanMoshkov, SeanNarenthiran, DariaGitman, FeiJia, andIgorGitman. Openmathinstruct-1: A 1.8 million math instruction tuning dataset. InNeurIPS, volume 37, pages 34737–34774, 2024
work page 2024
-
[7]
Class-based n-gram models of natural language.Computational linguistics, 18(4):467–480, 1992
Peter F Brown, Vincent J Della Pietra, Peter V Desouza, Jennifer C Lai, and Robert L Mercer. Class-based n-gram models of natural language.Computational linguistics, 18(4):467–480, 1992
work page 1992
-
[8]
The curious case of neural text degeneration
Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. In ICLR, 2020
work page 2020
-
[9]
MixCE: Training autoregressive language models by mixing forward and reverse cross-entropies
Shiyue Zhang, Shijie Wu, Ozan Irsoy, Steven Lu, Mohit Bansal, Mark Dredze, and David Rosenberg. MixCE: Training autoregressive language models by mixing forward and reverse cross-entropies. InACL, pages 9027–9050, 2023
work page 2023
-
[10]
Tailoring language generation models under total variation distance
Haozhe Ji, Pei Ke, Zhipeng Hu, Rongsheng Zhang, and Minlie Huang. Tailoring language generation models under total variation distance. InICLR, 2023
work page 2023
-
[11]
Siyu Ren, Zhiyong Wu, and Kenny Q. Zhu. EMO: Earth mover distance optimization for auto-regressive language modeling. InICLR, 2024
work page 2024
-
[12]
Regress, don’t guess–a regression-like loss on number tokens for language models
Jonas Zausinger, Lars Pennig, Anamarija Kozina, Sean Sdahl, Julian Sikora, Adrian Dendorfer, Timofey Kuznetsov, Mohamad Hagog, Nina Wiedemann, Kacper Chlodny, et al. Regress, don’t guess–a regression-like loss on number tokens for language models. InICML, 2025
work page 2025
-
[13]
Teaching metric distance to autoregressive multimodal foundational models
Jiwan Chung, Saejin Kim, Yongrae Jo, Jaewoo Park, Dongjun Min, and Youngjae Yu. Teaching metric distance to autoregressive multimodal foundational models. InICLR, 2026
work page 2026
-
[14]
Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xiong-Hui Chen, Jianxin Yang, Zhenru Zhang, et al. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning. InNeurIPS, 2025
work page 2025
-
[15]
Minimum error rate training in statistical machine translation
Franz Josef Och. Minimum error rate training in statistical machine translation. InACL, pages 160–167, 2003
work page 2003
-
[16]
Neural machine translation of rare words with subword units
Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. InACL, pages 1715–1725, 2016
work page 2016
-
[17]
Towards end-to-end speech recognition with recurrent neural networks
Alex Graves and Navdeep Jaitly. Towards end-to-end speech recognition with recurrent neural networks. InICML, pages 1764–1772, 2014
work page 2014
-
[18]
Robust speech recognition via large-scale weak supervision
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InICML, pages 28492–28518, 2023
work page 2023
-
[19]
Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. InICML, pages 369–376, 2006. Visual Computing Lab·The Hong Kong Polytechnic University 10 / 22
work page 2006
-
[20]
Trocr: Transformer-basedopticalcharacterrecognitionwithpre-trainedmodels
Minghao Li, Tengchao Lv, Jingye Chen, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, and Furu Wei. Trocr: Transformer-basedopticalcharacterrecognitionwithpre-trainedmodels. InAAAI,pages13094–13102, 2023
work page 2023
-
[21]
Long short-term memory.Neural computation, 9(8):1735–1780, 1997
Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.Neural computation, 9(8):1735–1780, 1997
work page 1997
-
[22]
Attention is all you need.NeurIPS, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.NeurIPS, 2017
work page 2017
-
[23]
A neural probabilistic language model
Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic language model. Journal of Machine Learning Research, 3(Feb):1137–1155, 2003
work page 2003
-
[24]
Efficient Estimation of Word Representations in Vector Space
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space.arXiv preprint arXiv:1301.3781, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[25]
BERT: Pre-training of deep bidirectional transformers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. InNAACL, pages 4171–4186, 2019
work page 2019
-
[26]
Language models are few-shot learners
TomBrown,BenjaminMann,NickRyder,MelanieSubbiah,JaredDKaplan,PrafullaDhariwal,ArvindNeelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. InNeurIPS, pages 1877–1901, 2020
work page 1901
-
[27]
Chain-of-thought prompting elicits reasoning in large language models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. InNeurIPS, pages 24824–24837, 2022
work page 2022
-
[28]
Compositional chain-of-thought prompting for large multimodal models
Chancharik Mitra, Brandon Huang, Trevor Darrell, and Roei Herzig. Compositional chain-of-thought prompting for large multimodal models. InCVPR, pages 14420–14431, 2024
work page 2024
-
[29]
Tree of thoughts: Deliberate problem solving with large language models
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. InNeurIPS, pages 11809–11822, 2023
work page 2023
-
[30]
Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W Cohen. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks.Transactions on Machine Learning Research, 2023
work page 2023
-
[31]
PAL: Program-aided language models
Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. PAL: Program-aided language models. InICML, pages 10764–10799, 2023
work page 2023
-
[32]
How well do large language models perform in arithmetic tasks?arXiv preprint arXiv:2304.02015, 2023
Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, and Songfang Huang. How well do large language models perform in arithmetic tasks?arXiv preprint arXiv:2304.02015, 2023
-
[33]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[34]
Measuring mathematical problem solving with the math dataset.NeurIPS, 2021
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.NeurIPS, 2021
work page 2021
-
[35]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, pages 34892–34916, 2023
work page 2023
-
[36]
InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InCVPR, pages 24185–24198, 2024
work page 2024
-
[37]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-VL: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[38]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 21(140):1–67, 2020
work page 2020
-
[39]
Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019
work page 2019
-
[40]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. Visual Computing Lab·The Hong Kong Polytechnic University 11 / 22
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[41]
JoshAchiam,StevenAdler,SandhiniAgarwal,LamaAhmad,IlgeAkkaya,FlorenciaLeoniAleman,DiogoAlmeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[42]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[43]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[44]
Qwen Team et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[45]
Qwen2.5 technical report, 2025
Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...
work page 2025
-
[46]
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023
work page 2023
-
[47]
Judging llm-as-a-judge with mt-bench and chatbot arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. InNeurIPS, volume 36, pages 46595–46623, 2023
work page 2023
-
[48]
A mathematical theory of communication.The Bell System Technical Journal, 27(3): 379–423, 1948
Claude Elwood Shannon. A mathematical theory of communication.The Bell System Technical Journal, 27(3): 379–423, 1948
work page 1948
-
[49]
Griffiths, and Ilia Sucholutsky
Raja Marjieh, Veniamin Veselovsky, Thomas L. Griffiths, and Ilia Sucholutsky. What is a number, that a large language model may know it? InNeurIPS Workshop, 2025
work page 2025
-
[50]
Benford’s curse: Tracing digit bias to numerical hallucination in LLMs
Jiandong Shao, Yao Lu, and Jianfei Yang. Benford’s curse: Tracing digit bias to numerical hallucination in LLMs. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
work page 2025
-
[51]
Visualizing data using t-sne.Journal of Machine Learning Research, 9(11), 2008
Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of Machine Learning Research, 9(11), 2008
work page 2008
-
[52]
Zhiwei He, Tian Liang, Jiahao Xu, Qiuzhi Liu, Xingyu Chen, Yue Wang, Linfeng Song, Dian Yu, Zhenwen Liang, Wenxuan Wang, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, and Dong Yu. DeepMath-103K: A large-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning.arXiv preprint arXiv:2504.11456, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[53]
Galactica: A Large Language Model for Science
RossTaylor,MarcinKardas,GuillemCucurull,ThomasScialom,AnthonyHartshorn,ElvisSaravia,AndrewPoulton, Viktor Kerkez, and Robert Stojnic. Galactica: A large language model for science.arXiv preprint arXiv:2211.09085, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[54]
Metamath: Bootstrap your own mathematical questions for large language models
Longhui Yu, Weisen Jiang, Han Shi, Jincheng YU, Zhengying Liu, Yu Zhang, James Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models. In ICLR, 2024
work page 2024
-
[55]
Chen Li, Weiqi Wang, Jingcheng Hu, Yixuan Wei, Nanning Zheng, Han Hu, Zheng Zhang, and Houwen Peng. Common 7b language models already possess strong math capabilities.arXiv preprint arXiv:2403.04706, 2024
-
[56]
WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct
Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, and Dongmei Zhang. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct.arXiv preprint arXiv:2308.09583, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[57]
Skywork Open Reasoner 1 Technical Report
JujieHe,JiacaiLiu,ChrisYuhaoLiu,RuiYan,ChaojieWang,PengCheng,XiaoyuZhang,FuxiangZhang,Jiacheng Xu, Wei Shen, et al. Skywork open reasoner 1 technical report.arXiv preprint arXiv:2505.22312, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[58]
Are nlpmodels reallyable to solve simple math wordproblems? InNAACL, pages 2080–2094, 2021
Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are nlpmodels reallyable to solve simple math wordproblems? InNAACL, pages 2080–2094, 2021. Visual Computing Lab·The Hong Kong Polytechnic University 12 / 22
work page 2080
-
[59]
MAWPS: A math word problem repository
Rik Koncel-Kedziorski, Subhro Roy, Aida Amini, Nate Kushman, and Hannaneh Hajishirzi. MAWPS: A math word problem repository. InNAACL, pages 1152–1157, 2016
work page 2016
-
[60]
Program induction by rationale generation: Learning to solve and explain algebraic word problems
Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program induction by rationale generation: Learning to solve and explain algebraic word problems. InACL, pages 158–167, 2017
work page 2017
-
[61]
AGIEval: A human-centric benchmark for evaluating foundation models
Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. AGIEval: A human-centric benchmark for evaluating foundation models. InNAACL, pages 2299–2314, 2024
work page 2024
-
[62]
Measuring massive multitask language understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InICLR, 2021
work page 2021
-
[63]
Transformers: State-of-the-art natural language processing
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. Transformers: State-of-the-art natural language processing. In EMNLP, pages 38–45, 2020
work page 2020
-
[64]
Decoupled weight decay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2019
work page 2019
-
[65]
Feature information driven position gaussian distribution estimation for tiny object detection
Jinghao Bian, Mingtao Feng, Weisheng Dong, Fangfang Wu, Jianqiao Luo, Yaonan Wang, and Guangming Shi. Feature information driven position gaussian distribution estimation for tiny object detection. InCVPR, pages 30376–30386, 2025
work page 2025
-
[66]
The unreasonable effectiveness of entropy minimization in LLM reasoning
Shivam Agarwal, Zimin Zhang, Lifan Yuan, Jiawei Han, and Hao Peng. The unreasonable effectiveness of entropy minimization in LLM reasoning. InNeurIPS, 2025
work page 2025
-
[67]
Preserving diversity in supervised fine-tuning of large language models
Ziniu Li, Congliang Chen, Tian Xu, Zeyu Qin, Jiancong Xiao, Zhi-Quan Luo, and Ruoyu Sun. Preserving diversity in supervised fine-tuning of large language models. InICLR, 2025
work page 2025
-
[68]
Qiufu Li, Huibin Xiao, and Linlin Shen. BCE vs. CE in deep feature learning. InICML, 2025
work page 2025
-
[69]
Focal loss for dense object detection
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In ICCV, pages 2980–2988, 2017
work page 2017
-
[70]
Large language models can be easily distracted by irrelevant context
Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H Chi, Nathanael Schärli, and Denny Zhou. Large language models can be easily distracted by irrelevant context. InICML, pages 31210–31227, 2023. Visual Computing Lab·The Hong Kong Polytechnic University 13 / 22 Appendix In appendix, we provide the following materials: A. More analysis ...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.