DEL: Digit Entropy Loss for Numerical Learning of Large Language Models

Chenhang He; Lei Zhang; Ming-Ming Cheng; Shihao Wang; Yuxuan Li; Zhaohui Zheng

arxiv: 2605.20369 · v1 · pith:HUQEYUCZnew · submitted 2026-05-19 · 💻 cs.CL · cs.AI· cs.LG

DEL: Digit Entropy Loss for Numerical Learning of Large Language Models

Zhaohui Zheng , Chenhang He , Shihao Wang , Yuxuan Li , Ming-Ming Cheng , Lei Zhang This is my paper

Pith reviewed 2026-05-21 07:31 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords digit entropy lossnumerical learninglarge language modelsmathematical reasoningfloating-point optimizationsupervised entropybinary cross-entropyloss function

0 comments

The pith

Digit Entropy Loss improves number prediction in LLMs by supervising digit probabilities with binary cross-entropy while dropping numerical distance terms and extending to floating-point values.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models often produce inaccurate numbers in math and code tasks even when they reason correctly about other parts of a problem. Standard maximum likelihood training does not target numerical accuracy, and recent penalty methods that add numerical distance create either overly peaked or overly flat digit distributions. The paper shows that these methods share a criterion-distance structure and introduces Digit Entropy Loss to replace the unsupervised entropy objective with a supervised version that conditions on prior digits, applies binary cross-entropy, removes the distance term entirely, and treats decimal points and digits uniformly so the model learns complete floating-point numbers. If the approach holds, models would generate more accurate numerical answers on reasoning benchmarks without the distribution problems of earlier losses.

Core claim

Existing numerical learning methods for LLMs follow a criterion-distance formulation in which the criterion defines the optimization pattern and the distance term supplies a geometric prior. Digit Entropy Loss reformulates the conventional unsupervised entropy optimization through three changes: it uses digit conditional probability together with binary cross-entropy to make entropy optimization supervised; it removes the distance term to sidestep over-sharpening and over-flattening; and it generalizes the objective from integers to full floating-point numbers that include decimal digits and points. On seven mathematical reasoning benchmarks and four LLMs the resulting loss yields higher end

What carries the argument

Digit Entropy Loss (DEL), which converts unsupervised entropy optimization into a supervised objective by conditioning on previous digits, applying binary cross-entropy, discarding the numerical distance term, and treating decimal points as ordinary tokens so the loss operates over entire floating-point numbers.

If this is right

DEL produces higher overall prediction accuracy than prior numerical losses on mathematical reasoning tasks.
DEL yields smaller numerical distance errors on the same benchmarks.
DEL supports optimization over floating-point numbers that contain decimal points and digits.
DEL applies across multiple LLMs including CodeLlama, Mistral, DeepSeek, and Qwen-2.5.
DEL expands the training objective from isolated digits to complete numbers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar supervised entropy formulations could be tested on other structured sequential outputs such as dates or measurements.
Removing explicit distance penalties may simplify training pipelines that currently combine multiple loss terms for numerical data.
The approach invites direct comparison with token-level calibration methods that also aim to shape probability distributions without geometric penalties.
Extending the same digit-level conditioning to code-generation settings that require precise numeric literals could be checked in follow-up experiments.

Load-bearing premise

That removing the numerical distance term and guiding entropy with supervised binary cross-entropy on digit conditional probabilities will avoid over-sharpening and over-flattening while successfully extending integer learning to floating-point numbers.

What would settle it

If DEL fails to produce higher overall prediction accuracy or lower numerical distance than the compared losses when evaluated on the same seven benchmarks with CodeLlama, Mistral, DeepSeek, and Qwen-2.5, the central claim would be falsified.

read the original abstract

Number prediction stands as a fundamental capability of large language models (LLMs) in mathematical problem-solving and code generation. The widely adopted maximum likelihood estimation (MLE) for LLM training is not tailored to number prediction. Recently, penalty-driven approaches, e.g., Number Token Loss and Discretized Distance Loss, introduce an inductive bias of numerical distance but induce over-sharpened and over-flattened digit distributions, respectively. In this paper, we make an in-depth analysis on LLM numerical learning, and show that existing numerical learning methods conceptually follow a criterion-distance formulation, where the criterion term represents optimization pattern and the distance term instills geometric prior. Consequently, we present Digit Entropy Loss (DEL) for auto-regressive numerical learning, which reformulates the conventional unsupervised entropy optimization in three key designs: leveraging digit conditional probability and binary cross-entropy to guide the entropy optimization into a supervised manner; deprecating the distance term to bypass the issue of numerical distance; and generalizing the integer-based numerical learning to floating-point number optimization, enabling more accurate number prediction. Our DEL formulation can incorporate integers, decimals, and decimal points, expanding the learning objective from a single digit to the floating-point number domain. Experiments conducted on seven mathematical reasoning benchmarks with four representative LLMs, including CodeLlama, Mistral, DeepSeek, and Qwen-2.5, demonstrate that DEL consistently outperforms its counterparts in both overall prediction accuracy and numerical distance. Source codes are at https://github.com/PolyU-VCLab/DEL

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DEL gives a supervised digit-entropy loss that drops distance penalties and adds floats via decimal points in the sequence, with reported gains on math benchmarks, but the uniform per-digit treatment raises questions for decimal error control.

read the letter

The main point is that DEL turns numerical learning into a supervised task on digit probabilities with binary cross-entropy, removes the distance penalty to avoid over-sharpening or flattening, and generalizes from integers to floats by adding decimal points to the sequence. This stands out from the prior work on Number Token Loss and Discretized Distance Loss because it explicitly drops the geometric prior and uses a supervised entropy guide instead. The experiments back this up with gains across CodeLlama, Mistral, DeepSeek, and Qwen-2.5 on seven math reasoning benchmarks, both in accuracy and in reducing numerical distance. The public code link makes it easier to verify. A weaker part is the floating-point extension. The loss treats digits before and after the decimal the same without any weighting by their place value, like 10 to the power of negative position. Since real numerical error depends on magnitude, this uniform per-digit signal might not push the model to get the more significant decimal places right. The stress test highlights this, and without more details on decimal-specific results or comparisons, it is not clear how robust the generalization is. This work targets people building or fine-tuning LLMs for tasks that involve precise numbers, such as math problem solving or code generation. A reader looking for practical loss modifications will get value from the three explicit design choices and the benchmark results. It deserves a serious referee because the formulation is distinct and the claims are testable with the released code. I would send it for peer review to sort out the details on floats and the stats.

Referee Report

2 major / 2 minor

Summary. The paper proposes Digit Entropy Loss (DEL) for numerical learning in LLMs. It frames prior methods (MLE, Number Token Loss, Discretized Distance Loss) as criterion-distance formulations and introduces DEL via three changes: supervised entropy optimization using digit conditional probabilities and binary cross-entropy, removal of the numerical distance term, and extension from integers to floating-point numbers by incorporating decimal points into the token sequence. Experiments across seven mathematical reasoning benchmarks and four LLMs (CodeLlama, Mistral, DeepSeek, Qwen-2.5) report consistent gains in both prediction accuracy and numerical distance metrics.

Significance. If the central claims hold, DEL offers a parameter-free alternative that sidesteps over-sharpening and over-flattening while extending numerical supervision to decimals. The multi-LLM, multi-benchmark evaluation and public code release provide a reproducible empirical foundation for improved number prediction in mathematical reasoning and code generation tasks.

major comments (2)

[Section 3] Section 3: The generalization to floating-point numbers incorporates decimal points but applies uniform binary cross-entropy across all digits without place-value weighting (e.g., scaling post-decimal digits by 10^{-k}). Because numerical distance is evaluated in absolute or log scale, the per-digit supervision signal lacks explicit magnitude awareness; this assumption is load-bearing for the claim that DEL successfully extends integer learning to floats while improving numerical distance.
[Results section] Results section: The reported outperformance on numerical distance is presented as evidence that deprecating the distance term succeeds, yet the manuscript does not include an ablation that isolates the effect of uniform digit treatment on decimal-heavy subsets of the benchmarks. Without this, it remains unclear whether the observed gains in distance metrics are robust to the lack of place-value scaling.

minor comments (2)

The description of how decimal points are tokenized and how the conditional probability is computed over the extended vocabulary could be accompanied by an explicit equation or pseudocode for clarity.
Details on the exact data splits, number of evaluation runs, and any statistical significance testing for the benchmark results are not fully elaborated, which would strengthen the empirical claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, providing clarifications on our design choices and indicating where revisions have been made to strengthen the presentation and empirical support.

read point-by-point responses

Referee: [Section 3] Section 3: The generalization to floating-point numbers incorporates decimal points but applies uniform binary cross-entropy across all digits without place-value weighting (e.g., scaling post-decimal digits by 10^{-k}). Because numerical distance is evaluated in absolute or log scale, the per-digit supervision signal lacks explicit magnitude awareness; this assumption is load-bearing for the claim that DEL successfully extends integer learning to floats while improving numerical distance.

Authors: We appreciate the referee's observation on the uniform binary cross-entropy in the floating-point generalization. This uniformity is a deliberate choice to preserve a parameter-free formulation that directly supervises digit-level conditional probabilities via binary cross-entropy, allowing the autoregressive model to capture positional significance through sequence context rather than explicit scaling. Introducing place-value weights would add hyperparameters that risk reintroducing the over-sharpening or over-flattening issues we sought to avoid by deprecating the distance term. Benchmarks in our evaluation contain floating-point numbers, and the observed gains in both accuracy and numerical distance metrics indicate that the supervised entropy optimization supplies adequate signal. In the revised manuscript we have expanded Section 3 with a dedicated paragraph explaining this design rationale and its implications for magnitude awareness. revision: partial
Referee: [Results section] Results section: The reported outperformance on numerical distance is presented as evidence that deprecating the distance term succeeds, yet the manuscript does not include an ablation that isolates the effect of uniform digit treatment on decimal-heavy subsets of the benchmarks. Without this, it remains unclear whether the observed gains in distance metrics are robust to the lack of place-value scaling.

Authors: We agree that an ablation isolating uniform digit treatment on decimal-heavy subsets would provide stronger evidence of robustness. Our primary experiments already span seven mathematical reasoning benchmarks that include a range of floating-point expressions, with consistent improvements in numerical distance after removing the distance term. To directly address the concern, the revised results section now includes a post-hoc breakdown on subsets with elevated decimal density; the gains in distance metrics remain stable under this analysis, supporting that the supervised entropy approach does not rely on place-value scaling for its benefits. revision: yes

Circularity Check

0 steps flagged

No significant circularity; DEL is an independent reformulation

full rationale

The paper derives DEL by analyzing existing methods as following a criterion-distance formulation, then defines a new loss that uses digit conditional probability with binary cross-entropy for supervised entropy optimization, explicitly deprecates the distance term, and extends the formulation to include decimal points and floating-point numbers. This construction is presented directly via new equations and design choices rather than by fitting parameters to data subsets or reducing predictions to inputs by construction. No load-bearing self-citations, uniqueness theorems from prior author work, or smuggled ansatzes are invoked to justify the core claim. Experiments on seven external benchmarks with four LLMs provide independent validation, confirming the derivation chain remains self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard machine-learning assumptions about loss optimization and the validity of the three design choices for numerical learning; no free parameters or new entities are introduced in the abstract.

axioms (1)

domain assumption Reformulating unsupervised entropy optimization into a supervised form using digit conditional probability and binary cross-entropy provides a valid inductive bias for numerical learning.
This is the core conceptual step described in the abstract's analysis of existing methods.

pith-pipeline@v0.9.0 · 5825 in / 1250 out tokens · 39644 ms · 2026-05-21T07:31:38.883732+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DEL reformulates unsupervised entropy optimization via digit conditional probability and binary cross-entropy; deprecates the distance term; generalizes to floating-point with place weighting u(t).
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Criterion-distance formulation (NTL, DIST2Loss) vs. DEL entropy criterion without geometric prior.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · 13 internal anchors

[1]

Math-shepherd: Verify and reinforce llms step-by-step without human annotations

Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. InACL, pages 9426–9439, 2024

work page 2024
[2]

DART-Math: Difficulty-aware rejection tuning for mathematical problem-solving

Yuxuan Tong, Xiwen Zhang, Rui Wang, Ruidong Wu, and Junxian He. DART-Math: Difficulty-aware rejection tuning for mathematical problem-solving. InNeurIPS, volume 37, pages 7821–7846, 2024

work page 2024
[3]

Mathscale: Scaling instruction tuning for mathematical reasoning

Zhengyang Tang, Xingxing Zhang, Benyou Wang, and Furu Wei. Mathscale: Scaling instruction tuning for mathematical reasoning. InICML, pages 47885–47900, 2024

work page 2024
[4]

Code Llama: Open Foundation Models for Code

Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. Code Llama: Open foundation models for code.arXiv preprint arXiv:2308.12950, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Mammoth: Building math generalist models through hybrid instruction tuning

Xiang Yue, Xingwei Qu, Ge Zhang, Yao Fu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mammoth: Building math generalist models through hybrid instruction tuning. InICLR, 2024

work page 2024
[6]

Openmathinstruct-1: A 1.8 million math instruction tuning dataset

ShubhamToshniwal, IvanMoshkov, SeanNarenthiran, DariaGitman, FeiJia, andIgorGitman. Openmathinstruct-1: A 1.8 million math instruction tuning dataset. InNeurIPS, volume 37, pages 34737–34774, 2024

work page 2024
[7]

Class-based n-gram models of natural language.Computational linguistics, 18(4):467–480, 1992

Peter F Brown, Vincent J Della Pietra, Peter V Desouza, Jennifer C Lai, and Robert L Mercer. Class-based n-gram models of natural language.Computational linguistics, 18(4):467–480, 1992

work page 1992
[8]

The curious case of neural text degeneration

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. In ICLR, 2020

work page 2020
[9]

MixCE: Training autoregressive language models by mixing forward and reverse cross-entropies

Shiyue Zhang, Shijie Wu, Ozan Irsoy, Steven Lu, Mohit Bansal, Mark Dredze, and David Rosenberg. MixCE: Training autoregressive language models by mixing forward and reverse cross-entropies. InACL, pages 9027–9050, 2023

work page 2023
[10]

Tailoring language generation models under total variation distance

Haozhe Ji, Pei Ke, Zhipeng Hu, Rongsheng Zhang, and Minlie Huang. Tailoring language generation models under total variation distance. InICLR, 2023

work page 2023
[11]

Siyu Ren, Zhiyong Wu, and Kenny Q. Zhu. EMO: Earth mover distance optimization for auto-regressive language modeling. InICLR, 2024

work page 2024
[12]

Regress, don’t guess–a regression-like loss on number tokens for language models

Jonas Zausinger, Lars Pennig, Anamarija Kozina, Sean Sdahl, Julian Sikora, Adrian Dendorfer, Timofey Kuznetsov, Mohamad Hagog, Nina Wiedemann, Kacper Chlodny, et al. Regress, don’t guess–a regression-like loss on number tokens for language models. InICML, 2025

work page 2025
[13]

Teaching metric distance to autoregressive multimodal foundational models

Jiwan Chung, Saejin Kim, Yongrae Jo, Jaewoo Park, Dongjun Min, and Youngjae Yu. Teaching metric distance to autoregressive multimodal foundational models. InICLR, 2026

work page 2026
[14]

Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning

Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xiong-Hui Chen, Jianxin Yang, Zhenru Zhang, et al. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning. InNeurIPS, 2025

work page 2025
[15]

Minimum error rate training in statistical machine translation

Franz Josef Och. Minimum error rate training in statistical machine translation. InACL, pages 160–167, 2003

work page 2003
[16]

Neural machine translation of rare words with subword units

Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. InACL, pages 1715–1725, 2016

work page 2016
[17]

Towards end-to-end speech recognition with recurrent neural networks

Alex Graves and Navdeep Jaitly. Towards end-to-end speech recognition with recurrent neural networks. InICML, pages 1764–1772, 2014

work page 2014
[18]

Robust speech recognition via large-scale weak supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InICML, pages 28492–28518, 2023

work page 2023
[19]

Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks

Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. InICML, pages 369–376, 2006. Visual Computing Lab·The Hong Kong Polytechnic University 10 / 22

work page 2006
[20]

Trocr: Transformer-basedopticalcharacterrecognitionwithpre-trainedmodels

Minghao Li, Tengchao Lv, Jingye Chen, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, and Furu Wei. Trocr: Transformer-basedopticalcharacterrecognitionwithpre-trainedmodels. InAAAI,pages13094–13102, 2023

work page 2023
[21]

Long short-term memory.Neural computation, 9(8):1735–1780, 1997

Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.Neural computation, 9(8):1735–1780, 1997

work page 1997
[22]

Attention is all you need.NeurIPS, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.NeurIPS, 2017

work page 2017
[23]

A neural probabilistic language model

Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic language model. Journal of Machine Learning Research, 3(Feb):1137–1155, 2003

work page 2003
[24]

Efficient Estimation of Word Representations in Vector Space

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space.arXiv preprint arXiv:1301.3781, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[25]

BERT: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. InNAACL, pages 4171–4186, 2019

work page 2019
[26]

Language models are few-shot learners

TomBrown,BenjaminMann,NickRyder,MelanieSubbiah,JaredDKaplan,PrafullaDhariwal,ArvindNeelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. InNeurIPS, pages 1877–1901, 2020

work page 1901
[27]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. InNeurIPS, pages 24824–24837, 2022

work page 2022
[28]

Compositional chain-of-thought prompting for large multimodal models

Chancharik Mitra, Brandon Huang, Trevor Darrell, and Roei Herzig. Compositional chain-of-thought prompting for large multimodal models. InCVPR, pages 14420–14431, 2024

work page 2024
[29]

Tree of thoughts: Deliberate problem solving with large language models

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. InNeurIPS, pages 11809–11822, 2023

work page 2023
[30]

Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks.Transactions on Machine Learning Research, 2023

Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W Cohen. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks.Transactions on Machine Learning Research, 2023

work page 2023
[31]

PAL: Program-aided language models

Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. PAL: Program-aided language models. InICML, pages 10764–10799, 2023

work page 2023
[32]

How well do large language models perform in arithmetic tasks?arXiv preprint arXiv:2304.02015, 2023

Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, and Songfang Huang. How well do large language models perform in arithmetic tasks?arXiv preprint arXiv:2304.02015, 2023

work page arXiv 2023
[33]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[34]

Measuring mathematical problem solving with the math dataset.NeurIPS, 2021

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.NeurIPS, 2021

work page 2021
[35]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, pages 34892–34916, 2023

work page 2023
[36]

InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InCVPR, pages 24185–24198, 2024

work page 2024
[37]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-VL: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 21(140):1–67, 2020

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 21(140):1–67, 2020

work page 2020
[39]

Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

work page 2019
[40]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. Visual Computing Lab·The Hong Kong Polytechnic University 11 / 22

work page internal anchor Pith review Pith/arXiv arXiv 2023
[41]

GPT-4 Technical Report

JoshAchiam,StevenAdler,SandhiniAgarwal,LamaAhmad,IlgeAkkaya,FlorenciaLeoniAleman,DiogoAlmeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[42]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

Qwen2 Technical Report

Qwen Team et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

Qwen2.5 technical report, 2025

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

work page 2025
[46]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023

work page 2023
[47]

Judging llm-as-a-judge with mt-bench and chatbot arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. InNeurIPS, volume 36, pages 46595–46623, 2023

work page 2023
[48]

A mathematical theory of communication.The Bell System Technical Journal, 27(3): 379–423, 1948

Claude Elwood Shannon. A mathematical theory of communication.The Bell System Technical Journal, 27(3): 379–423, 1948

work page 1948
[49]

Griffiths, and Ilia Sucholutsky

Raja Marjieh, Veniamin Veselovsky, Thomas L. Griffiths, and Ilia Sucholutsky. What is a number, that a large language model may know it? InNeurIPS Workshop, 2025

work page 2025
[50]

Benford’s curse: Tracing digit bias to numerical hallucination in LLMs

Jiandong Shao, Yao Lu, and Jianfei Yang. Benford’s curse: Tracing digit bias to numerical hallucination in LLMs. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[51]

Visualizing data using t-sne.Journal of Machine Learning Research, 9(11), 2008

Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of Machine Learning Research, 9(11), 2008

work page 2008
[52]

DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning

Zhiwei He, Tian Liang, Jiahao Xu, Qiuzhi Liu, Xingyu Chen, Yue Wang, Linfeng Song, Dian Yu, Zhenwen Liang, Wenxuan Wang, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, and Dong Yu. DeepMath-103K: A large-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning.arXiv preprint arXiv:2504.11456, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[53]

Galactica: A Large Language Model for Science

RossTaylor,MarcinKardas,GuillemCucurull,ThomasScialom,AnthonyHartshorn,ElvisSaravia,AndrewPoulton, Viktor Kerkez, and Robert Stojnic. Galactica: A large language model for science.arXiv preprint arXiv:2211.09085, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[54]

Metamath: Bootstrap your own mathematical questions for large language models

Longhui Yu, Weisen Jiang, Han Shi, Jincheng YU, Zhengying Liu, Yu Zhang, James Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models. In ICLR, 2024

work page 2024
[55]

Common 7b language models already possess strong math capabilities.arXiv preprint arXiv:2403.04706, 2024

Chen Li, Weiqi Wang, Jingcheng Hu, Yixuan Wei, Nanning Zheng, Han Hu, Zheng Zhang, and Houwen Peng. Common 7b language models already possess strong math capabilities.arXiv preprint arXiv:2403.04706, 2024

work page arXiv 2024
[56]

WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct

Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, and Dongmei Zhang. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct.arXiv preprint arXiv:2308.09583, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[57]

Skywork Open Reasoner 1 Technical Report

JujieHe,JiacaiLiu,ChrisYuhaoLiu,RuiYan,ChaojieWang,PengCheng,XiaoyuZhang,FuxiangZhang,Jiacheng Xu, Wei Shen, et al. Skywork open reasoner 1 technical report.arXiv preprint arXiv:2505.22312, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[58]

Are nlpmodels reallyable to solve simple math wordproblems? InNAACL, pages 2080–2094, 2021

Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are nlpmodels reallyable to solve simple math wordproblems? InNAACL, pages 2080–2094, 2021. Visual Computing Lab·The Hong Kong Polytechnic University 12 / 22

work page 2080
[59]

MAWPS: A math word problem repository

Rik Koncel-Kedziorski, Subhro Roy, Aida Amini, Nate Kushman, and Hannaneh Hajishirzi. MAWPS: A math word problem repository. InNAACL, pages 1152–1157, 2016

work page 2016
[60]

Program induction by rationale generation: Learning to solve and explain algebraic word problems

Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program induction by rationale generation: Learning to solve and explain algebraic word problems. InACL, pages 158–167, 2017

work page 2017
[61]

AGIEval: A human-centric benchmark for evaluating foundation models

Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. AGIEval: A human-centric benchmark for evaluating foundation models. InNAACL, pages 2299–2314, 2024

work page 2024
[62]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InICLR, 2021

work page 2021
[63]

Transformers: State-of-the-art natural language processing

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. Transformers: State-of-the-art natural language processing. In EMNLP, pages 38–45, 2020

work page 2020
[64]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2019

work page 2019
[65]

Feature information driven position gaussian distribution estimation for tiny object detection

Jinghao Bian, Mingtao Feng, Weisheng Dong, Fangfang Wu, Jianqiao Luo, Yaonan Wang, and Guangming Shi. Feature information driven position gaussian distribution estimation for tiny object detection. InCVPR, pages 30376–30386, 2025

work page 2025
[66]

The unreasonable effectiveness of entropy minimization in LLM reasoning

Shivam Agarwal, Zimin Zhang, Lifan Yuan, Jiawei Han, and Hao Peng. The unreasonable effectiveness of entropy minimization in LLM reasoning. InNeurIPS, 2025

work page 2025
[67]

Preserving diversity in supervised fine-tuning of large language models

Ziniu Li, Congliang Chen, Tian Xu, Zeyu Qin, Jiancong Xiao, Zhi-Quan Luo, and Ruoyu Sun. Preserving diversity in supervised fine-tuning of large language models. InICLR, 2025

work page 2025
[68]

Qiufu Li, Huibin Xiao, and Linlin Shen. BCE vs. CE in deep feature learning. InICML, 2025

work page 2025
[69]

Focal loss for dense object detection

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In ICCV, pages 2980–2988, 2017

work page 2017
[70]

Large language models can be easily distracted by irrelevant context

Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H Chi, Nathanael Schärli, and Denny Zhou. Large language models can be easily distracted by irrelevant context. InICML, pages 31210–31227, 2023. Visual Computing Lab·The Hong Kong Polytechnic University 13 / 22 Appendix In appendix, we provide the following materials: A. More analysis ...

work page 2023

[1] [1]

Math-shepherd: Verify and reinforce llms step-by-step without human annotations

Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. InACL, pages 9426–9439, 2024

work page 2024

[2] [2]

DART-Math: Difficulty-aware rejection tuning for mathematical problem-solving

Yuxuan Tong, Xiwen Zhang, Rui Wang, Ruidong Wu, and Junxian He. DART-Math: Difficulty-aware rejection tuning for mathematical problem-solving. InNeurIPS, volume 37, pages 7821–7846, 2024

work page 2024

[3] [3]

Mathscale: Scaling instruction tuning for mathematical reasoning

Zhengyang Tang, Xingxing Zhang, Benyou Wang, and Furu Wei. Mathscale: Scaling instruction tuning for mathematical reasoning. InICML, pages 47885–47900, 2024

work page 2024

[4] [4]

Code Llama: Open Foundation Models for Code

Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. Code Llama: Open foundation models for code.arXiv preprint arXiv:2308.12950, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

Mammoth: Building math generalist models through hybrid instruction tuning

Xiang Yue, Xingwei Qu, Ge Zhang, Yao Fu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mammoth: Building math generalist models through hybrid instruction tuning. InICLR, 2024

work page 2024

[6] [6]

Openmathinstruct-1: A 1.8 million math instruction tuning dataset

ShubhamToshniwal, IvanMoshkov, SeanNarenthiran, DariaGitman, FeiJia, andIgorGitman. Openmathinstruct-1: A 1.8 million math instruction tuning dataset. InNeurIPS, volume 37, pages 34737–34774, 2024

work page 2024

[7] [7]

Class-based n-gram models of natural language.Computational linguistics, 18(4):467–480, 1992

Peter F Brown, Vincent J Della Pietra, Peter V Desouza, Jennifer C Lai, and Robert L Mercer. Class-based n-gram models of natural language.Computational linguistics, 18(4):467–480, 1992

work page 1992

[8] [8]

The curious case of neural text degeneration

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. In ICLR, 2020

work page 2020

[9] [9]

MixCE: Training autoregressive language models by mixing forward and reverse cross-entropies

Shiyue Zhang, Shijie Wu, Ozan Irsoy, Steven Lu, Mohit Bansal, Mark Dredze, and David Rosenberg. MixCE: Training autoregressive language models by mixing forward and reverse cross-entropies. InACL, pages 9027–9050, 2023

work page 2023

[10] [10]

Tailoring language generation models under total variation distance

Haozhe Ji, Pei Ke, Zhipeng Hu, Rongsheng Zhang, and Minlie Huang. Tailoring language generation models under total variation distance. InICLR, 2023

work page 2023

[11] [11]

Siyu Ren, Zhiyong Wu, and Kenny Q. Zhu. EMO: Earth mover distance optimization for auto-regressive language modeling. InICLR, 2024

work page 2024

[12] [12]

Regress, don’t guess–a regression-like loss on number tokens for language models

Jonas Zausinger, Lars Pennig, Anamarija Kozina, Sean Sdahl, Julian Sikora, Adrian Dendorfer, Timofey Kuznetsov, Mohamad Hagog, Nina Wiedemann, Kacper Chlodny, et al. Regress, don’t guess–a regression-like loss on number tokens for language models. InICML, 2025

work page 2025

[13] [13]

Teaching metric distance to autoregressive multimodal foundational models

Jiwan Chung, Saejin Kim, Yongrae Jo, Jaewoo Park, Dongjun Min, and Youngjae Yu. Teaching metric distance to autoregressive multimodal foundational models. InICLR, 2026

work page 2026

[14] [14]

Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning

Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xiong-Hui Chen, Jianxin Yang, Zhenru Zhang, et al. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning. InNeurIPS, 2025

work page 2025

[15] [15]

Minimum error rate training in statistical machine translation

Franz Josef Och. Minimum error rate training in statistical machine translation. InACL, pages 160–167, 2003

work page 2003

[16] [16]

Neural machine translation of rare words with subword units

Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. InACL, pages 1715–1725, 2016

work page 2016

[17] [17]

Towards end-to-end speech recognition with recurrent neural networks

Alex Graves and Navdeep Jaitly. Towards end-to-end speech recognition with recurrent neural networks. InICML, pages 1764–1772, 2014

work page 2014

[18] [18]

Robust speech recognition via large-scale weak supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InICML, pages 28492–28518, 2023

work page 2023

[19] [19]

Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks

Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. InICML, pages 369–376, 2006. Visual Computing Lab·The Hong Kong Polytechnic University 10 / 22

work page 2006

[20] [20]

Trocr: Transformer-basedopticalcharacterrecognitionwithpre-trainedmodels

Minghao Li, Tengchao Lv, Jingye Chen, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, and Furu Wei. Trocr: Transformer-basedopticalcharacterrecognitionwithpre-trainedmodels. InAAAI,pages13094–13102, 2023

work page 2023

[21] [21]

Long short-term memory.Neural computation, 9(8):1735–1780, 1997

Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.Neural computation, 9(8):1735–1780, 1997

work page 1997

[22] [22]

Attention is all you need.NeurIPS, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.NeurIPS, 2017

work page 2017

[23] [23]

A neural probabilistic language model

Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic language model. Journal of Machine Learning Research, 3(Feb):1137–1155, 2003

work page 2003

[24] [24]

Efficient Estimation of Word Representations in Vector Space

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space.arXiv preprint arXiv:1301.3781, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[25] [25]

BERT: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. InNAACL, pages 4171–4186, 2019

work page 2019

[26] [26]

Language models are few-shot learners

TomBrown,BenjaminMann,NickRyder,MelanieSubbiah,JaredDKaplan,PrafullaDhariwal,ArvindNeelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. InNeurIPS, pages 1877–1901, 2020

work page 1901

[27] [27]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. InNeurIPS, pages 24824–24837, 2022

work page 2022

[28] [28]

Compositional chain-of-thought prompting for large multimodal models

Chancharik Mitra, Brandon Huang, Trevor Darrell, and Roei Herzig. Compositional chain-of-thought prompting for large multimodal models. InCVPR, pages 14420–14431, 2024

work page 2024

[29] [29]

Tree of thoughts: Deliberate problem solving with large language models

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. InNeurIPS, pages 11809–11822, 2023

work page 2023

[30] [30]

Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks.Transactions on Machine Learning Research, 2023

Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W Cohen. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks.Transactions on Machine Learning Research, 2023

work page 2023

[31] [31]

PAL: Program-aided language models

Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. PAL: Program-aided language models. InICML, pages 10764–10799, 2023

work page 2023

[32] [32]

How well do large language models perform in arithmetic tasks?arXiv preprint arXiv:2304.02015, 2023

Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, and Songfang Huang. How well do large language models perform in arithmetic tasks?arXiv preprint arXiv:2304.02015, 2023

work page arXiv 2023

[33] [33]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[34] [34]

Measuring mathematical problem solving with the math dataset.NeurIPS, 2021

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.NeurIPS, 2021

work page 2021

[35] [35]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, pages 34892–34916, 2023

work page 2023

[36] [36]

InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InCVPR, pages 24185–24198, 2024

work page 2024

[37] [37]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-VL: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[38] [38]

Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 21(140):1–67, 2020

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 21(140):1–67, 2020

work page 2020

[39] [39]

Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

work page 2019

[40] [40]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. Visual Computing Lab·The Hong Kong Polytechnic University 11 / 22

work page internal anchor Pith review Pith/arXiv arXiv 2023

[41] [41]

GPT-4 Technical Report

JoshAchiam,StevenAdler,SandhiniAgarwal,LamaAhmad,IlgeAkkaya,FlorenciaLeoniAleman,DiogoAlmeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[42] [42]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[43] [43]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[44] [44]

Qwen2 Technical Report

Qwen Team et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[45] [45]

Qwen2.5 technical report, 2025

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

work page 2025

[46] [46]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023

work page 2023

[47] [47]

Judging llm-as-a-judge with mt-bench and chatbot arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. InNeurIPS, volume 36, pages 46595–46623, 2023

work page 2023

[48] [48]

A mathematical theory of communication.The Bell System Technical Journal, 27(3): 379–423, 1948

Claude Elwood Shannon. A mathematical theory of communication.The Bell System Technical Journal, 27(3): 379–423, 1948

work page 1948

[49] [49]

Griffiths, and Ilia Sucholutsky

Raja Marjieh, Veniamin Veselovsky, Thomas L. Griffiths, and Ilia Sucholutsky. What is a number, that a large language model may know it? InNeurIPS Workshop, 2025

work page 2025

[50] [50]

Benford’s curse: Tracing digit bias to numerical hallucination in LLMs

Jiandong Shao, Yao Lu, and Jianfei Yang. Benford’s curse: Tracing digit bias to numerical hallucination in LLMs. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025

[51] [51]

Visualizing data using t-sne.Journal of Machine Learning Research, 9(11), 2008

Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of Machine Learning Research, 9(11), 2008

work page 2008

[52] [52]

DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning

Zhiwei He, Tian Liang, Jiahao Xu, Qiuzhi Liu, Xingyu Chen, Yue Wang, Linfeng Song, Dian Yu, Zhenwen Liang, Wenxuan Wang, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, and Dong Yu. DeepMath-103K: A large-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning.arXiv preprint arXiv:2504.11456, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[53] [53]

Galactica: A Large Language Model for Science

RossTaylor,MarcinKardas,GuillemCucurull,ThomasScialom,AnthonyHartshorn,ElvisSaravia,AndrewPoulton, Viktor Kerkez, and Robert Stojnic. Galactica: A large language model for science.arXiv preprint arXiv:2211.09085, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[54] [54]

Metamath: Bootstrap your own mathematical questions for large language models

Longhui Yu, Weisen Jiang, Han Shi, Jincheng YU, Zhengying Liu, Yu Zhang, James Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models. In ICLR, 2024

work page 2024

[55] [55]

Common 7b language models already possess strong math capabilities.arXiv preprint arXiv:2403.04706, 2024

Chen Li, Weiqi Wang, Jingcheng Hu, Yixuan Wei, Nanning Zheng, Han Hu, Zheng Zhang, and Houwen Peng. Common 7b language models already possess strong math capabilities.arXiv preprint arXiv:2403.04706, 2024

work page arXiv 2024

[56] [56]

WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct

Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, and Dongmei Zhang. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct.arXiv preprint arXiv:2308.09583, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[57] [57]

Skywork Open Reasoner 1 Technical Report

JujieHe,JiacaiLiu,ChrisYuhaoLiu,RuiYan,ChaojieWang,PengCheng,XiaoyuZhang,FuxiangZhang,Jiacheng Xu, Wei Shen, et al. Skywork open reasoner 1 technical report.arXiv preprint arXiv:2505.22312, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[58] [58]

Are nlpmodels reallyable to solve simple math wordproblems? InNAACL, pages 2080–2094, 2021

Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are nlpmodels reallyable to solve simple math wordproblems? InNAACL, pages 2080–2094, 2021. Visual Computing Lab·The Hong Kong Polytechnic University 12 / 22

work page 2080

[59] [59]

MAWPS: A math word problem repository

Rik Koncel-Kedziorski, Subhro Roy, Aida Amini, Nate Kushman, and Hannaneh Hajishirzi. MAWPS: A math word problem repository. InNAACL, pages 1152–1157, 2016

work page 2016

[60] [60]

Program induction by rationale generation: Learning to solve and explain algebraic word problems

Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program induction by rationale generation: Learning to solve and explain algebraic word problems. InACL, pages 158–167, 2017

work page 2017

[61] [61]

AGIEval: A human-centric benchmark for evaluating foundation models

Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. AGIEval: A human-centric benchmark for evaluating foundation models. InNAACL, pages 2299–2314, 2024

work page 2024

[62] [62]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InICLR, 2021

work page 2021

[63] [63]

Transformers: State-of-the-art natural language processing

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. Transformers: State-of-the-art natural language processing. In EMNLP, pages 38–45, 2020

work page 2020

[64] [64]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2019

work page 2019

[65] [65]

Feature information driven position gaussian distribution estimation for tiny object detection

Jinghao Bian, Mingtao Feng, Weisheng Dong, Fangfang Wu, Jianqiao Luo, Yaonan Wang, and Guangming Shi. Feature information driven position gaussian distribution estimation for tiny object detection. InCVPR, pages 30376–30386, 2025

work page 2025

[66] [66]

The unreasonable effectiveness of entropy minimization in LLM reasoning

Shivam Agarwal, Zimin Zhang, Lifan Yuan, Jiawei Han, and Hao Peng. The unreasonable effectiveness of entropy minimization in LLM reasoning. InNeurIPS, 2025

work page 2025

[67] [67]

Preserving diversity in supervised fine-tuning of large language models

Ziniu Li, Congliang Chen, Tian Xu, Zeyu Qin, Jiancong Xiao, Zhi-Quan Luo, and Ruoyu Sun. Preserving diversity in supervised fine-tuning of large language models. InICLR, 2025

work page 2025

[68] [68]

Qiufu Li, Huibin Xiao, and Linlin Shen. BCE vs. CE in deep feature learning. InICML, 2025

work page 2025

[69] [69]

Focal loss for dense object detection

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In ICCV, pages 2980–2988, 2017

work page 2017

[70] [70]

Large language models can be easily distracted by irrelevant context

Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H Chi, Nathanael Schärli, and Denny Zhou. Large language models can be easily distracted by irrelevant context. InICML, pages 31210–31227, 2023. Visual Computing Lab·The Hong Kong Polytechnic University 13 / 22 Appendix In appendix, we provide the following materials: A. More analysis ...

work page 2023