Recognition: 3 theorem links
· Lean TheoremGLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models
Pith reviewed 2026-05-11 17:42 UTC · model grok-4.3
The pith
GLM-4.5 reaches 70.1 percent on TAU-Bench and 91 percent on AIME 24 using an open-source 355B-parameter MoE model with only 32B parameters active at once.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GLM-4.5 achieves strong performance across agentic, reasoning, and coding (ARC) tasks, scoring 70.1% on TAU-Bench, 91.0% on AIME 24, and 64.2% on SWE-bench Verified. With much fewer parameters than several competitors, GLM-4.5 ranks 3rd overall among all evaluated models and 2nd on agentic benchmarks.
What carries the argument
The hybrid reasoning method that supports both thinking and direct response modes, built inside a Mixture-of-Experts architecture with 355 billion total parameters but only 32 billion activated per token.
Load-bearing premise
That the reported benchmark scores reflect genuine capabilities measured through fair, standardized, and uncontaminated evaluations that allow direct comparison to other models.
What would settle it
Independent re-evaluation of the model on the same benchmark problems using fresh, publicly documented prompts and code, or testing on a new suite of problems created after the training cutoff, would confirm or refute the claimed scores.
read the original abstract
We present GLM-4.5, an open-source Mixture-of-Experts (MoE) large language model with 355B total parameters and 32B activated parameters, featuring a hybrid reasoning method that supports both thinking and direct response modes. Through multi-stage training on 23T tokens and comprehensive post-training with expert model iteration and reinforcement learning, GLM-4.5 achieves strong performance across agentic, reasoning, and coding (ARC) tasks, scoring 70.1% on TAU-Bench, 91.0% on AIME 24, and 64.2% on SWE-bench Verified. With much fewer parameters than several competitors, GLM-4.5 ranks 3rd overall among all evaluated models and 2nd on agentic benchmarks. We release both GLM-4.5 (355B parameters) and a compact version, GLM-4.5-Air (106B parameters), to advance research in reasoning and agentic AI systems. Code, models, and more information are available at https://github.com/zai-org/GLM-4.5.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces GLM-4.5, an open-source Mixture-of-Experts (MoE) large language model with 355B total parameters and 32B activated parameters. It features a hybrid reasoning method that supports both thinking and direct response modes. The model undergoes multi-stage training on 23T tokens and post-training with expert model iteration and reinforcement learning. GLM-4.5 reports strong results across agentic, reasoning, and coding (ARC) tasks, including 70.1% on TAU-Bench, 91.0% on AIME 24, and 64.2% on SWE-bench Verified. It ranks 3rd overall among evaluated models and 2nd on agentic benchmarks despite having fewer parameters than several competitors. A compact variant, GLM-4.5-Air (106B parameters), is also released, with code and models made available at a GitHub repository.
Significance. If the benchmark results hold under verifiable and standardized conditions, the work advances open-source models for agentic and reasoning tasks by demonstrating competitive performance with an efficient MoE architecture and hybrid reasoning. The public release of both the full and compact models, along with code, is a clear strength that enables reproducibility and community follow-up research on ARC capabilities.
major comments (1)
- [Abstract] Abstract: The central performance claims, including the specific scores of 70.1% on TAU-Bench and 64.2% on SWE-bench Verified together with the 3rd overall and 2nd agentic ranking, are presented without any description of the evaluation methodology. Details on agent scaffolding, tool-use protocols, attempt limits, prompting consistency, use of the hybrid thinking mode, and data-contamination controls are required to establish that the results are comparable to those of competing models; their absence undermines confidence in the headline rankings.
minor comments (2)
- [Abstract] The phrase 'expert model iteration' in the abstract is used without definition or reference to a methods section; a brief clarification would improve readability.
- The efficiency claim ('much fewer parameters than several competitors') would be strengthened by explicitly listing the parameter counts of the referenced competing models in a comparison table.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. The feedback highlights an important point about ensuring transparency in the abstract for benchmark results. We address this directly below.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central performance claims, including the specific scores of 70.1% on TAU-Bench and 64.2% on SWE-bench Verified together with the 3rd overall and 2nd agentic ranking, are presented without any description of the evaluation methodology. Details on agent scaffolding, tool-use protocols, attempt limits, prompting consistency, use of the hybrid thinking mode, and data-contamination controls are required to establish that the results are comparable to those of competing models; their absence undermines confidence in the headline rankings.
Authors: We agree that the abstract, constrained by length, omits explicit methodology details, which can affect immediate assessment of comparability. The full manuscript contains sections on evaluation protocols that cover agent scaffolding (standard setups for TAU-Bench and SWE-bench), tool-use protocols, attempt limits, prompting strategies, selective use of the hybrid thinking mode, and data-contamination controls via held-out test sets and decontamination procedures. In the revision, we will expand the abstract with a concise clause summarizing these elements and add cross-references to the detailed methodology sections. This change will improve clarity while preserving the abstract's brevity. We do not believe the core results or rankings require alteration, only better contextualization. revision: yes
Circularity Check
No circularity: purely empirical benchmark reporting
full rationale
The paper describes training GLM-4.5 (355B MoE) on 23T tokens with post-training and RL, then reports measured benchmark scores (70.1% TAU-Bench, 91.0% AIME 24, 64.2% SWE-bench Verified). No mathematical derivations, equations, fitted predictions, or first-principles results exist. Claims rest on independent empirical evaluations with no self-definitional loops, fitted-input predictions, or load-bearing self-citations that reduce the central results to inputs by construction. Standard model-release structure; derivation chain is absent.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.DAlembert.Inevitabilitybilinear_family_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
GLM-4.5 achieves strong performance across agentic, reasoning, and coding (ARC) tasks, scoring 70.1% on TAU-Bench, 91.0% on AIME 24, and 64.2% on SWE-bench Verified. With much fewer parameters than several competitors, GLM-4.5 ranks 3rd overall among all evaluated models and 2nd on agentic benchmarks.
-
IndisputableMonolith.Foundation.PhiForcingphi_equation unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We present GLM-4.5, an open-source Mixture-of-Experts (MoE) large language model with 355B total parameters and 32B activated parameters, featuring a hybrid reasoning method that supports both thinking and direct response modes.
-
IndisputableMonolith.Foundation.LedgerForcingconservation_from_balance unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Through multi-stage training on 23T tokens and comprehensive post-training with expert model iteration and reinforcement learning
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 55 Pith papers
-
Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models
Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.
-
ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning
ReLibra uses pre-known token-to-expert routing from RL rollouts to perform inter-batch expert reordering and intra-batch replication, delivering up to 1.6x higher throughput than Megatron-LM and 1.2x over oracle-equip...
-
WildTableBench: Benchmarking Multimodal Foundation Models on Table Understanding In the Wild
WildTableBench is the first benchmark for multimodal models on naturally occurring table images, with only one of 21 tested models exceeding 50% accuracy and most ranging from 4.1% to 49.9%.
-
Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models
User-turn generation reveals that LLMs' interaction awareness is largely decoupled from task accuracy, remaining near zero in deterministic settings even as accuracy scales to 96.8% on GSM8K.
-
GGBound: A Genome-Grounded Agent for Microbial Life-Boundary Prediction
A genome-conditioned 4B LLM agent predicts microbial life boundaries and matches larger frontier models via token fusion, tool use, and a counterfactual gene-grounding reward.
-
CUDABeaver: Benchmarking LLM-Based Automated CUDA Debugging
CUDABeaver shows LLM CUDA debuggers often degenerate code for test-passing at the cost of speed, with protocol-aware metrics shifting success rates by up to 40 percentage points.
-
StoryAlign: Evaluating and Training Reward Models for Story Generation
StoryReward, trained on a new 100k story preference dataset, sets state-of-the-art performance on the introduced StoryRMB benchmark for aligning LLM stories with human preferences.
-
When Is the Same Model Not the Same Service? A Measurement Study of Hosted Open-Weight LLM APIs
Hosted open-weight LLM APIs function as time-varying heterogeneous services rather than fixed model artifacts, with demand concentrated, supply-use mismatches, and task-specific routing yielding major cost and through...
-
When Is the Same Model Not the Same Service? A Measurement Study of Hosted Open-Weight LLM APIs
Hosted open-weight LLMs function as heterogeneous, time-varying services rather than uniform model artifacts, with concentrated demand, decoupled supply and adoption, and measurable gains from task-aware routing.
-
AgentVisor: Defending LLM Agents Against Prompt Injection via Semantic Virtualization
AgentVisor cuts prompt injection success rate to 0.65% in LLM agents with only 1.45% utility loss via semantic privilege separation and one-shot self-correction.
-
Dr.Sai: An agentic AI for real-world physics analysis at BESIII
Dr.Sai autonomously executed full physics analysis pipelines on real BESIII data to re-measure ten J/psi decay branching fractions, matching established benchmarks without any manual coding.
-
Towards Temporal Compositional Reasoning in Long-Form Sports Videos
SportsTime benchmark and CoTR method improve multimodal AI's temporal compositional reasoning and evidence grounding in long-form sports videos.
-
FEPLB: Exploiting Copy Engines for Nearly Free MoE Load Balancing in Distributed Training
FEPLB reduces token and GEMM stragglers in MoE training by 50-70% using nearly free Copy Engine communication on Hopper architecture.
-
Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts
Expert upcycling duplicates experts in an existing MoE checkpoint and continues pre-training to match fixed-size baseline performance with 32% less compute.
-
AdversarialCoT: Single-Document Retrieval Poisoning for LLM Reasoning
A single query-specific poisoned document, built by extracting and iteratively refining an adversarial chain-of-thought, can substantially degrade reasoning accuracy in retrieval-augmented LLM systems.
-
E2E-REME: Towards End-to-End Microservices Auto-Remediation via Experience-Simulation Reinforcement Fine-Tuning
E2E-REME outperforms nine LLMs in accuracy and efficiency for end-to-end microservice remediation by using experience-simulation reinforcement fine-tuning on a new benchmark called MicroRemed.
-
ImplicitMemBench: Measuring Unconscious Behavioral Adaptation in Large Language Models
ImplicitMemBench shows no LLM exceeds 66% on implicit memory tasks, with top models at 65%, far below humans and pointing to architectural limits beyond scaling.
-
Bridging Time and Space: Decoupled Spatio-Temporal Alignment for Video Grounding
Bridge-STG decouples spatio-temporal alignment via semantic bridging and query-guided localization modules to achieve state-of-the-art m_vIoU of 34.3 on VidSTG among MLLM methods.
-
Autopoiesis: A Self-Evolving System Paradigm for LLM Serving Under Runtime Dynamics
Autopoiesis uses LLM-driven program synthesis to evolve serving policies online during deployment, delivering up to 53% and average 34% gains over prior LLM serving systems under runtime dynamics.
-
Code-Centric Detection of Vulnerability-Fixing Commits: A Unified Benchmark and Empirical Study
Code language models show no transferable security understanding from code diffs alone, rely on commit messages, miss over 93% of fixes at 0.5% false positive rate, and suffer large drops under group or temporal splits.
-
ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox
ComplexMCP benchmark shows current LLM agents achieve at most 60% success on interdependent tool tasks versus 90% for humans, due to tool retrieval saturation, over-confidence, and strategic defeatism.
-
Edit-Based Refinement for Parallel Masked Diffusion Language Models
ME-DLM augments parallel masked diffusion models with edit-distance-supervised refinements to raise quality on coding and math benchmarks while using far fewer diffusion steps.
-
Self-ReSET: Learning to Self-Recover from Unsafe Reasoning Trajectories
Self-ReSET is a reinforcement learning approach that lets large reasoning models learn to recover from their own unsafe reasoning trajectories, improving robustness to adversarial jailbreaks while preserving utility.
-
VeriContest: A Competitive-Programming Benchmark for Verifiable Code Generation
VeriContest supplies 946 problems with specs, code, proofs, and tests to benchmark verifiable code generation in Rust/Verus, showing models reach 92% on code but only 5% end-to-end on full verifiable synthesis.
-
WebTrap: Stealthy Mid-Task Hijacking of Browser Agents During Navigation
WebTrap uses multi-step instruction fusion and context-grounded generation to stealthily hijack browser agents mid-navigation while preserving original task success.
-
HTPO: Towards Exploration-Exploitation Balanced Policy Optimization via Hierarchical Token-level Objective Control
HTPO introduces hierarchical token-level objective control in RLVR to balance exploration and exploitation by grouping tokens according to difficulty, correctness, and entropy, yielding up to 8.6% gains on AIME benchm...
-
Hard to Read, Easy to Jailbreak: How Visual Degradation Bypasses MLLM Safety Alignment
Degraded image resolution in MLLMs bypasses safety alignments via cognitive overload, raising jailbreak rates across perturbations.
-
Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less
Full finetuning with the pretraining optimizer reduces forgetting compared to other optimizers or LoRA while achieving comparable new-task performance.
-
Teaching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning
A training recipe for tool-integrated reasoning models achieves state-of-the-art open-source results on math benchmarks such as 96.7% and 99.2% on AIME 2025 at 4B and 30B scales by balancing tool-use trajectories and ...
-
AlbumFill: Album-Guided Reasoning and Retrieval for Personalized Image Completion
AlbumFill retrieves identity-consistent references from personal albums via VLM-inferred semantic cues to support personalized image completion.
-
AMMA: A Multi-Chiplet Memory-Centric Architecture for Low-Latency 1M Context Attention Serving
AMMA is a memory-centric multi-chiplet architecture using HBM-PNM cubes, custom logic dies, hybrid parallelism, and reordered collectives that delivers 15.5X lower attention latency and 6.9X lower energy than NVIDIA H...
-
MAIC-UI: Making Interactive Courseware with Generative UI
MAIC-UI provides a zero-code authoring system for generating and iteratively editing interactive courseware from educational materials via structured analysis and incremental generation, with lab and classroom evaluat...
-
QuantClaw: Precision Where It Matters for OpenClaw
QuantClaw dynamically routes precision in agent workflows to cut cost by up to 21.4% and latency by 15.7% while keeping or improving task performance.
-
In-Situ Behavioral Evaluation for LLM Fairness, Not Standardized-Test Scores
Standardized-test benchmarks for LLM fairness are unreliable because prompt wording alone drives most score variance and ranking changes, while a multi-agent conversational framework reveals consistent model-specific ...
-
DR-Venus: Towards Frontier Edge-Scale Deep Research Agents with Only 10K Open Data
A 4B deep research agent trained on 10K open data outperforms prior agents under 9B parameters and narrows the gap to 30B-class systems on research benchmarks.
-
AeSlides: Incentivizing Aesthetic Layout in LLM-Based Slide Generation via Verifiable Rewards
AeSlides is a GRPO-based RL framework that uses verifiable aesthetic metrics to optimize LLM slide generation, achieving large gains in layout quality metrics and human scores with only 5K prompts.
-
Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts
Expert upcycling expands MoE models by duplicating experts and continuing pre-training, matching baseline performance while saving 32% GPU hours in 7B-13B experiments.
-
ELMoE-3D: Leveraging Intrinsic Elasticity of MoE for Hybrid-Bonding-Enabled Self-Speculative Decoding in On-Premises Serving
ELMoE-3D achieves 6.6x average speedup and 4.4x energy efficiency gain for MoE serving on 3D hardware by scaling expert and bit elasticity for elastic self-speculative decoding.
-
Towards Knowledgeable Deep Research: Framework and Benchmark
The paper introduces the KDR task, HKA multi-agent framework, and KDR-Bench to enable LLM agents to integrate structured knowledge into deep research reports, with experiments showing outperformance over prior agents.
-
ReAlign: Optimizing the Visual Document Retriever with Reasoning-Guided Fine-Grained Alignment
ReAlign improves visual document retrieval by training retrievers to match query-induced rankings with rankings derived from VLM-generated, region-focused descriptions of relevant page content.
-
Delay, Plateau, or Collapse: Evaluating the Impact of Systematic Verification Error on RLVR
Systematic false positives in verifiers can cause RLVR training to reach suboptimal plateaus or collapse, with outcomes driven by error patterns rather than overall error rate.
-
Learning to Retrieve from Agent Trajectories
Retrievers trained on agent trajectories via the LRAT framework improve evidence recall, task success, and efficiency in agentic search benchmarks.
-
OptiMat Alloys: a FAIR, living database of multi-principal element alloys enabled by a conversational agent
OptiMat Alloys is a conversational AI system that maintains a living FAIR database of multi-principal element alloy calculations and enables natural-language, on-demand computations with built-in uncertainty checks.
-
OGER: A Robust Offline-Guided Exploration Reward for Hybrid Reinforcement Learning
OGER adds an auxiliary exploration reward built from offline trajectories and model entropy to hybrid RL training, yielding gains on math reasoning benchmarks and out-of-domain generalization.
-
Learning from Less: Measuring the Effectiveness of RLVR in Low Data and Compute Regimes
Mixed-complexity procedural datasets provide up to 5x sample efficiency for RLVR on small models in low-data regimes, with low-to-high complexity generalization observed across counting, graph, and spatial tasks.
-
LongAct: Harnessing Intrinsic Activation Patterns for Long-Context Reinforcement Learning
LongAct uses saliency from high-magnitude activations to guide sparse weight updates in long-context RL, yielding about 8% gains on LongBench v2 across multiple algorithms.
-
Apriel-1.5-OpenReasoner: RL Post-Training for General-Purpose and Efficient Reasoning
Apriel-1.5-OpenReasoner uses RL post-training with adaptive sampling and difficulty-aware penalties to boost reasoning accuracy on AIME, GPQA, MMLU-Pro and LiveCodeBench while producing shorter traces and generalizing...
-
Yet Even Less Is Even Better For Agentic, Reasoning, and Coding LLMs
STITCH trains superior agentic coding and reasoning LLMs by using fewer high-quality trajectories filtered to keep only critical decision tokens, delivering up to 63% relative gains on SWE-bench Verified.
-
Same Voice, Different Lab: On the Homogenization of Frontier LLM Personalities
Frontier LLMs homogenize toward systematic and analytical personalities, suppressing emotional traits like remorseful or sycophantic, indicating an implicit consensus on optimal assistant behavior.
-
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
DeepSeek-V3.2 adds sparse attention, scaled RL post-training, and large-scale agentic data synthesis to reach GPT-5-level performance and gold medals in 2025 IMO and IOI with its high-compute variant.
-
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
UI-TARS-2 reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld while attaining 59.8 mean normalized score on a 15-game suite through multi-turn RL and scalable data generation.
-
Can Muon Fine-tune Adam-Pretrained Models?
Constraining fine-tuning updates with LoRA mitigates performance degradation when switching from Adam to Muon on pretrained models.
-
Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence
Safactory integrates three platforms for simulation, data management, and agent evolution to create a unified pipeline for training trustworthy autonomous AI.
-
Silicon Showdown: Performance, Efficiency, and Ecosystem Barriers in Consumer-Grade LLM Inference
Nvidia achieves 1.6x throughput with NVFP4 but hits a VRAM wall for 70B+ models, while Apple UMA enables linear scaling to 80B at 4-bit with up to 23x better energy efficiency.
-
Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence
Safactory combines parallel simulation, trustworthy data management, and asynchronous evolution platforms into a single pipeline claimed to be the first unified framework for trustworthy autonomous agents.
Reference graph
Works this paper leans on
- [1]
-
[2]
C. An, Z. Xie, X. Li, L. Li, J. Zhang, S. Gong, M. Zhong, J. Xu, X. Qiu, M. Wang, and L. Kong. Polaris: A post-training recipe for scaling reinforcement learning on advanced reasoning models, 2025
work page 2025
-
[3]
Y . Bai, X. Lv, J. Zhang, H. Lyu, J. Tang, Z. Huang, Z. Du, X. Liu, A. Zeng, L. Hou, et al. Longbench: A bilingual, multitask benchmark for long context understanding. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3119–3137, 2024
work page 2024
-
[4]
Y . Bai, S. Tu, J. Zhang, H. Peng, X. Wang, X. Lv, S. Cao, J. Xu, L. Hou, Y . Dong, J. Tang, and J. Li. LongBench v2: Towards deeper understanding and reasoning on realistic long-context multitasks. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3639–3664, Vienna, Austria, July 202...
work page 2025
-
[5]
M. Bavarian, H. Jun, N. Tezak, J. Schulman, C. McLeavey, J. Tworek, and M. Chen. Efficient training of language models to fill in the middle, 2022
work page 2022
- [6]
-
[7]
A. Chen, A. Li, B. Gong, B. Jiang, B. Fei, B. Yang, B. Shan, C. Yu, C. Wang, C. Zhu, et al. Minimax-m1: Scaling test-time compute efficiently with lightning attention. arXiv preprint arXiv:2506.13585, 2025
work page internal anchor Pith review arXiv 2025
-
[8]
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
- [9]
-
[10]
K. Deshpande, V . Sirdeshmukh, J. B. Mols, L. Jin, E.-Y . Hernandez-Cardona, D. Lee, J. Kritz, W. E. Primack, S. Yue, and C. Xing. Multichallenge: A realistic multi-turn conversation evalua- tion benchmark challenging to frontier llms. In Findings of the Association for Computational Linguistics: ACL 2025, pages 18632–18702, 2025
work page 2025
-
[11]
H. Ding, Z. Wang, G. Paolini, V . Kumar, A. Deoras, D. Roth, and S. Soatto. Fewer truncations improve language modeling. In Proceedings of the 41st International Conference on Machine Learning, pages 11030–11048, 2024
work page 2024
-
[12]
F. Gloeckle, B. Y . Idrissi, B. Rozière, D. Lopez-Paz, and G. Synnaeve. Better & faster large language models via multi-token prediction. arXiv preprint arXiv:2404.19737, 2024
-
[13]
D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring mathematical problem solving with the math dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)
- [15]
- [16]
-
[17]
S. Hu, Y . Tu, X. Han, G. Cui, C. He, W. Zhao, X. Long, Z. Zheng, Y . Fang, Y . Huang, et al. Minicpm: Unveiling the potential of small language models with scalable training strategies. In First Conference on Language Modeling
-
[18]
A. Jaech, A. Kalai, A. Lerer, A. Richardson, A. El-Kishky, A. Low, A. Helyar, A. Madry, A. Beutel, A. Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
N. Jain, K. Han, A. Gu, W.-D. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, and I. Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. In The Thirteenth International Conference on Learning Representations
-
[20]
C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan. Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [21]
-
[22]
A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov. Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Compu- tational Linguistics: Volume 2, Short Papers, pages 427–431. Association for Computational Linguistics, April 2017
work page 2017
-
[23]
A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
J. Liu, J. Su, X. Yao, Z. Jiang, G. Lai, Y . Du, Y . Qin, W. Xu, E. Lu, J. Yan, et al. Muon is scalable for llm training. arXiv preprint arXiv:2502.16982, 2025
work page internal anchor Pith review arXiv 2025
-
[25]
M. Luo, S. Tan, J. Wong, X. Shi, W. Y . Tang, M. Roongta, C. Cai, J. Luo, L. E. Li, R. A. Popa, and I. Stoica. Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl. https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model- by-Scaling-RL-19681902c1468005bed8ca303013a4e2, 2025. Notion Blog
work page 2025
-
[26]
S. G. Patil, H. Mao, C. Cheng-Jie Ji, F. Yan, V . Suresh, I. Stoica, and J. E. Gonzalez. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. In Forty-second International Conference on Machine Learning, 2025
work page 2025
-
[27]
arXiv preprint arXiv:2506.20920 , year=
G. Penedo, H. Kydlí ˇcek, V . Sabolˇcec, B. Messmer, N. Foroutan, A. H. Kargaran, C. Raffel, M. Jaggi, L. V on Werra, and T. Wolf. Fineweb2: One pipeline to scale them all–adapting pre-training data processing to every language. arXiv preprint arXiv:2506.20920, 2025
-
[28]
L. Phan, A. Gatti, Z. Han, N. Li, J. Hu, H. Zhang, C. B. C. Zhang, M. Shaaban, J. Ling, S. Shi, et al. Humanity’s last exam. arXiv preprint arXiv:2501.14249, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[29]
Y . Qin, T. Zhang, Y . Shen, W. Luo, Y . Zhang, Y . Qiao, Z. Zhou, W. Zhang, B. CUI, et al. Sysbench: Can llms follow system message? In The Thirteenth International Conference on Learning Representations, 2024
work page 2024
-
[30]
D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y . Pang, J. Dirani, J. Michael, and S. R. Bowman. Gpqa: A graduate-level google-proof q&a benchmark. In First Conference on Language Modeling, 2024
work page 2024
-
[31]
Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . Li, Y . Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [32]
-
[33]
G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023. 24
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[34]
K. Team, Y . Bai, Y . Bao, G. Chen, J. Chen, N. Chen, R. Chen, Y . Chen, Y . Chen, Y . Chen, et al. Kimi k2: Open agentic intelligence. arXiv preprint arXiv:2507.20534, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
T. T.-B. Team. Terminal-bench: A benchmark for ai agents in terminal environments, Apr 2025
work page 2025
-
[36]
M. Tian, L. Gao, S. Zhang, X. Chen, C. Fan, X. Guo, R. Haas, P. Ji, K. Krongchon, Y . Li, et al. Scicode: A research coding benchmark curated by scientists. Advances in Neural Information Processing Systems, 37:30624–30650, 2024
work page 2024
-
[37]
LLaMA: Open and Efficient Foundation Language Models
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[38]
K. V odrahalli, S. Ontanon, N. Tripuraneni, K. Xu, S. Jain, R. Shivanna, J. Hui, N. Dikkala, M. Kazemi, B. Fatemi, R. Anil, E. Dyer, S. Shakeri, R. Vij, H. Mehta, V . Ramasesh, Q. Le, E. Chi, Y . Lu, O. Firat, A. Lazaridou, J.-B. Lespiau, N. Attaluri, and K. Olszewska. Michelan- gelo: Long context evaluations beyond haystacks via latent structure queries, 2024
work page 2024
- [39]
- [40]
-
[41]
S. Wang, L. Yu, C. Gao, C. Zheng, S. Liu, R. Lu, K. Dang, X. Chen, J. Yang, Z. Zhang, et al. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning. arXiv preprint arXiv:2506.01939, 2025
work page internal anchor Pith review arXiv 2025
-
[42]
X. Wang, B. Li, Y . Song, F. F. Xu, X. Tang, M. Zhuge, J. Pan, Y . Song, B. Li, J. Singh, H. H. Tran, F. Li, R. Ma, M. Zheng, B. Qian, Y . Shao, N. Muennighoff, Y . Zhang, B. Hui, J. Lin, R. Brennan, H. Peng, H. Ji, and G. Neubig. Openhands: An open platform for AI software developers as generalist agents. In The Thirteenth International Conference on Lea...
work page 2025
-
[43]
Y . Wang, X. Ma, G. Zhang, Y . Ni, A. Chandra, S. Guo, W. Ren, A. Arulraj, X. He, Z. Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. Advances in Neural Information Processing Systems, 37:95266–95290, 2024
work page 2024
-
[44]
J. Wei, N. Karina, H. W. Chung, Y . J. Jiao, S. Papay, A. Glaese, J. Schulman, and W. Fedus. Measuring short-form factuality in large language models, 2024
work page 2024
-
[45]
J. Wei, Z. Sun, S. Papay, S. McKinney, J. Han, I. Fulford, H. W. Chung, A. T. Passos, W. Fedus, and A. Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents. arXiv preprint arXiv:2504.12516, 2025
work page internal anchor Pith review arXiv 2025
- [46]
-
[47]
A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[48]
S. Yao, N. Shinn, P. Razavi, and K. Narasimhan. tau-bench: A benchmark for tool-agent-user interaction in real-world domains. arXiv preprint arXiv:2406.12045, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[49]
Q. Yu, Z. Zhang, R. Zhu, Y . Yuan, X. Zuo, Y . Yue, W. Dai, T. Fan, G. Liu, L. Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[50]
A. Zeng, X. Liu, Z. Du, Z. Wang, H. Lai, M. Ding, Z. Yang, Y . Xu, W. Zheng, X. Xia, et al. Glm-130b: An open bilingual pre-trained model. In The Eleventh International Conference on Learning Representations. 25
-
[51]
Safetybench: Eval- uating the safety of large language models with mul- tiple choice questions
Z. Zhang, L. Lei, L. Wu, R. Sun, Y . Huang, C. Long, X. Liu, X. Lei, J. Tang, and M. Huang. Safetybench: Evaluating the safety of large language models with multiple choice questions. arXiv preprint arXiv:2309.07045, 2023
-
[52]
J. Zhou, T. Lu, S. Mishra, S. Brahma, S. Basu, Y . Luan, D. Zhou, and L. Hou. Instruction- following evaluation for large language models. arXiv preprint arXiv:2311.07911, 2023. 26
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.