arxiv: 2501.15383 · v1 · submitted 2025-01-26 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

Qwen2.5-1M Technical Report

An Yang , Bowen Yu , Chengyuan Li , Dayiheng Liu , Fei Huang , Haoyan Huang , Jiandong Jiang , Jianhong Tu

show 20 more authors

Jianwei Zhang Jingren Zhou Junyang Lin Kai Dang Kexin Yang Le Yu Mei Li Minmin Sun Qin Zhu Rui Men Tao He Weijia Xu Wenbiao Yin Wenyuan Yu Xiafei Qiu Xingzhang Ren Xinlong Yang Yong Li Zhiying Xu Zipeng Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 05:21 UTC · model grok-4.3

classification 💻 cs.CL

keywords long-context modelingcontext length extensionQwen2.5inference optimizationsparse attentionprogressive pre-trainingopen-source models

0 comments

The pith

Qwen2.5-1M models reach 1 million token context length while outperforming GPT-4o-mini on long-context tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents the Qwen2.5-1M series that extends model context from 128K to 1 million tokens through targeted long-context pre-training and post-training. Key methods include long data synthesis, progressive pre-training, and multi-stage supervised fine-tuning to boost long-context performance without raising training costs. An open-source inference framework adds length extrapolation for at least 4x further extension, sparse attention, and chunked prefill to cut costs and deliver 3x to 7x prefill speedups at 1M scale. The 14B-Instruct-1M variant beats GPT-4o-mini on long tasks while preserving short-context results. These changes matter because they make million-token processing practical for open-source users in applications such as document analysis or extended conversations.

Core claim

Through long-context pre-training with synthesized data and progressive training stages, the Qwen2.5-1M models achieve effective handling of 1 million tokens. The Qwen2.5-14B-Instruct-1M model significantly outperforms GPT-4o-mini in long-context tasks and supports contexts eight times longer than the prior 128K version, with no loss in short-context scenarios.

What carries the argument

Long data synthesis and progressive pre-training paired with a sparse-attention inference framework that performs length extrapolation of at least four times without further training.

If this is right

Long-context applications such as full-book reasoning become feasible at open-source scale with lower compute.
Inference costs drop through 3x-7x prefill speedups and sparse attention for 1M-token inputs.
The length extrapolation method allows users to push context beyond 1M tokens without retraining.
Short-context performance stays intact, so existing applications require no retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The synthesis and extrapolation techniques could be applied to other base models to test whether the gains transfer beyond the Qwen family.
Real-world deployment in domains like code repositories or scientific literature would reveal whether the reported speedups hold under irregular token distributions.
Energy consumption for long-context workloads may decrease enough to make sustained million-token sessions viable on consumer hardware.

Load-bearing premise

The long data synthesis and progressive pre-training steps create genuine generalization to new long sequences rather than overfitting to the synthetic training data.

What would settle it

Measure accuracy on a held-out benchmark of real-world long documents such as full-length novels or legal contracts that were never used in the synthesis or training pipeline, and compare directly against GPT-4o-mini.

read the original abstract

We introduce Qwen2.5-1M, a series of models that extend the context length to 1 million tokens. Compared to the previous 128K version, the Qwen2.5-1M series have significantly enhanced long-context capabilities through long-context pre-training and post-training. Key techniques such as long data synthesis, progressive pre-training, and multi-stage supervised fine-tuning are employed to effectively enhance long-context performance while reducing training costs. To promote the use of long-context models among a broader user base, we present and open-source our inference framework. This framework includes a length extrapolation method that can expand the model context lengths by at least four times, or even more, without additional training. To reduce inference costs, we implement a sparse attention method along with chunked prefill optimization for deployment scenarios and a sparsity refinement method to improve precision. Additionally, we detail our optimizations in the inference engine, including kernel optimization, pipeline parallelism, and scheduling optimization, which significantly enhance overall inference performance. By leveraging our inference framework, the Qwen2.5-1M models achieve a remarkable 3x to 7x prefill speedup in scenarios with 1 million tokens of context. This framework provides an efficient and powerful solution for developing applications that require long-context processing using open-source models. The Qwen2.5-1M series currently includes the open-source models Qwen2.5-7B-Instruct-1M and Qwen2.5-14B-Instruct-1M, as well as the API-accessed model Qwen2.5-Turbo. Evaluations show that Qwen2.5-1M models have been greatly improved in long-context tasks without compromising performance in short-context scenarios. Specifically, the Qwen2.5-14B-Instruct-1M model significantly outperforms GPT-4o-mini in long-context tasks and supports contexts eight times longer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a practical engineering report on scaling Qwen to 1M context with open inference tools, but the performance claims rest on thin evaluation details.

read the letter

The main point is that Qwen2.5-1M extends the prior 128K models to 1M tokens using long data synthesis, progressive pre-training, and multi-stage fine-tuning, while releasing an inference framework that includes length extrapolation without extra training, sparse attention, and chunked prefill optimizations. These deliver claimed 3x to 7x prefill speedups at full context length, which is the part that could actually move the needle for open-source long-context work.

Referee Report

3 major / 2 minor

Summary. This technical report introduces the Qwen2.5-1M series of models extending context length to 1 million tokens via long data synthesis, progressive pre-training, and multi-stage SFT. It also presents an open-source inference framework with length extrapolation (at least 4x without training), sparse attention, chunked prefill, and kernel/pipeline optimizations yielding 3x-7x prefill speedups at 1M context. The central claim is that Qwen2.5-14B-Instruct-1M significantly outperforms GPT-4o-mini on long-context tasks while supporting 8x longer contexts and preserving short-context performance.

Significance. If the performance claims hold with proper controls, the work provides practical open-source long-context models and an efficient inference stack that could accelerate deployment of 1M-context applications. The combination of progressive training and sparsity refinements offers reusable techniques for scaling context length while controlling compute.

major comments (3)

[Abstract] Abstract: the claim that Qwen2.5-14B-Instruct-1M 'significantly outperforms GPT-4o-mini in long-context tasks' is unsupported by any numerical scores, benchmark names, error bars, or evaluation protocol details, which is load-bearing for the headline result.
[Long data synthesis and progressive pre-training sections] Long data synthesis and progressive pre-training sections: no description or ablation is given on how synthetic sequences are constructed to avoid overlap with evaluation benchmarks, nor any comparison of performance on held-out long contexts versus training-distribution contexts; this directly leaves the generalization-vs-overfitting concern unaddressed.
[Evaluations section] Evaluations section: the manuscript provides no ablation tables isolating the contribution of each technique (data synthesis, progressive schedule, multi-stage SFT) and no quantitative comparison against the prior 128K Qwen2.5 baseline on the same long-context suite.

minor comments (2)

[Abstract] The abstract states 'significantly enhanced long-context capabilities' without naming the specific long-context benchmarks used; adding one sentence listing the primary suites would improve clarity.
[Inference framework description] The length-extrapolation method is described only at high level; a short paragraph or equation showing how the extrapolation factor is achieved (e.g., via RoPE scaling or attention masking) would help readers replicate the 4x+ extension.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our technical report. We address each major comment below and will incorporate revisions to improve clarity and completeness of the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that Qwen2.5-14B-Instruct-1M 'significantly outperforms GPT-4o-mini in long-context tasks' is unsupported by any numerical scores, benchmark names, error bars, or evaluation protocol details, which is load-bearing for the headline result.

Authors: We agree that the abstract would be strengthened by including concrete supporting evidence. In the revised version, we will add specific benchmark names (e.g., LongBench, RULER), key numerical scores comparing Qwen2.5-14B-Instruct-1M to GPT-4o-mini, and a brief description of the evaluation protocol. We report mean performance across tasks as is standard for these reports; error bars from multiple runs are not available in the current experiments but can be noted as a limitation if space permits. revision: yes
Referee: [Long data synthesis and progressive pre-training sections] Long data synthesis and progressive pre-training sections: no description or ablation is given on how synthetic sequences are constructed to avoid overlap with evaluation benchmarks, nor any comparison of performance on held-out long contexts versus training-distribution contexts; this directly leaves the generalization-vs-overfitting concern unaddressed.

Authors: We will expand these sections to describe the synthetic data construction pipeline, including explicit deduplication and filtering steps against known evaluation benchmarks to minimize overlap. We will also add available comparisons of model performance on held-out long-context examples versus in-distribution contexts from our internal validation sets. Full-scale held-out ablations were not part of the original experimental design, but we will include the strongest available evidence and note any remaining limitations. revision: partial
Referee: [Evaluations section] Evaluations section: the manuscript provides no ablation tables isolating the contribution of each technique (data synthesis, progressive schedule, multi-stage SFT) and no quantitative comparison against the prior 128K Qwen2.5 baseline on the same long-context suite.

Authors: We acknowledge that the current manuscript lacks explicit ablation tables. In the revision, we will add tables isolating the contributions of long data synthesis, the progressive pre-training schedule, and multi-stage SFT. We will also include direct side-by-side quantitative results on the same long-context benchmarks for the new 1M models versus the prior 128K Qwen2.5 baseline to quantify the incremental gains. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical engineering report with independent results

full rationale

The paper describes concrete training procedures (long data synthesis, progressive pre-training, multi-stage SFT) and inference optimizations (length extrapolation, sparse attention, chunked prefill) that are applied to produce the Qwen2.5-1M models. Performance claims are presented strictly as measured outcomes on benchmarks, with no mathematical derivations, fitted parameters renamed as predictions, or self-citations that carry the central argument. The reported gains on long-context tasks versus GPT-4o-mini are external empirical comparisons, not reductions to the training pipeline by construction. The work is self-contained against external benchmarks and contains no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

Relies on standard transformer training assumptions and scaling practices; no new entities or ad-hoc axioms introduced beyond typical LLM engineering choices.

free parameters (2)

progressive context length schedule
Stages for gradually increasing context during pre-training; exact values and selection criteria not specified in abstract.
sparsity threshold
Parameter controlling which attention computations are skipped in the sparse method.

axioms (1)

domain assumption Standard transformer attention and feed-forward layers remain stable under progressive length extension.
Invoked implicitly when claiming no short-context degradation after long-context training.

pith-pipeline@v0.9.0 · 5751 in / 1203 out tokens · 45157 ms · 2026-05-15T05:21:00.756306+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.DimensionForcing dimension_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Key techniques such as long data synthesis, progressive pre-training, and multi-stage supervised fine-tuning are employed to effectively enhance long-context performance while reducing training costs.
IndisputableMonolith.Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the Qwen2.5-14B-Instruct-1M model significantly outperforms GPT-4o-mini in long-context tasks and supports contexts eight times longer.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The Art of the Jailbreak: Formulating Jailbreak Attacks for LLM Security Beyond Binary Scoring
cs.CR 2026-05 unverdicted novelty 7.0

A 114k compositional jailbreak dataset is created, generators are fine-tuned for on-the-fly synthesis, and OPTIMUS introduces a continuous evaluator that identifies stealth-optimal regimes missed by binary attack succ...
Dooly: Configuration-Agnostic, Redundancy-Aware Profiling for LLM Inference Simulation
cs.DC 2026-05 unverdicted novelty 7.0

Dooly reduces LLM inference profiling costs by 56.4% via configuration-agnostic taint-based labeling and selective database reuse, delivering simulation accuracy within 5% MAPE for TTFT and 8% for TPOT across 12 models.
Relay Buffer Independent Communication over Pooled HBM for Efficient MoE Inference on Ascend
cs.DC 2026-05 unverdicted novelty 7.0

A buffer-free MoE dispatch and combine method on Ascend hardware with pooled HBM cuts intermediate relay overhead via direct expert window access.
Relay Buffer Independent Communication over Pooled HBM for Efficient MoE Inference on Ascend
cs.DC 2026-05 unverdicted novelty 7.0

A relay-buffer-free MoE communication scheme on Ascend uses pooled HBM for direct expert-window placement and reading, cutting dispatch and combine latency in prefill and decode phases.
Neuron-Anchored Rule Extraction for Large Language Models via Contrastive Hierarchical Ablation
cs.LG 2026-05 unverdicted novelty 7.0

MechaRule localizes agonist neurons in LLMs via contrastive hierarchical ablation to ground rule extraction in circuitry, recalling 96.8% of high-effect neurons and reducing task performance when suppressed.
LiveFMBench: Unveiling the Power and Limits of Agentic Workflows in Specification Generation
cs.SE 2026-05 conditional novelty 7.0

LiveFMBench shows that direct LLM prompting for C program formal specs overestimates accuracy by ~20% due to unfaithful behaviors like deceiving provers, while agentic workflows help under low sampling but overall per...
Thinking Like a Botanist: Challenging Multimodal Language Models with Intent-Driven Chain-of-Inquiry
cs.CV 2026-04 unverdicted novelty 7.0

PlantInquiryVQA shows multimodal LLMs describe plant symptoms but struggle with clinical reasoning and diagnosis, with structured Chain of Inquiry improving correctness and reducing hallucinations.
Towards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces
cs.CL 2026-04 unverdicted novelty 7.0

OmniBehavior benchmark demonstrates that LLMs simulating real human behavior converge on hyper-active positive average personas, losing long-tail individual differences.
IntervenSim: Intervention-Aware Social Network Simulation for Opinion Dynamics
cs.SI 2026-04 unverdicted novelty 7.0

IntervenSim is an intervention-aware social network simulation that couples source interventions with crowd interactions in a feedback loop, improving MAPE by 41.6% and DTW by 66.9% over prior static frameworks on rea...
Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context
cs.CV 2026-05 unverdicted novelty 6.0

Continued pre-training with balanced long-document VQA data extends a 7B LVLM to 128K context, improving long-document VQA by 7.1% and generalizing to 512K without further training.
Nectar: Neural Estimation of Cached-Token Attention via Regression
cs.LG 2026-05 unverdicted novelty 6.0

Nectar fits small per-layer per-head neural networks via regression to predict attention outputs and normalizers, enabling constant-time inference independent of context length while preserving semantic generation quality.
ModelLens: Finding the Best for Your Task from Myriads of Models
cs.LG 2026-05 unverdicted novelty 6.0

ModelLens learns a performance-aware latent space from 1.62M leaderboard records to rank unseen models on unseen datasets without forward passes on the target.
UniPrefill: Universal Long-Context Prefill Acceleration via Block-wise Dynamic Sparsification
cs.CL 2026-05 unverdicted novelty 6.0

UniPrefill accelerates LLM prefill via block-wise dynamic sparsification, achieving up to 2.1x TTFT speedup while supporting hybrid architectures and native vLLM continuous batching.
Birds of a Feather Cluster Nearby: a Proximity-Aware Geo-Codebook for Local Service Recommendation
cs.IR 2026-04 unverdicted novelty 6.0

Pro-GEO introduces a geo-centroid coordinate system and geo-rotary position encoding to model geographic proximity as rotational transformations, enabling balanced semantic-spatial modeling in local service recommendations.
MGDA-Decoupled: Geometry-Aware Multi-Objective Optimisation for DPO-based LLM Alignment
cs.LG 2026-04 unverdicted novelty 6.0

MGDA-Decoupled applies geometry-based multi-objective optimization within the DPO framework to find shared descent directions that account for each objective's convergence dynamics, yielding higher win rates on UltraFeedback.
MemExplorer: Navigating the Heterogeneous Memory Design Space for Agentic Inference NPUs
cs.AR 2026-04 unverdicted novelty 6.0

MemExplorer optimizes heterogeneous memory systems for agentic LLM inference on NPUs and reports up to 2.3x higher energy efficiency than baselines under fixed power budgets.
Faithfulness Serum: Mitigating the Faithfulness Gap in Textual Explanations of LLM Decisions via Attribution Guidance
cs.CL 2026-04 unverdicted novelty 6.0

A training-free method improves epistemic faithfulness of LLM textual explanations by guiding generation with attribution-based attention interventions.
Noise-Aware In-Context Learning for Hallucination Mitigation in ALLMs
cs.SD 2026-04 unverdicted novelty 6.0

NAICL reduces hallucination rates in ALLMs from 26.53% to 16.98% via noise priors in context and introduces the Clotho-1K benchmark with four hallucination types.
MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent
cs.CL 2025-07 unverdicted novelty 6.0

MemAgent uses multi-conversation RL to train a memory agent that reads text in segments and overwrites memory, extrapolating from 8K training to 3.5M token QA with under 5% loss and 95%+ on 512K RULER.
Optimized Deferral for Imbalanced Settings
cs.LG 2026-04 unverdicted novelty 5.0

MILD reformulates two-stage learning to defer as cost-sensitive learning over the input-expert domain and derives new margin-based losses with guarantees, yielding better performance than baselines on image classifica...
Network Edge Inference for Large Language Models: Principles, Techniques, and Opportunities
cs.DC 2026-04 unverdicted novelty 3.0

A survey synthesizing challenges, system architectures, model optimizations, deployment methods, and resource management techniques for large language model inference at the network edge.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · cited by 20 Pith papers · 17 internal anchors

[1]

Training-free long-context scaling of large language models.arXiv preprint arXiv:2402.17463, 2024

Chenxin An, Fei Huang, Jun Zhang, Shansan Gong, Xipeng Qiu, Chang Zhou, and Lingpeng Kong. Training-free long-context scaling of large language models. CoRR, abs/2402.17463, 2024a. Chenxin An, Jun Zhang, Ming Zhong, Lei Li, Shansan Gong, Yao Luo, Jingjing Xu, and Lingpeng Kong. Why does the effective context length of llms fall short?, 2024b. URL https://...

work page arXiv
[2]

Program Synthesis with Large Language Models

URL https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model Card Claud e 3.pdf. 15 Jacob Austin, Augustus Odena, Maxwell I. Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie J. Cai, Michael Terry, Quoc V . Le, and Charles Sutton. Program synthesis with large language models. CoRR, abs/2108.07732,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng X...

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Efficient training of language models to fill in the middle

Mohammad Bavarian, Heewoo Jun, Nikolas Tezak, John Schulman, Christine McLeavey, Jerry Tworek, and Mark Chen. Efficient training of language models to fill in the middle. CoRR, abs/2207.14255,

work page arXiv
[5]

Efficient training of language models to fill in the middle

doi: 10.48550/ARXIV.2207.14255. URL https://doi.org/10.48550/arXiv.2207.14255. Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffre...

work page doi:10.48550/arxiv.2207.14255
[6]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pond´e de Oliveira Pinto, Jared Kaplan, Harrison Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bava...

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. CoRR, abs/2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

URL http://papers.nips.cc/paper fil es/paper/2022/hash/67d57c32e20fd0a7a302cb81d36e40d5-Abstract-Conference.html. Yann N. Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated convolutional networks. In ICML, volume 70 of Proceedings of Machine Learning Research , pp. 933–941. PMLR,

work page 2022
[9]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aur´elien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozi`ere, Betha...

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Are we done with mmlu? CoRR, abs/2406.04127,

Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, et al. Are we done with mmlu? CoRR, abs/2406.04127,

work page arXiv
[11]

RULER: What's the Real Context Size of Your Long-Context Language Models?

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. RULER: What’s the real context size of your long-context language models? CoRR, abs/2404.06654,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Qwen2.5-Coder Technical Report

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. Qwen2.5-Coder technical report. CoRR, abs/2409.12186,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar- Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench: Holistic and contamination free evaluation of large language models for code. CoRR, abs/2403.07974,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, L ´elio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth´ee Lacroix, and William El Sayed. Mistral 7B. CoRR, abs/2310.0682...

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Gonzalez, and Ion Stoica

Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Tianhao Wu, Banghua Zhu, Joseph E. Gonzalez, and Ion Stoica. From crowdsourced data to high-quality benchmarks: Arena-Hard and BenchBuilder pipeline. CoRR, abs/2406.11939,

work page arXiv
[16]

GPT-4 Technical Report

OpenAI. GPT4 technical report. CoRR, abs/2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

YaRN: Efficient Context Window Extension of Large Language Models

URL https://gradient.ai/blog/scaling-rotational-embeddings-for-l ong-context-language-models . 17 Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. YaRN: Efficient context window extension of large language models. CoRR, abs/2309.00071,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level Google-proof Q&A benchmark. CoRR, abs/2311.12022,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, Aur ´elien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. LLaMA: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023a. Hugo Touvron, Louis Martin, Kevin Sto...

work page internal anchor Pith review Pith/arXiv arXiv
[20]

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Benjamin Feuer, Siddhartha Jain, Ravid Shwartz- Ziv, Neel Jain, Khalid Saifullah, Siddartha Naidu, Chinmay Hegde, Yann LeCun, Tom Goldstein, Willie Neiswanger, and Micah Goldblum. LiveBench: A challenging, contamination-free LLM benchmark. CoRR, abs/2406.19314,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

CoRR , volume =

Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi Rungta, Karthik Abinav Sankararaman, Barlas Oguz, Madian Khabsa, Han Fang, Yashar Mehdad, Sharan Narang, Kshitiz Malik, Angela Fan, Shruti Bhosale, Sergey Edunov, Mike Lewis, Sinong Wang, and Hao Ma. Effective long-context scaling of foundation models. CoR...

work page arXiv
[22]

Qwen2 Technical Report

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jianxin Yang, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei...

work page internal anchor Pith review Pith/arXiv arXiv
[23]

LV-Eval: A balanced long-context benchmark with 5 length levels up to 256K

Tao Yuan, Xuefei Ning, Dong Zhou, Zhijie Yang, Shiyao Li, Minghui Zhuang, Zheyue Tan, Zhuyu Yao, Dahua Lin, Boxun Li, Guohao Dai, Shengen Yan, and Yu Wang. LV-Eval: A balanced long-context benchmark with 5 length levels up to 256K. CoRR, abs/2402.05136,

work page arXiv
[24]

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, Hao Yu, Hongning Wang, Jiadai Sun, Jiajie Zhang, Jiale Cheng, Jiayi Gui, Jie Tang, Jing Zhang, Juanzi Li, Lei Zhao, Lindong Wu, Lucen Zhong, Mingdao Liu, Minlie Huang, Peng Zhang, Qinkai Zheng, Rui Lu, Shuaiqi Duan, Shudan Zhang, Shulin Cao, Shuxun Yan...

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Instruction-Following Evaluation for Large Language Models

Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models. CoRR, abs/2311.07911,

work page internal anchor Pith review Pith/arXiv arXiv