Recognition: 2 theorem links
· Lean TheoremQwen2.5-1M Technical Report
Pith reviewed 2026-05-15 05:21 UTC · model grok-4.3
The pith
Qwen2.5-1M models reach 1 million token context length while outperforming GPT-4o-mini on long-context tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Through long-context pre-training with synthesized data and progressive training stages, the Qwen2.5-1M models achieve effective handling of 1 million tokens. The Qwen2.5-14B-Instruct-1M model significantly outperforms GPT-4o-mini in long-context tasks and supports contexts eight times longer than the prior 128K version, with no loss in short-context scenarios.
What carries the argument
Long data synthesis and progressive pre-training paired with a sparse-attention inference framework that performs length extrapolation of at least four times without further training.
If this is right
- Long-context applications such as full-book reasoning become feasible at open-source scale with lower compute.
- Inference costs drop through 3x-7x prefill speedups and sparse attention for 1M-token inputs.
- The length extrapolation method allows users to push context beyond 1M tokens without retraining.
- Short-context performance stays intact, so existing applications require no retraining.
Where Pith is reading between the lines
- The synthesis and extrapolation techniques could be applied to other base models to test whether the gains transfer beyond the Qwen family.
- Real-world deployment in domains like code repositories or scientific literature would reveal whether the reported speedups hold under irregular token distributions.
- Energy consumption for long-context workloads may decrease enough to make sustained million-token sessions viable on consumer hardware.
Load-bearing premise
The long data synthesis and progressive pre-training steps create genuine generalization to new long sequences rather than overfitting to the synthetic training data.
What would settle it
Measure accuracy on a held-out benchmark of real-world long documents such as full-length novels or legal contracts that were never used in the synthesis or training pipeline, and compare directly against GPT-4o-mini.
read the original abstract
We introduce Qwen2.5-1M, a series of models that extend the context length to 1 million tokens. Compared to the previous 128K version, the Qwen2.5-1M series have significantly enhanced long-context capabilities through long-context pre-training and post-training. Key techniques such as long data synthesis, progressive pre-training, and multi-stage supervised fine-tuning are employed to effectively enhance long-context performance while reducing training costs. To promote the use of long-context models among a broader user base, we present and open-source our inference framework. This framework includes a length extrapolation method that can expand the model context lengths by at least four times, or even more, without additional training. To reduce inference costs, we implement a sparse attention method along with chunked prefill optimization for deployment scenarios and a sparsity refinement method to improve precision. Additionally, we detail our optimizations in the inference engine, including kernel optimization, pipeline parallelism, and scheduling optimization, which significantly enhance overall inference performance. By leveraging our inference framework, the Qwen2.5-1M models achieve a remarkable 3x to 7x prefill speedup in scenarios with 1 million tokens of context. This framework provides an efficient and powerful solution for developing applications that require long-context processing using open-source models. The Qwen2.5-1M series currently includes the open-source models Qwen2.5-7B-Instruct-1M and Qwen2.5-14B-Instruct-1M, as well as the API-accessed model Qwen2.5-Turbo. Evaluations show that Qwen2.5-1M models have been greatly improved in long-context tasks without compromising performance in short-context scenarios. Specifically, the Qwen2.5-14B-Instruct-1M model significantly outperforms GPT-4o-mini in long-context tasks and supports contexts eight times longer.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This technical report introduces the Qwen2.5-1M series of models extending context length to 1 million tokens via long data synthesis, progressive pre-training, and multi-stage SFT. It also presents an open-source inference framework with length extrapolation (at least 4x without training), sparse attention, chunked prefill, and kernel/pipeline optimizations yielding 3x-7x prefill speedups at 1M context. The central claim is that Qwen2.5-14B-Instruct-1M significantly outperforms GPT-4o-mini on long-context tasks while supporting 8x longer contexts and preserving short-context performance.
Significance. If the performance claims hold with proper controls, the work provides practical open-source long-context models and an efficient inference stack that could accelerate deployment of 1M-context applications. The combination of progressive training and sparsity refinements offers reusable techniques for scaling context length while controlling compute.
major comments (3)
- [Abstract] Abstract: the claim that Qwen2.5-14B-Instruct-1M 'significantly outperforms GPT-4o-mini in long-context tasks' is unsupported by any numerical scores, benchmark names, error bars, or evaluation protocol details, which is load-bearing for the headline result.
- [Long data synthesis and progressive pre-training sections] Long data synthesis and progressive pre-training sections: no description or ablation is given on how synthetic sequences are constructed to avoid overlap with evaluation benchmarks, nor any comparison of performance on held-out long contexts versus training-distribution contexts; this directly leaves the generalization-vs-overfitting concern unaddressed.
- [Evaluations section] Evaluations section: the manuscript provides no ablation tables isolating the contribution of each technique (data synthesis, progressive schedule, multi-stage SFT) and no quantitative comparison against the prior 128K Qwen2.5 baseline on the same long-context suite.
minor comments (2)
- [Abstract] The abstract states 'significantly enhanced long-context capabilities' without naming the specific long-context benchmarks used; adding one sentence listing the primary suites would improve clarity.
- [Inference framework description] The length-extrapolation method is described only at high level; a short paragraph or equation showing how the extrapolation factor is achieved (e.g., via RoPE scaling or attention masking) would help readers replicate the 4x+ extension.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our technical report. We address each major comment below and will incorporate revisions to improve clarity and completeness of the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that Qwen2.5-14B-Instruct-1M 'significantly outperforms GPT-4o-mini in long-context tasks' is unsupported by any numerical scores, benchmark names, error bars, or evaluation protocol details, which is load-bearing for the headline result.
Authors: We agree that the abstract would be strengthened by including concrete supporting evidence. In the revised version, we will add specific benchmark names (e.g., LongBench, RULER), key numerical scores comparing Qwen2.5-14B-Instruct-1M to GPT-4o-mini, and a brief description of the evaluation protocol. We report mean performance across tasks as is standard for these reports; error bars from multiple runs are not available in the current experiments but can be noted as a limitation if space permits. revision: yes
-
Referee: [Long data synthesis and progressive pre-training sections] Long data synthesis and progressive pre-training sections: no description or ablation is given on how synthetic sequences are constructed to avoid overlap with evaluation benchmarks, nor any comparison of performance on held-out long contexts versus training-distribution contexts; this directly leaves the generalization-vs-overfitting concern unaddressed.
Authors: We will expand these sections to describe the synthetic data construction pipeline, including explicit deduplication and filtering steps against known evaluation benchmarks to minimize overlap. We will also add available comparisons of model performance on held-out long-context examples versus in-distribution contexts from our internal validation sets. Full-scale held-out ablations were not part of the original experimental design, but we will include the strongest available evidence and note any remaining limitations. revision: partial
-
Referee: [Evaluations section] Evaluations section: the manuscript provides no ablation tables isolating the contribution of each technique (data synthesis, progressive schedule, multi-stage SFT) and no quantitative comparison against the prior 128K Qwen2.5 baseline on the same long-context suite.
Authors: We acknowledge that the current manuscript lacks explicit ablation tables. In the revision, we will add tables isolating the contributions of long data synthesis, the progressive pre-training schedule, and multi-stage SFT. We will also include direct side-by-side quantitative results on the same long-context benchmarks for the new 1M models versus the prior 128K Qwen2.5 baseline to quantify the incremental gains. revision: yes
Circularity Check
No circularity: empirical engineering report with independent results
full rationale
The paper describes concrete training procedures (long data synthesis, progressive pre-training, multi-stage SFT) and inference optimizations (length extrapolation, sparse attention, chunked prefill) that are applied to produce the Qwen2.5-1M models. Performance claims are presented strictly as measured outcomes on benchmarks, with no mathematical derivations, fitted parameters renamed as predictions, or self-citations that carry the central argument. The reported gains on long-context tasks versus GPT-4o-mini are external empirical comparisons, not reductions to the training pipeline by construction. The work is self-contained against external benchmarks and contains no load-bearing self-referential steps.
Axiom & Free-Parameter Ledger
free parameters (2)
- progressive context length schedule
- sparsity threshold
axioms (1)
- domain assumption Standard transformer attention and feed-forward layers remain stable under progressive length extension.
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.DimensionForcingdimension_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Key techniques such as long data synthesis, progressive pre-training, and multi-stage supervised fine-tuning are employed to effectively enhance long-context performance while reducing training costs.
-
IndisputableMonolith.Foundation.HierarchyEmergencehierarchy_emergence_forces_phi unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the Qwen2.5-14B-Instruct-1M model significantly outperforms GPT-4o-mini in long-context tasks and supports contexts eight times longer.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 21 Pith papers
-
The Art of the Jailbreak: Formulating Jailbreak Attacks for LLM Security Beyond Binary Scoring
A 114k compositional jailbreak dataset is created, generators are fine-tuned for on-the-fly synthesis, and OPTIMUS introduces a continuous evaluator that identifies stealth-optimal regimes missed by binary attack succ...
-
Dooly: Configuration-Agnostic, Redundancy-Aware Profiling for LLM Inference Simulation
Dooly reduces LLM inference profiling costs by 56.4% via configuration-agnostic taint-based labeling and selective database reuse, delivering simulation accuracy within 5% MAPE for TTFT and 8% for TPOT across 12 models.
-
Relay Buffer Independent Communication over Pooled HBM for Efficient MoE Inference on Ascend
A buffer-free MoE dispatch and combine method on Ascend hardware with pooled HBM cuts intermediate relay overhead via direct expert window access.
-
Relay Buffer Independent Communication over Pooled HBM for Efficient MoE Inference on Ascend
A relay-buffer-free MoE communication scheme on Ascend uses pooled HBM for direct expert-window placement and reading, cutting dispatch and combine latency in prefill and decode phases.
-
Neuron-Anchored Rule Extraction for Large Language Models via Contrastive Hierarchical Ablation
MechaRule localizes agonist neurons in LLMs via contrastive hierarchical ablation to ground rule extraction in circuitry, recalling 96.8% of high-effect neurons and reducing task performance when suppressed.
-
LiveFMBench: Unveiling the Power and Limits of Agentic Workflows in Specification Generation
LiveFMBench shows that direct LLM prompting for C program formal specs overestimates accuracy by ~20% due to unfaithful behaviors like deceiving provers, while agentic workflows help under low sampling but overall per...
-
Thinking Like a Botanist: Challenging Multimodal Language Models with Intent-Driven Chain-of-Inquiry
PlantInquiryVQA shows multimodal LLMs describe plant symptoms but struggle with clinical reasoning and diagnosis, with structured Chain of Inquiry improving correctness and reducing hallucinations.
-
Towards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces
OmniBehavior benchmark demonstrates that LLMs simulating real human behavior converge on hyper-active positive average personas, losing long-tail individual differences.
-
IntervenSim: Intervention-Aware Social Network Simulation for Opinion Dynamics
IntervenSim is an intervention-aware social network simulation that couples source interventions with crowd interactions in a feedback loop, improving MAPE by 41.6% and DTW by 66.9% over prior static frameworks on rea...
-
Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context
Continued pre-training with balanced long-document VQA data extends a 7B LVLM to 128K context, improving long-document VQA by 7.1% and generalizing to 512K without further training.
-
Nectar: Neural Estimation of Cached-Token Attention via Regression
Nectar fits small per-layer per-head neural networks via regression to predict attention outputs and normalizers, enabling constant-time inference independent of context length while preserving semantic generation quality.
-
ModelLens: Finding the Best for Your Task from Myriads of Models
ModelLens learns a performance-aware latent space from 1.62M leaderboard records to rank unseen models on unseen datasets without forward passes on the target.
-
UniPrefill: Universal Long-Context Prefill Acceleration via Block-wise Dynamic Sparsification
UniPrefill accelerates LLM prefill via block-wise dynamic sparsification, achieving up to 2.1x TTFT speedup while supporting hybrid architectures and native vLLM continuous batching.
-
Birds of a Feather Cluster Nearby: a Proximity-Aware Geo-Codebook for Local Service Recommendation
Pro-GEO introduces a geo-centroid coordinate system and geo-rotary position encoding to model geographic proximity as rotational transformations, enabling balanced semantic-spatial modeling in local service recommendations.
-
MGDA-Decoupled: Geometry-Aware Multi-Objective Optimisation for DPO-based LLM Alignment
MGDA-Decoupled applies geometry-based multi-objective optimization within the DPO framework to find shared descent directions that account for each objective's convergence dynamics, yielding higher win rates on UltraFeedback.
-
MemExplorer: Navigating the Heterogeneous Memory Design Space for Agentic Inference NPUs
MemExplorer optimizes heterogeneous memory systems for agentic LLM inference on NPUs and reports up to 2.3x higher energy efficiency than baselines under fixed power budgets.
-
Faithfulness Serum: Mitigating the Faithfulness Gap in Textual Explanations of LLM Decisions via Attribution Guidance
A training-free method improves epistemic faithfulness of LLM textual explanations by guiding generation with attribution-based attention interventions.
-
Noise-Aware In-Context Learning for Hallucination Mitigation in ALLMs
NAICL reduces hallucination rates in ALLMs from 26.53% to 16.98% via noise priors in context and introduces the Clotho-1K benchmark with four hallucination types.
-
MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent
MemAgent uses multi-conversation RL to train a memory agent that reads text in segments and overwrites memory, extrapolating from 8K training to 3.5M token QA with under 5% loss and 95%+ on 512K RULER.
-
Optimized Deferral for Imbalanced Settings
MILD reformulates two-stage learning to defer as cost-sensitive learning over the input-expert domain and derives new margin-based losses with guarantees, yielding better performance than baselines on image classifica...
-
Network Edge Inference for Large Language Models: Principles, Techniques, and Opportunities
A survey synthesizing challenges, system architectures, model optimizations, deployment methods, and resource management techniques for large language model inference at the network edge.
Reference graph
Works this paper leans on
-
[1]
Training-free long-context scaling of large language models.arXiv preprint arXiv:2402.17463, 2024
Chenxin An, Fei Huang, Jun Zhang, Shansan Gong, Xipeng Qiu, Chang Zhou, and Lingpeng Kong. Training-free long-context scaling of large language models. CoRR, abs/2402.17463, 2024a. Chenxin An, Jun Zhang, Ming Zhong, Lei Li, Shansan Gong, Yao Luo, Jingjing Xu, and Lingpeng Kong. Why does the effective context length of llms fall short?, 2024b. URL https://...
-
[2]
Program Synthesis with Large Language Models
URL https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model Card Claud e 3.pdf. 15 Jacob Austin, Augustus Odena, Maxwell I. Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie J. Cai, Michael Terry, Quoc V . Le, and Charles Sutton. Program synthesis with large language models. CoRR, abs/2108.07732,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng X...
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Efficient training of language models to fill in the middle
Mohammad Bavarian, Heewoo Jun, Nikolas Tezak, John Schulman, Christine McLeavey, Jerry Tworek, and Mark Chen. Efficient training of language models to fill in the middle. CoRR, abs/2207.14255,
-
[5]
Efficient training of language models to fill in the middle
doi: 10.48550/ARXIV.2207.14255. URL https://doi.org/10.48550/arXiv.2207.14255. Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffre...
-
[6]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pond´e de Oliveira Pinto, Jared Kaplan, Harrison Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bava...
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. CoRR, abs/2110.14168,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
URL http://papers.nips.cc/paper fil es/paper/2022/hash/67d57c32e20fd0a7a302cb81d36e40d5-Abstract-Conference.html. Yann N. Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated convolutional networks. In ICML, volume 70 of Proceedings of Machine Learning Research , pp. 933–941. PMLR,
work page 2022
-
[9]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aur´elien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozi`ere, Betha...
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Are we done with mmlu? CoRR, abs/2406.04127,
Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, et al. Are we done with mmlu? CoRR, abs/2406.04127,
-
[11]
RULER: What's the Real Context Size of Your Long-Context Language Models?
Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. RULER: What’s the real context size of your long-context language models? CoRR, abs/2404.06654,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Qwen2.5-Coder Technical Report
Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. Qwen2.5-Coder technical report. CoRR, abs/2409.12186,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar- Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench: Holistic and contamination free evaluation of large language models for code. CoRR, abs/2403.07974,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, L ´elio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth´ee Lacroix, and William El Sayed. Mistral 7B. CoRR, abs/2310.0682...
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Tianhao Wu, Banghua Zhu, Joseph E. Gonzalez, and Ion Stoica. From crowdsourced data to high-quality benchmarks: Arena-Hard and BenchBuilder pipeline. CoRR, abs/2406.11939,
-
[16]
OpenAI. GPT4 technical report. CoRR, abs/2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
YaRN: Efficient Context Window Extension of Large Language Models
URL https://gradient.ai/blog/scaling-rotational-embeddings-for-l ong-context-language-models . 17 Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. YaRN: Efficient context window extension of large language models. CoRR, abs/2309.00071,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level Google-proof Q&A benchmark. CoRR, abs/2311.12022,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, Aur ´elien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. LLaMA: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023a. Hugo Touvron, Louis Martin, Kevin Sto...
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
LiveBench: A Challenging, Contamination-Limited LLM Benchmark
Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Benjamin Feuer, Siddhartha Jain, Ravid Shwartz- Ziv, Neel Jain, Khalid Saifullah, Siddartha Naidu, Chinmay Hegde, Yann LeCun, Tom Goldstein, Willie Neiswanger, and Micah Goldblum. LiveBench: A challenging, contamination-free LLM benchmark. CoRR, abs/2406.19314,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi Rungta, Karthik Abinav Sankararaman, Barlas Oguz, Madian Khabsa, Han Fang, Yashar Mehdad, Sharan Narang, Kshitiz Malik, Angela Fan, Shruti Bhosale, Sergey Edunov, Mike Lewis, Sinong Wang, and Hao Ma. Effective long-context scaling of foundation models. CoR...
-
[22]
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jianxin Yang, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei...
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
LV-Eval: A balanced long-context benchmark with 5 length levels up to 256K
Tao Yuan, Xuefei Ning, Dong Zhou, Zhijie Yang, Shiyao Li, Minghui Zhuang, Zheyue Tan, Zhuyu Yao, Dahua Lin, Boxun Li, Guohao Dai, Shengen Yan, and Yu Wang. LV-Eval: A balanced long-context benchmark with 5 length levels up to 256K. CoRR, abs/2402.05136,
-
[24]
ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools
Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, Hao Yu, Hongning Wang, Jiadai Sun, Jiajie Zhang, Jiale Cheng, Jiayi Gui, Jie Tang, Jing Zhang, Juanzi Li, Lei Zhao, Lindong Wu, Lucen Zhong, Mingdao Liu, Minlie Huang, Peng Zhang, Qinkai Zheng, Rui Lu, Shuaiqi Duan, Shudan Zhang, Shulin Cao, Shuxun Yan...
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
Instruction-Following Evaluation for Large Language Models
Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models. CoRR, abs/2311.07911,
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.