arxiv: 2403.17297 · v1 · submitted 2024-03-26 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

InternLM2 Technical Report

Zheng Cai , Maosong Cao , Haojiong Chen , Kai Chen , Keyu Chen , Xin Chen , Xun Chen , Zehui Chen

show 92 more authors

Zhi Chen Pei Chu Xiaoyi Dong Haodong Duan Qi Fan Zhaoye Fei Yang Gao Jiaye Ge Chenya Gu Yuzhe Gu Tao Gui Aijia Guo Qipeng Guo Conghui He Yingfan Hu Ting Huang Tao Jiang Penglong Jiao Zhenjiang Jin Zhikai Lei Jiaxing Li Jingwen Li Linyang Li Shuaibin Li Wei Li Yining Li Hongwei Liu Jiangning Liu Jiawei Hong Kaiwen Liu Kuikun Liu Xiaoran Liu Chengqi Lv Haijun Lv Kai Lv Li Ma Runyuan Ma Zerun Ma Wenchang Ning Linke Ouyang Jiantao Qiu Yuan Qu Fukai Shang Yunfan Shao Demin Song Zifan Song Zhihao Sui Peng Sun Yu Sun Huanze Tang Bin Wang Guoteng Wang Jiaqi Wang Jiayu Wang Rui Wang Yudong Wang Ziyi Wang Xingjian Wei Qizhen Weng Fan Wu Yingtong Xiong Chao Xu Ruiliang Xu Hang Yan Yirong Yan Xiaogui Yang Haochen Ye Huaiyuan Ying Jia Yu Jing Yu Yuhang Zang Chuyu Zhang Li Zhang Pan Zhang Peng Zhang Ruijie Zhang Shuo Zhang Songyang Zhang Wenjian Zhang Wenwei Zhang Xingcheng Zhang Xinyue Zhang Hui Zhao Qian Zhao Xiaomeng Zhao Fengzhe Zhou Zaida Zhou Jingming Zhuo Yicheng Zou Xipeng Qiu Yu Qiao Dahua Lin

Authors on Pith no claims yet

Pith reviewed 2026-05-15 11:40 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords InternLM2large language modelopen-source LLMpre-trainingCOOL RLHFlong-context modelingbenchmark evaluationmodel alignment

0 comments

The pith

InternLM2 outperforms prior open-source LLMs on 30 benchmarks, long-context tasks up to 200k tokens, and subjective evaluations via staged pre-training and COOL RLHF alignment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces InternLM2 as an open-source large language model built to exceed earlier versions through careful pre-training on mixed text, code, and long-sequence data. Training begins at 4k token contexts and scales to 32k tokens, which supports strong results on extended dependency tests such as the 200k-token needle-in-a-haystack evaluation. Alignment proceeds with supervised fine-tuning followed by a new Conditional Online RLHF method designed to resolve conflicting human preferences and limit reward hacking. Models from multiple training stages and sizes are released to let others observe how capabilities develop. These steps matter for readers because they supply concrete, reproducible techniques that narrow the performance difference between open and closed models across broad capability measures.

Core claim

InternLM2 outperforms its predecessors across six evaluation dimensions and thirty benchmarks, exhibits effective long-context modeling after progressive training from 4k to 32k tokens, and improves open-ended subjective responses through supervised fine-tuning combined with Conditional Online Reinforcement Learning from Human Feedback that mitigates preference conflicts and reward hacking.

What carries the argument

The staged pre-training pipeline that scales context length while incorporating diverse text, code, and long-context data, paired with the Conditional Online RLHF alignment procedure.

Load-bearing premise

The chosen thirty benchmarks and subjective evaluations measure general capabilities fairly without selection bias or prompt sensitivity that would alter the reported rankings.

What would settle it

Re-running the same models on a different collection of thirty benchmarks or an alternative set of subjective prompts that produces a higher ranking for a predecessor model would falsify the outperformance claim.

read the original abstract

The evolution of Large Language Models (LLMs) like ChatGPT and GPT-4 has sparked discussions on the advent of Artificial General Intelligence (AGI). However, replicating such advancements in open-source models has been challenging. This paper introduces InternLM2, an open-source LLM that outperforms its predecessors in comprehensive evaluations across 6 dimensions and 30 benchmarks, long-context modeling, and open-ended subjective evaluations through innovative pre-training and optimization techniques. The pre-training process of InternLM2 is meticulously detailed, highlighting the preparation of diverse data types including text, code, and long-context data. InternLM2 efficiently captures long-term dependencies, initially trained on 4k tokens before advancing to 32k tokens in pre-training and fine-tuning stages, exhibiting remarkable performance on the 200k ``Needle-in-a-Haystack" test. InternLM2 is further aligned using Supervised Fine-Tuning (SFT) and a novel Conditional Online Reinforcement Learning from Human Feedback (COOL RLHF) strategy that addresses conflicting human preferences and reward hacking. By releasing InternLM2 models in different training stages and model sizes, we provide the community with insights into the model's evolution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

InternLM2 is a solid open-source model release with documented 32k context scaling and checkpoint drops, but the benchmark wins rest on thin evaluation controls.

read the letter

InternLM2 gives the community another competitive open LLM with clear scaling from 4k to 32k context during pre-training and a 200k needle-in-haystack result. They also release models at different training stages and sizes, which is the most immediately useful part of the report. The data curriculum mixing text, code, and long sequences is laid out plainly enough to follow, and COOL RLHF is presented as a way to manage conflicting preferences without heavy reward hacking. That combination of release plus training details is worth having on the record. The evaluation section is the weaker part. Claims of outperformance across 30 benchmarks in six dimensions come without ablations, error bars, contamination checks, or tests for prompt sensitivity. The stress-test note on selection effects holds up: nothing in the text shows the benchmark suite is stable under different choices or retrieval regimes. Needle-in-a-haystack at 200k is known to be solvable by simple tricks, so the long-context win needs more controls to stand on its own. This paper is mainly for groups that want to fine-tune or extend an open model rather than for readers hunting paradigm shifts. The training recipe and release make it worth a serious referee's time even if the numbers need tightening.

Referee Report

2 major / 2 minor

Summary. The paper introduces InternLM2, an open-source LLM claimed to outperform predecessors across 6 dimensions and 30 benchmarks, long-context modeling (including a 200k needle-in-a-haystack test after scaling from 4k to 32k context), and open-ended subjective evaluations. It details pre-training on diverse text/code/long-context data, SFT, and a novel COOL RLHF strategy to mitigate conflicting preferences and reward hacking, while releasing models at multiple training stages and sizes.

Significance. If the performance claims are robust, the work provides a valuable open-source model with demonstrated long-context capabilities and a new RLHF variant, offering community insights into training dynamics. The release of intermediate checkpoints strengthens reproducibility and allows external verification of the claimed innovations in pre-training and alignment.

major comments (2)

[Abstract and §4 (Evaluations)] Abstract and evaluation sections: the central claim of outperformance on 30 benchmarks across 6 dimensions rests on reported scores without error bars, ablation tables, prompt-template details, or data-decontamination analysis. This makes it impossible to assess whether gains are stable or sensitive to evaluation choices.
[Long-context modeling and Needle-in-a-Haystack] Long-context section: the 200k needle-in-a-haystack result is presented without controls for retrieval tricks or alternative prompt regimes, leaving open whether the reported capability stems from the claimed pre-training schedule or from test-specific artifacts.

minor comments (2)

[Alignment section] Notation for COOL RLHF hyperparameters is introduced without an explicit equation or pseudocode listing the conditional reward formulation.
[Pre-training] The data-mixture description would benefit from a table showing token counts per category (text, code, long-context) at each training stage.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the thorough review of our manuscript on InternLM2. We appreciate the feedback on the evaluation sections and have revised the paper to incorporate additional details and clarifications as outlined in our point-by-point responses below.

read point-by-point responses

Referee: [Abstract and §4 (Evaluations)] Abstract and evaluation sections: the central claim of outperformance on 30 benchmarks across 6 dimensions rests on reported scores without error bars, ablation tables, prompt-template details, or data-decontamination analysis. This makes it impossible to assess whether gains are stable or sensitive to evaluation choices.

Authors: We agree that additional transparency would strengthen the presentation. In the revised manuscript, we have added error bars for the primary benchmarks (computed over multiple evaluation seeds where feasible), a summary table of the prompt templates employed, and a concise description of our data decontamination procedure (n-gram overlap filtering against standard evaluation corpora). Comprehensive ablation tables for all 30 benchmarks would expand the paper substantially; we have therefore included key ablations in an appendix and released the full evaluation scripts and prompts with the model checkpoints to enable independent verification. These changes directly address concerns about stability and sensitivity to evaluation choices. revision: partial
Referee: [Long-context modeling and Needle-in-a-Haystack] Long-context section: the 200k needle-in-a-haystack result is presented without controls for retrieval tricks or alternative prompt regimes, leaving open whether the reported capability stems from the claimed pre-training schedule or from test-specific artifacts.

Authors: The 200k needle-in-a-haystack evaluation follows the standard protocol established in prior work. In the revised version, we have added results across varied needle insertion positions and multiple prompt formulations. These controls show consistent retrieval accuracy, supporting that the observed capability derives from the progressive context-length scaling in pre-training (4k to 32k tokens) rather than test-specific artifacts. We have also expanded the description of the long-context data mixture and training schedule in Section 3 to further clarify the source of the improvement. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on external benchmarks

full rationale

The paper reports empirical performance on 30 public benchmarks, long-context tests, and subjective evaluations without any mathematical derivations, equations, or predictions that reduce to self-defined quantities or fitted inputs by construction. Training procedures (pre-training data mixtures, context extension from 4k to 32k, SFT, and COOL RLHF) are described procedurally but contain no self-referential steps where a claimed result is equivalent to its own inputs. Any self-citations to prior InternLM work are not load-bearing for the central outperformance claims, which remain independently verifiable on external suites.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The central claims rest on the validity of standard LLM benchmarks, the assumption that reported training stages are faithfully reproduced by released weights, and the premise that COOL RLHF resolves reward hacking without introducing new unmeasured biases.

free parameters (2)

context length schedule
4k to 32k token progression chosen during pre-training and fine-tuning
RLHF reward model hyperparameters
Parameters of the conditional online RLHF procedure fitted to human preference data

axioms (2)

domain assumption Public benchmarks measure general language capability without significant contamination or prompt sensitivity
Invoked when claiming outperformance across 30 benchmarks
domain assumption Released model checkpoints match the described training stages
Required for any downstream reproduction or inspection

invented entities (1)

COOL RLHF no independent evidence
purpose: Alignment strategy that addresses conflicting human preferences and reward hacking
Newly introduced conditional online reinforcement learning from human feedback variant

pith-pipeline@v0.9.0 · 5884 in / 1422 out tokens · 26034 ms · 2026-05-15T11:40:36.165748+00:00 · methodology

discussion (0)

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The Wittgensteinian Representation Hypothesis: Is Language the Attractor of Multimodal Convergence?
cs.AI 2026-05 unverdicted novelty 7.0

Language representations serve as the asymptotic attractor for convergence in independently trained multimodal neural networks due to feature density asymmetry.
VTAgent: Agentic Keyframe Anchoring for Evidence-Aware Video TextVQA
cs.CV 2026-05 unverdicted novelty 7.0

VTAgent uses a question-guided agent to anchor keyframes for evidence-aware Video TextVQA, delivering up to +12 accuracy and new SOTA results via training-free operation plus SFT and RL.
StoryAlign: Evaluating and Training Reward Models for Story Generation
cs.CL 2026-05 unverdicted novelty 7.0

StoryReward, trained on a new 100k story preference dataset, sets state-of-the-art performance on the introduced StoryRMB benchmark for aligning LLM stories with human preferences.
CompliBench: Benchmarking LLM Judges for Compliance Violation Detection in Dialogue Systems
cs.CL 2026-04 unverdicted novelty 7.0

CompliBench uses simulation and adversarial flaw injection to create labeled dialogue data showing that top proprietary LLMs perform poorly at spotting guideline violations while fine-tuned smaller models outperform t...
TaxPraBen: A Scalable Benchmark for Structured Evaluation of LLMs in Chinese Real-World Tax Practice
cs.CL 2026-04 unverdicted novelty 7.0

TaxPraBen is a new benchmark with 14 datasets and a structured evaluation method for measuring LLM performance on Chinese real-world tax tasks and scenarios.
Visual-ERM: Reward Modeling for Visual Equivalence
cs.CV 2026-03 unverdicted novelty 7.0

Visual-ERM is a new multimodal reward model that supplies fine-grained visual feedback for training vision-language models on chart-to-code, table, and SVG tasks, yielding measurable gains over prior rewards.
MegaScale-Omni: A Hyper-Scale, Workload-Resilient System for MultiModal LLM Training in Production
cs.DC 2026-05 unverdicted novelty 6.0

MegaScale-Omni delivers 1.27x-7.57x higher throughput for dynamic multimodal LLM training by decoupling encoder and LLM parallelism, using unified colocation, and applying adaptive workload balancing.
CAR: Query-Guided Confidence-Aware Reranking for Retrieval-Augmented Generation
cs.CL 2026-05 unverdicted novelty 6.0

CAR reranks documents in RAG by promoting those that increase generator confidence (via answer consistency sampling) and demoting those that decrease it, yielding NDCG@5 gains on BEIR datasets that correlate with F1 i...
Decompose and Recompose: Reasoning New Skills from Existing Abilities for Cross-Task Robotic Manipulation
cs.RO 2026-05 unverdicted novelty 6.0

Decompose and Recompose decomposes seen robotic demonstrations into skill-action alignments and recomposes them via visual-semantic retrieval and planning to enable zero-shot cross-task generalization.
Decoding the Delta: Unifying Remote Sensing Change Detection and Understanding with Multimodal Large Language Models
cs.CV 2026-04 unverdicted novelty 6.0

Delta-LLaVA adds Change-Enhanced Attention, Change-SEG with prior embeddings, and Local Causal Attention to MLLMs to overcome temporal blindness, outperforming general models on a new unified benchmark for bi- and tri...
BadSkill: Backdoor Attacks on Agent Skills via Model-in-Skill Poisoning
cs.CR 2026-04 unverdicted novelty 6.0

BadSkill poisons embedded models in agent skills to achieve up to 99.5% attack success rate on triggered tasks with only 3% poison rate while preserving normal behavior on non-trigger inputs.
HAWK: Head Importance-Aware Visual Token Pruning in Multimodal Models
cs.CV 2026-04 unverdicted novelty 6.0

HAWK is a training-free method that prunes over 80% of visual tokens in MLLMs while retaining 96% accuracy by using head importance weights and text-guided attention to select task-relevant tokens.
Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization
cs.CL 2026-04 unverdicted novelty 6.0

Personalized RewardBench reveals that state-of-the-art reward models reach only 75.94% accuracy on personalized preferences and shows stronger correlation with downstream BoN and PPO performance than prior benchmarks.
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
cs.CV 2025-08 unverdicted novelty 6.0

InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
Visual-RFT: Visual Reinforcement Fine-Tuning
cs.CV 2025-03 conditional novelty 6.0

Visual-RFT applies reinforcement learning with verifiable perception rewards to improve large vision-language models on fine-grained classification, few-shot detection, and grounding tasks.
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
cs.CV 2024-12 unverdicted novelty 6.0

InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
Why Do Vision Language Models Struggle To Recognize Human Emotions?
cs.CV 2026-04 unverdicted novelty 5.0

VLMs fail at dynamic facial expression recognition because web-scale pretraining exacerbates long-tailed class bias and sparse frame sampling misses micro-expressions; a multi-stage context enrichment strategy using l...
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
cs.AI 2025-03 unverdicted novelty 5.0

The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
cs.CV 2024-04 unverdicted novelty 4.0

InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.

Reference graph

Works this paper leans on

172 extracted references · 172 canonical work pages · cited by 19 Pith papers · 28 internal anchors

[1]

https://github.com/MicrosoftDocs/azure-docs/blob/main/articles/ai-services/openai/includes/chat-markup-language.md

chat markup language. https://github.com/MicrosoftDocs/azure-docs/blob/main/articles/ai-services/openai/includes/chat-markup-language.md. Accessed: 2024-02-06

work page 2024
[2]

https://github.com/ggerganov/llama.cpp, 2023

llama.cpp: Port of facebook's llama model in c/c++. https://github.com/ggerganov/llama.cpp, 2023

work page 2023
[3]

GQA: training generalized multi-query transformer models from multi-head checkpoints

Joshua Ainslie, James Lee - Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebr \' o n, and Sumit Sanghai. GQA: training generalized multi-query transformer models from multi-head checkpoints. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapo...

work page 2023
[6]

Cibench: Evaluating your llms with a code interpreter plugin

Anonymous. Cibench: Evaluating your llms with a code interpreter plugin. In Openreview, 2024 a . URL https://openreview.net/forum?id=O8jmCw5puG

work page 2024
[7]

Mathbench: Evaluating the theory and application proficiency of llms with a hierarchical mathematics benchmark

Anonymous. Mathbench: Evaluating the theory and application proficiency of llms with a hierarchical mathematics benchmark. In Openreview, 2024 b . URL https://openreview.net/forum?id=4vRO48RwVG

work page 2024
[11]

Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, ...

work page 2022
[17]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert - Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litw...

work page 2020
[21]

Amsp: Reducing communication overhead of zero for efficient llm training, 2024 b

Qiaoling Chen, Qinghao Hu, Guoteng Wang, Yingtong Xiong, Ting Huang, Xun Chen, Yang Gao, Hang Yan, Yonggang Wen, Tianwei Zhang, and Peng Sun. Amsp: Reducing communication overhead of zero for efficient llm training, 2024 b

work page 2024
[22]

Theoremqa: A theorem-driven question answering dataset

Wenhu Chen, Ming Yin, Max Ku, Pan Lu, Yixin Wan, Xueguang Ma, Jianyu Xu, Xinyi Wang, and Tony Xia. Theoremqa: A theorem-driven question answering dataset. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023 , pp.\ 7889--7901....

work page 2023
[24]

Agent-flan: Designing data and methods of effective agent tuning for large language models

Zehui Chen, Kuikun Liu, Qiuchen Wang, Jiangning Liu, Dahua Lin, Kai Chen, and Feng Zhao. Agent-flan: Designing data and methods of effective agent tuning for large language models. work in progress, 2024 c

work page 2024
[25]

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradb...

work page 2023
[26]

Christiano, Jan Leike, Tom B

Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural In...

work page 2017
[28]

Lmdeploy: A toolkit for compressing, deploying, and serving llm

LMDeploy Contributors. Lmdeploy: A toolkit for compressing, deploying, and serving llm. https://github.com/InternLM/lmdeploy, 2023 a

work page 2023
[29]

Opencompass: A universal evaluation platform for foundation models

OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass, 2023 b

work page 2023
[30]

Ultrafeedback: Boosting language models with high-quality feedback, 2023

Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and Maosong Sun. Ultrafeedback: Boosting language models with high-quality feedback, 2023

work page 2023
[31]

Safe rlhf: Safe reinforcement learning from human feedback, 2023

Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, and Yaodong Yang. Safe rlhf: Safe reinforcement learning from human feedback, 2023

work page 2023
[34]

Understanding dataset difficulty with V -usable information

Kawin Ethayarajh, Yejin Choi, and Swabha Swayamdipta. Understanding dataset difficulty with V -usable information. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.\ 5988--60...

work page 2022
[35]

Scaling laws for reward model overoptimization

Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. Oct 2022

work page 2022
[37]

Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkinson, Russell Authur, Khyathi Raghavi Chandu, Arman Cohan, Jennifer Dumas, Yanai Elazar, Yuling Gu, Jack Hessel, Tushar Khot, William Merrill, Jacob Morrison, Niklas Muennighoff, Aakan...

work page arXiv 2024
[38]

Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y. Wu, Y. K. Li, Fuli Luo, Yingfei Xiong, and Wenfeng Liang. Deepseek-coder: When the large language model meets programming - the rise of code intelligence. CoRR, abs/2401.14196, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

Measuring mathematical problem solving with the MATH dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. In Joaquin Vanschoren and Sai - Kit Yeung (eds.), Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 202...

work page 2021
[41]

Characterization of large language model development in the datacenter

Qinghao Hu, Zhisheng Ye, Zerui Wang, Guoteng Wang, Meng Zhang, Qiaoling Chen, Peng Sun, Dahua Lin, Xiaolin Wang, Yingwei Luo, et al. Characterization of large language model development in the datacenter. In USENIX Symposium on Networked Systems Design and Implementation (NSDI’24), 2024

work page 2024
[42]

Gpipe: Efficient training of giant neural networks using pipeline parallelism

Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32, 2019

work page 2019
[47]

Reducing activation recomputation in large transformer models

Vijay Anand Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catanzaro. Reducing activation recomputation in large transformer models. Proceedings of Machine Learning and Systems, 5, 2023

work page 2023
[48]

Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Matthew Kelcey, Jacob Devlin, Kenton Lee, Kristina N. Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: a benchmark for question answering research. Transac...

work page 2019
[49]

Fabbri, Caiming Xiong, Shafiq Joty, and Chien - Sheng Wu

Philippe Laban, Wojciech Kryscinski, Divyansh Agarwal, Alexander R. Fabbri, Caiming Xiong, Shafiq Joty, and Chien - Sheng Wu. Summedits: Measuring LLM ability at factual reasoning through the lens of summarization. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, E...

work page 2023
[51]

Ds-1000: A natural and reliable benchmark for data science code generation

Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettlemoyer, Wen-tau Yih, Daniel Fried, Sida Wang, and Tao Yu. Ds-1000: A natural and reliable benchmark for data science code generation. In International Conference on Machine Learning, pp.\ 18319--18345. PMLR, 2023

work page 2023
[52]

Cmmlu: Measuring massive multitask language understanding in chinese, 2023 a

Haonan Li, Yixuan Zhang, Fajri Koto, Yifei Yang, Hai Zhao, Yeyun Gong, Nan Duan, and Timothy Baldwin. Cmmlu: Measuring massive multitask language understanding in chinese, 2023 a

work page 2023
[53]

Hashimoto

Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023 b

work page 2023
[58]

Dynamically scaled rope further increases performance of long context llama with zero fine-tuning, July 2023

LocalLLaMA. Dynamically scaled rope further increases performance of long context llama with zero fine-tuning, July 2023. URL https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/

work page 2023
[59]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019 . OpenReview.net, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7

work page 2019
[60]

Longwanjuan: Towards systematic measurement for long text quality, 2024

Kai Lv, Xiaoran Liu, Qipeng Guo, Hang Yan, Conghui He, Xipeng Qiu, and Dahua Lin. Longwanjuan: Towards systematic measurement for long text quality, 2024

work page 2024
[61]

Categorizing variants of goodhart’s law

David Manheim and Scott Garrabrant. Categorizing variants of goodhart’s law. arXiv: Artificial Intelligence,arXiv: Artificial Intelligence, Mar 2018

work page 2018
[63]

Mixed precision training

Sharan Narang, Gregory Diamos, Erich Elsen, Paulius Micikevicius, Jonah Alben, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. Mixed precision training. In Int. Conf. on Learning Representation, 2017

work page 2017
[65]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human fee...

work page 2022
[71]

Zero: Memory optimizations toward training trillion parameter models

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp.\ 1--16. IEEE, 2020

work page 2020
[72]

Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters

Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp.\ 3505--3506, 2020

work page 2020
[74]

Chi, James Caverlee, Julian J

Noveen Sachdeva, Benjamin Coleman, Wang - Cheng Kang, Jianmo Ni, Lichan Hong, Ed H. Chi, James Caverlee, Julian J. McAuley, and Derek Zhiyuan Cheng. How to train data-efficient llms. CoRR, abs/2402.09668, 2024

work page arXiv 2024
[80]

Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano

Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano. Learning to summarize from human feedback. In NeurIPS, 2020

work page 2020
[81]

Investigating prior knowledge for challenging chinese machine reading comprehension

Kai Sun, Dian Yu, Dong Yu, and Claire Cardie. Investigating prior knowledge for challenging chinese machine reading comprehension. Transactions of the Association for Computational Linguistics, 2020. URL https://arxiv.org/abs/1904.09679v3

work page arXiv 2020
[83]

NLLB Team, Marta R. Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, Loic Barrault, Gabriel Mejia Gonzalez, Prangthip Hansanti, John Hoffman, Semarley Jarrett, Kaushik Ram Sadagopan, Dirk Rowe,...

work page 2022
[86]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 30: Annual Conference o...

work page 2017
[87]

Skywork: A more open bilingual foundation model, 2023

Tianwen Wei, Liang Zhao, Lichang Zhang, Bo Zhu, Lijie Wang, Haihua Yang, Biye Li, Cheng Cheng, Weiwei Lü, Rui Hu, Chenxia Li, Liu Yang, Xilin Luo, Xuejie Wu, Lunan Liu, Wenjun Cheng, Peng Cheng, Jianhao Zhang, Xiaoyu Zhang, Lei Lin, Xiaokun Wang, Yutuan Ma, Chuanhai Dong, Yanqi Sun, Yifu Chen, Yongyi Peng, Xiaojuan Liang, Shuicheng Yan, Han Fang, and Yahu...

work page 2023
[88]

Qurating: Selecting high-quality data for training language models

Alexander Wettig, Aatmik Gupta, Saumya Malik, and Danqi Chen. Qurating: Selecting high-quality data for training language models. CoRR, abs/2402.09739, 2024

work page arXiv 2024
[93]

Narasimhan, and Yuan Cao

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview.net, 2023. URL https://openreview.net/pdf?id=WE\_vluYUL-X

work page 2023
[94]

Internlm-math: Open math large language models toward verifiable reasoning, 2024

Huaiyuan Ying, Shuo Zhang, Linyang Li, Zhejian Zhou, Yunfan Shao, Zhaoye Fei, Yichuan Ma, Jiawei Hong, Kuikun Liu, Ziyi Wang, Yudong Wang, Zijian Wu, Shuaibin Li, Fengzhe Zhou, Hongwei Liu, Songyang Zhang, Wenwei Zhang, Hang Yan, Xipeng Qiu, Jiayu Wang, Kai Chen, and Dahua Lin. Internlm-math: Open math large language models toward verifiable reasoning, 2024

work page 2024
[96]

GLM-130B: an open bilingual pre-trained model

Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, Weng Lam Tam, Zixuan Ma, Yufei Xue, Jidong Zhai, Wenguang Chen, Zhiyuan Liu, Peng Zhang, Yuxiao Dong, and Jie Tang. GLM-130B: an open bilingual pre-trained model. In The Eleventh International Conference on Learning Representations, ICLR 202...

work page 2023
[97]

Root mean square layer normalization

Biao Zhang and Rico Sennrich. Root mean square layer normalization. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d'Alch \' e - Buc, Emily B. Fox, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver,...

work page 2019
[98]

Evaluating the performance of large language models on gaokao benchmark

Xiaotian Zhang, Chunyang Li, Yi Zong, Zhengyu Ying, Liang He, and Xipeng Qiu. Evaluating the performance of large language models on gaokao benchmark. 2023

work page 2023
[99]

Mics: Near-linear scaling for training gigantic model on public

Zhen Zhang, Shuai Zheng, Yida Wang, Justin Chiu, George Karypis, Trishul Chilimbi, Mu Li, and Xin Jin. Mics: Near-linear scaling for training gigantic model on public. Proceedings of the VLDB Endowment, 16 0 (1): 0 37--50, 2022

work page 2022
[100]

P Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric. P Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023 a

work page 2023
[102]

Agieval: A human-centric benchmark for evaluating foundation models, 2023

Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. Agieval: A human-centric benchmark for evaluating foundation models, 2023

work page 2023
[104]

LMDeploy: A Toolkit for Compressing, Deploying, and Serving LLM , author=

work page
[105]

arXiv preprint arXiv:2402.16819 , year=

Nemotron-4 15B Technical Report , author=. arXiv preprint arXiv:2402.16819 , year=

work page arXiv
[106]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

work page
[107]

19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22) , pages=

Accelerating collective communication in data parallel training across deep learning frameworks , author=. 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22) , pages=

work page
[108]

USENIX Symposium on Networked Systems Design and Implementation (NSDI’24) , year=

Characterization of Large Language Model Development in the Datacenter , author=. USENIX Symposium on Networked Systems Design and Implementation (NSDI’24) , year=

work page
[109]

13th USENIX symposium on operating systems design and implementation (OSDI 18) , pages=

Ray: A distributed framework for emerging \ AI \ applications , author=. 13th USENIX symposium on operating systems design and implementation (OSDI 18) , pages=

work page
[110]

arXiv preprint arXiv:2402.13013 , year=

Code Needs Comments: Enhancing Code LLMs with Comment Augmentation , author=. arXiv preprint arXiv:2402.13013 , year=

work page arXiv
[111]

2024 , eprint=

AMSP: Reducing Communication Overhead of ZeRO for Efficient LLM Training , author=. 2024 , eprint=

work page 2024
[112]

Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , pages=

Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters , author=. Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , pages=

work page
[113]

Proceedings of the VLDB Endowment , volume=

MiCS: Near-linear Scaling for Training Gigantic Model on Public , author=. Proceedings of the VLDB Endowment , volume=. 2022 , publisher=

work page 2022
[114]

How to Train Data-Efficient LLMs , journal =

Noveen Sachdeva and Benjamin Coleman and Wang. How to Train Data-Efficient LLMs , journal =

work page
[115]

CoRR , volume =

Alexander Wettig and Aatmik Gupta and Saumya Malik and Danqi Chen , title =. CoRR , volume =

work page
[116]

Wu and Y

Daya Guo and Qihao Zhu and Dejian Yang and Zhenda Xie and Kai Dong and Wentao Zhang and Guanting Chen and Xiao Bi and Y. Wu and Y. K. Li and Fuli Luo and Yingfei Xiong and Wenfeng Liang , title =. CoRR , volume =

work page
[117]

2024 , eprint=

LongWanjuan: Towards Systematic Measurement for Long Text Quality , author=. 2024 , eprint=

work page 2024
[118]

Dirk Groeneveld and Iz Beltagy and Pete Walsh and Akshita Bhagia and Rodney Kinney and Oyvind Tafjord and Ananya Harsh Jha and Hamish Ivison and Ian Magnusson and Yizhong Wang and Shane Arora and David Atkinson and Russell Authur and Khyathi Raghavi Chandu and Arman Cohan and Jennifer Dumas and Yanai Elazar and Yuling Gu and Jack Hessel and Tushar Khot an...

work page
[119]

arXiv preprint arXiv:2401.09149 , year=

InternEvo: Efficient Long-sequence Large Language Model Training via Hybrid Parallelism and Redundant Sharding , author=. arXiv preprint arXiv:2401.09149 , year=

work page arXiv
[120]

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis , pages=

Zero: Memory optimizations toward training trillion parameter models , author=. SC20: International Conference for High Performance Computing, Networking, Storage and Analysis , pages=. 2020 , organization=

work page 2020
[121]

Proceedings of Machine Learning and Systems , volume=

Reducing activation recomputation in large transformer models , author=. Proceedings of Machine Learning and Systems , volume=

work page
[122]

Advances in neural information processing systems , volume=

Gpipe: Efficient training of giant neural networks using pipeline parallelism , author=. Advances in neural information processing systems , volume=

work page
[123]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Megatron-lm: Training multi-billion parameter language models using model parallelism , author=. arXiv preprint arXiv:1909.08053 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1909
[124]

Mixed precision training , author=. Int. Conf. on Learning Representation , year=

work page
[125]

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , pages=

Efficient large-scale language model training on gpu clusters using megatron-lm , author=. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , pages=

work page
[126]

chat markup language , howpublished =

work page
[127]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

work page
[128]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

work page 2016
[129]

Christiano and Jan Leike and Tom B

Paul F. Christiano and Jan Leike and Tom B. Brown and Miljan Martic and Shane Legg and Dario Amodei , editor =. Deep Reinforcement Learning from Human Preferences , booktitle =. 2017 , url =

work page 2017
[130]

Proximal Policy Optimization Algorithms

John Schulman and Filip Wolski and Prafulla Dhariwal and Alec Radford and Oleg Klimov , title =. CoRR , volume =. 2017 , url =. 1707.06347 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2017
[131]

Christopher J. C. Burges and Tal Shaked and Erin Renshaw and Ari Lazier and Matt Deeds and Nicole Hamilton and Gregory N. Hullender , editor =. Learning to rank using gradient descent , booktitle =. 2005 , url =. doi:10.1145/1102351.1102363 , timestamp =

work page doi:10.1145/1102351.1102363 2005

Showing first 80 references.