Recognition: 2 theorem links
· Lean TheoremInternLM2 Technical Report
Pith reviewed 2026-05-15 11:40 UTC · model grok-4.3
The pith
InternLM2 outperforms prior open-source LLMs on 30 benchmarks, long-context tasks up to 200k tokens, and subjective evaluations via staged pre-training and COOL RLHF alignment.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
InternLM2 outperforms its predecessors across six evaluation dimensions and thirty benchmarks, exhibits effective long-context modeling after progressive training from 4k to 32k tokens, and improves open-ended subjective responses through supervised fine-tuning combined with Conditional Online Reinforcement Learning from Human Feedback that mitigates preference conflicts and reward hacking.
What carries the argument
The staged pre-training pipeline that scales context length while incorporating diverse text, code, and long-context data, paired with the Conditional Online RLHF alignment procedure.
Load-bearing premise
The chosen thirty benchmarks and subjective evaluations measure general capabilities fairly without selection bias or prompt sensitivity that would alter the reported rankings.
What would settle it
Re-running the same models on a different collection of thirty benchmarks or an alternative set of subjective prompts that produces a higher ranking for a predecessor model would falsify the outperformance claim.
read the original abstract
The evolution of Large Language Models (LLMs) like ChatGPT and GPT-4 has sparked discussions on the advent of Artificial General Intelligence (AGI). However, replicating such advancements in open-source models has been challenging. This paper introduces InternLM2, an open-source LLM that outperforms its predecessors in comprehensive evaluations across 6 dimensions and 30 benchmarks, long-context modeling, and open-ended subjective evaluations through innovative pre-training and optimization techniques. The pre-training process of InternLM2 is meticulously detailed, highlighting the preparation of diverse data types including text, code, and long-context data. InternLM2 efficiently captures long-term dependencies, initially trained on 4k tokens before advancing to 32k tokens in pre-training and fine-tuning stages, exhibiting remarkable performance on the 200k ``Needle-in-a-Haystack" test. InternLM2 is further aligned using Supervised Fine-Tuning (SFT) and a novel Conditional Online Reinforcement Learning from Human Feedback (COOL RLHF) strategy that addresses conflicting human preferences and reward hacking. By releasing InternLM2 models in different training stages and model sizes, we provide the community with insights into the model's evolution.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces InternLM2, an open-source LLM claimed to outperform predecessors across 6 dimensions and 30 benchmarks, long-context modeling (including a 200k needle-in-a-haystack test after scaling from 4k to 32k context), and open-ended subjective evaluations. It details pre-training on diverse text/code/long-context data, SFT, and a novel COOL RLHF strategy to mitigate conflicting preferences and reward hacking, while releasing models at multiple training stages and sizes.
Significance. If the performance claims are robust, the work provides a valuable open-source model with demonstrated long-context capabilities and a new RLHF variant, offering community insights into training dynamics. The release of intermediate checkpoints strengthens reproducibility and allows external verification of the claimed innovations in pre-training and alignment.
major comments (2)
- [Abstract and §4 (Evaluations)] Abstract and evaluation sections: the central claim of outperformance on 30 benchmarks across 6 dimensions rests on reported scores without error bars, ablation tables, prompt-template details, or data-decontamination analysis. This makes it impossible to assess whether gains are stable or sensitive to evaluation choices.
- [Long-context modeling and Needle-in-a-Haystack] Long-context section: the 200k needle-in-a-haystack result is presented without controls for retrieval tricks or alternative prompt regimes, leaving open whether the reported capability stems from the claimed pre-training schedule or from test-specific artifacts.
minor comments (2)
- [Alignment section] Notation for COOL RLHF hyperparameters is introduced without an explicit equation or pseudocode listing the conditional reward formulation.
- [Pre-training] The data-mixture description would benefit from a table showing token counts per category (text, code, long-context) at each training stage.
Simulated Author's Rebuttal
Thank you for the thorough review of our manuscript on InternLM2. We appreciate the feedback on the evaluation sections and have revised the paper to incorporate additional details and clarifications as outlined in our point-by-point responses below.
read point-by-point responses
-
Referee: [Abstract and §4 (Evaluations)] Abstract and evaluation sections: the central claim of outperformance on 30 benchmarks across 6 dimensions rests on reported scores without error bars, ablation tables, prompt-template details, or data-decontamination analysis. This makes it impossible to assess whether gains are stable or sensitive to evaluation choices.
Authors: We agree that additional transparency would strengthen the presentation. In the revised manuscript, we have added error bars for the primary benchmarks (computed over multiple evaluation seeds where feasible), a summary table of the prompt templates employed, and a concise description of our data decontamination procedure (n-gram overlap filtering against standard evaluation corpora). Comprehensive ablation tables for all 30 benchmarks would expand the paper substantially; we have therefore included key ablations in an appendix and released the full evaluation scripts and prompts with the model checkpoints to enable independent verification. These changes directly address concerns about stability and sensitivity to evaluation choices. revision: partial
-
Referee: [Long-context modeling and Needle-in-a-Haystack] Long-context section: the 200k needle-in-a-haystack result is presented without controls for retrieval tricks or alternative prompt regimes, leaving open whether the reported capability stems from the claimed pre-training schedule or from test-specific artifacts.
Authors: The 200k needle-in-a-haystack evaluation follows the standard protocol established in prior work. In the revised version, we have added results across varied needle insertion positions and multiple prompt formulations. These controls show consistent retrieval accuracy, supporting that the observed capability derives from the progressive context-length scaling in pre-training (4k to 32k tokens) rather than test-specific artifacts. We have also expanded the description of the long-context data mixture and training schedule in Section 3 to further clarify the source of the improvement. revision: yes
Circularity Check
No significant circularity; claims rest on external benchmarks
full rationale
The paper reports empirical performance on 30 public benchmarks, long-context tests, and subjective evaluations without any mathematical derivations, equations, or predictions that reduce to self-defined quantities or fitted inputs by construction. Training procedures (pre-training data mixtures, context extension from 4k to 32k, SFT, and COOL RLHF) are described procedurally but contain no self-referential steps where a claimed result is equivalent to its own inputs. Any self-citations to prior InternLM work are not load-bearing for the central outperformance claims, which remain independently verifiable on external suites.
Axiom & Free-Parameter Ledger
free parameters (2)
- context length schedule
- RLHF reward model hyperparameters
axioms (2)
- domain assumption Public benchmarks measure general language capability without significant contamination or prompt sensitivity
- domain assumption Released model checkpoints match the described training stages
invented entities (1)
-
COOL RLHF
no independent evidence
Forward citations
Cited by 19 Pith papers
-
The Wittgensteinian Representation Hypothesis: Is Language the Attractor of Multimodal Convergence?
Language representations serve as the asymptotic attractor for convergence in independently trained multimodal neural networks due to feature density asymmetry.
-
VTAgent: Agentic Keyframe Anchoring for Evidence-Aware Video TextVQA
VTAgent uses a question-guided agent to anchor keyframes for evidence-aware Video TextVQA, delivering up to +12 accuracy and new SOTA results via training-free operation plus SFT and RL.
-
StoryAlign: Evaluating and Training Reward Models for Story Generation
StoryReward, trained on a new 100k story preference dataset, sets state-of-the-art performance on the introduced StoryRMB benchmark for aligning LLM stories with human preferences.
-
CompliBench: Benchmarking LLM Judges for Compliance Violation Detection in Dialogue Systems
CompliBench uses simulation and adversarial flaw injection to create labeled dialogue data showing that top proprietary LLMs perform poorly at spotting guideline violations while fine-tuned smaller models outperform t...
-
TaxPraBen: A Scalable Benchmark for Structured Evaluation of LLMs in Chinese Real-World Tax Practice
TaxPraBen is a new benchmark with 14 datasets and a structured evaluation method for measuring LLM performance on Chinese real-world tax tasks and scenarios.
-
Visual-ERM: Reward Modeling for Visual Equivalence
Visual-ERM is a new multimodal reward model that supplies fine-grained visual feedback for training vision-language models on chart-to-code, table, and SVG tasks, yielding measurable gains over prior rewards.
-
MegaScale-Omni: A Hyper-Scale, Workload-Resilient System for MultiModal LLM Training in Production
MegaScale-Omni delivers 1.27x-7.57x higher throughput for dynamic multimodal LLM training by decoupling encoder and LLM parallelism, using unified colocation, and applying adaptive workload balancing.
-
CAR: Query-Guided Confidence-Aware Reranking for Retrieval-Augmented Generation
CAR reranks documents in RAG by promoting those that increase generator confidence (via answer consistency sampling) and demoting those that decrease it, yielding NDCG@5 gains on BEIR datasets that correlate with F1 i...
-
Decompose and Recompose: Reasoning New Skills from Existing Abilities for Cross-Task Robotic Manipulation
Decompose and Recompose decomposes seen robotic demonstrations into skill-action alignments and recomposes them via visual-semantic retrieval and planning to enable zero-shot cross-task generalization.
-
Decoding the Delta: Unifying Remote Sensing Change Detection and Understanding with Multimodal Large Language Models
Delta-LLaVA adds Change-Enhanced Attention, Change-SEG with prior embeddings, and Local Causal Attention to MLLMs to overcome temporal blindness, outperforming general models on a new unified benchmark for bi- and tri...
-
BadSkill: Backdoor Attacks on Agent Skills via Model-in-Skill Poisoning
BadSkill poisons embedded models in agent skills to achieve up to 99.5% attack success rate on triggered tasks with only 3% poison rate while preserving normal behavior on non-trigger inputs.
-
HAWK: Head Importance-Aware Visual Token Pruning in Multimodal Models
HAWK is a training-free method that prunes over 80% of visual tokens in MLLMs while retaining 96% accuracy by using head importance weights and text-guided attention to select task-relevant tokens.
-
Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization
Personalized RewardBench reveals that state-of-the-art reward models reach only 75.94% accuracy on personalized preferences and shows stronger correlation with downstream BoN and PPO performance than prior benchmarks.
-
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
-
Visual-RFT: Visual Reinforcement Fine-Tuning
Visual-RFT applies reinforcement learning with verifiable perception rewards to improve large vision-language models on fine-grained classification, few-shot detection, and grounding tasks.
-
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
-
Why Do Vision Language Models Struggle To Recognize Human Emotions?
VLMs fail at dynamic facial expression recognition because web-scale pretraining exacerbates long-tailed class bias and sparse frame sampling misses micro-expressions; a multi-stage context enrichment strategy using l...
-
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
-
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.
Reference graph
Works this paper leans on
-
[1]
chat markup language. https://github.com/MicrosoftDocs/azure-docs/blob/main/articles/ai-services/openai/includes/chat-markup-language.md. Accessed: 2024-02-06
work page 2024
-
[2]
https://github.com/ggerganov/llama.cpp, 2023
llama.cpp: Port of facebook's llama model in c/c++. https://github.com/ggerganov/llama.cpp, 2023
work page 2023
-
[3]
GQA: training generalized multi-query transformer models from multi-head checkpoints
Joshua Ainslie, James Lee - Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebr \' o n, and Sumit Sanghai. GQA: training generalized multi-query transformer models from multi-head checkpoints. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapo...
work page 2023
-
[6]
Cibench: Evaluating your llms with a code interpreter plugin
Anonymous. Cibench: Evaluating your llms with a code interpreter plugin. In Openreview, 2024 a . URL https://openreview.net/forum?id=O8jmCw5puG
work page 2024
-
[7]
Anonymous. Mathbench: Evaluating the theory and application proficiency of llms with a hierarchical mathematics benchmark. In Openreview, 2024 b . URL https://openreview.net/forum?id=4vRO48RwVG
work page 2024
-
[11]
Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, ...
work page 2022
-
[17]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert - Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litw...
work page 2020
-
[21]
Amsp: Reducing communication overhead of zero for efficient llm training, 2024 b
Qiaoling Chen, Qinghao Hu, Guoteng Wang, Yingtong Xiong, Ting Huang, Xun Chen, Yang Gao, Hang Yan, Yonggang Wen, Tianwei Zhang, and Peng Sun. Amsp: Reducing communication overhead of zero for efficient llm training, 2024 b
work page 2024
-
[22]
Theoremqa: A theorem-driven question answering dataset
Wenhu Chen, Ming Yin, Max Ku, Pan Lu, Yixin Wan, Xueguang Ma, Jianyu Xu, Xinyi Wang, and Tony Xia. Theoremqa: A theorem-driven question answering dataset. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023 , pp.\ 7889--7901....
work page 2023
-
[24]
Agent-flan: Designing data and methods of effective agent tuning for large language models
Zehui Chen, Kuikun Liu, Qiuchen Wang, Jiangning Liu, Dahua Lin, Kai Chen, and Feng Zhao. Agent-flan: Designing data and methods of effective agent tuning for large language models. work in progress, 2024 c
work page 2024
-
[25]
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradb...
work page 2023
-
[26]
Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural In...
work page 2017
-
[28]
Lmdeploy: A toolkit for compressing, deploying, and serving llm
LMDeploy Contributors. Lmdeploy: A toolkit for compressing, deploying, and serving llm. https://github.com/InternLM/lmdeploy, 2023 a
work page 2023
-
[29]
Opencompass: A universal evaluation platform for foundation models
OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass, 2023 b
work page 2023
-
[30]
Ultrafeedback: Boosting language models with high-quality feedback, 2023
Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and Maosong Sun. Ultrafeedback: Boosting language models with high-quality feedback, 2023
work page 2023
-
[31]
Safe rlhf: Safe reinforcement learning from human feedback, 2023
Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, and Yaodong Yang. Safe rlhf: Safe reinforcement learning from human feedback, 2023
work page 2023
-
[34]
Understanding dataset difficulty with V -usable information
Kawin Ethayarajh, Yejin Choi, and Swabha Swayamdipta. Understanding dataset difficulty with V -usable information. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.\ 5988--60...
work page 2022
-
[35]
Scaling laws for reward model overoptimization
Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. Oct 2022
work page 2022
-
[37]
Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkinson, Russell Authur, Khyathi Raghavi Chandu, Arman Cohan, Jennifer Dumas, Yanai Elazar, Yuling Gu, Jack Hessel, Tushar Khot, William Merrill, Jacob Morrison, Niklas Muennighoff, Aakan...
-
[38]
Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y. Wu, Y. K. Li, Fuli Luo, Yingfei Xiong, and Wenfeng Liang. Deepseek-coder: When the large language model meets programming - the rise of code intelligence. CoRR, abs/2401.14196, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[40]
Measuring mathematical problem solving with the MATH dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. In Joaquin Vanschoren and Sai - Kit Yeung (eds.), Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 202...
work page 2021
-
[41]
Characterization of large language model development in the datacenter
Qinghao Hu, Zhisheng Ye, Zerui Wang, Guoteng Wang, Meng Zhang, Qiaoling Chen, Peng Sun, Dahua Lin, Xiaolin Wang, Yingwei Luo, et al. Characterization of large language model development in the datacenter. In USENIX Symposium on Networked Systems Design and Implementation (NSDI’24), 2024
work page 2024
-
[42]
Gpipe: Efficient training of giant neural networks using pipeline parallelism
Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32, 2019
work page 2019
-
[47]
Reducing activation recomputation in large transformer models
Vijay Anand Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catanzaro. Reducing activation recomputation in large transformer models. Proceedings of Machine Learning and Systems, 5, 2023
work page 2023
-
[48]
Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Matthew Kelcey, Jacob Devlin, Kenton Lee, Kristina N. Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: a benchmark for question answering research. Transac...
work page 2019
-
[49]
Fabbri, Caiming Xiong, Shafiq Joty, and Chien - Sheng Wu
Philippe Laban, Wojciech Kryscinski, Divyansh Agarwal, Alexander R. Fabbri, Caiming Xiong, Shafiq Joty, and Chien - Sheng Wu. Summedits: Measuring LLM ability at factual reasoning through the lens of summarization. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, E...
work page 2023
-
[51]
Ds-1000: A natural and reliable benchmark for data science code generation
Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettlemoyer, Wen-tau Yih, Daniel Fried, Sida Wang, and Tao Yu. Ds-1000: A natural and reliable benchmark for data science code generation. In International Conference on Machine Learning, pp.\ 18319--18345. PMLR, 2023
work page 2023
-
[52]
Cmmlu: Measuring massive multitask language understanding in chinese, 2023 a
Haonan Li, Yixuan Zhang, Fajri Koto, Yifei Yang, Hai Zhao, Yeyun Gong, Nan Duan, and Timothy Baldwin. Cmmlu: Measuring massive multitask language understanding in chinese, 2023 a
work page 2023
- [53]
-
[58]
LocalLLaMA. Dynamically scaled rope further increases performance of long context llama with zero fine-tuning, July 2023. URL https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/
work page 2023
-
[59]
Decoupled weight decay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019 . OpenReview.net, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7
work page 2019
-
[60]
Longwanjuan: Towards systematic measurement for long text quality, 2024
Kai Lv, Xiaoran Liu, Qipeng Guo, Hang Yan, Conghui He, Xipeng Qiu, and Dahua Lin. Longwanjuan: Towards systematic measurement for long text quality, 2024
work page 2024
-
[61]
Categorizing variants of goodhart’s law
David Manheim and Scott Garrabrant. Categorizing variants of goodhart’s law. arXiv: Artificial Intelligence,arXiv: Artificial Intelligence, Mar 2018
work page 2018
-
[63]
Sharan Narang, Gregory Diamos, Erich Elsen, Paulius Micikevicius, Jonah Alben, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. Mixed precision training. In Int. Conf. on Learning Representation, 2017
work page 2017
-
[65]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human fee...
work page 2022
-
[71]
Zero: Memory optimizations toward training trillion parameter models
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp.\ 1--16. IEEE, 2020
work page 2020
-
[72]
Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp.\ 3505--3506, 2020
work page 2020
-
[74]
Noveen Sachdeva, Benjamin Coleman, Wang - Cheng Kang, Jianmo Ni, Lichan Hong, Ed H. Chi, James Caverlee, Julian J. McAuley, and Derek Zhiyuan Cheng. How to train data-efficient llms. CoRR, abs/2402.09668, 2024
-
[80]
Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano
Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano. Learning to summarize from human feedback. In NeurIPS, 2020
work page 2020
-
[81]
Investigating prior knowledge for challenging chinese machine reading comprehension
Kai Sun, Dian Yu, Dong Yu, and Claire Cardie. Investigating prior knowledge for challenging chinese machine reading comprehension. Transactions of the Association for Computational Linguistics, 2020. URL https://arxiv.org/abs/1904.09679v3
-
[83]
NLLB Team, Marta R. Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, Loic Barrault, Gabriel Mejia Gonzalez, Prangthip Hansanti, John Hoffman, Semarley Jarrett, Kaushik Ram Sadagopan, Dirk Rowe,...
work page 2022
-
[86]
Gomez, Lukasz Kaiser, and Illia Polosukhin
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 30: Annual Conference o...
work page 2017
-
[87]
Skywork: A more open bilingual foundation model, 2023
Tianwen Wei, Liang Zhao, Lichang Zhang, Bo Zhu, Lijie Wang, Haihua Yang, Biye Li, Cheng Cheng, Weiwei Lü, Rui Hu, Chenxia Li, Liu Yang, Xilin Luo, Xuejie Wu, Lunan Liu, Wenjun Cheng, Peng Cheng, Jianhao Zhang, Xiaoyu Zhang, Lei Lin, Xiaokun Wang, Yutuan Ma, Chuanhai Dong, Yanqi Sun, Yifu Chen, Yongyi Peng, Xiaojuan Liang, Shuicheng Yan, Han Fang, and Yahu...
work page 2023
-
[88]
Qurating: Selecting high-quality data for training language models
Alexander Wettig, Aatmik Gupta, Saumya Malik, and Danqi Chen. Qurating: Selecting high-quality data for training language models. CoRR, abs/2402.09739, 2024
-
[93]
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview.net, 2023. URL https://openreview.net/pdf?id=WE\_vluYUL-X
work page 2023
-
[94]
Internlm-math: Open math large language models toward verifiable reasoning, 2024
Huaiyuan Ying, Shuo Zhang, Linyang Li, Zhejian Zhou, Yunfan Shao, Zhaoye Fei, Yichuan Ma, Jiawei Hong, Kuikun Liu, Ziyi Wang, Yudong Wang, Zijian Wu, Shuaibin Li, Fengzhe Zhou, Hongwei Liu, Songyang Zhang, Wenwei Zhang, Hang Yan, Xipeng Qiu, Jiayu Wang, Kai Chen, and Dahua Lin. Internlm-math: Open math large language models toward verifiable reasoning, 2024
work page 2024
-
[96]
GLM-130B: an open bilingual pre-trained model
Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, Weng Lam Tam, Zixuan Ma, Yufei Xue, Jidong Zhai, Wenguang Chen, Zhiyuan Liu, Peng Zhang, Yuxiao Dong, and Jie Tang. GLM-130B: an open bilingual pre-trained model. In The Eleventh International Conference on Learning Representations, ICLR 202...
work page 2023
-
[97]
Root mean square layer normalization
Biao Zhang and Rico Sennrich. Root mean square layer normalization. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d'Alch \' e - Buc, Emily B. Fox, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver,...
work page 2019
-
[98]
Evaluating the performance of large language models on gaokao benchmark
Xiaotian Zhang, Chunyang Li, Yi Zong, Zhengyu Ying, Liang He, and Xipeng Qiu. Evaluating the performance of large language models on gaokao benchmark. 2023
work page 2023
-
[99]
Mics: Near-linear scaling for training gigantic model on public
Zhen Zhang, Shuai Zheng, Yida Wang, Justin Chiu, George Karypis, Trishul Chilimbi, Mu Li, and Xin Jin. Mics: Near-linear scaling for training gigantic model on public. Proceedings of the VLDB Endowment, 16 0 (1): 0 37--50, 2022
work page 2022
-
[100]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric. P Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023 a
work page 2023
-
[102]
Agieval: A human-centric benchmark for evaluating foundation models, 2023
Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. Agieval: A human-centric benchmark for evaluating foundation models, 2023
work page 2023
-
[104]
LMDeploy: A Toolkit for Compressing, Deploying, and Serving LLM , author=
-
[105]
arXiv preprint arXiv:2402.16819 , year=
Nemotron-4 15B Technical Report , author=. arXiv preprint arXiv:2402.16819 , year=
-
[106]
Scaling Learning Algorithms Towards
Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards
-
[107]
19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22) , pages=
Accelerating collective communication in data parallel training across deep learning frameworks , author=. 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22) , pages=
-
[108]
USENIX Symposium on Networked Systems Design and Implementation (NSDI’24) , year=
Characterization of Large Language Model Development in the Datacenter , author=. USENIX Symposium on Networked Systems Design and Implementation (NSDI’24) , year=
-
[109]
13th USENIX symposium on operating systems design and implementation (OSDI 18) , pages=
Ray: A distributed framework for emerging \ AI \ applications , author=. 13th USENIX symposium on operating systems design and implementation (OSDI 18) , pages=
-
[110]
arXiv preprint arXiv:2402.13013 , year=
Code Needs Comments: Enhancing Code LLMs with Comment Augmentation , author=. arXiv preprint arXiv:2402.13013 , year=
-
[111]
AMSP: Reducing Communication Overhead of ZeRO for Efficient LLM Training , author=. 2024 , eprint=
work page 2024
-
[112]
Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters , author=. Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , pages=
-
[113]
Proceedings of the VLDB Endowment , volume=
MiCS: Near-linear Scaling for Training Gigantic Model on Public , author=. Proceedings of the VLDB Endowment , volume=. 2022 , publisher=
work page 2022
-
[114]
How to Train Data-Efficient LLMs , journal =
Noveen Sachdeva and Benjamin Coleman and Wang. How to Train Data-Efficient LLMs , journal =
-
[115]
Alexander Wettig and Aatmik Gupta and Saumya Malik and Danqi Chen , title =. CoRR , volume =
- [116]
-
[117]
LongWanjuan: Towards Systematic Measurement for Long Text Quality , author=. 2024 , eprint=
work page 2024
-
[118]
Dirk Groeneveld and Iz Beltagy and Pete Walsh and Akshita Bhagia and Rodney Kinney and Oyvind Tafjord and Ananya Harsh Jha and Hamish Ivison and Ian Magnusson and Yizhong Wang and Shane Arora and David Atkinson and Russell Authur and Khyathi Raghavi Chandu and Arman Cohan and Jennifer Dumas and Yanai Elazar and Yuling Gu and Jack Hessel and Tushar Khot an...
-
[119]
arXiv preprint arXiv:2401.09149 , year=
InternEvo: Efficient Long-sequence Large Language Model Training via Hybrid Parallelism and Redundant Sharding , author=. arXiv preprint arXiv:2401.09149 , year=
-
[120]
Zero: Memory optimizations toward training trillion parameter models , author=. SC20: International Conference for High Performance Computing, Networking, Storage and Analysis , pages=. 2020 , organization=
work page 2020
-
[121]
Proceedings of Machine Learning and Systems , volume=
Reducing activation recomputation in large transformer models , author=. Proceedings of Machine Learning and Systems , volume=
-
[122]
Advances in neural information processing systems , volume=
Gpipe: Efficient training of giant neural networks using pipeline parallelism , author=. Advances in neural information processing systems , volume=
-
[123]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Megatron-lm: Training multi-billion parameter language models using model parallelism , author=. arXiv preprint arXiv:1909.08053 , year=
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[124]
Mixed precision training , author=. Int. Conf. on Learning Representation , year=
-
[125]
Efficient large-scale language model training on gpu clusters using megatron-lm , author=. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , pages=
-
[126]
chat markup language , howpublished =
-
[127]
and Osindero, Simon and Teh, Yee Whye , journal =
Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =
- [128]
-
[129]
Christiano and Jan Leike and Tom B
Paul F. Christiano and Jan Leike and Tom B. Brown and Miljan Martic and Shane Legg and Dario Amodei , editor =. Deep Reinforcement Learning from Human Preferences , booktitle =. 2017 , url =
work page 2017
-
[130]
Proximal Policy Optimization Algorithms
John Schulman and Filip Wolski and Prafulla Dhariwal and Alec Radford and Oleg Klimov , title =. CoRR , volume =. 2017 , url =. 1707.06347 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[131]
Christopher J. C. Burges and Tal Shaked and Erin Renshaw and Ari Lazier and Matt Deeds and Nicole Hamilton and Gregory N. Hullender , editor =. Learning to rank using gradient descent , booktitle =. 2005 , url =. doi:10.1145/1102351.1102363 , timestamp =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.