pith. machine review for the scientific record. sign in

arxiv: 2403.17297 · v1 · submitted 2024-03-26 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

InternLM2 Technical Report

Authors on Pith no claims yet

Pith reviewed 2026-05-15 11:40 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords InternLM2large language modelopen-source LLMpre-trainingCOOL RLHFlong-context modelingbenchmark evaluationmodel alignment
0
0 comments X

The pith

InternLM2 outperforms prior open-source LLMs on 30 benchmarks, long-context tasks up to 200k tokens, and subjective evaluations via staged pre-training and COOL RLHF alignment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces InternLM2 as an open-source large language model built to exceed earlier versions through careful pre-training on mixed text, code, and long-sequence data. Training begins at 4k token contexts and scales to 32k tokens, which supports strong results on extended dependency tests such as the 200k-token needle-in-a-haystack evaluation. Alignment proceeds with supervised fine-tuning followed by a new Conditional Online RLHF method designed to resolve conflicting human preferences and limit reward hacking. Models from multiple training stages and sizes are released to let others observe how capabilities develop. These steps matter for readers because they supply concrete, reproducible techniques that narrow the performance difference between open and closed models across broad capability measures.

Core claim

InternLM2 outperforms its predecessors across six evaluation dimensions and thirty benchmarks, exhibits effective long-context modeling after progressive training from 4k to 32k tokens, and improves open-ended subjective responses through supervised fine-tuning combined with Conditional Online Reinforcement Learning from Human Feedback that mitigates preference conflicts and reward hacking.

What carries the argument

The staged pre-training pipeline that scales context length while incorporating diverse text, code, and long-context data, paired with the Conditional Online RLHF alignment procedure.

Load-bearing premise

The chosen thirty benchmarks and subjective evaluations measure general capabilities fairly without selection bias or prompt sensitivity that would alter the reported rankings.

What would settle it

Re-running the same models on a different collection of thirty benchmarks or an alternative set of subjective prompts that produces a higher ranking for a predecessor model would falsify the outperformance claim.

read the original abstract

The evolution of Large Language Models (LLMs) like ChatGPT and GPT-4 has sparked discussions on the advent of Artificial General Intelligence (AGI). However, replicating such advancements in open-source models has been challenging. This paper introduces InternLM2, an open-source LLM that outperforms its predecessors in comprehensive evaluations across 6 dimensions and 30 benchmarks, long-context modeling, and open-ended subjective evaluations through innovative pre-training and optimization techniques. The pre-training process of InternLM2 is meticulously detailed, highlighting the preparation of diverse data types including text, code, and long-context data. InternLM2 efficiently captures long-term dependencies, initially trained on 4k tokens before advancing to 32k tokens in pre-training and fine-tuning stages, exhibiting remarkable performance on the 200k ``Needle-in-a-Haystack" test. InternLM2 is further aligned using Supervised Fine-Tuning (SFT) and a novel Conditional Online Reinforcement Learning from Human Feedback (COOL RLHF) strategy that addresses conflicting human preferences and reward hacking. By releasing InternLM2 models in different training stages and model sizes, we provide the community with insights into the model's evolution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces InternLM2, an open-source LLM claimed to outperform predecessors across 6 dimensions and 30 benchmarks, long-context modeling (including a 200k needle-in-a-haystack test after scaling from 4k to 32k context), and open-ended subjective evaluations. It details pre-training on diverse text/code/long-context data, SFT, and a novel COOL RLHF strategy to mitigate conflicting preferences and reward hacking, while releasing models at multiple training stages and sizes.

Significance. If the performance claims are robust, the work provides a valuable open-source model with demonstrated long-context capabilities and a new RLHF variant, offering community insights into training dynamics. The release of intermediate checkpoints strengthens reproducibility and allows external verification of the claimed innovations in pre-training and alignment.

major comments (2)
  1. [Abstract and §4 (Evaluations)] Abstract and evaluation sections: the central claim of outperformance on 30 benchmarks across 6 dimensions rests on reported scores without error bars, ablation tables, prompt-template details, or data-decontamination analysis. This makes it impossible to assess whether gains are stable or sensitive to evaluation choices.
  2. [Long-context modeling and Needle-in-a-Haystack] Long-context section: the 200k needle-in-a-haystack result is presented without controls for retrieval tricks or alternative prompt regimes, leaving open whether the reported capability stems from the claimed pre-training schedule or from test-specific artifacts.
minor comments (2)
  1. [Alignment section] Notation for COOL RLHF hyperparameters is introduced without an explicit equation or pseudocode listing the conditional reward formulation.
  2. [Pre-training] The data-mixture description would benefit from a table showing token counts per category (text, code, long-context) at each training stage.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the thorough review of our manuscript on InternLM2. We appreciate the feedback on the evaluation sections and have revised the paper to incorporate additional details and clarifications as outlined in our point-by-point responses below.

read point-by-point responses
  1. Referee: [Abstract and §4 (Evaluations)] Abstract and evaluation sections: the central claim of outperformance on 30 benchmarks across 6 dimensions rests on reported scores without error bars, ablation tables, prompt-template details, or data-decontamination analysis. This makes it impossible to assess whether gains are stable or sensitive to evaluation choices.

    Authors: We agree that additional transparency would strengthen the presentation. In the revised manuscript, we have added error bars for the primary benchmarks (computed over multiple evaluation seeds where feasible), a summary table of the prompt templates employed, and a concise description of our data decontamination procedure (n-gram overlap filtering against standard evaluation corpora). Comprehensive ablation tables for all 30 benchmarks would expand the paper substantially; we have therefore included key ablations in an appendix and released the full evaluation scripts and prompts with the model checkpoints to enable independent verification. These changes directly address concerns about stability and sensitivity to evaluation choices. revision: partial

  2. Referee: [Long-context modeling and Needle-in-a-Haystack] Long-context section: the 200k needle-in-a-haystack result is presented without controls for retrieval tricks or alternative prompt regimes, leaving open whether the reported capability stems from the claimed pre-training schedule or from test-specific artifacts.

    Authors: The 200k needle-in-a-haystack evaluation follows the standard protocol established in prior work. In the revised version, we have added results across varied needle insertion positions and multiple prompt formulations. These controls show consistent retrieval accuracy, supporting that the observed capability derives from the progressive context-length scaling in pre-training (4k to 32k tokens) rather than test-specific artifacts. We have also expanded the description of the long-context data mixture and training schedule in Section 3 to further clarify the source of the improvement. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on external benchmarks

full rationale

The paper reports empirical performance on 30 public benchmarks, long-context tests, and subjective evaluations without any mathematical derivations, equations, or predictions that reduce to self-defined quantities or fitted inputs by construction. Training procedures (pre-training data mixtures, context extension from 4k to 32k, SFT, and COOL RLHF) are described procedurally but contain no self-referential steps where a claimed result is equivalent to its own inputs. Any self-citations to prior InternLM work are not load-bearing for the central outperformance claims, which remain independently verifiable on external suites.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The central claims rest on the validity of standard LLM benchmarks, the assumption that reported training stages are faithfully reproduced by released weights, and the premise that COOL RLHF resolves reward hacking without introducing new unmeasured biases.

free parameters (2)
  • context length schedule
    4k to 32k token progression chosen during pre-training and fine-tuning
  • RLHF reward model hyperparameters
    Parameters of the conditional online RLHF procedure fitted to human preference data
axioms (2)
  • domain assumption Public benchmarks measure general language capability without significant contamination or prompt sensitivity
    Invoked when claiming outperformance across 30 benchmarks
  • domain assumption Released model checkpoints match the described training stages
    Required for any downstream reproduction or inspection
invented entities (1)
  • COOL RLHF no independent evidence
    purpose: Alignment strategy that addresses conflicting human preferences and reward hacking
    Newly introduced conditional online reinforcement learning from human feedback variant

pith-pipeline@v0.9.0 · 5884 in / 1422 out tokens · 26034 ms · 2026-05-15T11:40:36.165748+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. The Wittgensteinian Representation Hypothesis: Is Language the Attractor of Multimodal Convergence?

    cs.AI 2026-05 unverdicted novelty 7.0

    Language representations serve as the asymptotic attractor for convergence in independently trained multimodal neural networks due to feature density asymmetry.

  2. VTAgent: Agentic Keyframe Anchoring for Evidence-Aware Video TextVQA

    cs.CV 2026-05 unverdicted novelty 7.0

    VTAgent uses a question-guided agent to anchor keyframes for evidence-aware Video TextVQA, delivering up to +12 accuracy and new SOTA results via training-free operation plus SFT and RL.

  3. StoryAlign: Evaluating and Training Reward Models for Story Generation

    cs.CL 2026-05 unverdicted novelty 7.0

    StoryReward, trained on a new 100k story preference dataset, sets state-of-the-art performance on the introduced StoryRMB benchmark for aligning LLM stories with human preferences.

  4. CompliBench: Benchmarking LLM Judges for Compliance Violation Detection in Dialogue Systems

    cs.CL 2026-04 unverdicted novelty 7.0

    CompliBench uses simulation and adversarial flaw injection to create labeled dialogue data showing that top proprietary LLMs perform poorly at spotting guideline violations while fine-tuned smaller models outperform t...

  5. TaxPraBen: A Scalable Benchmark for Structured Evaluation of LLMs in Chinese Real-World Tax Practice

    cs.CL 2026-04 unverdicted novelty 7.0

    TaxPraBen is a new benchmark with 14 datasets and a structured evaluation method for measuring LLM performance on Chinese real-world tax tasks and scenarios.

  6. Visual-ERM: Reward Modeling for Visual Equivalence

    cs.CV 2026-03 unverdicted novelty 7.0

    Visual-ERM is a new multimodal reward model that supplies fine-grained visual feedback for training vision-language models on chart-to-code, table, and SVG tasks, yielding measurable gains over prior rewards.

  7. MegaScale-Omni: A Hyper-Scale, Workload-Resilient System for MultiModal LLM Training in Production

    cs.DC 2026-05 unverdicted novelty 6.0

    MegaScale-Omni delivers 1.27x-7.57x higher throughput for dynamic multimodal LLM training by decoupling encoder and LLM parallelism, using unified colocation, and applying adaptive workload balancing.

  8. CAR: Query-Guided Confidence-Aware Reranking for Retrieval-Augmented Generation

    cs.CL 2026-05 unverdicted novelty 6.0

    CAR reranks documents in RAG by promoting those that increase generator confidence (via answer consistency sampling) and demoting those that decrease it, yielding NDCG@5 gains on BEIR datasets that correlate with F1 i...

  9. Decompose and Recompose: Reasoning New Skills from Existing Abilities for Cross-Task Robotic Manipulation

    cs.RO 2026-05 unverdicted novelty 6.0

    Decompose and Recompose decomposes seen robotic demonstrations into skill-action alignments and recomposes them via visual-semantic retrieval and planning to enable zero-shot cross-task generalization.

  10. Decoding the Delta: Unifying Remote Sensing Change Detection and Understanding with Multimodal Large Language Models

    cs.CV 2026-04 unverdicted novelty 6.0

    Delta-LLaVA adds Change-Enhanced Attention, Change-SEG with prior embeddings, and Local Causal Attention to MLLMs to overcome temporal blindness, outperforming general models on a new unified benchmark for bi- and tri...

  11. BadSkill: Backdoor Attacks on Agent Skills via Model-in-Skill Poisoning

    cs.CR 2026-04 unverdicted novelty 6.0

    BadSkill poisons embedded models in agent skills to achieve up to 99.5% attack success rate on triggered tasks with only 3% poison rate while preserving normal behavior on non-trigger inputs.

  12. HAWK: Head Importance-Aware Visual Token Pruning in Multimodal Models

    cs.CV 2026-04 unverdicted novelty 6.0

    HAWK is a training-free method that prunes over 80% of visual tokens in MLLMs while retaining 96% accuracy by using head importance weights and text-guided attention to select task-relevant tokens.

  13. Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization

    cs.CL 2026-04 unverdicted novelty 6.0

    Personalized RewardBench reveals that state-of-the-art reward models reach only 75.94% accuracy on personalized preferences and shows stronger correlation with downstream BoN and PPO performance than prior benchmarks.

  14. InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    cs.CV 2025-08 unverdicted novelty 6.0

    InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...

  15. Visual-RFT: Visual Reinforcement Fine-Tuning

    cs.CV 2025-03 conditional novelty 6.0

    Visual-RFT applies reinforcement learning with verifiable perception rewards to improve large vision-language models on fine-grained classification, few-shot detection, and grounding tasks.

  16. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    cs.CV 2024-12 unverdicted novelty 6.0

    InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.

  17. Why Do Vision Language Models Struggle To Recognize Human Emotions?

    cs.CV 2026-04 unverdicted novelty 5.0

    VLMs fail at dynamic facial expression recognition because web-scale pretraining exacerbates long-tailed class bias and sparse frame sampling misses micro-expressions; a multi-stage context enrichment strategy using l...

  18. Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

    cs.AI 2025-03 unverdicted novelty 5.0

    The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.

  19. How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

    cs.CV 2024-04 unverdicted novelty 4.0

    InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.

Reference graph

Works this paper leans on

172 extracted references · 172 canonical work pages · cited by 19 Pith papers · 28 internal anchors

  1. [1]

    https://github.com/MicrosoftDocs/azure-docs/blob/main/articles/ai-services/openai/includes/chat-markup-language.md

    chat markup language. https://github.com/MicrosoftDocs/azure-docs/blob/main/articles/ai-services/openai/includes/chat-markup-language.md. Accessed: 2024-02-06

  2. [2]

    https://github.com/ggerganov/llama.cpp, 2023

    llama.cpp: Port of facebook's llama model in c/c++. https://github.com/ggerganov/llama.cpp, 2023

  3. [3]

    GQA: training generalized multi-query transformer models from multi-head checkpoints

    Joshua Ainslie, James Lee - Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebr \' o n, and Sumit Sanghai. GQA: training generalized multi-query transformer models from multi-head checkpoints. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapo...

  4. [6]

    Cibench: Evaluating your llms with a code interpreter plugin

    Anonymous. Cibench: Evaluating your llms with a code interpreter plugin. In Openreview, 2024 a . URL https://openreview.net/forum?id=O8jmCw5puG

  5. [7]

    Mathbench: Evaluating the theory and application proficiency of llms with a hierarchical mathematics benchmark

    Anonymous. Mathbench: Evaluating the theory and application proficiency of llms with a hierarchical mathematics benchmark. In Openreview, 2024 b . URL https://openreview.net/forum?id=4vRO48RwVG

  6. [11]

    Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, ...

  7. [17]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert - Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litw...

  8. [21]

    Amsp: Reducing communication overhead of zero for efficient llm training, 2024 b

    Qiaoling Chen, Qinghao Hu, Guoteng Wang, Yingtong Xiong, Ting Huang, Xun Chen, Yang Gao, Hang Yan, Yonggang Wen, Tianwei Zhang, and Peng Sun. Amsp: Reducing communication overhead of zero for efficient llm training, 2024 b

  9. [22]

    Theoremqa: A theorem-driven question answering dataset

    Wenhu Chen, Ming Yin, Max Ku, Pan Lu, Yixin Wan, Xueguang Ma, Jianyu Xu, Xinyi Wang, and Tony Xia. Theoremqa: A theorem-driven question answering dataset. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023 , pp.\ 7889--7901....

  10. [24]

    Agent-flan: Designing data and methods of effective agent tuning for large language models

    Zehui Chen, Kuikun Liu, Qiuchen Wang, Jiangning Liu, Dahua Lin, Kai Chen, and Feng Zhao. Agent-flan: Designing data and methods of effective agent tuning for large language models. work in progress, 2024 c

  11. [25]

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradb...

  12. [26]

    Christiano, Jan Leike, Tom B

    Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural In...

  13. [28]

    Lmdeploy: A toolkit for compressing, deploying, and serving llm

    LMDeploy Contributors. Lmdeploy: A toolkit for compressing, deploying, and serving llm. https://github.com/InternLM/lmdeploy, 2023 a

  14. [29]

    Opencompass: A universal evaluation platform for foundation models

    OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass, 2023 b

  15. [30]

    Ultrafeedback: Boosting language models with high-quality feedback, 2023

    Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and Maosong Sun. Ultrafeedback: Boosting language models with high-quality feedback, 2023

  16. [31]

    Safe rlhf: Safe reinforcement learning from human feedback, 2023

    Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, and Yaodong Yang. Safe rlhf: Safe reinforcement learning from human feedback, 2023

  17. [34]

    Understanding dataset difficulty with V -usable information

    Kawin Ethayarajh, Yejin Choi, and Swabha Swayamdipta. Understanding dataset difficulty with V -usable information. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.\ 5988--60...

  18. [35]

    Scaling laws for reward model overoptimization

    Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. Oct 2022

  19. [37]

    Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkinson, Russell Authur, Khyathi Raghavi Chandu, Arman Cohan, Jennifer Dumas, Yanai Elazar, Yuling Gu, Jack Hessel, Tushar Khot, William Merrill, Jacob Morrison, Niklas Muennighoff, Aakan...

  20. [38]

    Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y. Wu, Y. K. Li, Fuli Luo, Yingfei Xiong, and Wenfeng Liang. Deepseek-coder: When the large language model meets programming - the rise of code intelligence. CoRR, abs/2401.14196, 2024

  21. [40]

    Measuring mathematical problem solving with the MATH dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. In Joaquin Vanschoren and Sai - Kit Yeung (eds.), Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 202...

  22. [41]

    Characterization of large language model development in the datacenter

    Qinghao Hu, Zhisheng Ye, Zerui Wang, Guoteng Wang, Meng Zhang, Qiaoling Chen, Peng Sun, Dahua Lin, Xiaolin Wang, Yingwei Luo, et al. Characterization of large language model development in the datacenter. In USENIX Symposium on Networked Systems Design and Implementation (NSDI’24), 2024

  23. [42]

    Gpipe: Efficient training of giant neural networks using pipeline parallelism

    Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32, 2019

  24. [47]

    Reducing activation recomputation in large transformer models

    Vijay Anand Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catanzaro. Reducing activation recomputation in large transformer models. Proceedings of Machine Learning and Systems, 5, 2023

  25. [48]

    Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

    Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Matthew Kelcey, Jacob Devlin, Kenton Lee, Kristina N. Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: a benchmark for question answering research. Transac...

  26. [49]

    Fabbri, Caiming Xiong, Shafiq Joty, and Chien - Sheng Wu

    Philippe Laban, Wojciech Kryscinski, Divyansh Agarwal, Alexander R. Fabbri, Caiming Xiong, Shafiq Joty, and Chien - Sheng Wu. Summedits: Measuring LLM ability at factual reasoning through the lens of summarization. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, E...

  27. [51]

    Ds-1000: A natural and reliable benchmark for data science code generation

    Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettlemoyer, Wen-tau Yih, Daniel Fried, Sida Wang, and Tao Yu. Ds-1000: A natural and reliable benchmark for data science code generation. In International Conference on Machine Learning, pp.\ 18319--18345. PMLR, 2023

  28. [52]

    Cmmlu: Measuring massive multitask language understanding in chinese, 2023 a

    Haonan Li, Yixuan Zhang, Fajri Koto, Yifei Yang, Hai Zhao, Yeyun Gong, Nan Duan, and Timothy Baldwin. Cmmlu: Measuring massive multitask language understanding in chinese, 2023 a

  29. [53]

    Hashimoto

    Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023 b

  30. [58]

    Dynamically scaled rope further increases performance of long context llama with zero fine-tuning, July 2023

    LocalLLaMA. Dynamically scaled rope further increases performance of long context llama with zero fine-tuning, July 2023. URL https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/

  31. [59]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019 . OpenReview.net, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7

  32. [60]

    Longwanjuan: Towards systematic measurement for long text quality, 2024

    Kai Lv, Xiaoran Liu, Qipeng Guo, Hang Yan, Conghui He, Xipeng Qiu, and Dahua Lin. Longwanjuan: Towards systematic measurement for long text quality, 2024

  33. [61]

    Categorizing variants of goodhart’s law

    David Manheim and Scott Garrabrant. Categorizing variants of goodhart’s law. arXiv: Artificial Intelligence,arXiv: Artificial Intelligence, Mar 2018

  34. [63]

    Mixed precision training

    Sharan Narang, Gregory Diamos, Erich Elsen, Paulius Micikevicius, Jonah Alben, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. Mixed precision training. In Int. Conf. on Learning Representation, 2017

  35. [65]

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human fee...

  36. [71]

    Zero: Memory optimizations toward training trillion parameter models

    Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp.\ 1--16. IEEE, 2020

  37. [72]

    Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters

    Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp.\ 3505--3506, 2020

  38. [74]

    Chi, James Caverlee, Julian J

    Noveen Sachdeva, Benjamin Coleman, Wang - Cheng Kang, Jianmo Ni, Lichan Hong, Ed H. Chi, James Caverlee, Julian J. McAuley, and Derek Zhiyuan Cheng. How to train data-efficient llms. CoRR, abs/2402.09668, 2024

  39. [80]

    Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano

    Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano. Learning to summarize from human feedback. In NeurIPS, 2020

  40. [81]

    Investigating prior knowledge for challenging chinese machine reading comprehension

    Kai Sun, Dian Yu, Dong Yu, and Claire Cardie. Investigating prior knowledge for challenging chinese machine reading comprehension. Transactions of the Association for Computational Linguistics, 2020. URL https://arxiv.org/abs/1904.09679v3

  41. [83]

    NLLB Team, Marta R. Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, Loic Barrault, Gabriel Mejia Gonzalez, Prangthip Hansanti, John Hoffman, Semarley Jarrett, Kaushik Ram Sadagopan, Dirk Rowe,...

  42. [86]

    Gomez, Lukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 30: Annual Conference o...

  43. [87]

    Skywork: A more open bilingual foundation model, 2023

    Tianwen Wei, Liang Zhao, Lichang Zhang, Bo Zhu, Lijie Wang, Haihua Yang, Biye Li, Cheng Cheng, Weiwei Lü, Rui Hu, Chenxia Li, Liu Yang, Xilin Luo, Xuejie Wu, Lunan Liu, Wenjun Cheng, Peng Cheng, Jianhao Zhang, Xiaoyu Zhang, Lei Lin, Xiaokun Wang, Yutuan Ma, Chuanhai Dong, Yanqi Sun, Yifu Chen, Yongyi Peng, Xiaojuan Liang, Shuicheng Yan, Han Fang, and Yahu...

  44. [88]

    Qurating: Selecting high-quality data for training language models

    Alexander Wettig, Aatmik Gupta, Saumya Malik, and Danqi Chen. Qurating: Selecting high-quality data for training language models. CoRR, abs/2402.09739, 2024

  45. [93]

    Narasimhan, and Yuan Cao

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview.net, 2023. URL https://openreview.net/pdf?id=WE\_vluYUL-X

  46. [94]

    Internlm-math: Open math large language models toward verifiable reasoning, 2024

    Huaiyuan Ying, Shuo Zhang, Linyang Li, Zhejian Zhou, Yunfan Shao, Zhaoye Fei, Yichuan Ma, Jiawei Hong, Kuikun Liu, Ziyi Wang, Yudong Wang, Zijian Wu, Shuaibin Li, Fengzhe Zhou, Hongwei Liu, Songyang Zhang, Wenwei Zhang, Hang Yan, Xipeng Qiu, Jiayu Wang, Kai Chen, and Dahua Lin. Internlm-math: Open math large language models toward verifiable reasoning, 2024

  47. [96]

    GLM-130B: an open bilingual pre-trained model

    Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, Weng Lam Tam, Zixuan Ma, Yufei Xue, Jidong Zhai, Wenguang Chen, Zhiyuan Liu, Peng Zhang, Yuxiao Dong, and Jie Tang. GLM-130B: an open bilingual pre-trained model. In The Eleventh International Conference on Learning Representations, ICLR 202...

  48. [97]

    Root mean square layer normalization

    Biao Zhang and Rico Sennrich. Root mean square layer normalization. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d'Alch \' e - Buc, Emily B. Fox, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver,...

  49. [98]

    Evaluating the performance of large language models on gaokao benchmark

    Xiaotian Zhang, Chunyang Li, Yi Zong, Zhengyu Ying, Liang He, and Xipeng Qiu. Evaluating the performance of large language models on gaokao benchmark. 2023

  50. [99]

    Mics: Near-linear scaling for training gigantic model on public

    Zhen Zhang, Shuai Zheng, Yida Wang, Justin Chiu, George Karypis, Trishul Chilimbi, Mu Li, and Xin Jin. Mics: Near-linear scaling for training gigantic model on public. Proceedings of the VLDB Endowment, 16 0 (1): 0 37--50, 2022

  51. [100]

    P Xing, Hao Zhang, Joseph E

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric. P Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023 a

  52. [102]

    Agieval: A human-centric benchmark for evaluating foundation models, 2023

    Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. Agieval: A human-centric benchmark for evaluating foundation models, 2023

  53. [104]

    LMDeploy: A Toolkit for Compressing, Deploying, and Serving LLM , author=

  54. [105]

    arXiv preprint arXiv:2402.16819 , year=

    Nemotron-4 15B Technical Report , author=. arXiv preprint arXiv:2402.16819 , year=

  55. [106]

    Scaling Learning Algorithms Towards

    Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

  56. [107]

    19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22) , pages=

    Accelerating collective communication in data parallel training across deep learning frameworks , author=. 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22) , pages=

  57. [108]

    USENIX Symposium on Networked Systems Design and Implementation (NSDI’24) , year=

    Characterization of Large Language Model Development in the Datacenter , author=. USENIX Symposium on Networked Systems Design and Implementation (NSDI’24) , year=

  58. [109]

    13th USENIX symposium on operating systems design and implementation (OSDI 18) , pages=

    Ray: A distributed framework for emerging \ AI \ applications , author=. 13th USENIX symposium on operating systems design and implementation (OSDI 18) , pages=

  59. [110]

    arXiv preprint arXiv:2402.13013 , year=

    Code Needs Comments: Enhancing Code LLMs with Comment Augmentation , author=. arXiv preprint arXiv:2402.13013 , year=

  60. [111]

    2024 , eprint=

    AMSP: Reducing Communication Overhead of ZeRO for Efficient LLM Training , author=. 2024 , eprint=

  61. [112]

    Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , pages=

    Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters , author=. Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , pages=

  62. [113]

    Proceedings of the VLDB Endowment , volume=

    MiCS: Near-linear Scaling for Training Gigantic Model on Public , author=. Proceedings of the VLDB Endowment , volume=. 2022 , publisher=

  63. [114]

    How to Train Data-Efficient LLMs , journal =

    Noveen Sachdeva and Benjamin Coleman and Wang. How to Train Data-Efficient LLMs , journal =

  64. [115]

    CoRR , volume =

    Alexander Wettig and Aatmik Gupta and Saumya Malik and Danqi Chen , title =. CoRR , volume =

  65. [116]

    Wu and Y

    Daya Guo and Qihao Zhu and Dejian Yang and Zhenda Xie and Kai Dong and Wentao Zhang and Guanting Chen and Xiao Bi and Y. Wu and Y. K. Li and Fuli Luo and Yingfei Xiong and Wenfeng Liang , title =. CoRR , volume =

  66. [117]

    2024 , eprint=

    LongWanjuan: Towards Systematic Measurement for Long Text Quality , author=. 2024 , eprint=

  67. [118]

    Dirk Groeneveld and Iz Beltagy and Pete Walsh and Akshita Bhagia and Rodney Kinney and Oyvind Tafjord and Ananya Harsh Jha and Hamish Ivison and Ian Magnusson and Yizhong Wang and Shane Arora and David Atkinson and Russell Authur and Khyathi Raghavi Chandu and Arman Cohan and Jennifer Dumas and Yanai Elazar and Yuling Gu and Jack Hessel and Tushar Khot an...

  68. [119]

    arXiv preprint arXiv:2401.09149 , year=

    InternEvo: Efficient Long-sequence Large Language Model Training via Hybrid Parallelism and Redundant Sharding , author=. arXiv preprint arXiv:2401.09149 , year=

  69. [120]

    SC20: International Conference for High Performance Computing, Networking, Storage and Analysis , pages=

    Zero: Memory optimizations toward training trillion parameter models , author=. SC20: International Conference for High Performance Computing, Networking, Storage and Analysis , pages=. 2020 , organization=

  70. [121]

    Proceedings of Machine Learning and Systems , volume=

    Reducing activation recomputation in large transformer models , author=. Proceedings of Machine Learning and Systems , volume=

  71. [122]

    Advances in neural information processing systems , volume=

    Gpipe: Efficient training of giant neural networks using pipeline parallelism , author=. Advances in neural information processing systems , volume=

  72. [123]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    Megatron-lm: Training multi-billion parameter language models using model parallelism , author=. arXiv preprint arXiv:1909.08053 , year=

  73. [124]

    Mixed precision training , author=. Int. Conf. on Learning Representation , year=

  74. [125]

    Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , pages=

    Efficient large-scale language model training on gpu clusters using megatron-lm , author=. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , pages=

  75. [126]

    chat markup language , howpublished =

  76. [127]

    and Osindero, Simon and Teh, Yee Whye , journal =

    Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

  77. [128]

    2016 , publisher=

    Deep learning , author=. 2016 , publisher=

  78. [129]

    Christiano and Jan Leike and Tom B

    Paul F. Christiano and Jan Leike and Tom B. Brown and Miljan Martic and Shane Legg and Dario Amodei , editor =. Deep Reinforcement Learning from Human Preferences , booktitle =. 2017 , url =

  79. [130]

    Proximal Policy Optimization Algorithms

    John Schulman and Filip Wolski and Prafulla Dhariwal and Alec Radford and Oleg Klimov , title =. CoRR , volume =. 2017 , url =. 1707.06347 , timestamp =

  80. [131]

    Christopher J. C. Burges and Tal Shaked and Erin Renshaw and Ari Lazier and Matt Deeds and Nicole Hamilton and Gregory N. Hullender , editor =. Learning to rank using gradient descent , booktitle =. 2005 , url =. doi:10.1145/1102351.1102363 , timestamp =

Showing first 80 references.