pith. machine review for the scientific record. sign in

arxiv: 2402.08268 · v4 · submitted 2024-02-13 · 💻 cs.LG

Recognition: 3 theorem links

· Lean Theorem

World Model on Million-Length Video And Language With Blockwise RingAttention

Authors on Pith no claims yet

Pith reviewed 2026-05-16 06:32 UTC · model grok-4.3

classification 💻 cs.LG
keywords long context modelsvideo language modelsBlockwise RingAttentionmillion token sequencescontext extensionsequence modelingopen source modelslong video understanding
0
0 comments X

The pith

7B parameter models process video and language sequences exceeding 1 million tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper shows how to build and train a family of 7 billion parameter models that handle sequences longer than one million tokens for both pure language and combined video-language tasks. The work covers curating long data, extending context length step by step from 4,000 to 1 million tokens, and using an efficient open-source training method. A sympathetic reader would care because current models struggle with long temporal horizons needed for general intelligence, and these models set new retrieval benchmarks while adding video understanding capabilities at that scale. The result is publicly released models that can take in entire long documents or extended videos as single inputs.

Core claim

We provide a comprehensive exploration of the full development process for producing 1M context language models and video-language models, setting new benchmarks in language retrieval and new capabilities in long video understanding. We detail our long context data curation process, progressive context extension from 4K to 1M tokens, and present an efficient open-source implementation for scalable training on long sequences. Additionally, we open-source a family of 7B parameter models capable of processing long text documents and videos exceeding 1M tokens.

What carries the argument

Blockwise RingAttention, a mechanism that processes attention in blocks arranged in a ring to enable memory-efficient training on sequences up to one million tokens.

If this is right

  • Models achieve new state-of-the-art results on long-document language retrieval benchmarks.
  • The same architecture supports previously unseen capabilities in understanding videos that span millions of tokens.
  • Open-source release of the 7B models and training code allows direct replication and further scaling experiments.
  • Progressive context extension combined with the attention method keeps training feasible on standard hardware clusters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • These models could act as initial world models for agents that must reason over hour-long video streams or book-length texts.
  • The training recipe might transfer to other modalities such as audio or 3D scene sequences without major redesign.
  • Future tests could measure whether the same approach sustains quality at 10M tokens or whether new bottlenecks appear.

Load-bearing premise

That the combination of Blockwise RingAttention and progressive context extension from 4K to 1M tokens lets models use the entire context effectively without prohibitive compute costs or performance collapse.

What would settle it

A controlled evaluation in which the released 7B models show no improvement on 1M-token video retrieval or understanding tasks compared with the same architecture trained only to 4K context.

read the original abstract

Enabling long-context understanding remains a key challenge in scaling existing sequence models -- a crucial component in developing generally intelligent models that can process and operate over long temporal horizons that potentially consist of millions of tokens. In this paper, we aim to address these challenges by providing a comprehensive exploration of the full development process for producing 1M context language models and video-language models, setting new benchmarks in language retrieval and new capabilities in long video understanding. We detail our long context data curation process, progressive context extension from 4K to 1M tokens, and present an efficient open-source implementation for scalable training on long sequences. Additionally, we open-source a family of 7B parameter models capable of processing long text documents and videos exceeding 1M tokens.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims to develop and open-source a family of 7B-parameter models for language and video-language tasks that process sequences exceeding 1M tokens. It details a data curation pipeline, progressive context extension from 4K to 1M tokens, an efficient Blockwise RingAttention implementation for scalable training, and reports new benchmarks in long-context language retrieval together with novel capabilities in long video understanding.

Significance. If the central claim holds, the work would be significant for long-context sequence modeling: it supplies reproducible open-source 7B models, an efficient training recipe, and empirical evidence that 1M-token video-language understanding is feasible at this scale, directly addressing the temporal-horizon challenge highlighted in the abstract.

major comments (2)
  1. [§4 (Experiments)] §4 (Experiments) and Table 2: the language-retrieval and video-understanding benchmarks report aggregate metrics but contain no ablations that place critical evidence at sequence extremes (e.g., retrieval accuracy when the relevant information is located in the first 100k tokens of a 1M-token video-language sequence). Without such tests the headline claim that the models “exploit information distributed across >1M tokens” rests on an unverified assumption about attention reach.
  2. [§3.2] §3.2 (Blockwise RingAttention) and §3.3 (Progressive Context Extension): the description of the 4K→1M extension schedule does not quantify attention sparsity, memory scaling, or per-block gradient norms at the full 1M length; the claim that the mechanism “enables effective utilization of the full context length without prohibitive computational costs” therefore lacks the load-bearing measurements needed to support the scaling argument.
minor comments (2)
  1. [Figure 3] Figure 3 caption and axis labels use inconsistent token-length notation (1M vs. 1024k); standardize throughout.
  2. [Abstract] The open-source repository link is mentioned only in the abstract; add an explicit footnote or appendix entry with the exact commit hash and license.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point-by-point below and will revise the manuscript accordingly to strengthen the empirical support for our claims.

read point-by-point responses
  1. Referee: [§4 (Experiments)] §4 (Experiments) and Table 2: the language-retrieval and video-understanding benchmarks report aggregate metrics but contain no ablations that place critical evidence at sequence extremes (e.g., retrieval accuracy when the relevant information is located in the first 100k tokens of a 1M-token video-language sequence). Without such tests the headline claim that the models “exploit information distributed across >1M tokens” rests on an unverified assumption about attention reach.

    Authors: We agree that explicit position-aware ablations would provide more direct evidence for full-context utilization. Our existing retrieval benchmarks already involve 1M-token sequences with information distributed across varying positions, but we will add targeted ablations in the revision (e.g., accuracy when the query-relevant content is placed in the first 100k tokens, middle, or final 100k tokens of 1M-token video-language inputs) to directly address this concern. revision: yes

  2. Referee: [§3.2] §3.2 (Blockwise RingAttention) and §3.3 (Progressive Context Extension): the description of the 4K→1M extension schedule does not quantify attention sparsity, memory scaling, or per-block gradient norms at the full 1M length; the claim that the mechanism “enables effective utilization of the full context length without prohibitive computational costs” therefore lacks the load-bearing measurements needed to support the scaling argument.

    Authors: We will expand §3.2 and §3.3 in the revision to include the requested quantitative details: attention sparsity ratios observed at 1M length, memory scaling curves across context lengths, and per-block gradient norm statistics collected during progressive extension training. These measurements will be added to better substantiate the efficiency claims. revision: yes

Circularity Check

0 steps flagged

No circularity; long-context claims rest on empirical training and implementation details

full rationale

The paper describes a training pipeline involving data curation, progressive context extension from 4K to 1M tokens, and an open-source Blockwise RingAttention implementation for 7B models. These steps are presented as engineering and optimization procedures whose effectiveness is evaluated through benchmarks, not derived by redefining inputs or fitting parameters that are then relabeled as predictions. No self-citation chains, uniqueness theorems, or ansatzes reduce the central claims to tautologies. The reported capabilities in language retrieval and video understanding are treated as outcomes of the described process rather than forced by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on standard transformer scaling assumptions plus the unverified efficacy of the described attention variant and training schedule for maintaining performance at extreme lengths.

free parameters (1)
  • progressive context schedule
    The choice of starting at 4K and extending to 1M tokens is a training hyperparameter selected to enable scaling.
axioms (1)
  • domain assumption Blockwise RingAttention computes attention efficiently over million-token sequences
    Invoked as the key enabler for practical training and inference at the target scale.

pith-pipeline@v0.9.0 · 5427 in / 1243 out tokens · 44657 ms · 2026-05-16T06:32:36.859746+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Foundation/DimensionForcing.lean eight_tick_forces_D3 echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    we leverage recent advancements in scaling context window size, particularly Blockwise RingAttention (Liu et al., 2024; Liu and Abbeel, 2023), a technique that scales context size without approximations or overheads, enabling efficient training on long sequences... progressively increase the effective context length of the model across 5 stages: 32K, 128K, 256K, 512K, and 1M

  • IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We open-source a family of 7B parameter models capable of processing long text documents and videos exceeding 1M tokens, setting new benchmarks in language retrieval and new capabilities in long video understanding

  • IndisputableMonolith/Foundation/LawOfExistence.lean law_of_existence unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    To address the scarcity of long-form conversational datasets, we developed a model-based question-answering technique, where a short-context model generates training data from books

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Learning to (Learn at Test Time): RNNs with Expressive Hidden States

    cs.LG 2024-07 conditional novelty 8.0

    TTT layers treat the hidden state as a trainable model updated at test time, allowing linear-complexity sequence models to scale perplexity reduction with context length unlike Mamba.

  2. RULER: What's the Real Context Size of Your Long-Context Language Models?

    cs.CL 2024-04 accept novelty 8.0

    RULER shows most long-context LMs drop sharply in performance on complex tasks as length and difficulty increase, with only half maintaining results at 32K tokens.

  3. Exploring Spatial Intelligence from a Generative Perspective

    cs.CV 2026-04 unverdicted novelty 7.0

    Fine-tuning multimodal models on a new synthetic spatial benchmark improves generative spatial compliance on real and synthetic tasks and transfers to better spatial understanding.

  4. Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark

    cs.CV 2026-03 unverdicted novelty 7.0

    SceneBench shows VLMs lose accuracy on scene-level questions in long videos due to forgetting, and Scene-RAG retrieval improves performance by 2.5%.

  5. A Unified and Controllable Framework for Layered Image Generation with Visual Effects

    cs.CV 2026-01 unverdicted novelty 7.0

    LASAGNA produces layered images with integrated visual effects in a single pass, enabling drift-free edits via alpha compositing while releasing a 48K dataset and a 242-sample benchmark.

  6. SpaceR: Reinforcing MLLMs in Video Spatial Reasoning

    cs.CV 2025-04 unverdicted novelty 7.0

    SpaceR uses a new verifiable dataset and map-imagination-augmented RLVR to reach SOTA spatial reasoning accuracy in MLLMs, exceeding GPT-4o on VSI-Bench.

  7. PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction

    cs.CV 2024-10 accept novelty 7.0

    PyramidDrop accelerates LVLMs by staged, similarity-based dropping of visual tokens that become redundant in deeper layers, delivering 40% faster training and 55% lower inference cost with comparable accuracy.

  8. Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

    cs.CV 2024-10 unverdicted novelty 7.0

    Janus decouples visual encoding into task-specific pathways inside a single autoregressive transformer to unify multimodal understanding and generation while outperforming earlier unified models.

  9. MLVU: Benchmarking Multi-task Long Video Understanding

    cs.CV 2024-06 conditional novelty 7.0

    MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.

  10. Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

    cs.LG 2024-05 unverdicted novelty 7.0

    Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.

  11. MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality

    cs.CV 2026-05 unverdicted novelty 6.0

    MUSE decouples reconstruction and semantic learning in visual tokenization via topological orthogonality, yielding SOTA generation quality and improved semantic performance over its teacher model.

  12. World2VLM: Distilling World Model Imagination into VLMs for Dynamic Spatial Reasoning

    cs.CV 2026-04 unverdicted novelty 6.0

    Distilling view-consistent future views and action-outcome supervision from a generative world model into a VLM via two-stage post-training improves dynamic spatial reasoning on SAT-Real, VSI-Bench and similar benchma...

  13. Refinement via Regeneration: Enlarging Modification Space Boosts Image Refinement in Unified Multimodal Models

    cs.CV 2026-04 unverdicted novelty 6.0

    Refinement via Regeneration (RvR) reformulates image refinement in unified multimodal models as conditional regeneration using prompt and semantic tokens from the initial image, yielding higher alignment scores than e...

  14. Seeing Without Eyes: 4D Human-Scene Understanding from Wearable IMUs

    cs.CV 2026-04 unverdicted novelty 6.0

    IMU-to-4D uses wearable IMU data and repurposed LLMs to predict coherent 4D human motion plus coarse scene structure, outperforming cascaded state-of-the-art pipelines in temporal stability.

  15. LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

    cs.CV 2024-10 unverdicted novelty 6.0

    LongVU adaptively compresses long video tokens using DINOv2-based frame deduplication, text-guided cross-modal selection, and temporal spatial reduction to improve video-language understanding in MLLMs with minimal de...

  16. VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation

    cs.CV 2024-09 unverdicted novelty 6.0

    VILA-U unifies visual understanding and generation inside one autoregressive next-token prediction model, removing separate diffusion components while claiming near state-of-the-art results.

  17. SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

    cs.CV 2024-04 unverdicted novelty 6.0

    SEED-X is a unified multimodal foundation model that handles multi-granularity visual semantics for both comprehension and generation across arbitrary image sizes and ratios.

  18. SnapKV: LLM Knows What You are Looking for Before Generation

    cs.CL 2024-04 conditional novelty 6.0

    SnapKV selects clustered important KV positions per attention head from an observation window at the prompt end, yielding 3.6x faster generation and 8.2x better memory efficiency on 16K-token inputs with comparable pe...

  19. Emerging Properties in Unified Multimodal Pretraining

    cs.CV 2025-05 unverdicted novelty 5.0

    BAGEL is a unified decoder-only model that develops emerging complex multimodal reasoning abilities after pretraining on large-scale interleaved data and outperforms prior open-source unified models.

  20. Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    cs.AI 2025-01 conditional novelty 3.0

    Scaling data, model size, and training optimization on the Janus architecture yields better multimodal understanding and more stable, instruction-following text-to-image generation.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · cited by 20 Pith papers · 22 internal anchors

  1. [1]

    Jointly training large autoregressive multimodal models

    Emanuele Aiello, Lili Yu, Yixin Nie, Armen Aghajanyan, and Barlas Oguz. Jointly training large autoregressive multimodal models. arXiv preprint arXiv:2309.15564,

  2. [2]

    OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

    Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390,

  3. [3]

    Longformer: The Long-Document Transformer

    Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150,

  4. [4]

    William Brandon, Aniruddha Nrusimha, Kevin Qian, Zachary Ankner, Tian Jin, Zhiye Song, and Jonathan Ragan-Kelley

    URL http://github.com/google/jax. William Brandon, Aniruddha Nrusimha, Kevin Qian, Zachary Ankner, Tian Jin, Zhiye Song, and Jonathan Ragan-Kelley. Striped attention: Faster ring attention for causal transformers. arXiv preprint arXiv:2311.09431,

  5. [5]

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al

    URL https://openai.com/research/ video-generation-models-as-world-simulators . Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems , 33:1877–1901,

  6. [6]

    ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

    Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023a. Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via positional interpolation. arXiv pre...

  7. [7]

    Generating Long Sequences with Sparse Transformers

    URL https: //lmsys.org/blog/2023-03-30-vicuna/ . Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509,

  8. [8]

    Flashattention: Fast and memory- efficient exact attention with io-awareness

    11 Published as a conference paper at ICLR 2025 Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory- efficient exact attention with io-awareness. Advances in Neural Information Processing Systems , 35:16344–16359,

  9. [9]

    Enhancing Chat Language Models by Scaling High-quality Instructional Conversations

    Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi Zheng, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233,

  10. [10]

    Fully Sharded Data Parallel: faster AI training with fewer GPUs — engineering.fb.com

    Facebook. Fully Sharded Data Parallel: faster AI training with fewer GPUs — engineering.fb.com. https://engineering.fb.com/2021/07/15/open-source/fsdp/,

  11. [11]

    Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

    [Ac- cessed 16-May-2023]. Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. arXiv preprint arXiv:2405.21075,

  12. [12]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027,

  13. [13]

    World Models

    [Online; accessed 7-Feb-2024]. David Ha and Jürgen Schmidhuber. World models. arXiv preprint arXiv:1803.10122,

  14. [14]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598,

  15. [15]

    Imagen Video: High Definition Video Generation with Diffusion Models

    Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022a. Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J. Fle...

  16. [16]

    Can long-context language models subsume retrieval, rag, sql, and more? arXiv preprint arXiv:2406.13121,

    Jinhyuk Lee, Anthony Chen, Zhuyun Dai, Dheeru Dua, Devendra Singh Sachan, Michael Boratko, Yi Luan, Sébastien MR Arnold, Vincent Perot, Siddharth Dalmia, et al. Can long-context language models subsume retrieval, rag, sql, and more? arXiv preprint arXiv:2406.13121,

  17. [17]

    Lightseq: Sequence level parallelism for distributed training of long context transformers

    12 Published as a conference paper at ICLR 2025 Dacheng Li, Rulin Shao, Anze Xie, Eric P Xing, Joseph E Gonzalez, Ion Stoica, Xuezhe Ma, and Hao Zhang. Lightseq: Sequence level parallelism for distributed training of long context transformers. arXiv preprint arXiv:2310.03294,

  18. [18]

    arXiv preprint arXiv:2105.13120 , year=

    Shenggui Li, Fuzhao Xue, Yongbin Li, and Yang You. Sequence parallelism: Making 4d parallelism possible. arXiv preprint arXiv:2105.13120,

  19. [19]

    Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

    Bin Lin, Bin Zhu, Yang Ye, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122,

  20. [20]

    Improved Baselines with Visual Instruction Tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023a. Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023b. Xiaoran Liu, Hang Yan, Shuo Zhang, Chenxin An, Xipeng Qiu, and Dahua Lin. Scaling laws ...

  21. [21]

    Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

    Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424,

  22. [22]

    amused: An open muse reproduction.arXiv preprint arXiv:2401.01808, 2024

    Suraj Patil, William Berman, Robin Rombach, and Patrick von Platen. amused: An open muse reproduction. arXiv preprint arXiv:2401.01808,

  23. [23]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530,

  24. [24]

    Code Llama: Open Foundation Models for Code

    Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950,

  25. [25]

    Alpaca: A strong, replicable instruction-following model

    13 Published as a conference paper at ICLR 2025 Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on F oundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7,

  26. [26]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805,

  27. [27]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay ...

  28. [28]

    Phenaki: Variable length video generation from open domain textual description.arXiv preprint arXiv:2210.02399, 2022

    Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kindermans, Hernan Moraldo, Han Zhang, Mohammad Taghi Saffar, Santiago Castro, Julius Kunze, and Dumitru Erhan. Phenaki: Variable length video generation from open domain textual description. arXiv preprint arXiv:2210.02399,

  29. [29]

    InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

    Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, et al. Internvid: A large-scale video-text dataset for multimodal understanding and generation. arXiv preprint arXiv:2307.06942,

  30. [30]

    HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600,

  31. [31]

    Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

    Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. Scaling autoregressive models for content- rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2(3):5,

  32. [32]

    LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

    Renrui Zhang, Jiaming Han, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, Peng Gao, and Yu Qiao. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199,

  33. [33]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685,

  34. [34]

    We trained our models using TPUv4-1024, which is approximately equivalent to 450 A100s, with a batch size of 8M using FSDP (Facebook,

    14 Published as a conference paper at ICLR 2025 A F URTHER DETAILS Model Flops Utilization . We trained our models using TPUv4-1024, which is approximately equivalent to 450 A100s, with a batch size of 8M using FSDP (Facebook,

  35. [35]

    Figure 8 shows the model FLOPS utilization (MFU) for each training stage

    for large contexts. Figure 8 shows the model FLOPS utilization (MFU) for each training stage. Blue color bars show language training and orange color bars show vision-language training. Our training achieves good MFUs even for very large context sizes. Figure 8 High MFU training across sequence lengths. Model flops utilization (MFU) of each training stage...

  36. [36]

    The magic number for San Francisco is 2521233

    to test its conversation ability. Table 10 shows the MT-Bench scores of for each of our models. Table 11 illustrates the relationship between the mix of chat and fact retrieval tasks and the performance on MT-Bench score and Needle Retrieval accuracy. As the proportion of chat increases and fact retrieval decreases, the MT-Bench score improves, indicating...