Recognition: 3 theorem links
· Lean TheoremWorld Model on Million-Length Video And Language With Blockwise RingAttention
Pith reviewed 2026-05-16 06:32 UTC · model grok-4.3
The pith
7B parameter models process video and language sequences exceeding 1 million tokens.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We provide a comprehensive exploration of the full development process for producing 1M context language models and video-language models, setting new benchmarks in language retrieval and new capabilities in long video understanding. We detail our long context data curation process, progressive context extension from 4K to 1M tokens, and present an efficient open-source implementation for scalable training on long sequences. Additionally, we open-source a family of 7B parameter models capable of processing long text documents and videos exceeding 1M tokens.
What carries the argument
Blockwise RingAttention, a mechanism that processes attention in blocks arranged in a ring to enable memory-efficient training on sequences up to one million tokens.
If this is right
- Models achieve new state-of-the-art results on long-document language retrieval benchmarks.
- The same architecture supports previously unseen capabilities in understanding videos that span millions of tokens.
- Open-source release of the 7B models and training code allows direct replication and further scaling experiments.
- Progressive context extension combined with the attention method keeps training feasible on standard hardware clusters.
Where Pith is reading between the lines
- These models could act as initial world models for agents that must reason over hour-long video streams or book-length texts.
- The training recipe might transfer to other modalities such as audio or 3D scene sequences without major redesign.
- Future tests could measure whether the same approach sustains quality at 10M tokens or whether new bottlenecks appear.
Load-bearing premise
That the combination of Blockwise RingAttention and progressive context extension from 4K to 1M tokens lets models use the entire context effectively without prohibitive compute costs or performance collapse.
What would settle it
A controlled evaluation in which the released 7B models show no improvement on 1M-token video retrieval or understanding tasks compared with the same architecture trained only to 4K context.
read the original abstract
Enabling long-context understanding remains a key challenge in scaling existing sequence models -- a crucial component in developing generally intelligent models that can process and operate over long temporal horizons that potentially consist of millions of tokens. In this paper, we aim to address these challenges by providing a comprehensive exploration of the full development process for producing 1M context language models and video-language models, setting new benchmarks in language retrieval and new capabilities in long video understanding. We detail our long context data curation process, progressive context extension from 4K to 1M tokens, and present an efficient open-source implementation for scalable training on long sequences. Additionally, we open-source a family of 7B parameter models capable of processing long text documents and videos exceeding 1M tokens.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to develop and open-source a family of 7B-parameter models for language and video-language tasks that process sequences exceeding 1M tokens. It details a data curation pipeline, progressive context extension from 4K to 1M tokens, an efficient Blockwise RingAttention implementation for scalable training, and reports new benchmarks in long-context language retrieval together with novel capabilities in long video understanding.
Significance. If the central claim holds, the work would be significant for long-context sequence modeling: it supplies reproducible open-source 7B models, an efficient training recipe, and empirical evidence that 1M-token video-language understanding is feasible at this scale, directly addressing the temporal-horizon challenge highlighted in the abstract.
major comments (2)
- [§4 (Experiments)] §4 (Experiments) and Table 2: the language-retrieval and video-understanding benchmarks report aggregate metrics but contain no ablations that place critical evidence at sequence extremes (e.g., retrieval accuracy when the relevant information is located in the first 100k tokens of a 1M-token video-language sequence). Without such tests the headline claim that the models “exploit information distributed across >1M tokens” rests on an unverified assumption about attention reach.
- [§3.2] §3.2 (Blockwise RingAttention) and §3.3 (Progressive Context Extension): the description of the 4K→1M extension schedule does not quantify attention sparsity, memory scaling, or per-block gradient norms at the full 1M length; the claim that the mechanism “enables effective utilization of the full context length without prohibitive computational costs” therefore lacks the load-bearing measurements needed to support the scaling argument.
minor comments (2)
- [Figure 3] Figure 3 caption and axis labels use inconsistent token-length notation (1M vs. 1024k); standardize throughout.
- [Abstract] The open-source repository link is mentioned only in the abstract; add an explicit footnote or appendix entry with the exact commit hash and license.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the major comments point-by-point below and will revise the manuscript accordingly to strengthen the empirical support for our claims.
read point-by-point responses
-
Referee: [§4 (Experiments)] §4 (Experiments) and Table 2: the language-retrieval and video-understanding benchmarks report aggregate metrics but contain no ablations that place critical evidence at sequence extremes (e.g., retrieval accuracy when the relevant information is located in the first 100k tokens of a 1M-token video-language sequence). Without such tests the headline claim that the models “exploit information distributed across >1M tokens” rests on an unverified assumption about attention reach.
Authors: We agree that explicit position-aware ablations would provide more direct evidence for full-context utilization. Our existing retrieval benchmarks already involve 1M-token sequences with information distributed across varying positions, but we will add targeted ablations in the revision (e.g., accuracy when the query-relevant content is placed in the first 100k tokens, middle, or final 100k tokens of 1M-token video-language inputs) to directly address this concern. revision: yes
-
Referee: [§3.2] §3.2 (Blockwise RingAttention) and §3.3 (Progressive Context Extension): the description of the 4K→1M extension schedule does not quantify attention sparsity, memory scaling, or per-block gradient norms at the full 1M length; the claim that the mechanism “enables effective utilization of the full context length without prohibitive computational costs” therefore lacks the load-bearing measurements needed to support the scaling argument.
Authors: We will expand §3.2 and §3.3 in the revision to include the requested quantitative details: attention sparsity ratios observed at 1M length, memory scaling curves across context lengths, and per-block gradient norm statistics collected during progressive extension training. These measurements will be added to better substantiate the efficiency claims. revision: yes
Circularity Check
No circularity; long-context claims rest on empirical training and implementation details
full rationale
The paper describes a training pipeline involving data curation, progressive context extension from 4K to 1M tokens, and an open-source Blockwise RingAttention implementation for 7B models. These steps are presented as engineering and optimization procedures whose effectiveness is evaluated through benchmarks, not derived by redefining inputs or fitting parameters that are then relabeled as predictions. No self-citation chains, uniqueness theorems, or ansatzes reduce the central claims to tautologies. The reported capabilities in language retrieval and video understanding are treated as outcomes of the described process rather than forced by construction.
Axiom & Free-Parameter Ledger
free parameters (1)
- progressive context schedule
axioms (1)
- domain assumption Blockwise RingAttention computes attention efficiently over million-token sequences
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/DimensionForcing.leaneight_tick_forces_D3 echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
we leverage recent advancements in scaling context window size, particularly Blockwise RingAttention (Liu et al., 2024; Liu and Abbeel, 2023), a technique that scales context size without approximations or overheads, enabling efficient training on long sequences... progressively increase the effective context length of the model across 5 stages: 32K, 128K, 256K, 512K, and 1M
-
IndisputableMonolith/Foundation/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We open-source a family of 7B parameter models capable of processing long text documents and videos exceeding 1M tokens, setting new benchmarks in language retrieval and new capabilities in long video understanding
-
IndisputableMonolith/Foundation/LawOfExistence.leanlaw_of_existence unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
To address the scarcity of long-form conversational datasets, we developed a model-based question-answering technique, where a short-context model generates training data from books
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 20 Pith papers
-
Learning to (Learn at Test Time): RNNs with Expressive Hidden States
TTT layers treat the hidden state as a trainable model updated at test time, allowing linear-complexity sequence models to scale perplexity reduction with context length unlike Mamba.
-
RULER: What's the Real Context Size of Your Long-Context Language Models?
RULER shows most long-context LMs drop sharply in performance on complex tasks as length and difficulty increase, with only half maintaining results at 32K tokens.
-
Exploring Spatial Intelligence from a Generative Perspective
Fine-tuning multimodal models on a new synthetic spatial benchmark improves generative spatial compliance on real and synthetic tasks and transfers to better spatial understanding.
-
Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark
SceneBench shows VLMs lose accuracy on scene-level questions in long videos due to forgetting, and Scene-RAG retrieval improves performance by 2.5%.
-
A Unified and Controllable Framework for Layered Image Generation with Visual Effects
LASAGNA produces layered images with integrated visual effects in a single pass, enabling drift-free edits via alpha compositing while releasing a 48K dataset and a 242-sample benchmark.
-
SpaceR: Reinforcing MLLMs in Video Spatial Reasoning
SpaceR uses a new verifiable dataset and map-imagination-augmented RLVR to reach SOTA spatial reasoning accuracy in MLLMs, exceeding GPT-4o on VSI-Bench.
-
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction
PyramidDrop accelerates LVLMs by staged, similarity-based dropping of visual tokens that become redundant in deeper layers, delivering 40% faster training and 55% lower inference cost with comparable accuracy.
-
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation
Janus decouples visual encoding into task-specific pathways inside a single autoregressive transformer to unify multimodal understanding and generation while outperforming earlier unified models.
-
MLVU: Benchmarking Multi-task Long Video Understanding
MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.
-
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.
-
MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality
MUSE decouples reconstruction and semantic learning in visual tokenization via topological orthogonality, yielding SOTA generation quality and improved semantic performance over its teacher model.
-
World2VLM: Distilling World Model Imagination into VLMs for Dynamic Spatial Reasoning
Distilling view-consistent future views and action-outcome supervision from a generative world model into a VLM via two-stage post-training improves dynamic spatial reasoning on SAT-Real, VSI-Bench and similar benchma...
-
Refinement via Regeneration: Enlarging Modification Space Boosts Image Refinement in Unified Multimodal Models
Refinement via Regeneration (RvR) reformulates image refinement in unified multimodal models as conditional regeneration using prompt and semantic tokens from the initial image, yielding higher alignment scores than e...
-
Seeing Without Eyes: 4D Human-Scene Understanding from Wearable IMUs
IMU-to-4D uses wearable IMU data and repurposed LLMs to predict coherent 4D human motion plus coarse scene structure, outperforming cascaded state-of-the-art pipelines in temporal stability.
-
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
LongVU adaptively compresses long video tokens using DINOv2-based frame deduplication, text-guided cross-modal selection, and temporal spatial reduction to improve video-language understanding in MLLMs with minimal de...
-
VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation
VILA-U unifies visual understanding and generation inside one autoregressive next-token prediction model, removing separate diffusion components while claiming near state-of-the-art results.
-
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation
SEED-X is a unified multimodal foundation model that handles multi-granularity visual semantics for both comprehension and generation across arbitrary image sizes and ratios.
-
SnapKV: LLM Knows What You are Looking for Before Generation
SnapKV selects clustered important KV positions per attention head from an observation window at the prompt end, yielding 3.6x faster generation and 8.2x better memory efficiency on 16K-token inputs with comparable pe...
-
Emerging Properties in Unified Multimodal Pretraining
BAGEL is a unified decoder-only model that develops emerging complex multimodal reasoning abilities after pretraining on large-scale interleaved data and outperforms prior open-source unified models.
-
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
Scaling data, model size, and training optimization on the Janus architecture yields better multimodal understanding and more stable, instruction-following text-to-image generation.
Reference graph
Works this paper leans on
-
[1]
Jointly training large autoregressive multimodal models
Emanuele Aiello, Lili Yu, Yixin Nie, Armen Aghajanyan, and Barlas Oguz. Jointly training large autoregressive multimodal models. arXiv preprint arXiv:2309.15564,
-
[2]
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Longformer: The Long-Document Transformer
Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150,
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[4]
URL http://github.com/google/jax. William Brandon, Aniruddha Nrusimha, Kevin Qian, Zachary Ankner, Tian Jin, Zhiye Song, and Jonathan Ragan-Kelley. Striped attention: Faster ring attention for causal transformers. arXiv preprint arXiv:2311.09431,
-
[5]
URL https://openai.com/research/ video-generation-models-as-world-simulators . Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems , 33:1877–1901,
work page 1901
-
[6]
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023a. Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via positional interpolation. arXiv pre...
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Generating Long Sequences with Sparse Transformers
URL https: //lmsys.org/blog/2023-03-30-vicuna/ . Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509,
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
Flashattention: Fast and memory- efficient exact attention with io-awareness
11 Published as a conference paper at ICLR 2025 Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory- efficient exact attention with io-awareness. Advances in Neural Information Processing Systems , 35:16344–16359,
work page 2025
-
[9]
Enhancing Chat Language Models by Scaling High-quality Instructional Conversations
Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi Zheng, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Fully Sharded Data Parallel: faster AI training with fewer GPUs — engineering.fb.com
Facebook. Fully Sharded Data Parallel: faster AI training with fewer GPUs — engineering.fb.com. https://engineering.fb.com/2021/07/15/open-source/fsdp/,
work page 2021
-
[11]
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
[Ac- cessed 16-May-2023]. Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. arXiv preprint arXiv:2405.21075,
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[12]
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
[Online; accessed 7-Feb-2024]. David Ha and Jürgen Schmidhuber. World models. arXiv preprint arXiv:1803.10122,
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
Classifier-Free Diffusion Guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Imagen Video: High Definition Video Generation with Diffusion Models
Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022a. Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J. Fle...
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Jinhyuk Lee, Anthony Chen, Zhuyun Dai, Dheeru Dua, Devendra Singh Sachan, Michael Boratko, Yi Luan, Sébastien MR Arnold, Vincent Perot, Siddharth Dalmia, et al. Can long-context language models subsume retrieval, rag, sql, and more? arXiv preprint arXiv:2406.13121,
-
[17]
Lightseq: Sequence level parallelism for distributed training of long context transformers
12 Published as a conference paper at ICLR 2025 Dacheng Li, Rulin Shao, Anze Xie, Eric P Xing, Joseph E Gonzalez, Ion Stoica, Xuezhe Ma, and Hao Zhang. Lightseq: Sequence level parallelism for distributed training of long context transformers. arXiv preprint arXiv:2310.03294,
-
[18]
arXiv preprint arXiv:2105.13120 , year=
Shenggui Li, Fuzhao Xue, Yongbin Li, and Yang You. Sequence parallelism: Making 4d parallelism possible. arXiv preprint arXiv:2105.13120,
-
[19]
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Bin Lin, Bin Zhu, Yang Ye, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Improved Baselines with Visual Instruction Tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023a. Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023b. Xiaoran Liu, Hang Yan, Shuo Zhang, Chenxin An, Xipeng Qiu, and Dahua Lin. Scaling laws ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
amused: An open muse reproduction.arXiv preprint arXiv:2401.01808, 2024
Suraj Patil, William Berman, Robin Rombach, and Patrick von Platen. amused: An open muse reproduction. arXiv preprint arXiv:2401.01808,
-
[23]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530,
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
Code Llama: Open Foundation Models for Code
Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
Alpaca: A strong, replicable instruction-following model
13 Published as a conference paper at ICLR 2025 Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on F oundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7,
work page 2025
-
[26]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805,
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kindermans, Hernan Moraldo, Han Zhang, Mohammad Taghi Saffar, Santiago Castro, Julius Kunze, and Dumitru Erhan. Phenaki: Variable length video generation from open domain textual description. arXiv preprint arXiv:2210.02399,
-
[29]
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation
Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, et al. Internvid: A large-scale video-text dataset for multimodal understanding and generation. arXiv preprint arXiv:2307.06942,
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600,
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. Scaling autoregressive models for content- rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2(3):5,
work page internal anchor Pith review Pith/arXiv arXiv
-
[32]
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
Renrui Zhang, Jiaming Han, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, Peng Gao, and Yu Qiao. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199,
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685,
work page internal anchor Pith review Pith/arXiv arXiv
-
[34]
14 Published as a conference paper at ICLR 2025 A F URTHER DETAILS Model Flops Utilization . We trained our models using TPUv4-1024, which is approximately equivalent to 450 A100s, with a batch size of 8M using FSDP (Facebook,
work page 2025
-
[35]
Figure 8 shows the model FLOPS utilization (MFU) for each training stage
for large contexts. Figure 8 shows the model FLOPS utilization (MFU) for each training stage. Blue color bars show language training and orange color bars show vision-language training. Our training achieves good MFUs even for very large context sizes. Figure 8 High MFU training across sequence lengths. Model flops utilization (MFU) of each training stage...
work page 2025
-
[36]
The magic number for San Francisco is 2521233
to test its conversation ability. Table 10 shows the MT-Bench scores of for each of our models. Table 11 illustrates the relationship between the mix of chat and fact retrieval tasks and the performance on MT-Bench score and Needle Retrieval accuracy. As the proportion of chat increases and fact retrieval decreases, the MT-Bench score improves, indicating...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.