Recognition: 3 theorem links
· Lean TheoremLanguage Is Not All You Need: Aligning Perception with Language Models
Pith reviewed 2026-05-15 18:28 UTC · model grok-4.3
The pith
Kosmos-1 learns perception and language jointly from web-scale interleaved text and images, then performs zero-shot and few-shot tasks across modalities without any finetuning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Kosmos-1 is a Multimodal Large Language Model trained on web-scale multimodal corpora containing arbitrarily interleaved text and images, image-caption pairs, and text data; the resulting model achieves strong zero-shot and few-shot performance on language tasks, perception-language tasks, and vision tasks specified via text, with no gradient updates or task-specific finetuning required.
What carries the argument
Training a transformer on arbitrarily interleaved text-image sequences so that the same parameters support in-context learning across modalities.
If this is right
- Knowledge transfers in both directions: language pretraining improves multimodal performance and multimodal training improves language performance.
- Document images can be fed directly for OCR-free NLP tasks such as question answering or summarization.
- Image recognition can be performed by supplying only a textual description of the desired classes.
- A single set of weights can handle multimodal dialogue that mixes text and images in the same conversation.
Where Pith is reading between the lines
- The approach suggests that separate vision encoders may become unnecessary if interleaved training data is large enough.
- Similar training recipes could be tested on video or audio sequences to check whether the same model architecture scales to additional modalities.
- The Raven IQ dataset provides a concrete way to compare nonverbal reasoning across future multimodal models without relying on language mediation.
Load-bearing premise
Web-scale multimodal data already contains enough aligned signal that one model can acquire general cross-modal capabilities that transfer to new tasks without any adaptation.
What would settle it
Kosmos-1 scores no higher than a text-only language model on visual question answering when images are provided as input.
read the original abstract
A big convergence of language, multimodal perception, action, and world modeling is a key step toward artificial general intelligence. In this work, we introduce Kosmos-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context (i.e., few-shot), and follow instructions (i.e., zero-shot). Specifically, we train Kosmos-1 from scratch on web-scale multimodal corpora, including arbitrarily interleaved text and images, image-caption pairs, and text data. We evaluate various settings, including zero-shot, few-shot, and multimodal chain-of-thought prompting, on a wide range of tasks without any gradient updates or finetuning. Experimental results show that Kosmos-1 achieves impressive performance on (i) language understanding, generation, and even OCR-free NLP (directly fed with document images), (ii) perception-language tasks, including multimodal dialogue, image captioning, visual question answering, and (iii) vision tasks, such as image recognition with descriptions (specifying classification via text instructions). We also show that MLLMs can benefit from cross-modal transfer, i.e., transfer knowledge from language to multimodal, and from multimodal to language. In addition, we introduce a dataset of Raven IQ test, which diagnoses the nonverbal reasoning capability of MLLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Kosmos-1, a multimodal large language model trained from scratch on web-scale corpora of interleaved text-image data, image-caption pairs, and text. It claims strong zero-shot and few-shot performance (via in-context learning and instruction following) across language understanding/generation, OCR-free document tasks, multimodal dialogue, image captioning, VQA, and vision tasks such as instruction-based image recognition, without any gradient updates or finetuning. The work also reports cross-modal transfer benefits and introduces a new Raven-style IQ test dataset to diagnose nonverbal reasoning in MLLMs.
Significance. If the empirical claims hold after addressing evaluation details, the results would demonstrate that web-scale aligned multimodal pretraining can produce general cross-modal capabilities that transfer to held-out tasks, supporting the broader thesis that perception-language alignment is a key step toward AGI-like convergence of modalities. The new Raven IQ dataset adds a useful diagnostic for nonverbal reasoning that is not language-mediated.
major comments (3)
- [§4 and §5] §4 (Experimental Setup) and §5 (Results): The central performance claims (e.g., zero-shot VQA, captioning, OCR-free NLP) are presented without explicit quantitative tables showing exact metrics, standard baselines (Flamingo, BLIP-2, etc.), error bars, or data exclusion criteria. This makes it impossible to judge whether the reported gains reflect genuine generalization or post-hoc prompt selection.
- [§3.2 and §5.3] §3.2 (Training Data) and §5.3 (Cross-modal Transfer): No decontamination statistics or overlap analysis are provided between the web-scale training corpora and common evaluation benchmarks (VQA v2, COCO, Raven matrices). Given that web data frequently contains near-duplicates of these benchmarks, the zero-shot transfer claims rest on an untested assumption that performance arises from alignment rather than memorization.
- [§6] §6 (Raven IQ Dataset): The new dataset is introduced as a diagnostic for nonverbal reasoning, but the paper provides no details on construction protocol, human validation, or controls for language leakage (e.g., textual descriptions of matrices). This is load-bearing for the claim that MLLMs can be evaluated on purely perceptual reasoning.
minor comments (3)
- [Figure 1 and §3.1] Figure 1 and §3.1: The model architecture diagram and description of the visual encoder + LLM integration use inconsistent notation for the special tokens (e.g., <image> vs. [IMG]); standardize and add a precise tokenization equation.
- [§5] Throughout §5: All reported numbers should include the exact prompt templates used and the number of in-context examples; several tables omit these details, reducing reproducibility.
- [Related Work] Related Work: Add explicit comparison to concurrent MLLMs (Flamingo, PaLM-E) in the introduction and results tables rather than only in passing.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below and describe the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [§4 and §5] §4 (Experimental Setup) and §5 (Results): The central performance claims (e.g., zero-shot VQA, captioning, OCR-free NLP) are presented without explicit quantitative tables showing exact metrics, standard baselines (Flamingo, BLIP-2, etc.), error bars, or data exclusion criteria. This makes it impossible to judge whether the reported gains reflect genuine generalization or post-hoc prompt selection.
Authors: We agree that the results section would benefit from clearer quantitative presentation. In the revised manuscript we will add comprehensive tables in §5 that report exact metrics for every task, direct comparisons to standard baselines including Flamingo and BLIP-2, and explicit statements of evaluation protocols and any data exclusion criteria. Because the evaluations are single-run zero-shot and few-shot settings, we will note the absence of error bars and describe our prompt selection procedure to address concerns about post-hoc tuning. revision: yes
-
Referee: [§3.2 and §5.3] §3.2 (Training Data) and §5.3 (Cross-modal Transfer): No decontamination statistics or overlap analysis are provided between the web-scale training corpora and common evaluation benchmarks (VQA v2, COCO, Raven matrices). Given that web data frequently contains near-duplicates of these benchmarks, the zero-shot transfer claims rest on an untested assumption that performance arises from alignment rather than memorization.
Authors: We acknowledge the importance of decontamination for zero-shot claims. Given the scale of the training corpora, exhaustive overlap analysis is computationally prohibitive; however, we will add a new subsection in §3.2 describing the filtering steps we applied to remove known benchmark duplicates and will report any available overlap statistics. We will also clarify that the Raven dataset is newly constructed and therefore free of training overlap. While we cannot provide a complete decontamination audit, the cross-modal transfer results and performance on held-out tasks support generalization beyond simple memorization. revision: partial
-
Referee: [§6] §6 (Raven IQ Dataset): The new dataset is introduced as a diagnostic for nonverbal reasoning, but the paper provides no details on construction protocol, human validation, or controls for language leakage (e.g., textual descriptions of matrices). This is load-bearing for the claim that MLLMs can be evaluated on purely perceptual reasoning.
Authors: We thank the referee for highlighting this gap. In the revised §6 we will provide a detailed construction protocol, including how the matrices were procedurally generated, the human validation process used to ensure quality and correctness, and explicit controls against language leakage (e.g., matrices are presented purely visually with no accompanying textual descriptions during evaluation). revision: yes
Circularity Check
No circularity in derivation or claims
full rationale
The paper introduces Kosmos-1 via training on web-scale multimodal corpora and reports empirical zero-shot/few-shot results on external benchmarks (VQA, captioning, OCR-free NLP, Raven IQ). No equations, derivations, or load-bearing steps are present that reduce by construction to fitted inputs, self-definitions, or self-citation chains. All performance claims rest on held-out task evaluations rather than internal reparameterization or renamed patterns. The central premise of cross-modal transfer is an empirical observation, not a mathematical identity.
Axiom & Free-Parameter Ledger
free parameters (1)
- model scale and training hyperparameters
axioms (1)
- domain assumption Web-scale interleaved text-image corpora contain sufficient cross-modal alignments to induce general perception-language capabilities.
Lean theorems connected to this paper
-
IndisputableMonolith.Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We train Kosmos-1 from scratch on web-scale multimodal corpora, including arbitrarily interleaved text and images, image-caption pairs, and text data. We evaluate various settings, including zero-shot, few-shot, and multimodal chain-of-thought prompting, on a wide range of tasks without any gradient updates or finetuning.
-
IndisputableMonolith.Foundation.DAlembert.Inevitabilitybilinear_family_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Experimental results show that Kosmos-1 achieves impressive performance on (i) language understanding, generation, and even OCR-free NLP (directly fed with document images), (ii) perception-language tasks, including multimodal dialogue, image captioning, visual question answering, and (iii) vision tasks, such as image recognition with descriptions.
-
IndisputableMonolith.Foundation.DimensionForcingdimension_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
In addition, we introduce a dataset of Raven IQ test, which diagnoses the nonverbal reasoning capability of MLLMs.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 18 Pith papers
-
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving ...
-
3D-VLA: A 3D Vision-Language-Action Generative World Model
3D-VLA is a new embodied foundation model that uses a 3D LLM plus aligned diffusion models to generate future images and point clouds for improved reasoning and action planning in 3D environments.
-
VideoChat: Chat-Centric Video Understanding
VideoChat integrates video models and LLMs via a learnable interface for chat-based spatiotemporal and causal video reasoning, trained on a new video-centric instruction dataset.
-
Visual Instruction Tuning
LLaVA is trained on GPT-4 generated visual instruction data to achieve 85.1% relative performance to GPT-4 on synthetic multimodal tasks and 92.53% accuracy on Science QA.
-
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.
-
Retentive Network: A Successor to Transformer for Large Language Models
RetNet is a new sequence modeling architecture that delivers parallel training, constant-time inference, and competitive language modeling performance as a potential replacement for Transformers.
-
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation
InternVid supplies 7M videos and LLM captions to train ViCLIP, which reaches leading zero-shot action recognition and competitive retrieval performance.
-
Kosmos-2: Grounding Multimodal Large Language Models to the World
Kosmos-2 grounds text to image regions by encoding refer expressions as Markdown links to sequences of location tokens and trains on a new GrIT dataset of grounded image-text pairs.
-
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
MME is a manually annotated benchmark evaluating MLLMs on perception and cognition across 14 subtasks to avoid data leakage and support fair model comparisons.
-
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
MiniGPT-4 shows that aligning a frozen vision encoder to Vicuna via one projection layer plus a second-stage detailed-description fine-tune produces GPT-4-like vision-language abilities including detailed captions, cr...
-
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face
HuggingGPT is an agent system where ChatGPT plans and orchestrates calls to Hugging Face models to solve complex multi-modal AI tasks.
-
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
MM-REACT uses textual prompts to let ChatGPT collaborate with external vision experts for zero-shot multimodal reasoning and action on advanced visual tasks.
-
Qwen2.5-Omni Technical Report
Qwen2.5-Omni presents a multimodal model with block-wise encoders, TMRoPE position embeddings, and a Thinker-Talker architecture that enables simultaneous text and streaming speech generation while matching text perfo...
-
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions
The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.
-
The Rise and Potential of Large Language Model Based Agents: A Survey
The paper surveys the origins, frameworks, applications, and open challenges of AI agents built on large language models.
-
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
OpenFlamingo provides open-source autoregressive vision-language models that achieve 80-89% of Flamingo performance on seven vision-language datasets.
-
Large Language Models: A Survey
The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.
-
A Survey of Large Language Models
This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.
Reference graph
Works this paper leans on
-
[1]
CM3: A causal masked multimodal model of the Internet
[AHR+22] Armen Aghajanyan, Bernie Huang, Candace Ross, Vladimir Karpukhin, Hu Xu, Na- man Goyal, Dmytro Okhonko, Mandar Joshi, Gargi Ghosh, Mike Lewis, and Luke Zettlemoyer. CM3: A causal masked multimodal model of the Internet. ArXiv, abs/2201.07520,
-
[2]
Are Elephants Bigger than Butterflies? Reasoning about Sizes of Objects
[BHCF16] Hessam Bagherinezhad, Hannaneh Hajishirzi, Yejin Choi, and Ali Farhadi. Are ele- phants bigger than butterflies? reasoning about sizes of objects. ArXiv, abs/1602.00753,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Language models are few-shot learners
[BMR+20] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sand- hini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litw...
work page 1901
-
[4]
BoolQ: Exploring the surprising difficulty of natural yes/no questions
[CLC+19] Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short ...
work page 2019
-
[5]
Association for Computational Linguistics. 18 [CND+22] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek B Rao, Parker Barnes, Yi Tay, Noam M. Shazeer, Vinodkumar Prabhakaran, ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
WebSRC: A dataset for web-based structural reading comprehension
[CZC+21] Xingyu Chen, Zihan Zhao, Lu Chen, JiaBao Ji, Danyang Zhang, Ao Luo, Yuxuan Xiong, and Kai Yu. WebSRC: A dataset for web-based structural reading comprehension. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 4173–4185, Online and Punta Cana, Dominican Republic, November
work page 2021
-
[7]
[DDS+09] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei
Association for Computational Linguistics. [DDS+09] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20-25 June 2009, Miami, Florida, USA, pages 248–255. IEEE Computer Society,
work page 2009
-
[8]
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
[GBB+20] Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Gaussian Error Linear Units (GELUs)
[HG16] Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (GELUs). arXiv preprint arXiv:1606.08415,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Language models are general-purpose interfaces
[HSD+22] Yaru Hao, Haoyu Song, Li Dong, Shaohan Huang, Zewen Chi, Wenhui Wang, Shum- ing Ma, and Furu Wei. Language models are general-purpose interfaces. ArXiv, abs/2206.06336,
-
[11]
Grounding language models to images for multimodal generation
[KSF23] Jing Yu Koh, Ruslan Salakhutdinov, and Daniel Fried. Grounding language models to images for multimodal generation. arXiv preprint arXiv:2301.13823,
-
[12]
The flan collection: Designing data and methods for effective instruction tuning
[LHV+23] Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V Le, Barret Zoph, Jason Wei, et al. The flan collection: Designing data and methods for effective instruction tuning. arXiv preprint arXiv:2301.13688,
-
[13]
[LLSH23] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. ArXiv, abs/2301.12597,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
[LOG+19] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692,
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[15]
Lsdsem 2017 shared task: The story cloze test
[MRL+17] Nasrin Mostafazadeh, Michael Roth, Annie Louis, Nathanael Chambers, and James Allen. Lsdsem 2017 shared task: The story cloze test. In Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics , pages 46–51,
work page 2017
-
[16]
TorchScale: Transformers at scale
[MWH+22] Shuming Ma, Hongyu Wang, Shaohan Huang, Wenhui Wang, Zewen Chi, Li Dong, Alon Benhaim, Barun Patra, Vishrav Chaudhary, Xia Song, and Furu Wei. TorchScale: Transformers at scale. CoRR, abs/2211.13184,
-
[17]
[NHJ21] Tobias Norlund, Lovisa Hagström, and Richard Johansson. Transferring knowl- edge from vision to language: How to achieve it and how to measure it? ArXiv, abs/2109.11321,
-
[18]
LAION-5B: An open large-scale dataset for training next generation image-text models
[SBV+22] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wight- man, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning
[SDGS18] Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Pro- ceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers , pages 2556–2565. As...
work page 2018
-
[20]
A length-extrapolatable transformer
[SDP+22] Yutao Sun, Li Dong, Barun Patra, Shuming Ma, Shaohan Huang, Alon Benhaim, Vishrav Chaudhary, Xia Song, and Furu Wei. A length-extrapolatable transformer. arXiv preprint arXiv:2212.10554,
-
[21]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
[SPP+19] Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053,
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[22]
Manning, Andrew Ng, and Christopher Potts
[SPW+13] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic composition- ality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642, Seattle, Washington, USA, October
work page 2013
-
[23]
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs
Association for Computational Linguistics. [SVB+21] Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clay- ton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion- 400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114,
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
Image as a foreign language: BEiT pretraining for all vision and vision-language tasks
[WBD+22] Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiang Liu, Kriti Aggarwal, Owais Mohammed, Saksham Singhal, Subhojit Som, and Furu Wei. Image as a foreign language: BEiT pretraining for all vision and vision-language tasks. ArXiv, abs/2208.10442,
- [25]
-
[26]
Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers
[WCW+23] Chengyi Wang, Sanyuan Chen, Yu Wu, Zi-Hua Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, Lei He, Sheng Zhao, and Furu Wei. Neural codec language models are zero-shot text to speech synthesizers. ArXiv, abs/2301.02111,
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
DeepNet: Scaling Transformers to 1,000 layers
21 [WMD+22] Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Dongdong Zhang, and Furu Wei. DeepNet: Scaling Transformers to 1,000 layers. CoRR, abs/2203.00555,
-
[28]
[WMH+22] Hongyu Wang, Shuming Ma, Shaohan Huang, Li Dong, Wenhui Wang, Zhiliang Peng, Yu Wu, Payal Bajaj, Saksham Singhal, Alon Benhaim, Barun Patra, Zhun Liu, Vishrav Chaudhary, Xia Song, and Furu Wei. Foundation transformers. CoRR, abs/2210.06423,
-
[29]
SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems
[WPN+19] Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. SuperGLUE: A stickier benchmark for general-purpose language understanding systems. arXiv preprint arXiv:1905.00537,
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[30]
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
[WWS+22] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903,
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
GIT: A generative image-to-text transformer for vision and language
[WYH+22] Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, and Lijuan Wang. GIT: A generative image-to-text transformer for vision and language. CoRR, abs/2205.14100,
-
[32]
Retrieval-augmented multimodal language modeling
[Y AS+22] Michihiro Yasunaga, Armen Aghajanyan, Weijia Shi, Rich James, Jure Leskovec, Percy Liang, Mike Lewis, Luke Zettlemoyer, and Wen tau Yih. Retrieval-augmented multimodal language modeling. ArXiv, abs/2211.12561,
-
[33]
Hyperparameters Training steps 10,000 Warmup steps 375 Batch size of instruction data 256 Batch size of text corpora 32 Batch size of image-caption pairs 768 Batch size of interleaved data 16 Learning rate 2e-5 Table 19: Instruction tuning hyperparameters of KOSMOS -1 23 B Datasets B.1 Pretraning B.1.1 Text Corpora KOSMOS -1 is trained on The Pile [GBB+20...
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.