arxiv: 2311.07575 · v1 · submitted 2023-11-13 · 💻 cs.CV · cs.AI· cs.CL· cs.LG

Recognition: 2 theorem links

· Lean Theorem

SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models

Ziyi Lin , Chris Liu , Renrui Zhang , Peng Gao , Longtian Qiu , Han Xiao , Han Qiu , Chen Lin

show 8 more authors

Wenqi Shao Keqin Chen Jiaming Han Siyuan Huang Yichi Zhang Xuming He Hongsheng Li Yu Qiao

Authors on Pith no claims yet

Pith reviewed 2026-05-17 02:58 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CLcs.LG

keywords multi-modal large language modeljoint mixingweight mixingvisual instruction tuningmulti-task learningvisual embeddingshigh-resolution image understanding

0 comments

The pith

Mixing weights from real-world and synthetic LLMs with varied tasks and visual embeddings produces a single versatile multi-modal model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SPHINX by unfreezing the language model during pre-training and directly combining its weights with those of an LLM trained on synthetic data. It then mixes multiple visual instruction tasks such as question answering, region understanding, caption grounding, document layout detection, and pose estimation, each with tailored instructions to prevent interference. Visual features are pulled from several network architectures and pre-training methods at different granularities. These three mixing steps together yield stronger alignment and broader capabilities than single-source approaches. An additional high-resolution strategy that mixes scales and sub-images further improves fine-grained visual reasoning on existing benchmarks.

Core claim

By directly integrating weights from LLMs trained on real-world and synthetic data, jointly tuning on a curated set of visual instruction tasks with conflict-avoiding instructions, and extracting embeddings from multiple architectures and granularities, SPHINX attains superior multi-modal understanding across a wide range of applications while an auxiliary mixing of image scales enables strong high-resolution parsing.

What carries the argument

Joint mixing of model weights, tuning tasks, and visual embeddings, which directly combines parameters, instructions, and features from different sources to build one model.

If this is right

Unfreezing the LLM plus weight mixing produces stronger vision-language alignment than frozen baselines.
Task-specific instructions allow simultaneous training on region-level understanding, pose estimation, and document tasks without mutual degradation.
Diverse visual embeddings from multiple networks and pre-training regimes supply more robust image representations to the language model.
Mixing image scales and high-resolution sub-images yields improved fine-grained appearance capture on existing evaluation sets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same mixing principle could be tested on language-only or audio-visual models to check whether parameter-level integration generalizes beyond vision-language pairs.
If weight mixing succeeds here, it raises the possibility that separate large-scale pre-training runs on different data distributions can be combined post hoc rather than retrained from scratch.
Future variants might add a third weight source or additional task categories to probe the limits of conflict-free mixing.

Load-bearing premise

Directly integrating weights from LLMs trained on real-world and synthetic data will incorporate diverse semantics with robustness and without conflicts or performance loss.

What would settle it

If the weight-mixed model scores lower than either the real-world-only or synthetic-only LLM on standard vision-language benchmarks, the mixing step would be shown to introduce net conflicts rather than gains.

read the original abstract

We present SPHINX, a versatile multi-modal large language model (MLLM) with a joint mixing of model weights, tuning tasks, and visual embeddings. First, for stronger vision-language alignment, we unfreeze the large language model (LLM) during pre-training, and introduce a weight mix strategy between LLMs trained by real-world and synthetic data. By directly integrating the weights from two domains, the mixed LLM can efficiently incorporate diverse semantics with favorable robustness. Then, to enable multi-purpose capabilities, we mix a variety of tasks for joint visual instruction tuning, and design task-specific instructions to avoid inter-task conflict. In addition to the basic visual question answering, we include more challenging tasks such as region-level understanding, caption grounding, document layout detection, and human pose estimation, contributing to mutual enhancement over different scenarios. Additionally, we propose to extract comprehensive visual embeddings from various network architectures, pre-training paradigms, and information granularity, providing language models with more robust image representations. Based on our proposed joint mixing, SPHINX exhibits superior multi-modal understanding capabilities on a wide range of applications. On top of this, we further propose an efficient strategy aiming to better capture fine-grained appearances of high-resolution images. With a mixing of different scales and high-resolution sub-images, SPHINX attains exceptional visual parsing and reasoning performance on existing evaluation benchmarks. We hope our work may cast a light on the exploration of joint mixing in future MLLM research. Code is released at https://github.com/Alpha-VLLM/LLaMA2-Accessory.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SPHINX shows gains from mixing real/synthetic LLM weights with multi-task tuning and varied visual embeddings, but the weight step needs clearer validation on conflicts.

read the letter

The main takeaway is that SPHINX improves multi-modal performance by unfreezing the LLM and directly mixing its weights with one trained on synthetic data, then running joint instruction tuning across tasks like VQA, region understanding, caption grounding, layout detection, and pose estimation, while pulling embeddings from multiple vision networks and adding a high-res sub-image mix for finer detail. The code is out, which helps anyone who wants to try the recipe.

Referee Report

3 major / 2 minor

Summary. The paper introduces SPHINX, a multi-modal large language model that performs joint mixing of LLM weights (unfreezing and integrating parameters from real-world and synthetic data LLMs), a variety of visual instruction tuning tasks (including VQA, region-level understanding, caption grounding, document layout detection, and pose estimation) with task-specific instructions, and visual embeddings extracted from diverse network architectures, pre-training paradigms, and granularities. It further proposes an efficient high-resolution strategy mixing scales and sub-images, claiming superior multi-modal understanding and visual parsing on benchmarks.

Significance. If the empirical gains hold after proper controls, the joint mixing framework could offer a practical route to versatile MLLMs that combine robustness and diversity without separate expert modules; the release of code is a positive contribution for reproducibility.

major comments (3)

[weight mix strategy section] The central claim that 'directly integrating the weights from two domains' efficiently incorporates diverse semantics with favorable robustness (abstract and weight-mix description) is load-bearing for the superiority argument, yet the manuscript provides no explicit definition of the mixing operator (simple average, task-vector addition, or learned gate), no analysis of activation/gradient conflicts, and no ablation isolating the weight-mix step from task and embedding mixing.
[joint visual instruction tuning section] The assertion that task-specific instructions alone suffice to avoid inter-task conflict during joint visual instruction tuning is presented without quantitative evidence of interference (e.g., performance drop when mixing all tasks vs. sequential) or comparison to standard multi-task baselines; this undermines the 'mutual enhancement' claim across scenarios such as region-level understanding and pose estimation.
[high-resolution strategy section] The high-resolution strategy of mixing different scales and sub-images is claimed to attain exceptional visual parsing, but the manuscript lacks a controlled comparison showing that the gains exceed those from simply increasing input resolution or using standard multi-scale patching, making it unclear whether the mixing itself is the decisive factor.

minor comments (2)

[visual embeddings section] Notation for the visual embedding extraction (various architectures and granularities) is introduced without a compact diagram or table summarizing the sources and dimensions, which would aid clarity.
[abstract and experiments] The abstract states 'superior multi-modal understanding capabilities on a wide range of applications' but the manuscript should explicitly list the exact benchmarks and metrics used for each claim rather than referring generically to 'existing evaluation benchmarks'.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We have carefully reviewed each major comment and provide point-by-point responses below, including clarifications and planned revisions to address the concerns raised.

read point-by-point responses

Referee: [weight mix strategy section] The central claim that 'directly integrating the weights from two domains' efficiently incorporates diverse semantics with favorable robustness (abstract and weight-mix description) is load-bearing for the superiority argument, yet the manuscript provides no explicit definition of the mixing operator (simple average, task-vector addition, or learned gate), no analysis of activation/gradient conflicts, and no ablation isolating the weight-mix step from task and embedding mixing.

Authors: We agree that greater precision is needed here. The weight mixing is implemented as a direct parameter-wise average between the real-world and synthetic-data LLMs after a short alignment phase, as introduced in Section 3.2. We acknowledge the absence of an explicit formula, conflict analysis, and isolating ablation. In the revised manuscript we will add the mathematical definition of the operator, a short discussion of activation/gradient behavior, and a new ablation that holds task mixing and embedding mixing fixed while toggling only the weight-mix step. These additions will better substantiate the contribution of this component. revision: yes
Referee: [joint visual instruction tuning section] The assertion that task-specific instructions alone suffice to avoid inter-task conflict during joint visual instruction tuning is presented without quantitative evidence of interference (e.g., performance drop when mixing all tasks vs. sequential) or comparison to standard multi-task baselines; this undermines the 'mutual enhancement' claim across scenarios such as region-level understanding and pose estimation.

Authors: We appreciate this point. Task-specific instructions are used to condition the model on each task during joint training, as described in Section 4. However, we did not report a direct joint-versus-sequential comparison or a standard multi-task baseline. In the revision we will include new experiments that measure performance when all tasks are trained jointly with the proposed instructions versus sequential training and versus a vanilla multi-task baseline without task-specific prompts. These results will quantify interference (or its absence) and support the mutual-enhancement claim for tasks such as region-level understanding and pose estimation. revision: yes
Referee: [high-resolution strategy section] The high-resolution strategy of mixing different scales and sub-images is claimed to attain exceptional visual parsing, but the manuscript lacks a controlled comparison showing that the gains exceed those from simply increasing input resolution or using standard multi-scale patching, making it unclear whether the mixing itself is the decisive factor.

Authors: We thank the referee for this observation. Our high-resolution approach mixes multi-scale inputs with selected high-resolution sub-images to balance detail and efficiency. We recognize that a controlled comparison against simply raising resolution or using conventional multi-scale patching is missing. In the revised version we will add such experiments, reporting performance when using our mixing strategy versus equivalent higher-resolution inputs and versus standard multi-scale feature extraction, thereby isolating the benefit of the proposed mixing procedure. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical mixing of existing components with no derivations or self-referential claims

full rationale

The paper describes SPHINX as an empirical construction: unfreezing the LLM and directly integrating weights from real-world and synthetic LLMs, mixing diverse tasks with task-specific instructions, and extracting visual embeddings from multiple architectures. No equations, derivations, predictions, or uniqueness theorems appear in the provided text. Claims of superior performance rest on experimental benchmarks rather than any reduction of outputs to fitted inputs or self-citations by construction. The approach is self-contained as a practical combination of prior components without load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on the empirical effectiveness of weight interpolation between differently trained LLMs and the assumption that task-specific instructions prevent interference; no explicit free parameters or invented entities are named in the abstract.

axioms (2)

domain assumption Weight mixing between LLMs trained on real-world and synthetic data produces a model with diverse semantics and robustness
Invoked in the first paragraph of the abstract as the basis for stronger vision-language alignment.
domain assumption Task-specific instructions prevent inter-task conflict during joint visual instruction tuning
Stated when describing the mixing of tasks including region-level understanding and pose estimation.

pith-pipeline@v0.9.0 · 5647 in / 1384 out tokens · 41557 ms · 2026-05-17T02:58:20.956133+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we introduce a weight mix strategy between LLMs trained by real-world and synthetic data... θ_mix = β · θ_real + (1 − β) · θ_syn

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
cs.CL 2023-11 unverdicted novelty 8.0

MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.
Aligned Multi-View Scripts for Universal Chart-to-Code Generation
cs.CL 2026-04 unverdicted novelty 7.0

Introduces an aligned multi-language dataset and a language-conditioned low-rank adapter for generating executable plotting code in Python, R, and LaTeX from chart images.
AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation
cs.CV 2026-04 unverdicted novelty 7.0

AnchorSeg uses ordered query banks of latent reasoning tokens plus a spatial anchor token and a Token-Mask Cycle Consistency loss to achieve 67.7% gIoU and 68.1% cIoU on the ReasonSeg benchmark.
VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents
cs.IR 2024-10 conditional novelty 7.0

VisRAG achieves 20-40% better end-to-end performance than text-based RAG by directly embedding and retrieving document images with VLMs.
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
cs.CV 2024-07 unverdicted novelty 7.0

LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving ...
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?
cs.CV 2024-03 conditional novelty 7.0

MathVerse is a benchmark that tests multi-modal LLMs on visual math by providing each problem in six versions with progressively less diagram and text information to measure true visual understanding.
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
cs.CV 2023-03 conditional novelty 7.0

LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.
Foveated Reasoning: Stateful, Action-based Visual Focusing for Vision-Language Models
cs.CV 2026-04 unverdicted novelty 6.0

Foveated Reasoner integrates foveation as stateful actions inside the autoregressive decoding loop of vision-language models, trained via cold-start supervision then reinforcement learning to achieve higher accuracy a...
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
cs.LG 2025-06 unverdicted novelty 6.0

SmolVLA is a small efficient VLA model that achieves performance comparable to 10x larger models while training on one GPU and deploying on consumer hardware via community data and chunked asynchronous action prediction.
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation
cs.CV 2024-04 unverdicted novelty 6.0

SEED-X is a unified multimodal foundation model that handles multi-granularity visual semantics for both comprehension and generation across arbitrary image sizes and ratios.
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
cs.CV 2024-03 unverdicted novelty 6.0

MM1 models achieve state-of-the-art few-shot multimodal results by pre-training on a careful mix of image-caption, interleaved, and text-only data with optimized image encoders.
TempCompass: Do Video LLMs Really Understand Videos?
cs.CV 2024-03 unverdicted novelty 6.0

TempCompass benchmark reveals that state-of-the-art Video LLMs have poor ability to perceive temporal aspects such as speed, direction, and ordering in videos.
CogVLM: Visual Expert for Pretrained Language Models
cs.CV 2023-11 conditional novelty 6.0

CogVLM adds a trainable visual expert inside frozen language model layers for deep vision-language fusion and reports state-of-the-art results on ten cross-modal benchmarks while preserving NLP performance.
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
cs.CV 2023-06 unverdicted novelty 6.0

MME is a manually annotated benchmark evaluating MLLMs on perception and cognition across 14 subtasks to avoid data leakage and support fair model comparisons.
Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models
cs.AI 2026-04 unverdicted novelty 5.0

Multimodal reasoning models hallucinate at high-entropy cognitive bifurcation points due to loss of visual semantic anchoring, and the V-STAR training paradigm with HVAR rewards and FRM reflection mitigates this by re...
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
cs.CV 2024-07 conditional novelty 5.0

InternLM-XComposer-2.5 is a 7B vision-language model supporting up to 96K context that reaches GPT-4V-level performance on image, video, and multi-turn tasks and adds LoRA-driven text-image composition capabilities.
Hallucination of Multimodal Large Language Models: A Survey
cs.CV 2024-04 accept novelty 5.0

The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
cs.CV 2024-04 unverdicted novelty 4.0

InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.
A Survey on Multimodal Large Language Models
cs.CV 2023-06 accept novelty 3.0

This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · cited by 19 Pith papers · 22 internal anchors

[1]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. ArXiv, abs/2308.12966, 2023a. Shuai Bai, Shusheng Yang, Jinze Bai, Peng Wang, Xingxuan Zhang, Junyang Lin, Xinggang Wang, Chang Zhou, and Jingren Zhou. Touchstone: Eval...

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al

URL https://www.adept.ai/ blog/fuyu-8b. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901,

work page 1901
[3]

MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

Jun Chen, Deyao Zhu1 Xiaoqian Shen1 Xiang Li, Zechun Liu2 Pengchuan Zhang, Raghuraman Krishnamoorthi2 Vikas Chandra2 Yunyang Xiong, and Mohamed Elhoseiny. Minigpt-v2: Large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023a. Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and ...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Albert Li, Pascale Fung, and Steven C. H. Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. ArXiv, abs/2305.06500,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Dreamllm: Synergistic multimodal com- prehension and creation

18 Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, Haoran Wei, et al. Dreamllm: Synergistic multimodal comprehension and creation. arXiv preprint arXiv:2309.11499,

work page arXiv
[7]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929,

work page internal anchor Pith review Pith/arXiv arXiv 2010
[8]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023a. Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, X...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following

Ziyu Guo, Renrui Zhang, Xiangyang Zhu, Yiwen Tang, Xianzheng Ma, Jiaming Han, Ke Chen, Peng Gao, Xianzhi Li, Hongsheng Li, and Pheng-Ann Heng. Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following. ArXiv, abs/2309.00615,

work page arXiv
[10]

Danna Gurari, Qing Li, Abigale Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P. Bigham. Vizwiz grand challenge: Answering visual questions from blind people. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3608–3617,

work page 2018
[11]

Imagebind-llm: Multi-modality instruction tuning

Jiaming Han, Renrui Zhang, Wenqi Shao, Peng Gao, Peng Xu, Han Xiao, Kaipeng Zhang, Chris Liu, Song Wen, Ziyu Guo, et al. Imagebind-llm: Multi-modality instruction tuning. arXiv preprint arXiv:2309.03905,

work page arXiv
[12]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016a. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vis...

work page arXiv
[13]

Language Is Not All You Need: Aligning Perception with Language Models

Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Qiang Liu, et al. Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Hudson and Christopher D

19 Drew A. Hudson and Christopher D. Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6693–6702,

work page 2019
[15]

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Segment Anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything.arXiv preprint arXiv:2304.02643,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Li, and Ziwei Liu

Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Fanyi Pu, Jingkang Yang, C. Li, and Ziwei Liu. Mimic-it: Multi-modal in-context instruction tuning. ArXiv, abs/2306.05425, 2023a. Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning. ArXiv, abs/2305.03726, 2023b. Boha...

work page arXiv
[18]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023d. Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. arXiv preprint ...

work page internal anchor Pith review Pith/arXiv arXiv 2014
[19]

Improved Baselines with Visual Instruction Tuning

Fangyu Liu, Guy Edward Toh Emerson, and Nigel Collier. Visual spatial reasoning. Transactions of the Association for Computational Linguistics, 2023a. Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. ArXiv, abs/2310.03744, 2023b. Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instru...

work page internal anchor Pith review Pith/arXiv arXiv
[20]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chun yue Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating math reasoning in visual contexts with gpt-4v, bard, and other large multimodal models. ArXiv, abs/2310.02255,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Junhua Mao, Jonathan Huang, Alexander Toshev, Oana-Maria Camburu, Alan Loddon Yuille, and Kevin P. Murphy. Generation and comprehension of unambiguous object descriptions. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11–20,

work page 2016
[22]

Ok-vqa: A visual question answering benchmark requiring external knowledge

Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3190–3199,

work page 2019
[23]

Ocr-vqa: Visual question answering by reading text in images

Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. Ocr-vqa: Visual question answering by reading text in images. 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 947–952,

work page 2019
[24]

OpenAI. Chatgpt. https://chat.openai.com, 2023a. OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023b. Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint...

work page internal anchor Pith review Pith/arXiv arXiv
[25]

The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Instruction Tuning with GPT-4

Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023a. Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023b. Bryan A. Plummer, ...

work page internal anchor Pith review Pith/arXiv arXiv
[27]

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114,

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Tiny lvlm-ehub: Early multimodal experiments with bard

Wenqi Shao, Yutao Hu, Peng Gao, Meng Lei, Kaipeng Zhang, Fanqing Meng, Peng Xu, Siyuan Huang, Hongsheng Li, Yu Qiao, et al. Tiny lvlm-ehub: Early multimodal experiments with bard. arXiv preprint arXiv:2308.03729,

work page arXiv
[29]

Textcaps: a dataset for image captioning with reading comprehension

Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and Amanpreet Singh. Textcaps: a dataset for image captioning with reading comprehension. ArXiv, abs/2003.12462,

work page arXiv 2003
[30]

Towards vqa models that can read

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8309–8318,

work page 2019
[31]

PandaGPT: One Model To Instruction-Follow Them All

Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang, and Deng Cai. Pandagpt: One model to instruction-follow them all. ArXiv, abs/2305.16355,

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Resolution- robust large mask inpainting with fourier convolutions

Roman Suvorov, Elizaveta Logacheva, Anton Mashikhin, Anastasia Remizova, Arsenii Ashukha, Aleksei Silvestrov, Naejin Kong, Harshith Goka, Kiwoong Park, and Victor Lempitsky. Resolution- robust large mask inpainting with fourier convolutions. arXiv preprint arXiv:2109.07161,

work page arXiv
[33]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a. Hugo Touvron, Louis Martin, K...

work page internal anchor Pith review Pith/arXiv arXiv
[34]

Mathcoder: Seamless code integration in llms for enhanced mathematical reasoning

Ke Wang, Houxing Ren, Aojun Zhou, Zimu Lu, Sichun Luo, Weikang Shi, Renrui Zhang, Linqi Song, Mingjie Zhan, and Hongsheng Li. Mathcoder: Seamless code integration in llms for enhanced mathematical reasoning. arXiv preprint arXiv:2310.03731, 2023a. 22 Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and...

work page arXiv
[35]

Visionllm: Large language model is also an open-ended decoder for vision-centric tasks

Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu Qiao, et al. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. arXiv preprint arXiv:2305.11175, 2023b. Song Wen, Guian Fang, Renrui Zhang, Peng Gao, Hao Dong, and Dimitris Metaxas. Improv- ing compositional text-to...

work page arXiv
[36]

Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Vi- sual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671,

work page internal anchor Pith review Pith/arXiv arXiv
[37]

Pointllm: Empowering large language models to understand point clouds

Runsen Xu, Xiaolong Wang, Tai Wang, Yilun Chen, Jiangmiao Pang, and Dahua Lin. Pointllm: Empowering large language models to understand point clouds. ArXiv, abs/2308.16911,

work page arXiv
[38]

Yan, Yi Jiang, Jiannan Wu, D

B. Yan, Yi Jiang, Jiannan Wu, D. Wang, Ping Luo, Zehuan Yuan, and Huchuan Lu. Universal instance perception as object discovery and retrieval. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15325–15336,

work page 2023
[39]

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381,

work page internal anchor Pith review Pith/arXiv arXiv
[40]

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178,

work page internal anchor Pith review Pith/arXiv arXiv
[41]

Inpaint anything: Segment anything meets image inpainting,

Tao Yu, Runseng Feng, Ruoyu Feng, Jinming Liu, Xin Jin, Wenjun Zeng, and Zhibo Chen. Inpaint anything: Segment anything meets image inpainting. arXiv preprint arXiv:2304.06790, 2023a. Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabil...

work page arXiv 2023
[42]

arXiv preprint arXiv:2309.07915 , year=

Haozhe Zhao, Zefan Cai, Shuzheng Si, Xiaojian Ma, Kaikai An, Liang Chen, Zixuan Liu, Sheng Wang, Wenjuan Han, and Baobao Chang. Mmicl: Empowering vision-language model with multi-modal in-context learning. arXiv preprint arXiv:2309.07915,

work page arXiv
[43]

Solving challenging math word problems using gpt-4 code interpreter with code-based self-verification

Aojun Zhou, Ke Wang, Zimu Lu, Weikang Shi, Sichun Luo, Zipeng Qin, Shaoqing Lu, Anya Jia, Linqi Song, Mingjie Zhan, et al. Solving challenging math word problems using gpt-4 code interpreter with code-based self-verification. arXiv preprint arXiv:2308.07921,

work page arXiv
[44]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

23 Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: En- hancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592,

work page internal anchor Pith review Pith/arXiv arXiv
[45]

Pointclip v2: Adapting clip for powerful 3d open-world learning

Xiangyang Zhu, Renrui Zhang, Bowei He, Ziyao Zeng, Shanghang Zhang, and Peng Gao. Pointclip v2: Adapting clip for powerful 3d open-world learning. arXiv preprint arXiv:2211.11682,

work page arXiv