Recognition: 2 theorem links
· Lean TheoremSPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models
Pith reviewed 2026-05-17 02:58 UTC · model grok-4.3
The pith
Mixing weights from real-world and synthetic LLMs with varied tasks and visual embeddings produces a single versatile multi-modal model.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By directly integrating weights from LLMs trained on real-world and synthetic data, jointly tuning on a curated set of visual instruction tasks with conflict-avoiding instructions, and extracting embeddings from multiple architectures and granularities, SPHINX attains superior multi-modal understanding across a wide range of applications while an auxiliary mixing of image scales enables strong high-resolution parsing.
What carries the argument
Joint mixing of model weights, tuning tasks, and visual embeddings, which directly combines parameters, instructions, and features from different sources to build one model.
If this is right
- Unfreezing the LLM plus weight mixing produces stronger vision-language alignment than frozen baselines.
- Task-specific instructions allow simultaneous training on region-level understanding, pose estimation, and document tasks without mutual degradation.
- Diverse visual embeddings from multiple networks and pre-training regimes supply more robust image representations to the language model.
- Mixing image scales and high-resolution sub-images yields improved fine-grained appearance capture on existing evaluation sets.
Where Pith is reading between the lines
- The same mixing principle could be tested on language-only or audio-visual models to check whether parameter-level integration generalizes beyond vision-language pairs.
- If weight mixing succeeds here, it raises the possibility that separate large-scale pre-training runs on different data distributions can be combined post hoc rather than retrained from scratch.
- Future variants might add a third weight source or additional task categories to probe the limits of conflict-free mixing.
Load-bearing premise
Directly integrating weights from LLMs trained on real-world and synthetic data will incorporate diverse semantics with robustness and without conflicts or performance loss.
What would settle it
If the weight-mixed model scores lower than either the real-world-only or synthetic-only LLM on standard vision-language benchmarks, the mixing step would be shown to introduce net conflicts rather than gains.
read the original abstract
We present SPHINX, a versatile multi-modal large language model (MLLM) with a joint mixing of model weights, tuning tasks, and visual embeddings. First, for stronger vision-language alignment, we unfreeze the large language model (LLM) during pre-training, and introduce a weight mix strategy between LLMs trained by real-world and synthetic data. By directly integrating the weights from two domains, the mixed LLM can efficiently incorporate diverse semantics with favorable robustness. Then, to enable multi-purpose capabilities, we mix a variety of tasks for joint visual instruction tuning, and design task-specific instructions to avoid inter-task conflict. In addition to the basic visual question answering, we include more challenging tasks such as region-level understanding, caption grounding, document layout detection, and human pose estimation, contributing to mutual enhancement over different scenarios. Additionally, we propose to extract comprehensive visual embeddings from various network architectures, pre-training paradigms, and information granularity, providing language models with more robust image representations. Based on our proposed joint mixing, SPHINX exhibits superior multi-modal understanding capabilities on a wide range of applications. On top of this, we further propose an efficient strategy aiming to better capture fine-grained appearances of high-resolution images. With a mixing of different scales and high-resolution sub-images, SPHINX attains exceptional visual parsing and reasoning performance on existing evaluation benchmarks. We hope our work may cast a light on the exploration of joint mixing in future MLLM research. Code is released at https://github.com/Alpha-VLLM/LLaMA2-Accessory.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SPHINX, a multi-modal large language model that performs joint mixing of LLM weights (unfreezing and integrating parameters from real-world and synthetic data LLMs), a variety of visual instruction tuning tasks (including VQA, region-level understanding, caption grounding, document layout detection, and pose estimation) with task-specific instructions, and visual embeddings extracted from diverse network architectures, pre-training paradigms, and granularities. It further proposes an efficient high-resolution strategy mixing scales and sub-images, claiming superior multi-modal understanding and visual parsing on benchmarks.
Significance. If the empirical gains hold after proper controls, the joint mixing framework could offer a practical route to versatile MLLMs that combine robustness and diversity without separate expert modules; the release of code is a positive contribution for reproducibility.
major comments (3)
- [weight mix strategy section] The central claim that 'directly integrating the weights from two domains' efficiently incorporates diverse semantics with favorable robustness (abstract and weight-mix description) is load-bearing for the superiority argument, yet the manuscript provides no explicit definition of the mixing operator (simple average, task-vector addition, or learned gate), no analysis of activation/gradient conflicts, and no ablation isolating the weight-mix step from task and embedding mixing.
- [joint visual instruction tuning section] The assertion that task-specific instructions alone suffice to avoid inter-task conflict during joint visual instruction tuning is presented without quantitative evidence of interference (e.g., performance drop when mixing all tasks vs. sequential) or comparison to standard multi-task baselines; this undermines the 'mutual enhancement' claim across scenarios such as region-level understanding and pose estimation.
- [high-resolution strategy section] The high-resolution strategy of mixing different scales and sub-images is claimed to attain exceptional visual parsing, but the manuscript lacks a controlled comparison showing that the gains exceed those from simply increasing input resolution or using standard multi-scale patching, making it unclear whether the mixing itself is the decisive factor.
minor comments (2)
- [visual embeddings section] Notation for the visual embedding extraction (various architectures and granularities) is introduced without a compact diagram or table summarizing the sources and dimensions, which would aid clarity.
- [abstract and experiments] The abstract states 'superior multi-modal understanding capabilities on a wide range of applications' but the manuscript should explicitly list the exact benchmarks and metrics used for each claim rather than referring generically to 'existing evaluation benchmarks'.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We have carefully reviewed each major comment and provide point-by-point responses below, including clarifications and planned revisions to address the concerns raised.
read point-by-point responses
-
Referee: [weight mix strategy section] The central claim that 'directly integrating the weights from two domains' efficiently incorporates diverse semantics with favorable robustness (abstract and weight-mix description) is load-bearing for the superiority argument, yet the manuscript provides no explicit definition of the mixing operator (simple average, task-vector addition, or learned gate), no analysis of activation/gradient conflicts, and no ablation isolating the weight-mix step from task and embedding mixing.
Authors: We agree that greater precision is needed here. The weight mixing is implemented as a direct parameter-wise average between the real-world and synthetic-data LLMs after a short alignment phase, as introduced in Section 3.2. We acknowledge the absence of an explicit formula, conflict analysis, and isolating ablation. In the revised manuscript we will add the mathematical definition of the operator, a short discussion of activation/gradient behavior, and a new ablation that holds task mixing and embedding mixing fixed while toggling only the weight-mix step. These additions will better substantiate the contribution of this component. revision: yes
-
Referee: [joint visual instruction tuning section] The assertion that task-specific instructions alone suffice to avoid inter-task conflict during joint visual instruction tuning is presented without quantitative evidence of interference (e.g., performance drop when mixing all tasks vs. sequential) or comparison to standard multi-task baselines; this undermines the 'mutual enhancement' claim across scenarios such as region-level understanding and pose estimation.
Authors: We appreciate this point. Task-specific instructions are used to condition the model on each task during joint training, as described in Section 4. However, we did not report a direct joint-versus-sequential comparison or a standard multi-task baseline. In the revision we will include new experiments that measure performance when all tasks are trained jointly with the proposed instructions versus sequential training and versus a vanilla multi-task baseline without task-specific prompts. These results will quantify interference (or its absence) and support the mutual-enhancement claim for tasks such as region-level understanding and pose estimation. revision: yes
-
Referee: [high-resolution strategy section] The high-resolution strategy of mixing different scales and sub-images is claimed to attain exceptional visual parsing, but the manuscript lacks a controlled comparison showing that the gains exceed those from simply increasing input resolution or using standard multi-scale patching, making it unclear whether the mixing itself is the decisive factor.
Authors: We thank the referee for this observation. Our high-resolution approach mixes multi-scale inputs with selected high-resolution sub-images to balance detail and efficiency. We recognize that a controlled comparison against simply raising resolution or using conventional multi-scale patching is missing. In the revised version we will add such experiments, reporting performance when using our mixing strategy versus equivalent higher-resolution inputs and versus standard multi-scale feature extraction, thereby isolating the benefit of the proposed mixing procedure. revision: yes
Circularity Check
No circularity: empirical mixing of existing components with no derivations or self-referential claims
full rationale
The paper describes SPHINX as an empirical construction: unfreezing the LLM and directly integrating weights from real-world and synthetic LLMs, mixing diverse tasks with task-specific instructions, and extracting visual embeddings from multiple architectures. No equations, derivations, predictions, or uniqueness theorems appear in the provided text. Claims of superior performance rest on experimental benchmarks rather than any reduction of outputs to fitted inputs or self-citations by construction. The approach is self-contained as a practical combination of prior components without load-bearing circular steps.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Weight mixing between LLMs trained on real-world and synthetic data produces a model with diverse semantics and robustness
- domain assumption Task-specific instructions prevent inter-task conflict during joint visual instruction tuning
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we introduce a weight mix strategy between LLMs trained by real-world and synthetic data... θ_mix = β · θ_real + (1 − β) · θ_syn
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 19 Pith papers
-
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.
-
Aligned Multi-View Scripts for Universal Chart-to-Code Generation
Introduces an aligned multi-language dataset and a language-conditioned low-rank adapter for generating executable plotting code in Python, R, and LaTeX from chart images.
-
AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation
AnchorSeg uses ordered query banks of latent reasoning tokens plus a spatial anchor token and a Token-Mask Cycle Consistency loss to achieve 67.7% gIoU and 68.1% cIoU on the ReasonSeg benchmark.
-
VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents
VisRAG achieves 20-40% better end-to-end performance than text-based RAG by directly embedding and retrieving document images with VLMs.
-
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving ...
-
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?
MathVerse is a benchmark that tests multi-modal LLMs on visual math by providing each problem in six versions with progressively less diagram and text information to measure true visual understanding.
-
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.
-
Foveated Reasoning: Stateful, Action-based Visual Focusing for Vision-Language Models
Foveated Reasoner integrates foveation as stateful actions inside the autoregressive decoding loop of vision-language models, trained via cold-start supervision then reinforcement learning to achieve higher accuracy a...
-
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
SmolVLA is a small efficient VLA model that achieves performance comparable to 10x larger models while training on one GPU and deploying on consumer hardware via community data and chunked asynchronous action prediction.
-
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation
SEED-X is a unified multimodal foundation model that handles multi-granularity visual semantics for both comprehension and generation across arbitrary image sizes and ratios.
-
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
MM1 models achieve state-of-the-art few-shot multimodal results by pre-training on a careful mix of image-caption, interleaved, and text-only data with optimized image encoders.
-
TempCompass: Do Video LLMs Really Understand Videos?
TempCompass benchmark reveals that state-of-the-art Video LLMs have poor ability to perceive temporal aspects such as speed, direction, and ordering in videos.
-
CogVLM: Visual Expert for Pretrained Language Models
CogVLM adds a trainable visual expert inside frozen language model layers for deep vision-language fusion and reports state-of-the-art results on ten cross-modal benchmarks while preserving NLP performance.
-
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
MME is a manually annotated benchmark evaluating MLLMs on perception and cognition across 14 subtasks to avoid data leakage and support fair model comparisons.
-
Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models
Multimodal reasoning models hallucinate at high-entropy cognitive bifurcation points due to loss of visual semantic anchoring, and the V-STAR training paradigm with HVAR rewards and FRM reflection mitigates this by re...
-
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
InternLM-XComposer-2.5 is a 7B vision-language model supporting up to 96K context that reaches GPT-4V-level performance on image, video, and multi-turn tasks and adds LoRA-driven text-image composition capabilities.
-
Hallucination of Multimodal Large Language Models: A Survey
The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
-
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.
-
A Survey on Multimodal Large Language Models
This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.
Reference graph
Works this paper leans on
-
[1]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. ArXiv, abs/2308.12966, 2023a. Shuai Bai, Shusheng Yang, Jinze Bai, Peng Wang, Xingxuan Zhang, Junyang Lin, Xinggang Wang, Chang Zhou, and Jingren Zhou. Touchstone: Eval...
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
URL https://www.adept.ai/ blog/fuyu-8b. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901,
work page 1901
-
[3]
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning
Jun Chen, Deyao Zhu1 Xiaoqian Shen1 Xiang Li, Zechun Liu2 Pengchuan Zhang, Raghuraman Krishnamoorthi2 Vikas Chandra2 Yunyang Xiong, and Mohamed Elhoseiny. Minigpt-v2: Large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023a. Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and ...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Albert Li, Pascale Fung, and Steven C. H. Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. ArXiv, abs/2305.06500,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Dreamllm: Synergistic multimodal com- prehension and creation
18 Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, Haoran Wei, et al. Dreamllm: Synergistic multimodal comprehension and creation. arXiv preprint arXiv:2309.11499,
-
[7]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929,
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[8]
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023a. Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, X...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
Ziyu Guo, Renrui Zhang, Xiangyang Zhu, Yiwen Tang, Xianzheng Ma, Jiaming Han, Ke Chen, Peng Gao, Xianzhi Li, Hongsheng Li, and Pheng-Ann Heng. Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following. ArXiv, abs/2309.00615,
-
[10]
Danna Gurari, Qing Li, Abigale Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P. Bigham. Vizwiz grand challenge: Answering visual questions from blind people. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3608–3617,
work page 2018
-
[11]
Imagebind-llm: Multi-modality instruction tuning
Jiaming Han, Renrui Zhang, Wenqi Shao, Peng Gao, Peng Xu, Han Xiao, Kaipeng Zhang, Chris Liu, Song Wen, Ziyu Guo, et al. Imagebind-llm: Multi-modality instruction tuning. arXiv preprint arXiv:2309.03905,
-
[12]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016a. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vis...
-
[13]
Language Is Not All You Need: Aligning Perception with Language Models
Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Qiang Liu, et al. Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
19 Drew A. Hudson and Christopher D. Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6693–6702,
work page 2019
-
[15]
Adam: A Method for Stochastic Optimization
Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything.arXiv preprint arXiv:2304.02643,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Fanyi Pu, Jingkang Yang, C. Li, and Ziwei Liu. Mimic-it: Multi-modal in-context instruction tuning. ArXiv, abs/2306.05425, 2023a. Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning. ArXiv, abs/2305.03726, 2023b. Boha...
-
[18]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023d. Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. arXiv preprint ...
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[19]
Improved Baselines with Visual Instruction Tuning
Fangyu Liu, Guy Edward Toh Emerson, and Nigel Collier. Visual spatial reasoning. Transactions of the Association for Computational Linguistics, 2023a. Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. ArXiv, abs/2310.03744, 2023b. Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instru...
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chun yue Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating math reasoning in visual contexts with gpt-4v, bard, and other large multimodal models. ArXiv, abs/2310.02255,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Junhua Mao, Jonathan Huang, Alexander Toshev, Oana-Maria Camburu, Alan Loddon Yuille, and Kevin P. Murphy. Generation and comprehension of unambiguous object descriptions. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11–20,
work page 2016
-
[22]
Ok-vqa: A visual question answering benchmark requiring external knowledge
Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3190–3199,
work page 2019
-
[23]
Ocr-vqa: Visual question answering by reading text in images
Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. Ocr-vqa: Visual question answering by reading text in images. 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 947–952,
work page 2019
-
[24]
OpenAI. Chatgpt. https://chat.openai.com, 2023a. OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023b. Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint...
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116,
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023a. Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023b. Bryan A. Plummer, ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs
Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114,
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
Tiny lvlm-ehub: Early multimodal experiments with bard
Wenqi Shao, Yutao Hu, Peng Gao, Meng Lei, Kaipeng Zhang, Fanqing Meng, Peng Xu, Siyuan Huang, Hongsheng Li, Yu Qiao, et al. Tiny lvlm-ehub: Early multimodal experiments with bard. arXiv preprint arXiv:2308.03729,
-
[29]
Textcaps: a dataset for image captioning with reading comprehension
Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and Amanpreet Singh. Textcaps: a dataset for image captioning with reading comprehension. ArXiv, abs/2003.12462,
-
[30]
Towards vqa models that can read
Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8309–8318,
work page 2019
-
[31]
PandaGPT: One Model To Instruction-Follow Them All
Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang, and Deng Cai. Pandagpt: One model to instruction-follow them all. ArXiv, abs/2305.16355,
work page internal anchor Pith review Pith/arXiv arXiv
-
[32]
Resolution- robust large mask inpainting with fourier convolutions
Roman Suvorov, Elizaveta Logacheva, Anton Mashikhin, Anastasia Remizova, Arsenii Ashukha, Aleksei Silvestrov, Naejin Kong, Harshith Goka, Kiwoong Park, and Victor Lempitsky. Resolution- robust large mask inpainting with fourier convolutions. arXiv preprint arXiv:2109.07161,
-
[33]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a. Hugo Touvron, Louis Martin, K...
work page internal anchor Pith review Pith/arXiv arXiv
-
[34]
Mathcoder: Seamless code integration in llms for enhanced mathematical reasoning
Ke Wang, Houxing Ren, Aojun Zhou, Zimu Lu, Sichun Luo, Weikang Shi, Renrui Zhang, Linqi Song, Mingjie Zhan, and Hongsheng Li. Mathcoder: Seamless code integration in llms for enhanced mathematical reasoning. arXiv preprint arXiv:2310.03731, 2023a. 22 Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and...
-
[35]
Visionllm: Large language model is also an open-ended decoder for vision-centric tasks
Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu Qiao, et al. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. arXiv preprint arXiv:2305.11175, 2023b. Song Wen, Guian Fang, Renrui Zhang, Peng Gao, Hao Dong, and Dimitris Metaxas. Improv- ing compositional text-to...
-
[36]
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Vi- sual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671,
work page internal anchor Pith review Pith/arXiv arXiv
-
[37]
Pointllm: Empowering large language models to understand point clouds
Runsen Xu, Xiaolong Wang, Tai Wang, Yilun Chen, Jiangmiao Pang, and Dahua Lin. Pointllm: Empowering large language models to understand point clouds. ArXiv, abs/2308.16911,
-
[38]
B. Yan, Yi Jiang, Jiannan Wu, D. Wang, Ping Luo, Zehuan Yuan, and Huchuan Lu. Universal instance perception as object discovery and retrieval. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15325–15336,
work page 2023
-
[39]
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381,
work page internal anchor Pith review Pith/arXiv arXiv
-
[40]
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178,
work page internal anchor Pith review Pith/arXiv arXiv
-
[41]
Inpaint anything: Segment anything meets image inpainting,
Tao Yu, Runseng Feng, Ruoyu Feng, Jinming Liu, Xin Jin, Wenjun Zeng, and Zhibo Chen. Inpaint anything: Segment anything meets image inpainting. arXiv preprint arXiv:2304.06790, 2023a. Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabil...
-
[42]
arXiv preprint arXiv:2309.07915 , year=
Haozhe Zhao, Zefan Cai, Shuzheng Si, Xiaojian Ma, Kaikai An, Liang Chen, Zixuan Liu, Sheng Wang, Wenjuan Han, and Baobao Chang. Mmicl: Empowering vision-language model with multi-modal in-context learning. arXiv preprint arXiv:2309.07915,
-
[43]
Aojun Zhou, Ke Wang, Zimu Lu, Weikang Shi, Sichun Luo, Zipeng Qin, Shaoqing Lu, Anya Jia, Linqi Song, Mingjie Zhan, et al. Solving challenging math word problems using gpt-4 code interpreter with code-based self-verification. arXiv preprint arXiv:2308.07921,
-
[44]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
23 Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: En- hancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592,
work page internal anchor Pith review Pith/arXiv arXiv
-
[45]
Pointclip v2: Adapting clip for powerful 3d open-world learning
Xiangyang Zhu, Renrui Zhang, Bowei He, Ziyao Zeng, Shanghang Zhang, and Peng Gao. Pointclip v2: Adapting clip for powerful 3d open-world learning. arXiv preprint arXiv:2211.11682,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.