pith. machine review for the scientific record. sign in

arxiv: 2603.00655 · v2 · submitted 2026-02-28 · 💻 cs.CV

Recognition: 1 theorem link

· Lean Theorem

Mema: Memory-Augmented Adapter for Enhanced Vision-Language Understanding

Authors on Pith no claims yet

Pith reviewed 2026-05-15 17:49 UTC · model grok-4.3

classification 💻 cs.CV
keywords Memory-Augmented AdapterVision EncoderHierarchical Visual FeaturesMultimodal Large Language ModelsAdapter ModuleCross-Layer InteractionFine-Grained Visual Cues
0
0 comments X

The pith

A stateful memory inside the vision encoder accumulates hierarchical visual features and selectively feeds them back to preserve fine-grained cues for multimodal reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Mema as a lightweight adapter added to pretrained vision encoders in multimodal large language models. Existing methods lose shallow-layer visual details because they use only final features or simple fusions without explicit cross-layer interaction. Mema maintains a memory that builds up representations layer by layer, evolves that memory using both the input query and current visual tokens, and injects a selected portion back into the token stream via a feedback loop. This design requires training only a small set of new parameters and leaves the original backbone unchanged. If the approach works as described, models gain better access to detailed visual information across the hierarchy at low extra cost.

Core claim

Mema maintains a stateful memory that accumulates hierarchical visual representations across layers of the vision encoder. Memory evolution is conditioned on both query embeddings and step-wise visual features, and a portion of the memory is selectively injected into token representations through a feedback mechanism. This process mitigates attenuation of fine-grained cues from shallow layers while requiring no changes to the vanilla backbone architecture and only minimal additional trainable parameters.

What carries the argument

The stateful memory that accumulates layer-wise visual representations, evolves under query and feature conditioning, and supplies selective feedback injection into token streams.

If this is right

  • Vision encoders can retain more hierarchical detail without retraining the entire backbone.
  • Training cost stays low because only the adapter parameters are updated.
  • The same memory mechanism can be dropped into existing MLLM pipelines as a plug-in.
  • Complex reasoning tasks that rely on fine visual distinctions show measurable gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same memory pattern could be tested in other hierarchical encoders such as audio or 3D models.
  • If the selective injection proves robust, future designs may replace multi-layer fusion modules entirely with memory-based feedback.
  • The approach suggests that explicit state across layers matters more than simply adding more fusion parameters.

Load-bearing premise

That query-conditioned memory evolution plus selective injection recovers shallow-layer fine details without introducing noise or requiring any backbone changes.

What would settle it

A controlled ablation on a standard multimodal benchmark where replacing Mema with a plain adapter or removing the memory component produces equal or higher scores than the full Mema model.

Figures

Figures reproduced from arXiv: 2603.00655 by Kean Shi, Liyuan Pan, Ying Liu, Yudong Han.

Figure 2
Figure 2. Figure 2: Layer-wise visualization of attention patterns and feature [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of Mema. Mema maintains a stateful memory that accumulates hierarchical visual representations via HME, and [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of (a) HME and (b) MVA. tain a hidden state to capture global context across sequen￾tial input steps. In contrast, we introduce HME to iter￾atively evolve the memory state across hierarchical trans￾former layers, as depicted in [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visualization analysis of behavior of our method. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Effect of additional training data on the LLaVA-v1.5-7B. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative comparison between LLaVA-v1.5-7B and [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
read the original abstract

Multimodal Large Language Models (MLLMs) have achieved remarkable performance by aligning pretrained visual representations with the linguistic knowledge embedded in Large Language Models (LLMs). However, existing approaches typically rely on final-layer visual features or learnable multi-layer fusion, which often fail to sufficiently exploit hierarchical visual cues without explicit cross-layer interaction design. In this work, we propose a Memory-Augmented Adapter (Mema) within the vision encoder. Specifically, Mema maintains a stateful memory that accumulates hierarchical visual representations across layers, with its evolution conditioned on both query embeddings and step-wise visual features. A portion of this memory is selectively injected into token representations via a feedback mechanism, thereby mitigating the attenuation of fine-grained visual cues from shallow layers. Designed as a lightweight and plug-and-play module, Mema integrates seamlessly into pretrained vision encoders without modifying the vanilla backbone architecture. Only a minimal set of additional parameters requires training, enabling adaptive visual feature refinement while reducing training overhead. Extensive experiments across multiple benchmarks demonstrate that Mema consistently improves performance, validating its effectiveness in complex multimodal reasoning tasks. The code have been released at https://github.com/Sisiliu312/Mema.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes Mema, a lightweight Memory-Augmented Adapter inserted into the vision encoder of MLLMs. It maintains a stateful memory module that accumulates hierarchical visual representations across layers; memory evolution is conditioned on both query embeddings and step-wise visual features, with selective injection of memory portions back into token representations via a feedback mechanism. The design is presented as plug-and-play with no modifications to the vanilla backbone and only minimal additional parameters to train, and the abstract claims that extensive experiments show consistent performance improvements on multiple benchmarks for complex multimodal reasoning tasks.

Significance. If the central claims hold, Mema would provide an efficient, low-overhead mechanism for preserving fine-grained hierarchical visual cues in MLLM vision encoders, potentially benefiting tasks that require detailed visual reasoning without requiring full backbone retraining or heavy fusion modules.

major comments (2)
  1. [Abstract and §3 (method)] Abstract and method description: the claim of seamless 'plug-and-play' integration 'without modifying the vanilla backbone architecture' is undermined by the requirement that memory evolution be conditioned on query embeddings. Standard MLLM pipelines extract all visual features before text tokens are processed, so this conditioning implies either an unstated early text injection interface or an external query projection into every vision layer; neither the implementation nor the resulting parameter/compute cost is quantified, making the 'minimal additional parameters' and 'no backbone modification' assertions unverifiable.
  2. [Abstract] Abstract: the central claim that 'Mema consistently improves performance' and 'validat[es] its effectiveness' rests on 'extensive experiments across multiple benchmarks,' yet the provided text supplies no quantitative results, ablation studies, error bars, memory-size specifications, or training details. Without these, the soundness of the performance gains cannot be assessed.
minor comments (1)
  1. [Abstract] Abstract: grammatical error in 'The code have been released' (should be 'has').

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below with clarifications on the Mema design and experimental reporting, and we will revise the paper to improve verifiability while preserving the core claims.

read point-by-point responses
  1. Referee: [Abstract and §3 (method)] Abstract and method description: the claim of seamless 'plug-and-play' integration 'without modifying the vanilla backbone architecture' is undermined by the requirement that memory evolution be conditioned on query embeddings. Standard MLLM pipelines extract all visual features before text tokens are processed, so this conditioning implies either an unstated early text injection interface or an external query projection into every vision layer; neither the implementation nor the resulting parameter/compute cost is quantified, making the 'minimal additional parameters' and 'no backbone modification' assertions unverifiable.

    Authors: The query embeddings are obtained from the text input and projected via a lightweight linear projection layer that is part of the Mema adapter itself; this projection is applied externally to the vision encoder layers without altering any internal weights, layers, or forward pass of the vanilla backbone. The adapter is inserted as a plug-in module around the existing vision transformer blocks. We will revise §3 to explicitly diagram and describe this external query projection step, and we will add a table quantifying the exact additional parameter count (under 1% of the backbone) and compute overhead to substantiate the 'minimal' and 'no backbone modification' claims. revision: yes

  2. Referee: [Abstract] Abstract: the central claim that 'Mema consistently improves performance' and 'validat[es] its effectiveness' rests on 'extensive experiments across multiple benchmarks,' yet the provided text supplies no quantitative results, ablation studies, error bars, memory-size specifications, or training details. Without these, the soundness of the performance gains cannot be assessed.

    Authors: The abstract is a high-level summary; the full manuscript contains all requested details in the Experiments section, including benchmark scores with comparisons, ablation tables on memory size and injection ratios, standard deviations from multiple runs, and full training hyperparameters. To address the concern directly, we will expand the abstract with a concise statement of key quantitative gains (e.g., average improvement across benchmarks) while keeping it within length limits. revision: yes

Circularity Check

0 steps flagged

No circularity: architectural description with no equations or self-referential reductions

full rationale

The paper presents Mema as a lightweight plug-and-play adapter module inserted into the vision encoder. It describes a stateful memory that accumulates hierarchical features with evolution conditioned on query embeddings and step-wise visual features, followed by selective injection. No equations, derivations, or quantitative predictions appear in the text that reduce claimed performance gains to quantities defined by fitted parameters or prior self-citations within the paper. Improvements are asserted via experimental results on benchmarks rather than by construction from the method's own inputs. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked. The derivation chain is therefore self-contained as an independent architectural proposal.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the unproven effectiveness of the new memory mechanism for preserving hierarchical cues; the abstract provides no independent evidence or formal derivation for this effectiveness.

free parameters (1)
  • memory injection portion
    The abstract refers to 'a portion of this memory' being selectively injected, implying at least one tunable design choice whose value is not specified.
axioms (1)
  • domain assumption Hierarchical visual representations across layers contain complementary fine-grained cues that are attenuated in standard final-layer or simple fusion processing.
    This premise is invoked to motivate the memory design and is treated as background for multimodal models.
invented entities (1)
  • Stateful memory module inside Mema no independent evidence
    purpose: To accumulate hierarchical visual representations across layers and enable selective feedback injection conditioned on queries and features.
    This is a newly introduced component whose independent validation is not provided in the abstract.

pith-pipeline@v0.9.0 · 5509 in / 1341 out tokens · 44183 ms · 2026-05-15T17:49:14.064715+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Visual Enhanced Depth Scaling for Multimodal Latent Reasoning

    cs.CV 2026-04 unverdicted novelty 6.0

    Visual replay module and adaptive depth scaling improve multimodal latent reasoning, reaching SOTA benchmarks with faster inference than explicit chain-of-thought methods.

  2. Visual Enhanced Depth Scaling for Multimodal Latent Reasoning

    cs.CV 2026-04 unverdicted novelty 5.0

    Visual replay and depth scaling in latent reasoning produce state-of-the-art multimodal results with faster inference than explicit CoT.

  3. Visual Enhanced Depth Scaling for Multimodal Latent Reasoning

    cs.CV 2026-04 unverdicted novelty 5.0

    A visual replay module combined with adaptive depth scaling improves multimodal latent reasoning, delivering state-of-the-art benchmark results and faster inference than explicit chain-of-thought methods.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · cited by 1 Pith paper · 12 internal anchors

  1. [1]

    Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Men- sch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,

  2. [2]

    Leo: Boosting mixture of vision en- coders for multimodal large language models.arXiv preprint arXiv:2501.06986, 2025

    Mozhgan Nasr Azadani, James Riddell, Sean Sedwards, and Krzysztof Czarnecki. Leo: Boosting mixture of vision en- coders for multimodal large language models.arXiv preprint arXiv:2501.06986, 2025. 1, 3

  3. [3]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023. 1

  4. [4]

    Mmfuser: Multimodal multi-layer feature fuser for fine-grained vision-language un- derstanding.arXiv preprint arXiv:2410.11829, 2024

    Yue Cao, Yangzhou Liu, Zhe Chen, Guangchen Shi, Wenhai Wang, Danhuai Zhao, and Tong Lu. Mmfuser: Multimodal multi-layer feature fuser for fine-grained vision-language un- derstanding.arXiv preprint arXiv:2410.11829, 2024. 1, 3, 7

  5. [5]

    Sharegpt4v: Improving large multi-modal models with better captions

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. In European Conference on Computer Vision, pages 370–387. Springer, 2024. 5, 6

  6. [6]

    Pali: A jointly-scaled multilingual language-image model.arXiv preprint arXiv:2209.06794, 2022

    Xi Chen, Xiao Wang, Soravit Changpinyo, Anthony J Pier- giovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, et al. Pali: A jointly-scaled multilingual language-image model.arXiv preprint arXiv:2209.06794, 2022. 1, 3

  7. [7]

    Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024. 1, 3

  8. [8]

    Reproducible scal- ing laws for contrastive language-image learning

    Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuh- mann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scal- ing laws for contrastive language-image learning. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2818–2829, 2023. 5

  9. [9]

    Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.See https://vicuna

    Wei-Lin Chiang, Zhuohan Li, Ziqing Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.See https://vicuna. lmsys. org (accessed 14 April 2023), 2(3):6, 2023. 1

  10. [10]

    Instructblip: Towards general-purpose vision- language models with instruction tuning.Advances in neural information processing systems, 36:49250–49267, 2023

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision- language models with instruction tuning.Advances in neural information processing systems, 36:49250–49267, 2023. 3

  11. [11]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 1, 3

  12. [12]

    PaLM-E: An Embodied Multimodal Language Model

    Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm- e: An embodied multimodal language model.arXiv preprint arXiv:2303.03378, 2023. 1, 3

  13. [13]

    Cross-layer retrospective retrieving via layer attention, 2023

    Yanwen Fang, Yuxi Cai, Jintai Chen, Jingyu Zhao, Guangjian Tian, and Guodong Li. Cross-layer retrospective retrieving via layer attention, 2023. 3

  14. [14]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation bench- mark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023. 5

  15. [15]

    Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing

    Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba- tra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017. 5

  16. [16]

    Long short-term memory.Supervised sequence labelling with recurrent neural networks, pages 37–45, 2012

    Alex Graves. Long short-term memory.Supervised sequence labelling with recurrent neural networks, pages 37–45, 2012. 3

  17. [17]

    Focal and composed vision- semantic modeling for visual question answering

    Yudong Han, Yangyang Guo, Jianhua Yin, Meng Liu, Yu- peng Hu, and Liqiang Nie. Focal and composed vision- semantic modeling for visual question answering. InMM ’21: ACM Multimedia Conference, Virtual Event, China, Oc- tober 20 - 24, 2021, pages 4528–4536. ACM, 2021. 1

  18. [18]

    Semantic-aware modular capsule routing for visual question answering.IEEE Trans

    Yudong Han, Jianhua Yin, Jianlong Wu, Yinwei Wei, and Liqiang Nie. Semantic-aware modular capsule routing for visual question answering.IEEE Trans. Image Process., 32: 5537–5549, 2023

  19. [19]

    Exploiting the social- like prior in transformer for visual reasoning

    Yudong Han, Yupeng Hu, Xuemeng Song, Haoyu Tang, Mingzhu Xu, and Liqiang Nie. Exploiting the social- like prior in transformer for visual reasoning. InThirty- Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence, IAAI 2024, Fourteenth Symposium on Educational Advances i...

  20. [20]

    Dynfocus: Dynamic cooperative network empowers llms with video understanding

    Yudong Han, Qingpei Guo, Liyuan Pan, Liu Liu, Yu Guan, and Ming Yang. Dynfocus: Dynamic cooperative network empowers llms with video understanding. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 8512–8522. Computer Vision Foundation / IEEE, 2025. 1

  21. [21]

    Dianet: Dense-and-implicit attention net- work, 2019

    Zhongzhan Huang, Senwei Liang, Mingfu Liang, and Haizhao Yang. Dianet: Dense-and-implicit attention net- work, 2019. 3

  22. [22]

    Gqa: A new dataset for real-world visual reasoning and compositional question answering

    Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 6700–6709, 2019. 5

  23. [23]

    Phi-2: The surprising power of small language models.Microsoft Research Blog, 1(3):3, 2023

    Mojan Javaheripi, S ´ebastien Bubeck, Marah Abdin, Jy- oti Aneja, Sebastien Bubeck, Caio C ´esar Teodoro Mendes, Weizhu Chen, Allie Del Giorno, Ronen Eldan, Sivakanth Gopi, et al. Phi-2: The surprising power of small language models.Microsoft Research Blog, 1(3):3, 2023. 5

  24. [24]

    SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

    Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yix- iao Ge, and Ying Shan. Seed-bench: Benchmarking mul- timodal llms with generative comprehension.arXiv preprint arXiv:2307.16125, 2023. 5

  25. [25]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 1, 3

  26. [26]

    Instruction-guided fusion of multi-layer visual features in large vision-language models.Pattern Recognition, 170: 111932, 2026

    Xu Li, Yi Zheng, Haotian Chen, Xiaolei Chen, Yux- uan Liang, Chenghang Lai, Bin Li, and Xiangyang Xue. Instruction-guided fusion of multi-layer visual features in large vision-language models.Pattern Recognition, 170: 111932, 2026. 1, 3, 7

  27. [27]

    Textbooks Are All You Need II: phi-1.5 technical report

    Yuanzhi Li, S ´ebastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, and Yin Tat Lee. Textbooks are all you need ii: phi-1.5 technical report.arXiv preprint arXiv:2309.05463, 2023. 5

  28. [28]

    Evaluating object hallucination in large vision-language models

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InProceedings of the 2023 conference on empirical methods in natural language pro- cessing, pages 292–305, 2023. 5, 9

  29. [29]

    Mini-gemini: Mining the potential of multi-modality vi- sion language models.IEEE Transactions on Pattern Analy- sis and Machine Intelligence, 2025

    Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, and Jiaya Jia. Mini-gemini: Mining the potential of multi-modality vi- sion language models.IEEE Transactions on Pattern Analy- sis and Machine Intelligence, 2025. 1, 3, 5

  30. [30]

    Text-guided layer fusion mitigates hallucination in multimodal llms.arXiv preprint arXiv:2601.03100, 2026

    Chenchen Lin, Sanbao Su, Rachel Luo, Yuxiao Chen, Yan Wang, Marco Pavone, and Fei Miao. Text-guided layer fusion mitigates hallucination in multimodal llms.arXiv preprint arXiv:2601.03100, 2026. 3, 7

  31. [31]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 1, 3, 5, 6

  32. [32]

    Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vi- sion, pages 216–233

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vi- sion, pages 216–233. Springer, 2024. 5, 9

  33. [33]

    A convnet for the 2020s

    Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feicht- enhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 11976–11986,

  34. [34]

    Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in neural information processing systems, 35:2507–2521,

    Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in neural information processing systems, 35:2507–2521,

  35. [35]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 3

  36. [36]

    Kosmos-2: Grounding Multimodal Large Language Models to the World

    Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Ground- ing multimodal large language models to the world.arXiv preprint arXiv:2306.14824, 2023. 3

  37. [37]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 1, 3, 5, 6

  38. [38]

    Towards vqa models that can read

    Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019. 5, 9

  39. [39]

    Eyes wide shut? exploring the visual shortcomings of multimodal llms

    Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9568–9578, 2024. 1, 3

  40. [40]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 1

  41. [41]

    Strengthening layer interaction via dynamic layer attention

    Kaishen Wang, Xun Xia, Jian Liu, Zhang Yi, and Tao He. Strengthening layer interaction via dynamic layer attention. arXiv preprint arXiv:2406.13392, 2024. 3

  42. [42]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 3

  43. [43]

    Dense connector for mllms.Advances in Neural Information Processing Systems, 37:33108–33140, 2024

    Huanjin Yao, Wenhao Wu, Taojiannan Yang, YuXin Song, Mengxi Zhang, Haocheng Feng, Yifan Sun, Zhiheng Li, Wanli Ouyang, and Jingdong Wang. Dense connector for mllms.Advances in Neural Information Processing Systems, 37:33108–33140, 2024. 1, 3, 7

  44. [44]

    MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

    Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities.arXiv preprint arXiv:2308.02490, 2023. 5

  45. [45]

    Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9556–9567,

  46. [46]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 1, 5, 6

  47. [47]

    OPT: Open Pre-trained Transformer Language Models

    Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained trans- former language models.arXiv preprint arXiv:2205.01068,

  48. [48]

    Mipha: A comprehensive overhaul of multi- modal assistant with small language models.arXiv preprint arXiv:2403.06199, 2024

    Minjie Zhu, Yichen Zhu, Xin Liu, Ning Liu, Zhiyuan Xu, Chaomin Shen, Yaxin Peng, Zhicai Ou, Feifei Feng, and Jian Tang. Mipha: A comprehensive overhaul of multi- modal assistant with small language models.arXiv preprint arXiv:2403.06199, 2024. 6