arxiv: 2603.00655 · v2 · submitted 2026-02-28 · 💻 cs.CV

Recognition: 1 theorem link

· Lean Theorem

Mema: Memory-Augmented Adapter for Enhanced Vision-Language Understanding

Ying Liu , Yudong Han , Kean Shi , Liyuan Pan

Authors on Pith no claims yet

Pith reviewed 2026-05-15 17:49 UTC · model grok-4.3

classification 💻 cs.CV

keywords Memory-Augmented AdapterVision EncoderHierarchical Visual FeaturesMultimodal Large Language ModelsAdapter ModuleCross-Layer InteractionFine-Grained Visual Cues

0 comments

The pith

A stateful memory inside the vision encoder accumulates hierarchical visual features and selectively feeds them back to preserve fine-grained cues for multimodal reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Mema as a lightweight adapter added to pretrained vision encoders in multimodal large language models. Existing methods lose shallow-layer visual details because they use only final features or simple fusions without explicit cross-layer interaction. Mema maintains a memory that builds up representations layer by layer, evolves that memory using both the input query and current visual tokens, and injects a selected portion back into the token stream via a feedback loop. This design requires training only a small set of new parameters and leaves the original backbone unchanged. If the approach works as described, models gain better access to detailed visual information across the hierarchy at low extra cost.

Core claim

Mema maintains a stateful memory that accumulates hierarchical visual representations across layers of the vision encoder. Memory evolution is conditioned on both query embeddings and step-wise visual features, and a portion of the memory is selectively injected into token representations through a feedback mechanism. This process mitigates attenuation of fine-grained cues from shallow layers while requiring no changes to the vanilla backbone architecture and only minimal additional trainable parameters.

What carries the argument

The stateful memory that accumulates layer-wise visual representations, evolves under query and feature conditioning, and supplies selective feedback injection into token streams.

If this is right

Vision encoders can retain more hierarchical detail without retraining the entire backbone.
Training cost stays low because only the adapter parameters are updated.
The same memory mechanism can be dropped into existing MLLM pipelines as a plug-in.
Complex reasoning tasks that rely on fine visual distinctions show measurable gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same memory pattern could be tested in other hierarchical encoders such as audio or 3D models.
If the selective injection proves robust, future designs may replace multi-layer fusion modules entirely with memory-based feedback.
The approach suggests that explicit state across layers matters more than simply adding more fusion parameters.

Load-bearing premise

That query-conditioned memory evolution plus selective injection recovers shallow-layer fine details without introducing noise or requiring any backbone changes.

What would settle it

A controlled ablation on a standard multimodal benchmark where replacing Mema with a plain adapter or removing the memory component produces equal or higher scores than the full Mema model.

Figures

Figures reproduced from arXiv: 2603.00655 by Kean Shi, Liyuan Pan, Ying Liu, Yudong Han.

**Figure 3.** Figure 3: Overview of Mema. Mema maintains a stateful memory that accumulates hierarchical visual representations via HME, and [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Illustration of (a) HME and (b) MVA. tain a hidden state to capture global context across sequential input steps. In contrast, we introduce HME to iteratively evolve the memory state across hierarchical transformer layers, as depicted in [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Visualization analysis of behavior of our method. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Effect of additional training data on the LLaVA-v1.5-7B. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative comparison between LLaVA-v1.5-7B and [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

read the original abstract

Multimodal Large Language Models (MLLMs) have achieved remarkable performance by aligning pretrained visual representations with the linguistic knowledge embedded in Large Language Models (LLMs). However, existing approaches typically rely on final-layer visual features or learnable multi-layer fusion, which often fail to sufficiently exploit hierarchical visual cues without explicit cross-layer interaction design. In this work, we propose a Memory-Augmented Adapter (Mema) within the vision encoder. Specifically, Mema maintains a stateful memory that accumulates hierarchical visual representations across layers, with its evolution conditioned on both query embeddings and step-wise visual features. A portion of this memory is selectively injected into token representations via a feedback mechanism, thereby mitigating the attenuation of fine-grained visual cues from shallow layers. Designed as a lightweight and plug-and-play module, Mema integrates seamlessly into pretrained vision encoders without modifying the vanilla backbone architecture. Only a minimal set of additional parameters requires training, enabling adaptive visual feature refinement while reducing training overhead. Extensive experiments across multiple benchmarks demonstrate that Mema consistently improves performance, validating its effectiveness in complex multimodal reasoning tasks. The code have been released at https://github.com/Sisiliu312/Mema.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Mema offers a stateful memory adapter to better retain hierarchical visual cues in MLLMs, but the query conditioning likely requires more pipeline changes than the plug-and-play description suggests.

read the letter

Mema adds a stateful memory module inside the vision encoder. The memory accumulates features across layers, updates based on both the current visual tokens and the query embedding, and then feeds a selected portion back into the representations. The point is to keep fine-grained details from early layers from disappearing before they reach the language model. This specific combination of running memory, query conditioning, and selective injection is the new piece compared with simple final-layer adapters or basic multi-layer fusion methods. The low parameter count and the claim of no backbone changes are also presented as practical advantages for people who want to plug something in without full retraining. The released code is a clear positive for anyone who wants to test it directly. The main soft spot is the integration mechanics. Standard MLLM pipelines run the vision encoder on images alone before any text tokens exist, so query embeddings are not available during that forward pass. Conditioning the memory on them would seem to need either early text encoding or some extra wiring into the vision stack. The paper states that the vanilla backbone stays untouched, but it does not spell out how the query information actually gets in or what that costs in practice. Without that detail the plug-and-play and minimal-overhead claims rest on an unexamined interface. The abstract also gives no numbers, no ablation results on memory size or injection fraction, and no error bars, so the reported gains cannot be checked from the text. This is aimed at people working on efficient adapters and hierarchical visual features for multimodal models. A reader who already cares about preserving shallow-layer cues might get value once the experiments and integration code are examined. It deserves peer review to sort out the conditioning details and verify whether the improvements hold up under scrutiny.

Referee Report

2 major / 1 minor

Summary. The paper proposes Mema, a lightweight Memory-Augmented Adapter inserted into the vision encoder of MLLMs. It maintains a stateful memory module that accumulates hierarchical visual representations across layers; memory evolution is conditioned on both query embeddings and step-wise visual features, with selective injection of memory portions back into token representations via a feedback mechanism. The design is presented as plug-and-play with no modifications to the vanilla backbone and only minimal additional parameters to train, and the abstract claims that extensive experiments show consistent performance improvements on multiple benchmarks for complex multimodal reasoning tasks.

Significance. If the central claims hold, Mema would provide an efficient, low-overhead mechanism for preserving fine-grained hierarchical visual cues in MLLM vision encoders, potentially benefiting tasks that require detailed visual reasoning without requiring full backbone retraining or heavy fusion modules.

major comments (2)

[Abstract and §3 (method)] Abstract and method description: the claim of seamless 'plug-and-play' integration 'without modifying the vanilla backbone architecture' is undermined by the requirement that memory evolution be conditioned on query embeddings. Standard MLLM pipelines extract all visual features before text tokens are processed, so this conditioning implies either an unstated early text injection interface or an external query projection into every vision layer; neither the implementation nor the resulting parameter/compute cost is quantified, making the 'minimal additional parameters' and 'no backbone modification' assertions unverifiable.
[Abstract] Abstract: the central claim that 'Mema consistently improves performance' and 'validat[es] its effectiveness' rests on 'extensive experiments across multiple benchmarks,' yet the provided text supplies no quantitative results, ablation studies, error bars, memory-size specifications, or training details. Without these, the soundness of the performance gains cannot be assessed.

minor comments (1)

[Abstract] Abstract: grammatical error in 'The code have been released' (should be 'has').

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below with clarifications on the Mema design and experimental reporting, and we will revise the paper to improve verifiability while preserving the core claims.

read point-by-point responses

Referee: [Abstract and §3 (method)] Abstract and method description: the claim of seamless 'plug-and-play' integration 'without modifying the vanilla backbone architecture' is undermined by the requirement that memory evolution be conditioned on query embeddings. Standard MLLM pipelines extract all visual features before text tokens are processed, so this conditioning implies either an unstated early text injection interface or an external query projection into every vision layer; neither the implementation nor the resulting parameter/compute cost is quantified, making the 'minimal additional parameters' and 'no backbone modification' assertions unverifiable.

Authors: The query embeddings are obtained from the text input and projected via a lightweight linear projection layer that is part of the Mema adapter itself; this projection is applied externally to the vision encoder layers without altering any internal weights, layers, or forward pass of the vanilla backbone. The adapter is inserted as a plug-in module around the existing vision transformer blocks. We will revise §3 to explicitly diagram and describe this external query projection step, and we will add a table quantifying the exact additional parameter count (under 1% of the backbone) and compute overhead to substantiate the 'minimal' and 'no backbone modification' claims. revision: yes
Referee: [Abstract] Abstract: the central claim that 'Mema consistently improves performance' and 'validat[es] its effectiveness' rests on 'extensive experiments across multiple benchmarks,' yet the provided text supplies no quantitative results, ablation studies, error bars, memory-size specifications, or training details. Without these, the soundness of the performance gains cannot be assessed.

Authors: The abstract is a high-level summary; the full manuscript contains all requested details in the Experiments section, including benchmark scores with comparisons, ablation tables on memory size and injection ratios, standard deviations from multiple runs, and full training hyperparameters. To address the concern directly, we will expand the abstract with a concise statement of key quantitative gains (e.g., average improvement across benchmarks) while keeping it within length limits. revision: yes

Circularity Check

0 steps flagged

No circularity: architectural description with no equations or self-referential reductions

full rationale

The paper presents Mema as a lightweight plug-and-play adapter module inserted into the vision encoder. It describes a stateful memory that accumulates hierarchical features with evolution conditioned on query embeddings and step-wise visual features, followed by selective injection. No equations, derivations, or quantitative predictions appear in the text that reduce claimed performance gains to quantities defined by fitted parameters or prior self-citations within the paper. Improvements are asserted via experimental results on benchmarks rather than by construction from the method's own inputs. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked. The derivation chain is therefore self-contained as an independent architectural proposal.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the unproven effectiveness of the new memory mechanism for preserving hierarchical cues; the abstract provides no independent evidence or formal derivation for this effectiveness.

free parameters (1)

memory injection portion
The abstract refers to 'a portion of this memory' being selectively injected, implying at least one tunable design choice whose value is not specified.

axioms (1)

domain assumption Hierarchical visual representations across layers contain complementary fine-grained cues that are attenuated in standard final-layer or simple fusion processing.
This premise is invoked to motivate the memory design and is treated as background for multimodal models.

invented entities (1)

Stateful memory module inside Mema no independent evidence
purpose: To accumulate hierarchical visual representations across layers and enable selective feedback injection conditioned on queries and features.
This is a newly introduced component whose independent validation is not provided in the abstract.

pith-pipeline@v0.9.0 · 5509 in / 1341 out tokens · 44183 ms · 2026-05-15T17:49:14.064715+00:00 · methodology

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Visual Enhanced Depth Scaling for Multimodal Latent Reasoning
cs.CV 2026-04 unverdicted novelty 6.0

Visual replay module and adaptive depth scaling improve multimodal latent reasoning, reaching SOTA benchmarks with faster inference than explicit chain-of-thought methods.
Visual Enhanced Depth Scaling for Multimodal Latent Reasoning
cs.CV 2026-04 unverdicted novelty 5.0

Visual replay and depth scaling in latent reasoning produce state-of-the-art multimodal results with faster inference than explicit CoT.
Visual Enhanced Depth Scaling for Multimodal Latent Reasoning
cs.CV 2026-04 unverdicted novelty 5.0

A visual replay module combined with adaptive depth scaling improves multimodal latent reasoning, delivering state-of-the-art benchmark results and faster inference than explicit chain-of-thought methods.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · cited by 1 Pith paper · 12 internal anchors

[1]

Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Men- sch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,

work page
[2]

Leo: Boosting mixture of vision en- coders for multimodal large language models.arXiv preprint arXiv:2501.06986, 2025

Mozhgan Nasr Azadani, James Riddell, Sean Sedwards, and Krzysztof Czarnecki. Leo: Boosting mixture of vision en- coders for multimodal large language models.arXiv preprint arXiv:2501.06986, 2025. 1, 3

work page arXiv 2025
[3]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Mmfuser: Multimodal multi-layer feature fuser for fine-grained vision-language un- derstanding.arXiv preprint arXiv:2410.11829, 2024

Yue Cao, Yangzhou Liu, Zhe Chen, Guangchen Shi, Wenhai Wang, Danhuai Zhao, and Tong Lu. Mmfuser: Multimodal multi-layer feature fuser for fine-grained vision-language un- derstanding.arXiv preprint arXiv:2410.11829, 2024. 1, 3, 7

work page arXiv 2024
[5]

Sharegpt4v: Improving large multi-modal models with better captions

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. In European Conference on Computer Vision, pages 370–387. Springer, 2024. 5, 6

work page 2024
[6]

Pali: A jointly-scaled multilingual language-image model.arXiv preprint arXiv:2209.06794, 2022

Xi Chen, Xiao Wang, Soravit Changpinyo, Anthony J Pier- giovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, et al. Pali: A jointly-scaled multilingual language-image model.arXiv preprint arXiv:2209.06794, 2022. 1, 3

work page arXiv 2022
[7]

Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024. 1, 3

work page 2024
[8]

Reproducible scal- ing laws for contrastive language-image learning

Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuh- mann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scal- ing laws for contrastive language-image learning. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2818–2829, 2023. 5

work page 2023
[9]

Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.See https://vicuna

Wei-Lin Chiang, Zhuohan Li, Ziqing Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.See https://vicuna. lmsys. org (accessed 14 April 2023), 2(3):6, 2023. 1

work page 2023
[10]

Instructblip: Towards general-purpose vision- language models with instruction tuning.Advances in neural information processing systems, 36:49250–49267, 2023

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision- language models with instruction tuning.Advances in neural information processing systems, 36:49250–49267, 2023. 3

work page 2023
[11]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2010
[12]

PaLM-E: An Embodied Multimodal Language Model

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm- e: An embodied multimodal language model.arXiv preprint arXiv:2303.03378, 2023. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

Cross-layer retrospective retrieving via layer attention, 2023

Yanwen Fang, Yuxi Cai, Jintai Chen, Jingyu Zhao, Guangjian Tian, and Guodong Li. Cross-layer retrospective retrieving via layer attention, 2023. 3

work page 2023
[14]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation bench- mark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023. 5

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba- tra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017. 5

work page 2017
[16]

Long short-term memory.Supervised sequence labelling with recurrent neural networks, pages 37–45, 2012

Alex Graves. Long short-term memory.Supervised sequence labelling with recurrent neural networks, pages 37–45, 2012. 3

work page 2012
[17]

Focal and composed vision- semantic modeling for visual question answering

Yudong Han, Yangyang Guo, Jianhua Yin, Meng Liu, Yu- peng Hu, and Liqiang Nie. Focal and composed vision- semantic modeling for visual question answering. InMM ’21: ACM Multimedia Conference, Virtual Event, China, Oc- tober 20 - 24, 2021, pages 4528–4536. ACM, 2021. 1

work page 2021
[18]

Semantic-aware modular capsule routing for visual question answering.IEEE Trans

Yudong Han, Jianhua Yin, Jianlong Wu, Yinwei Wei, and Liqiang Nie. Semantic-aware modular capsule routing for visual question answering.IEEE Trans. Image Process., 32: 5537–5549, 2023

work page 2023
[19]

Exploiting the social- like prior in transformer for visual reasoning

Yudong Han, Yupeng Hu, Xuemeng Song, Haoyu Tang, Mingzhu Xu, and Liqiang Nie. Exploiting the social- like prior in transformer for visual reasoning. InThirty- Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence, IAAI 2024, Fourteenth Symposium on Educational Advances i...

work page 2024
[20]

Dynfocus: Dynamic cooperative network empowers llms with video understanding

Yudong Han, Qingpei Guo, Liyuan Pan, Liu Liu, Yu Guan, and Ming Yang. Dynfocus: Dynamic cooperative network empowers llms with video understanding. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 8512–8522. Computer Vision Foundation / IEEE, 2025. 1

work page 2025
[21]

Dianet: Dense-and-implicit attention net- work, 2019

Zhongzhan Huang, Senwei Liang, Mingfu Liang, and Haizhao Yang. Dianet: Dense-and-implicit attention net- work, 2019. 3

work page 2019
[22]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 6700–6709, 2019. 5

work page 2019
[23]

Phi-2: The surprising power of small language models.Microsoft Research Blog, 1(3):3, 2023

Mojan Javaheripi, S ´ebastien Bubeck, Marah Abdin, Jy- oti Aneja, Sebastien Bubeck, Caio C ´esar Teodoro Mendes, Weizhu Chen, Allie Del Giorno, Ronen Eldan, Sivakanth Gopi, et al. Phi-2: The surprising power of small language models.Microsoft Research Blog, 1(3):3, 2023. 5

work page 2023
[24]

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yix- iao Ge, and Ying Shan. Seed-bench: Benchmarking mul- timodal llms with generative comprehension.arXiv preprint arXiv:2307.16125, 2023. 5

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 1, 3

work page 2023
[26]

Instruction-guided fusion of multi-layer visual features in large vision-language models.Pattern Recognition, 170: 111932, 2026

Xu Li, Yi Zheng, Haotian Chen, Xiaolei Chen, Yux- uan Liang, Chenghang Lai, Bin Li, and Xiangyang Xue. Instruction-guided fusion of multi-layer visual features in large vision-language models.Pattern Recognition, 170: 111932, 2026. 1, 3, 7

work page 2026
[27]

Textbooks Are All You Need II: phi-1.5 technical report

Yuanzhi Li, S ´ebastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, and Yin Tat Lee. Textbooks are all you need ii: phi-1.5 technical report.arXiv preprint arXiv:2309.05463, 2023. 5

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

Evaluating object hallucination in large vision-language models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InProceedings of the 2023 conference on empirical methods in natural language pro- cessing, pages 292–305, 2023. 5, 9

work page 2023
[29]

Mini-gemini: Mining the potential of multi-modality vi- sion language models.IEEE Transactions on Pattern Analy- sis and Machine Intelligence, 2025

Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, and Jiaya Jia. Mini-gemini: Mining the potential of multi-modality vi- sion language models.IEEE Transactions on Pattern Analy- sis and Machine Intelligence, 2025. 1, 3, 5

work page 2025
[30]

Text-guided layer fusion mitigates hallucination in multimodal llms.arXiv preprint arXiv:2601.03100, 2026

Chenchen Lin, Sanbao Su, Rachel Luo, Yuxiao Chen, Yan Wang, Marco Pavone, and Fei Miao. Text-guided layer fusion mitigates hallucination in multimodal llms.arXiv preprint arXiv:2601.03100, 2026. 3, 7

work page arXiv 2026
[31]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 1, 3, 5, 6

work page 2023
[32]

Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vi- sion, pages 216–233

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vi- sion, pages 216–233. Springer, 2024. 5, 9

work page 2024
[33]

A convnet for the 2020s

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feicht- enhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 11976–11986,

work page
[34]

Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in neural information processing systems, 35:2507–2521,

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in neural information processing systems, 35:2507–2521,

work page
[35]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

Kosmos-2: Grounding Multimodal Large Language Models to the World

Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Ground- ing multimodal large language models to the world.arXiv preprint arXiv:2306.14824, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 1, 3, 5, 6

work page 2021
[38]

Towards vqa models that can read

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019. 5, 9

work page 2019
[39]

Eyes wide shut? exploring the visual shortcomings of multimodal llms

Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9568–9578, 2024. 1, 3

work page 2024
[40]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[41]

Strengthening layer interaction via dynamic layer attention

Kaishen Wang, Xun Xia, Jian Liu, Zhang Yi, and Tao He. Strengthening layer interaction via dynamic layer attention. arXiv preprint arXiv:2406.13392, 2024. 3

work page arXiv 2024
[42]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

Dense connector for mllms.Advances in Neural Information Processing Systems, 37:33108–33140, 2024

Huanjin Yao, Wenhao Wu, Taojiannan Yang, YuXin Song, Mengxi Zhang, Haocheng Feng, Yifan Sun, Zhiheng Li, Wanli Ouyang, and Jingdong Wang. Dense connector for mllms.Advances in Neural Information Processing Systems, 37:33108–33140, 2024. 1, 3, 7

work page 2024
[44]

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities.arXiv preprint arXiv:2308.02490, 2023. 5

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9556–9567,

work page
[46]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 1, 5, 6

work page 2023
[47]

OPT: Open Pre-trained Transformer Language Models

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained trans- former language models.arXiv preprint arXiv:2205.01068,

work page internal anchor Pith review Pith/arXiv arXiv
[48]

Mipha: A comprehensive overhaul of multi- modal assistant with small language models.arXiv preprint arXiv:2403.06199, 2024

Minjie Zhu, Yichen Zhu, Xin Liu, Ning Liu, Zhiyuan Xu, Chaomin Shen, Yaxin Peng, Zhicai Ou, Feifei Feng, and Jian Tang. Mipha: A comprehensive overhaul of multi- modal assistant with small language models.arXiv preprint arXiv:2403.06199, 2024. 6

work page arXiv 2024