Recognition: 1 theorem link
· Lean TheoremMema: Memory-Augmented Adapter for Enhanced Vision-Language Understanding
Pith reviewed 2026-05-15 17:49 UTC · model grok-4.3
The pith
A stateful memory inside the vision encoder accumulates hierarchical visual features and selectively feeds them back to preserve fine-grained cues for multimodal reasoning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Mema maintains a stateful memory that accumulates hierarchical visual representations across layers of the vision encoder. Memory evolution is conditioned on both query embeddings and step-wise visual features, and a portion of the memory is selectively injected into token representations through a feedback mechanism. This process mitigates attenuation of fine-grained cues from shallow layers while requiring no changes to the vanilla backbone architecture and only minimal additional trainable parameters.
What carries the argument
The stateful memory that accumulates layer-wise visual representations, evolves under query and feature conditioning, and supplies selective feedback injection into token streams.
If this is right
- Vision encoders can retain more hierarchical detail without retraining the entire backbone.
- Training cost stays low because only the adapter parameters are updated.
- The same memory mechanism can be dropped into existing MLLM pipelines as a plug-in.
- Complex reasoning tasks that rely on fine visual distinctions show measurable gains.
Where Pith is reading between the lines
- The same memory pattern could be tested in other hierarchical encoders such as audio or 3D models.
- If the selective injection proves robust, future designs may replace multi-layer fusion modules entirely with memory-based feedback.
- The approach suggests that explicit state across layers matters more than simply adding more fusion parameters.
Load-bearing premise
That query-conditioned memory evolution plus selective injection recovers shallow-layer fine details without introducing noise or requiring any backbone changes.
What would settle it
A controlled ablation on a standard multimodal benchmark where replacing Mema with a plain adapter or removing the memory component produces equal or higher scores than the full Mema model.
Figures
read the original abstract
Multimodal Large Language Models (MLLMs) have achieved remarkable performance by aligning pretrained visual representations with the linguistic knowledge embedded in Large Language Models (LLMs). However, existing approaches typically rely on final-layer visual features or learnable multi-layer fusion, which often fail to sufficiently exploit hierarchical visual cues without explicit cross-layer interaction design. In this work, we propose a Memory-Augmented Adapter (Mema) within the vision encoder. Specifically, Mema maintains a stateful memory that accumulates hierarchical visual representations across layers, with its evolution conditioned on both query embeddings and step-wise visual features. A portion of this memory is selectively injected into token representations via a feedback mechanism, thereby mitigating the attenuation of fine-grained visual cues from shallow layers. Designed as a lightweight and plug-and-play module, Mema integrates seamlessly into pretrained vision encoders without modifying the vanilla backbone architecture. Only a minimal set of additional parameters requires training, enabling adaptive visual feature refinement while reducing training overhead. Extensive experiments across multiple benchmarks demonstrate that Mema consistently improves performance, validating its effectiveness in complex multimodal reasoning tasks. The code have been released at https://github.com/Sisiliu312/Mema.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Mema, a lightweight Memory-Augmented Adapter inserted into the vision encoder of MLLMs. It maintains a stateful memory module that accumulates hierarchical visual representations across layers; memory evolution is conditioned on both query embeddings and step-wise visual features, with selective injection of memory portions back into token representations via a feedback mechanism. The design is presented as plug-and-play with no modifications to the vanilla backbone and only minimal additional parameters to train, and the abstract claims that extensive experiments show consistent performance improvements on multiple benchmarks for complex multimodal reasoning tasks.
Significance. If the central claims hold, Mema would provide an efficient, low-overhead mechanism for preserving fine-grained hierarchical visual cues in MLLM vision encoders, potentially benefiting tasks that require detailed visual reasoning without requiring full backbone retraining or heavy fusion modules.
major comments (2)
- [Abstract and §3 (method)] Abstract and method description: the claim of seamless 'plug-and-play' integration 'without modifying the vanilla backbone architecture' is undermined by the requirement that memory evolution be conditioned on query embeddings. Standard MLLM pipelines extract all visual features before text tokens are processed, so this conditioning implies either an unstated early text injection interface or an external query projection into every vision layer; neither the implementation nor the resulting parameter/compute cost is quantified, making the 'minimal additional parameters' and 'no backbone modification' assertions unverifiable.
- [Abstract] Abstract: the central claim that 'Mema consistently improves performance' and 'validat[es] its effectiveness' rests on 'extensive experiments across multiple benchmarks,' yet the provided text supplies no quantitative results, ablation studies, error bars, memory-size specifications, or training details. Without these, the soundness of the performance gains cannot be assessed.
minor comments (1)
- [Abstract] Abstract: grammatical error in 'The code have been released' (should be 'has').
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below with clarifications on the Mema design and experimental reporting, and we will revise the paper to improve verifiability while preserving the core claims.
read point-by-point responses
-
Referee: [Abstract and §3 (method)] Abstract and method description: the claim of seamless 'plug-and-play' integration 'without modifying the vanilla backbone architecture' is undermined by the requirement that memory evolution be conditioned on query embeddings. Standard MLLM pipelines extract all visual features before text tokens are processed, so this conditioning implies either an unstated early text injection interface or an external query projection into every vision layer; neither the implementation nor the resulting parameter/compute cost is quantified, making the 'minimal additional parameters' and 'no backbone modification' assertions unverifiable.
Authors: The query embeddings are obtained from the text input and projected via a lightweight linear projection layer that is part of the Mema adapter itself; this projection is applied externally to the vision encoder layers without altering any internal weights, layers, or forward pass of the vanilla backbone. The adapter is inserted as a plug-in module around the existing vision transformer blocks. We will revise §3 to explicitly diagram and describe this external query projection step, and we will add a table quantifying the exact additional parameter count (under 1% of the backbone) and compute overhead to substantiate the 'minimal' and 'no backbone modification' claims. revision: yes
-
Referee: [Abstract] Abstract: the central claim that 'Mema consistently improves performance' and 'validat[es] its effectiveness' rests on 'extensive experiments across multiple benchmarks,' yet the provided text supplies no quantitative results, ablation studies, error bars, memory-size specifications, or training details. Without these, the soundness of the performance gains cannot be assessed.
Authors: The abstract is a high-level summary; the full manuscript contains all requested details in the Experiments section, including benchmark scores with comparisons, ablation tables on memory size and injection ratios, standard deviations from multiple runs, and full training hyperparameters. To address the concern directly, we will expand the abstract with a concise statement of key quantitative gains (e.g., average improvement across benchmarks) while keeping it within length limits. revision: yes
Circularity Check
No circularity: architectural description with no equations or self-referential reductions
full rationale
The paper presents Mema as a lightweight plug-and-play adapter module inserted into the vision encoder. It describes a stateful memory that accumulates hierarchical features with evolution conditioned on query embeddings and step-wise visual features, followed by selective injection. No equations, derivations, or quantitative predictions appear in the text that reduce claimed performance gains to quantities defined by fitted parameters or prior self-citations within the paper. Improvements are asserted via experimental results on benchmarks rather than by construction from the method's own inputs. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked. The derivation chain is therefore self-contained as an independent architectural proposal.
Axiom & Free-Parameter Ledger
free parameters (1)
- memory injection portion
axioms (1)
- domain assumption Hierarchical visual representations across layers contain complementary fine-grained cues that are attenuated in standard final-layer or simple fusion processing.
invented entities (1)
-
Stateful memory module inside Mema
no independent evidence
Forward citations
Cited by 3 Pith papers
-
Visual Enhanced Depth Scaling for Multimodal Latent Reasoning
Visual replay module and adaptive depth scaling improve multimodal latent reasoning, reaching SOTA benchmarks with faster inference than explicit chain-of-thought methods.
-
Visual Enhanced Depth Scaling for Multimodal Latent Reasoning
Visual replay and depth scaling in latent reasoning produce state-of-the-art multimodal results with faster inference than explicit CoT.
-
Visual Enhanced Depth Scaling for Multimodal Latent Reasoning
A visual replay module combined with adaptive depth scaling improves multimodal latent reasoning, delivering state-of-the-art benchmark results and faster inference than explicit chain-of-thought methods.
Reference graph
Works this paper leans on
-
[1]
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Men- sch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,
-
[2]
Mozhgan Nasr Azadani, James Riddell, Sean Sedwards, and Krzysztof Czarnecki. Leo: Boosting mixture of vision en- coders for multimodal large language models.arXiv preprint arXiv:2501.06986, 2025. 1, 3
-
[3]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Yue Cao, Yangzhou Liu, Zhe Chen, Guangchen Shi, Wenhai Wang, Danhuai Zhao, and Tong Lu. Mmfuser: Multimodal multi-layer feature fuser for fine-grained vision-language un- derstanding.arXiv preprint arXiv:2410.11829, 2024. 1, 3, 7
-
[5]
Sharegpt4v: Improving large multi-modal models with better captions
Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. In European Conference on Computer Vision, pages 370–387. Springer, 2024. 5, 6
work page 2024
-
[6]
Pali: A jointly-scaled multilingual language-image model.arXiv preprint arXiv:2209.06794, 2022
Xi Chen, Xiao Wang, Soravit Changpinyo, Anthony J Pier- giovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, et al. Pali: A jointly-scaled multilingual language-image model.arXiv preprint arXiv:2209.06794, 2022. 1, 3
-
[7]
Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024. 1, 3
work page 2024
-
[8]
Reproducible scal- ing laws for contrastive language-image learning
Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuh- mann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scal- ing laws for contrastive language-image learning. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2818–2829, 2023. 5
work page 2023
-
[9]
Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.See https://vicuna
Wei-Lin Chiang, Zhuohan Li, Ziqing Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.See https://vicuna. lmsys. org (accessed 14 April 2023), 2(3):6, 2023. 1
work page 2023
-
[10]
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision- language models with instruction tuning.Advances in neural information processing systems, 36:49250–49267, 2023. 3
work page 2023
-
[11]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 1, 3
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[12]
PaLM-E: An Embodied Multimodal Language Model
Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm- e: An embodied multimodal language model.arXiv preprint arXiv:2303.03378, 2023. 1, 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[13]
Cross-layer retrospective retrieving via layer attention, 2023
Yanwen Fang, Yuxi Cai, Jintai Chen, Jingyu Zhao, Guangjian Tian, and Guodong Li. Cross-layer retrospective retrieving via layer attention, 2023. 3
work page 2023
-
[14]
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation bench- mark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023. 5
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[15]
Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba- tra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017. 5
work page 2017
-
[16]
Alex Graves. Long short-term memory.Supervised sequence labelling with recurrent neural networks, pages 37–45, 2012. 3
work page 2012
-
[17]
Focal and composed vision- semantic modeling for visual question answering
Yudong Han, Yangyang Guo, Jianhua Yin, Meng Liu, Yu- peng Hu, and Liqiang Nie. Focal and composed vision- semantic modeling for visual question answering. InMM ’21: ACM Multimedia Conference, Virtual Event, China, Oc- tober 20 - 24, 2021, pages 4528–4536. ACM, 2021. 1
work page 2021
-
[18]
Semantic-aware modular capsule routing for visual question answering.IEEE Trans
Yudong Han, Jianhua Yin, Jianlong Wu, Yinwei Wei, and Liqiang Nie. Semantic-aware modular capsule routing for visual question answering.IEEE Trans. Image Process., 32: 5537–5549, 2023
work page 2023
-
[19]
Exploiting the social- like prior in transformer for visual reasoning
Yudong Han, Yupeng Hu, Xuemeng Song, Haoyu Tang, Mingzhu Xu, and Liqiang Nie. Exploiting the social- like prior in transformer for visual reasoning. InThirty- Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence, IAAI 2024, Fourteenth Symposium on Educational Advances i...
work page 2024
-
[20]
Dynfocus: Dynamic cooperative network empowers llms with video understanding
Yudong Han, Qingpei Guo, Liyuan Pan, Liu Liu, Yu Guan, and Ming Yang. Dynfocus: Dynamic cooperative network empowers llms with video understanding. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pages 8512–8522. Computer Vision Foundation / IEEE, 2025. 1
work page 2025
-
[21]
Dianet: Dense-and-implicit attention net- work, 2019
Zhongzhan Huang, Senwei Liang, Mingfu Liang, and Haizhao Yang. Dianet: Dense-and-implicit attention net- work, 2019. 3
work page 2019
-
[22]
Gqa: A new dataset for real-world visual reasoning and compositional question answering
Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 6700–6709, 2019. 5
work page 2019
-
[23]
Phi-2: The surprising power of small language models.Microsoft Research Blog, 1(3):3, 2023
Mojan Javaheripi, S ´ebastien Bubeck, Marah Abdin, Jy- oti Aneja, Sebastien Bubeck, Caio C ´esar Teodoro Mendes, Weizhu Chen, Allie Del Giorno, Ronen Eldan, Sivakanth Gopi, et al. Phi-2: The surprising power of small language models.Microsoft Research Blog, 1(3):3, 2023. 5
work page 2023
-
[24]
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yix- iao Ge, and Ying Shan. Seed-bench: Benchmarking mul- timodal llms with generative comprehension.arXiv preprint arXiv:2307.16125, 2023. 5
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[25]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 1, 3
work page 2023
-
[26]
Xu Li, Yi Zheng, Haotian Chen, Xiaolei Chen, Yux- uan Liang, Chenghang Lai, Bin Li, and Xiangyang Xue. Instruction-guided fusion of multi-layer visual features in large vision-language models.Pattern Recognition, 170: 111932, 2026. 1, 3, 7
work page 2026
-
[27]
Textbooks Are All You Need II: phi-1.5 technical report
Yuanzhi Li, S ´ebastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, and Yin Tat Lee. Textbooks are all you need ii: phi-1.5 technical report.arXiv preprint arXiv:2309.05463, 2023. 5
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[28]
Evaluating object hallucination in large vision-language models
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InProceedings of the 2023 conference on empirical methods in natural language pro- cessing, pages 292–305, 2023. 5, 9
work page 2023
-
[29]
Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, and Jiaya Jia. Mini-gemini: Mining the potential of multi-modality vi- sion language models.IEEE Transactions on Pattern Analy- sis and Machine Intelligence, 2025. 1, 3, 5
work page 2025
-
[30]
Chenchen Lin, Sanbao Su, Rachel Luo, Yuxiao Chen, Yan Wang, Marco Pavone, and Fei Miao. Text-guided layer fusion mitigates hallucination in multimodal llms.arXiv preprint arXiv:2601.03100, 2026. 3, 7
-
[31]
Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 1, 3, 5, 6
work page 2023
-
[32]
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vi- sion, pages 216–233. Springer, 2024. 5, 9
work page 2024
-
[33]
Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feicht- enhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 11976–11986,
-
[34]
Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in neural information processing systems, 35:2507–2521,
-
[35]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[36]
Kosmos-2: Grounding Multimodal Large Language Models to the World
Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Ground- ing multimodal large language models to the world.arXiv preprint arXiv:2306.14824, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[37]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 1, 3, 5, 6
work page 2021
-
[38]
Towards vqa models that can read
Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019. 5, 9
work page 2019
-
[39]
Eyes wide shut? exploring the visual shortcomings of multimodal llms
Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9568–9578, 2024. 1, 3
work page 2024
-
[40]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[41]
Strengthening layer interaction via dynamic layer attention
Kaishen Wang, Xun Xia, Jian Liu, Zhang Yi, and Tao He. Strengthening layer interaction via dynamic layer attention. arXiv preprint arXiv:2406.13392, 2024. 3
-
[42]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[43]
Dense connector for mllms.Advances in Neural Information Processing Systems, 37:33108–33140, 2024
Huanjin Yao, Wenhao Wu, Taojiannan Yang, YuXin Song, Mengxi Zhang, Haocheng Feng, Yifan Sun, Zhiheng Li, Wanli Ouyang, and Jingdong Wang. Dense connector for mllms.Advances in Neural Information Processing Systems, 37:33108–33140, 2024. 1, 3, 7
work page 2024
-
[44]
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities.arXiv preprint arXiv:2308.02490, 2023. 5
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[45]
Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9556–9567,
-
[46]
Sigmoid loss for language image pre-training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 1, 5, 6
work page 2023
-
[47]
OPT: Open Pre-trained Transformer Language Models
Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained trans- former language models.arXiv preprint arXiv:2205.01068,
work page internal anchor Pith review Pith/arXiv arXiv
-
[48]
Minjie Zhu, Yichen Zhu, Xin Liu, Ning Liu, Zhiyuan Xu, Chaomin Shen, Yaxin Peng, Zhicai Ou, Feifei Feng, and Jian Tang. Mipha: A comprehensive overhaul of multi- modal assistant with small language models.arXiv preprint arXiv:2403.06199, 2024. 6
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.