Recognition: 2 theorem links
· Lean TheoremLogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts
Pith reviewed 2026-05-16 02:42 UTC · model grok-4.3
The pith
LogicVista provides a benchmark of 448 visual questions to evaluate logical reasoning in multimodal LLMs across five tasks and nine capabilities.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LogicVista assesses the integrated logical reasoning capabilities of MLLMs in visual contexts across 5 logical reasoning tasks encompassing 9 different capabilities using a sample of 448 multiple-choice questions. Each question is annotated with the correct answer and the human-written reasoning behind the selection, enabling both open-ended and multiple-choice evaluation.
What carries the argument
The LogicVista benchmark, a set of 448 image-based multiple-choice questions with human reasoning annotations that test logical cognition in visual settings.
Load-bearing premise
The 448 questions and their human-written reasoning annotations accurately and comprehensively capture general logical cognition abilities in visual contexts without significant selection bias or coverage gaps.
What would settle it
A demonstration that high-scoring models on LogicVista fail at comparable logical tasks with new images or real-world visual scenarios would show the benchmark does not measure general visual-logic abilities.
read the original abstract
We propose LogicVista, an evaluation benchmark that assesses the integrated logical reasoning capabilities of multimodal large language models (MLLMs) in Visual contexts. Recent advancements in MLLMs have demonstrated various fascinating abilities, from crafting poetry based on an image to performing mathematical reasoning. However, there is still a lack of systematic evaluation of MLLMs' proficiency in logical reasoning tasks, which are essential for activities like navigation and puzzle-solving. Thus we evaluate general logical cognition abilities across 5 logical reasoning tasks encompassing 9 different capabilities, using a sample of 448 multiple-choice questions. Each question is annotated with the correct answer and the human-written reasoning behind the selection, enabling both open-ended and multiple-choice evaluation. A total of 8 MLLMs are comprehensively evaluated using LogicVista. Code and Data Available at https://github.com/Yijia-Xiao/LogicVista.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces LogicVista, a new benchmark for assessing the integrated logical reasoning capabilities of multimodal large language models (MLLMs) in visual contexts. It consists of 448 multiple-choice questions spanning 5 logical reasoning tasks that together cover 9 distinct capabilities. Each question includes the correct answer and human-written reasoning annotations to support both multiple-choice and open-ended evaluation. The authors evaluate 8 MLLMs on the benchmark and release the code and data.
Significance. If the questions are shown to be representative and free of major selection bias, LogicVista would fill a clear gap by providing a systematic visual-context benchmark for logical reasoning, an area where current MLLM evaluations remain limited. The release of code, data, and human reasoning annotations is a clear strength that supports reproducibility and further research.
major comments (2)
- [Benchmark Construction] The central claim that the 448 questions comprehensively cover the 9 capabilities across 5 tasks without significant selection bias or gaps is not supported by sufficient methodological detail. The manuscript provides no quantitative breakdown (e.g., number of questions per capability or task), sampling strategy, visual diversity metrics, or validation against external logical-reasoning taxonomies in the benchmark-construction section.
- [Annotation Process] The human-written reasoning annotations are presented as enabling reliable open-ended evaluation, yet no inter-annotator agreement statistics or validation procedure for these annotations are reported, which is load-bearing for claims about the benchmark's utility beyond multiple-choice accuracy.
minor comments (2)
- [Abstract and §3] The abstract states 'a sample of 448 multiple-choice questions' but does not clarify whether this is the full benchmark size or a subset; this should be stated explicitly in the main text.
- [Figures and Tables] Figure captions and table headers should explicitly list the 5 tasks and 9 capabilities to improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on LogicVista. We agree that additional methodological details are needed to support the claims of comprehensive coverage and annotation reliability. We address each major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Benchmark Construction] The central claim that the 448 questions comprehensively cover the 9 capabilities across 5 tasks without significant selection bias or gaps is not supported by sufficient methodological detail. The manuscript provides no quantitative breakdown (e.g., number of questions per capability or task), sampling strategy, visual diversity metrics, or validation against external logical-reasoning taxonomies in the benchmark-construction section.
Authors: We acknowledge that the current manuscript lacks these details in the benchmark-construction section. In the revised version, we will expand this section to include: a table reporting the exact number of questions per task and per capability (totaling 448), a description of the sampling strategy (stratified selection to ensure balanced coverage of the 9 capabilities without over-representation), quantitative visual diversity metrics (e.g., distribution across image sources, types, and complexity levels), and explicit mapping/validation against established logical-reasoning taxonomies from cognitive science to demonstrate coverage and minimize gaps or bias. revision: yes
-
Referee: [Annotation Process] The human-written reasoning annotations are presented as enabling reliable open-ended evaluation, yet no inter-annotator agreement statistics or validation procedure for these annotations are reported, which is load-bearing for claims about the benchmark's utility beyond multiple-choice accuracy.
Authors: We agree that inter-annotator agreement statistics and validation details are necessary to substantiate the reliability of the human-written reasoning annotations for open-ended evaluation. In the revision, we will add a dedicated subsection describing the annotation process (including annotator qualifications and guidelines), report inter-annotator agreement metrics (e.g., Fleiss' kappa across reasoning steps), and outline the validation procedure (e.g., review rounds for consistency and accuracy). This will directly support the benchmark's utility claims. revision: yes
Circularity Check
No circularity: direct benchmark construction and evaluation with no derivations or self-referential steps
full rationale
The paper introduces LogicVista as a new benchmark dataset of 448 multiple-choice questions with human annotations, then directly evaluates 8 MLLMs on it across 5 tasks and 9 capabilities. No equations, parameter fitting, predictions derived from inputs, or load-bearing self-citations appear in the provided text. The central claim rests on the explicit creation and application of the dataset rather than any reduction of results to prior fitted values or self-defined constructs. This is a standard empirical benchmark paper with no mathematical derivation chain to inspect for circularity.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The five logical reasoning tasks and nine capabilities adequately represent general logical cognition in visual contexts.
Forward citations
Cited by 24 Pith papers
-
Reflection Anchors for Propagation-Aware Visual Retention in Long-Chain Multimodal Reasoning
RAPO uses an information-theoretic lower bound on visual gain to select high-entropy reflection anchors and optimizes a chain-masked KL surrogate, delivering gains over baselines on reasoning benchmarks across LVLM backbones.
-
Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding
Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.
-
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
DeepEyes uses reinforcement learning to teach vision-language models active perception and image-based thinking, yielding gains on perception, reasoning, grounding, and hallucination benchmarks.
-
Reinforcing Multimodal Reasoning Against Visual Degradation
ROMA improves MLLM robustness to seen and unseen visual corruptions by +2.3-2.4% over GRPO on seven reasoning benchmarks while matching clean accuracy.
-
Anisotropic Modality Align
Modality representations share dominant semantic geometry but have an anisotropic residual gap; AnisoAlign corrects source representations boundedly using target geometry for unpaired alignment.
-
Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe
Uni-OPD unifies on-policy distillation across LLMs and MLLMs with dual-perspective strategies that promote student exploration and enforce order-consistent teacher supervision based on outcome rewards.
-
Segment-Aligned Policy Optimization for Multi-Modal Reasoning
SAPO introduces segment-level policy optimization using a step-wise MDP abstraction to better align RL updates with reasoning structure in multi-modal LLM tasks.
-
Evian: Towards Explainable Visual Instruction-tuning Data Auditing
EVian decomposes vision-language model responses into three cognitive components and audits them along consistency, coherence, and accuracy axes, showing that a small curated subset outperforms much larger training sets.
-
Visually-Guided Policy Optimization for Multimodal Reasoning
VGPO introduces visual attention compensation and dual-grained advantage re-weighting to reinforce visual focus in VLMs, yielding better activation and performance on multimodal reasoning tasks.
-
Not All Tokens See Equally: Perception-Grounded Policy Optimization for Large Vision-Language Models
PGPO uses KL divergence to quantify token visual dependency and reshapes advantages in RLVR to amplify signals for visually grounded tokens, yielding 18.7% average gains on seven benchmarks.
-
MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?
MapTab benchmark shows current MLLMs struggle with multi-criteria multimodal route planning and that combining vision and language frequently underperforms single-modality approaches.
-
Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models
ReAlign corrects the modality gap in unpaired data to let MLLMs learn visual distributions from text alone before instruction tuning, reducing dependence on expensive paired corpora.
-
ChartVerse: Scaling Chart Reasoning via Reliable Programmatic Synthesis from Scratch
ChartVerse uses Rollout Posterior Entropy and truth-anchored inverse QA synthesis to produce 640K high-quality chart reasoning samples, training an 8B model that surpasses its 30B teacher.
-
DeepEyesV2: Toward Agentic Multimodal Model
DeepEyesV2 uses a two-stage cold-start plus reinforcement learning pipeline to produce an agentic multimodal model that adaptively invokes tools and outperforms direct RL on real-world reasoning benchmarks.
-
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
-
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
-
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
Mixed Preference Optimization with the MMPR dataset boosts multimodal CoT reasoning, lifting InternVL2-8B to 67.0 accuracy on MathVista (+8.7 points) and matching the 76B model.
-
GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents
GLM-5V-Turbo integrates multimodal perception as a core part of reasoning and execution for agentic tasks, reporting strong results in visual tool use and multimodal coding while keeping text-only performance competitive.
-
Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models
HDPO reframes tool efficiency as a conditional objective within accurate trajectories, enabling Metis to reduce tool invocations by orders of magnitude while raising reasoning accuracy.
-
Seed1.8 Model Card: Towards Generalized Real-World Agency
Seed1.8 is a new foundation model that adds unified agentic capabilities for search, code execution, and GUI interaction to existing LLM and vision strengths.
-
MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe
An 8B MLLM reaches state-of-the-art efficiency and performance under 30B by combining a unified 3D resampler, joint document-text training, and hybrid RL for reasoning modes.
-
GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents
GLM-5V-Turbo integrates multimodal perception directly into reasoning, planning, tool use, and execution for agents, yielding strong results in multimodal coding and framework-based tasks while keeping text coding com...
-
GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents
GLM-5V-Turbo integrates multimodal perception directly into reasoning and agent workflows, reporting strong results on visual tool use, multimodal coding, and framework-based agent tasks while keeping text coding competitive.
-
Efficient Inference for Large Vision-Language Models: Bottlenecks, Techniques, and Prospects
A survey that taxonomizes efficiency methods for LVLMs across the full inference pipeline, decouples the problem into information density, long-context attention, and memory limits, and outlines four future research f...
Reference graph
Works this paper leans on
-
[1]
OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner...
work page 2024
-
[2]
Flamingo: a visual language model for few-shot learning, 2022
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski,...
work page 2022
-
[4]
Minigpt-4: Enhancing vision-language understanding with advanced large language models, 2023
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models, 2023
work page 2023
-
[5]
A survey on multimodal large language models, 2023
Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models, 2023
work page 2023
-
[6]
Mme: A comprehensive evaluation benchmark for multimodal large language models, 2023
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, and Rongrong Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models, 2023
work page 2023
-
[7]
Pmc-vqa: Visual instruction tuning for medical visual question answering, 2023
Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Weixiong Lin, Ya Zhang, Yanfeng Wang, and Weidi Xie. Pmc-vqa: Visual instruction tuning for medical visual question answering, 2023
work page 2023
-
[8]
Lawrence Zitnick, and Devi Parikh
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), December 2015
work page 2015
-
[10]
Mm-vet: Evaluating large multimodal models for integrated capabilities, 2023
Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities, 2023
work page 2023
- [11]
-
[12]
Catherine Sophian and Susan C. Somerville. Early developments in logical reasoning: Considering alternative possibilities. Cognitive Development, 3(2):183–222, 1988
work page 1988
-
[13]
Hugo Bronkhorst, Gerrit Roorda, Cor Suhre, and Martin Goedhart. Logical reasoning in formal and everyday reasoning tasks - international journal of science and mathematics education, Dec 2019
work page 2019
-
[14]
Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts, 2024
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts, 2024
work page 2024
-
[15]
Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR) , 2017
work page 2017
-
[16]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: Common Objects in Context , page 740–755. Springer International Publishing, 2014. 13
work page 2014
-
[17]
Textcaps: a dataset for image captioning with reading comprehension, 2020
Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and Amanpreet Singh. Textcaps: a dataset for image captioning with reading comprehension, 2020
work page 2020
-
[18]
Contextual: Evaluating context-sensitive text-rich visual reasoning in large multimodal models, 2024
Rohan Wadhawan, Hritik Bansal, Kai-Wei Chang, and Nanyun Peng. Contextual: Evaluating context-sensitive text-rich visual reasoning in large multimodal models, 2024
work page 2024
-
[19]
Visit-bench: A benchmark for vision-language instruction following inspired by real-world use, 2023
Yonatan Bitton, Hritik Bansal, Jack Hessel, Rulin Shao, Wanrong Zhu, Anas Awadalla, Josh Gardner, Rohan Taori, and Ludwig Schmidt. Visit-bench: A benchmark for vision-language instruction following inspired by real-world use, 2023
work page 2023
-
[20]
Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, and C. Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server, 2015
work page 2015
-
[21]
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering, 2017
work page 2017
-
[22]
Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic visiolinguis- tic representations for vision-and-language tasks, 2019
work page 2019
-
[23]
Uniter: Universal image-text representation learning, 2020
Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Universal image-text representation learning, 2020
work page 2020
-
[24]
Oscar: Object-semantics aligned pre-training for vision-language tasks, 2020
Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, and Jianfeng Gao. Oscar: Object-semantics aligned pre-training for vision-language tasks, 2020
work page 2020
-
[25]
Vilt: Vision-and-language transformer without convolution or region supervision, 2021
Wonjae Kim, Bokyung Son, and Ildoo Kim. Vilt: Vision-and-language transformer without convolution or region supervision, 2021
work page 2021
-
[26]
Simvlm: Simple visual language model pretraining with weak supervision, 2022
Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, and Yuan Cao. Simvlm: Simple visual language model pretraining with weak supervision, 2022
work page 2022
-
[27]
Git: A generative image-to-text transformer for vision and language, 2022
Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, and Lijuan Wang. Git: A generative image-to-text transformer for vision and language, 2022
work page 2022
-
[28]
Unitab: Unifying text and box outputs for grounded vision-language modeling, 2022
Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei Hu, Faisal Ahmed, Zicheng Liu, Yumao Lu, and Lijuan Wang. Unitab: Unifying text and box outputs for grounded vision-language modeling, 2022
work page 2022
-
[29]
Vision-language pre-training: Basics, recent advances, and future trends, 2022
Zhe Gan, Linjie Li, Chunyuan Li, Lijuan Wang, Zicheng Liu, and Jianfeng Gao. Vision-language pre-training: Basics, recent advances, and future trends, 2022
work page 2022
-
[30]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...
work page 2020
-
[31]
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradb...
work page 2022
-
[32]
Llama: Open and efficient foundation language models, 2023
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023
work page 2023
-
[33]
Maria Tsimpoukelli, Jacob Menick, Serkan Cabi, S. M. Ali Eslami, Oriol Vinyals, and Felix Hill. Multimodal few-shot learning with frozen language models, 2021
work page 2021
-
[34]
Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence. Palm-e: An embodied ...
work page 2023
-
[35]
Opt: Open pre-trained transformer language models, 2022
Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. Opt: Open pre-trained transformer language models, 2022
work page 2022
-
[36]
Instruction tuning with gpt-4, 2023
Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with gpt-4, 2023
work page 2023
-
[37]
Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, Jenia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell Wortsman, and Ludwig Schmidt. Openflamingo: An open-source framework for training large autoregressive vision-language models, 2023
work page 2023
-
[38]
Visual instruction tuning, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023
work page 2023
-
[39]
Otter: A multi-modal model with in-context instruction tuning, 2023
Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning, 2023
work page 2023
-
[40]
Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023
work page 2023
-
[41]
Multimodal-gpt: A vision and language model for dialogue with humans, 2023
Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qian Zhao, Kuikun Liu, Wenwei Zhang, Ping Luo, and Kai Chen. Multimodal-gpt: A vision and language model for dialogue with humans, 2023
work page 2023
-
[42]
mplug-owl: Modularization empowers large language models with multimodality, 2023
Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, Chenliang Li, Yuanhong Xu, Hehong Chen, Junfeng Tian, Qian Qi, Ji Zhang, and Fei Huang. mplug-owl: Modularization empowers large language models with multimodality, 2023
work page 2023
-
[43]
Mm-react: Prompting chatgpt for multimodal reasoning and action, 2023
Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action, 2023
work page 2023
-
[44]
Hugging- gpt: Solving ai tasks with chatgpt and its friends in hugging face, 2023
Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugging- gpt: Solving ai tasks with chatgpt and its friends in hugging face, 2023. 15
work page 2023
-
[45]
Assistgpt: A general multi-modal assistant that can plan, execute, inspect, and learn, 2023
Difei Gao, Lei Ji, Luowei Zhou, Kevin Qinghong Lin, Joya Chen, Zihan Fan, and Mike Zheng Shou. Assistgpt: A general multi-modal assistant that can plan, execute, inspect, and learn, 2023
work page 2023
-
[46]
nocaps: novel object captioning at scale
Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. nocaps: novel object captioning at scale. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV) . IEEE, October 2019
work page 2019
-
[47]
Towards vqa models that can read, 2019
Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read, 2019
work page 2019
-
[48]
Tap: Text-aware pre-training for text-vqa and text-caption, 2020
Zhengyuan Yang, Yijuan Lu, Jianfeng Wang, Xi Yin, Dinei Florencio, Lijuan Wang, Cha Zhang, Lei Zhang, and Jiebo Luo. Tap: Text-aware pre-training for text-vqa and text-caption, 2020
work page 2020
-
[49]
From recognition to cognition: Visual commonsense reasoning, 2019
Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. From recognition to cognition: Visual commonsense reasoning, 2019
work page 2019
-
[50]
Ok-vqa: A visual question answering benchmark requiring external knowledge, 2019
Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge, 2019
work page 2019
-
[51]
Mmbench: Is your multi-modal model an all-around player?, 2023
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. Mmbench: Is your multi-modal model an all-around player?, 2023
work page 2023
-
[52]
Can large language models be an alternative to human evaluations?, 2023
Cheng-Han Chiang and Hung yi Lee. Can large language models be an alternative to human evaluations?, 2023
work page 2023
-
[53]
G-eval: Nlg evaluation using gpt-4 with better human alignment, 2023
Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Nlg evaluation using gpt-4 with better human alignment, 2023
work page 2023
-
[54]
Gptscore: Evaluate as you desire, 2023
Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, and Pengfei Liu. Gptscore: Evaluate as you desire, 2023
work page 2023
-
[55]
Mm-soc: Benchmarking multimodal large language models in social media platforms
Yiqiao Jin, Minje Choi, Gaurav Verma, Jindong Wang, and Srijan Kumar. Mm-soc: Benchmarking multimodal large language models in social media platforms. In ACL, 2024
work page 2024
-
[56]
Mina Lee, Percy Liang, and Qian Yang. Coauthor: Designing a human-ai collaborative writing dataset for exploring language model capabilities. In CHI Conference on Human Factors in Computing Systems, CHI ’22. ACM, April 2022
work page 2022
-
[57]
Zhang, Mark Harman, and Meng Wang
Shuyin Ouyang, Jie M. Zhang, Mark Harman, and Meng Wang. Llm is like a box of chocolates: the non-determinism of chatgpt in code generation, 2023
work page 2023
-
[58]
Llava-next: Improved reasoning, ocr, and world knowledge, January 2024
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024
work page 2024
-
[59]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023
work page 2023
-
[60]
spending on IT hardware will decline
Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, and Kristina Toutanova. Pix2struct: Screenshot parsing as pretraining for visual language understanding, 2023. 16 Appendix: LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts A Examples of LogicVista Logi...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.