arxiv: 2407.04973 · v1 · submitted 2024-07-06 · 💻 cs.AI · cs.CL· cs.CV· cs.LG

Recognition: 2 theorem links

· Lean Theorem

LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts

Yijia Xiao , Edward Sun , Tianyu Liu , Wei Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-16 02:42 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.CVcs.LG

keywords LogicVistamultimodal LLMslogical reasoningvisual contextsbenchmarkevaluationreasoning capabilitiesMLLM

0 comments

The pith

LogicVista provides a benchmark of 448 visual questions to evaluate logical reasoning in multimodal LLMs across five tasks and nine capabilities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LogicVista to address the lack of systematic tests for logical reasoning in multimodal large language models when images are present. It covers five tasks that together test nine capabilities through 448 multiple-choice questions, each paired with a correct answer and human-written reasoning steps. This setup allows evaluation in both multiple-choice and open-ended formats. A sympathetic reader would care because logical reasoning combined with visuals underpins activities such as navigation and puzzle-solving, skills that current model assessments largely ignore. The authors test eight existing MLLMs on the benchmark and release the questions and annotations.

Core claim

LogicVista assesses the integrated logical reasoning capabilities of MLLMs in visual contexts across 5 logical reasoning tasks encompassing 9 different capabilities using a sample of 448 multiple-choice questions. Each question is annotated with the correct answer and the human-written reasoning behind the selection, enabling both open-ended and multiple-choice evaluation.

What carries the argument

The LogicVista benchmark, a set of 448 image-based multiple-choice questions with human reasoning annotations that test logical cognition in visual settings.

Load-bearing premise

The 448 questions and their human-written reasoning annotations accurately and comprehensively capture general logical cognition abilities in visual contexts without significant selection bias or coverage gaps.

What would settle it

A demonstration that high-scoring models on LogicVista fail at comparable logical tasks with new images or real-world visual scenarios would show the benchmark does not measure general visual-logic abilities.

read the original abstract

We propose LogicVista, an evaluation benchmark that assesses the integrated logical reasoning capabilities of multimodal large language models (MLLMs) in Visual contexts. Recent advancements in MLLMs have demonstrated various fascinating abilities, from crafting poetry based on an image to performing mathematical reasoning. However, there is still a lack of systematic evaluation of MLLMs' proficiency in logical reasoning tasks, which are essential for activities like navigation and puzzle-solving. Thus we evaluate general logical cognition abilities across 5 logical reasoning tasks encompassing 9 different capabilities, using a sample of 448 multiple-choice questions. Each question is annotated with the correct answer and the human-written reasoning behind the selection, enabling both open-ended and multiple-choice evaluation. A total of 8 MLLMs are comprehensively evaluated using LogicVista. Code and Data Available at https://github.com/Yijia-Xiao/LogicVista.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LogicVista adds a targeted benchmark for visual logical reasoning in MLLMs with 448 annotated questions, but lacks enough detail on question construction to confirm broad coverage.

read the letter

LogicVista puts together a new set of 448 multiple-choice questions aimed at logical reasoning in images, split across 5 tasks and 9 capabilities, with human-written reasoning steps attached to each. They run it on 8 MLLMs and release the data and code. That fills a practical gap since most existing multimodal tests lean toward perception or simple math rather than the chained logic needed for navigation or puzzles. The annotations for open-ended evaluation are a useful addition too. The paper does the straightforward work of defining the tasks, collecting the items, and reporting model scores without overclaiming. The main soft spot is the missing detail on how the questions were chosen and checked. The abstract gives no numbers on visual variety, sampling method, or agreement checks against standard logic categories, so it is hard to tell if the set overweights certain capabilities or carries annotator bias. If the full paper has a clear section on sourcing and validation, that would tighten things up; otherwise the results rest on an unexamined sample. This is the kind of paper that matters for groups building or testing MLLMs on applied reasoning tasks. It is worth discussing in a reading group to look at the actual questions and model breakdowns. I would not cite it unless I started using the benchmark myself. It should go to peer review so referees can press on the construction details and suggest expansions.

Referee Report

2 major / 2 minor

Summary. The paper introduces LogicVista, a new benchmark for assessing the integrated logical reasoning capabilities of multimodal large language models (MLLMs) in visual contexts. It consists of 448 multiple-choice questions spanning 5 logical reasoning tasks that together cover 9 distinct capabilities. Each question includes the correct answer and human-written reasoning annotations to support both multiple-choice and open-ended evaluation. The authors evaluate 8 MLLMs on the benchmark and release the code and data.

Significance. If the questions are shown to be representative and free of major selection bias, LogicVista would fill a clear gap by providing a systematic visual-context benchmark for logical reasoning, an area where current MLLM evaluations remain limited. The release of code, data, and human reasoning annotations is a clear strength that supports reproducibility and further research.

major comments (2)

[Benchmark Construction] The central claim that the 448 questions comprehensively cover the 9 capabilities across 5 tasks without significant selection bias or gaps is not supported by sufficient methodological detail. The manuscript provides no quantitative breakdown (e.g., number of questions per capability or task), sampling strategy, visual diversity metrics, or validation against external logical-reasoning taxonomies in the benchmark-construction section.
[Annotation Process] The human-written reasoning annotations are presented as enabling reliable open-ended evaluation, yet no inter-annotator agreement statistics or validation procedure for these annotations are reported, which is load-bearing for claims about the benchmark's utility beyond multiple-choice accuracy.

minor comments (2)

[Abstract and §3] The abstract states 'a sample of 448 multiple-choice questions' but does not clarify whether this is the full benchmark size or a subset; this should be stated explicitly in the main text.
[Figures and Tables] Figure captions and table headers should explicitly list the 5 tasks and 9 capabilities to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on LogicVista. We agree that additional methodological details are needed to support the claims of comprehensive coverage and annotation reliability. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Benchmark Construction] The central claim that the 448 questions comprehensively cover the 9 capabilities across 5 tasks without significant selection bias or gaps is not supported by sufficient methodological detail. The manuscript provides no quantitative breakdown (e.g., number of questions per capability or task), sampling strategy, visual diversity metrics, or validation against external logical-reasoning taxonomies in the benchmark-construction section.

Authors: We acknowledge that the current manuscript lacks these details in the benchmark-construction section. In the revised version, we will expand this section to include: a table reporting the exact number of questions per task and per capability (totaling 448), a description of the sampling strategy (stratified selection to ensure balanced coverage of the 9 capabilities without over-representation), quantitative visual diversity metrics (e.g., distribution across image sources, types, and complexity levels), and explicit mapping/validation against established logical-reasoning taxonomies from cognitive science to demonstrate coverage and minimize gaps or bias. revision: yes
Referee: [Annotation Process] The human-written reasoning annotations are presented as enabling reliable open-ended evaluation, yet no inter-annotator agreement statistics or validation procedure for these annotations are reported, which is load-bearing for claims about the benchmark's utility beyond multiple-choice accuracy.

Authors: We agree that inter-annotator agreement statistics and validation details are necessary to substantiate the reliability of the human-written reasoning annotations for open-ended evaluation. In the revision, we will add a dedicated subsection describing the annotation process (including annotator qualifications and guidelines), report inter-annotator agreement metrics (e.g., Fleiss' kappa across reasoning steps), and outline the validation procedure (e.g., review rounds for consistency and accuracy). This will directly support the benchmark's utility claims. revision: yes

Circularity Check

0 steps flagged

No circularity: direct benchmark construction and evaluation with no derivations or self-referential steps

full rationale

The paper introduces LogicVista as a new benchmark dataset of 448 multiple-choice questions with human annotations, then directly evaluates 8 MLLMs on it across 5 tasks and 9 capabilities. No equations, parameter fitting, predictions derived from inputs, or load-bearing self-citations appear in the provided text. The central claim rests on the explicit creation and application of the dataset rather than any reduction of results to prior fitted values or self-defined constructs. This is a standard empirical benchmark paper with no mathematical derivation chain to inspect for circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the chosen tasks and questions validly measure logical reasoning abilities; no free parameters or invented entities are introduced.

axioms (1)

domain assumption The five logical reasoning tasks and nine capabilities adequately represent general logical cognition in visual contexts.
Invoked in the abstract when defining the benchmark scope without further justification or validation details.

pith-pipeline@v0.9.0 · 5455 in / 1068 out tokens · 48400 ms · 2026-05-16T02:42:19.597729+00:00 · methodology

discussion (0)

Forward citations

Cited by 24 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Reflection Anchors for Propagation-Aware Visual Retention in Long-Chain Multimodal Reasoning
cs.CV 2026-05 unverdicted novelty 7.0

RAPO uses an information-theoretic lower bound on visual gain to select high-entropy reflection anchors and optimizes a chain-masked KL surrogate, delivering gains over baselines on reasoning benchmarks across LVLM backbones.
Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding
cs.CV 2026-04 unverdicted novelty 7.0

Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
cs.CV 2025-05 unverdicted novelty 7.0

DeepEyes uses reinforcement learning to teach vision-language models active perception and image-based thinking, yielding gains on perception, reasoning, grounding, and hallucination benchmarks.
Reinforcing Multimodal Reasoning Against Visual Degradation
cs.CV 2026-05 unverdicted novelty 6.0

ROMA improves MLLM robustness to seen and unseen visual corruptions by +2.3-2.4% over GRPO on seven reasoning benchmarks while matching clean accuracy.
Anisotropic Modality Align
cs.MM 2026-05 unverdicted novelty 6.0

Modality representations share dominant semantic geometry but have an anisotropic residual gap; AnisoAlign corrects source representations boundedly using target geometry for unpaired alignment.
Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe
cs.LG 2026-05 unverdicted novelty 6.0

Uni-OPD unifies on-policy distillation across LLMs and MLLMs with dual-perspective strategies that promote student exploration and enforce order-consistent teacher supervision based on outcome rewards.
Segment-Aligned Policy Optimization for Multi-Modal Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

SAPO introduces segment-level policy optimization using a step-wise MDP abstraction to better align RL updates with reasoning structure in multi-modal LLM tasks.
Evian: Towards Explainable Visual Instruction-tuning Data Auditing
cs.CV 2026-04 unverdicted novelty 6.0

EVian decomposes vision-language model responses into three cognitive components and audits them along consistency, coherence, and accuracy axes, showing that a small curated subset outperforms much larger training sets.
Visually-Guided Policy Optimization for Multimodal Reasoning
cs.CV 2026-04 unverdicted novelty 6.0

VGPO introduces visual attention compensation and dual-grained advantage re-weighting to reinforce visual focus in VLMs, yielding better activation and performance on multimodal reasoning tasks.
Not All Tokens See Equally: Perception-Grounded Policy Optimization for Large Vision-Language Models
cs.AI 2026-04 unverdicted novelty 6.0

PGPO uses KL divergence to quantify token visual dependency and reshapes advantages in RLVR to amplify signals for visually grounded tokens, yielding 18.7% average gains on seven benchmarks.
MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?
cs.LG 2026-02 unverdicted novelty 6.0

MapTab benchmark shows current MLLMs struggle with multi-criteria multimodal route planning and that combining vision and language frequently underperforms single-modality approaches.
Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models
cs.CV 2026-02 unverdicted novelty 6.0

ReAlign corrects the modality gap in unpaired data to let MLLMs learn visual distributions from text alone before instruction tuning, reducing dependence on expensive paired corpora.
ChartVerse: Scaling Chart Reasoning via Reliable Programmatic Synthesis from Scratch
cs.CV 2026-01 conditional novelty 6.0

ChartVerse uses Rollout Posterior Entropy and truth-anchored inverse QA synthesis to produce 640K high-quality chart reasoning samples, training an 8B model that surpasses its 30B teacher.
DeepEyesV2: Toward Agentic Multimodal Model
cs.CV 2025-11 unverdicted novelty 6.0

DeepEyesV2 uses a two-stage cold-start plus reinforcement learning pipeline to produce an agentic multimodal model that adaptively invokes tools and outperforms direct RL on real-world reasoning benchmarks.
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
cs.CV 2025-08 unverdicted novelty 6.0

InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
cs.CV 2025-04 conditional novelty 6.0

InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
cs.CL 2024-11 conditional novelty 6.0

Mixed Preference Optimization with the MMPR dataset boosts multimodal CoT reasoning, lifting InternVL2-8B to 67.0 accuracy on MathVista (+8.7 points) and matching the 76B model.
GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents
cs.CV 2026-04 unverdicted novelty 5.0

GLM-5V-Turbo integrates multimodal perception as a core part of reasoning and execution for agentic tasks, reporting strong results in visual tool use and multimodal coding while keeping text-only performance competitive.
Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models
cs.CV 2026-04 unverdicted novelty 5.0

HDPO reframes tool efficiency as a conditional objective within accurate trajectories, enabling Metis to reduce tool invocations by orders of magnitude while raising reasoning accuracy.
Seed1.8 Model Card: Towards Generalized Real-World Agency
cs.AI 2026-03 unverdicted novelty 5.0

Seed1.8 is a new foundation model that adds unified agentic capabilities for search, code execution, and GUI interaction to existing LLM and vision strengths.
MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe
cs.LG 2025-09 unverdicted novelty 5.0

An 8B MLLM reaches state-of-the-art efficiency and performance under 30B by combining a unified 3D resampler, joint document-text training, and hybrid RL for reasoning modes.
GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents
cs.CV 2026-04 unverdicted novelty 4.0

GLM-5V-Turbo integrates multimodal perception directly into reasoning, planning, tool use, and execution for agents, yielding strong results in multimodal coding and framework-based tasks while keeping text coding com...
GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents
cs.CV 2026-04 unverdicted novelty 4.0

GLM-5V-Turbo integrates multimodal perception directly into reasoning and agent workflows, reporting strong results on visual tool use, multimodal coding, and framework-based agent tasks while keeping text coding competitive.
Efficient Inference for Large Vision-Language Models: Bottlenecks, Techniques, and Prospects
cs.CL 2026-04 unverdicted novelty 4.0

A survey that taxonomizes efficiency methods for LVLMs across the full inference pipeline, decouples the problem into information density, long-context attention, and memory limits, and outlines four future research f...

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · cited by 22 Pith papers

[1]

OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner...

work page 2024
[2]

Flamingo: a visual language model for few-shot learning, 2022

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski,...

work page 2022
[4]

Minigpt-4: Enhancing vision-language understanding with advanced large language models, 2023

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models, 2023

work page 2023
[5]

A survey on multimodal large language models, 2023

Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models, 2023

work page 2023
[6]

Mme: A comprehensive evaluation benchmark for multimodal large language models, 2023

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, and Rongrong Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models, 2023

work page 2023
[7]

Pmc-vqa: Visual instruction tuning for medical visual question answering, 2023

Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Weixiong Lin, Ya Zhang, Yanfeng Wang, and Weidi Xie. Pmc-vqa: Visual instruction tuning for medical visual question answering, 2023

work page 2023
[8]

Lawrence Zitnick, and Devi Parikh

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), December 2015

work page 2015
[10]

Mm-vet: Evaluating large multimodal models for integrated capabilities, 2023

Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities, 2023

work page 2023
[11]

Wavering

Michael J. Wavering. Logical reasoning necessary to make line graphs. Journal of Research in Science Teaching, 26(5):373–379, May 1989

work page 1989
[12]

Somerville

Catherine Sophian and Susan C. Somerville. Early developments in logical reasoning: Considering alternative possibilities. Cognitive Development, 3(2):183–222, 1988

work page 1988
[13]

Logical reasoning in formal and everyday reasoning tasks - international journal of science and mathematics education, Dec 2019

Hugo Bronkhorst, Gerrit Roorda, Cor Suhre, and Martin Goedhart. Logical reasoning in formal and everyday reasoning tasks - international journal of science and mathematics education, Dec 2019

work page 2019
[14]

Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts, 2024

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts, 2024

work page 2024
[15]

Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR) , 2017

work page 2017
[16]

Lawrence Zitnick

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: Common Objects in Context , page 740–755. Springer International Publishing, 2014. 13

work page 2014
[17]

Textcaps: a dataset for image captioning with reading comprehension, 2020

Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and Amanpreet Singh. Textcaps: a dataset for image captioning with reading comprehension, 2020

work page 2020
[18]

Contextual: Evaluating context-sensitive text-rich visual reasoning in large multimodal models, 2024

Rohan Wadhawan, Hritik Bansal, Kai-Wei Chang, and Nanyun Peng. Contextual: Evaluating context-sensitive text-rich visual reasoning in large multimodal models, 2024

work page 2024
[19]

Visit-bench: A benchmark for vision-language instruction following inspired by real-world use, 2023

Yonatan Bitton, Hritik Bansal, Jack Hessel, Rulin Shao, Wanrong Zhu, Anas Awadalla, Josh Gardner, Rohan Taori, and Ludwig Schmidt. Visit-bench: A benchmark for vision-language instruction following inspired by real-world use, 2023

work page 2023
[20]

Lawrence Zitnick

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, and C. Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server, 2015

work page 2015
[21]

Making the v in vqa matter: Elevating the role of image understanding in visual question answering, 2017

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering, 2017

work page 2017
[22]

Vilbert: Pretraining task-agnostic visiolinguis- tic representations for vision-and-language tasks, 2019

Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic visiolinguis- tic representations for vision-and-language tasks, 2019

work page 2019
[23]

Uniter: Universal image-text representation learning, 2020

Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Universal image-text representation learning, 2020

work page 2020
[24]

Oscar: Object-semantics aligned pre-training for vision-language tasks, 2020

Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, and Jianfeng Gao. Oscar: Object-semantics aligned pre-training for vision-language tasks, 2020

work page 2020
[25]

Vilt: Vision-and-language transformer without convolution or region supervision, 2021

Wonjae Kim, Bokyung Son, and Ildoo Kim. Vilt: Vision-and-language transformer without convolution or region supervision, 2021

work page 2021
[26]

Simvlm: Simple visual language model pretraining with weak supervision, 2022

Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, and Yuan Cao. Simvlm: Simple visual language model pretraining with weak supervision, 2022

work page 2022
[27]

Git: A generative image-to-text transformer for vision and language, 2022

Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, and Lijuan Wang. Git: A generative image-to-text transformer for vision and language, 2022

work page 2022
[28]

Unitab: Unifying text and box outputs for grounded vision-language modeling, 2022

Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei Hu, Faisal Ahmed, Zicheng Liu, Yumao Lu, and Lijuan Wang. Unitab: Unifying text and box outputs for grounded vision-language modeling, 2022

work page 2022
[29]

Vision-language pre-training: Basics, recent advances, and future trends, 2022

Zhe Gan, Linjie Li, Chunyuan Li, Lijuan Wang, Zicheng Liu, and Jianfeng Gao. Vision-language pre-training: Basics, recent advances, and future trends, 2022

work page 2022
[30]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

work page 2020
[31]

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradb...

work page 2022
[32]

Llama: Open and efficient foundation language models, 2023

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023

work page 2023
[33]

Maria Tsimpoukelli, Jacob Menick, Serkan Cabi, S. M. Ali Eslami, Oriol Vinyals, and Felix Hill. Multimodal few-shot learning with frozen language models, 2021

work page 2021
[34]

Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence. Palm-e: An embodied ...

work page 2023
[35]

Opt: Open pre-trained transformer language models, 2022

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. Opt: Open pre-trained transformer language models, 2022

work page 2022
[36]

Instruction tuning with gpt-4, 2023

Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with gpt-4, 2023

work page 2023
[37]

Openflamingo: An open-source framework for training large autoregressive vision-language models, 2023

Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, Jenia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell Wortsman, and Ludwig Schmidt. Openflamingo: An open-source framework for training large autoregressive vision-language models, 2023

work page 2023
[38]

Visual instruction tuning, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023

work page 2023
[39]

Otter: A multi-modal model with in-context instruction tuning, 2023

Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning, 2023

work page 2023
[40]

Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023

work page 2023
[41]

Multimodal-gpt: A vision and language model for dialogue with humans, 2023

Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qian Zhao, Kuikun Liu, Wenwei Zhang, Ping Luo, and Kai Chen. Multimodal-gpt: A vision and language model for dialogue with humans, 2023

work page 2023
[42]

mplug-owl: Modularization empowers large language models with multimodality, 2023

Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, Chenliang Li, Yuanhong Xu, Hehong Chen, Junfeng Tian, Qian Qi, Ji Zhang, and Fei Huang. mplug-owl: Modularization empowers large language models with multimodality, 2023

work page 2023
[43]

Mm-react: Prompting chatgpt for multimodal reasoning and action, 2023

Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action, 2023

work page 2023
[44]

Hugging- gpt: Solving ai tasks with chatgpt and its friends in hugging face, 2023

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugging- gpt: Solving ai tasks with chatgpt and its friends in hugging face, 2023. 15

work page 2023
[45]

Assistgpt: A general multi-modal assistant that can plan, execute, inspect, and learn, 2023

Difei Gao, Lei Ji, Luowei Zhou, Kevin Qinghong Lin, Joya Chen, Zihan Fan, and Mike Zheng Shou. Assistgpt: A general multi-modal assistant that can plan, execute, inspect, and learn, 2023

work page 2023
[46]

nocaps: novel object captioning at scale

Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. nocaps: novel object captioning at scale. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV) . IEEE, October 2019

work page 2019
[47]

Towards vqa models that can read, 2019

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read, 2019

work page 2019
[48]

Tap: Text-aware pre-training for text-vqa and text-caption, 2020

Zhengyuan Yang, Yijuan Lu, Jianfeng Wang, Xi Yin, Dinei Florencio, Lijuan Wang, Cha Zhang, Lei Zhang, and Jiebo Luo. Tap: Text-aware pre-training for text-vqa and text-caption, 2020

work page 2020
[49]

From recognition to cognition: Visual commonsense reasoning, 2019

Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. From recognition to cognition: Visual commonsense reasoning, 2019

work page 2019
[50]

Ok-vqa: A visual question answering benchmark requiring external knowledge, 2019

Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge, 2019

work page 2019
[51]

Mmbench: Is your multi-modal model an all-around player?, 2023

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. Mmbench: Is your multi-modal model an all-around player?, 2023

work page 2023
[52]

Can large language models be an alternative to human evaluations?, 2023

Cheng-Han Chiang and Hung yi Lee. Can large language models be an alternative to human evaluations?, 2023

work page 2023
[53]

G-eval: Nlg evaluation using gpt-4 with better human alignment, 2023

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Nlg evaluation using gpt-4 with better human alignment, 2023

work page 2023
[54]

Gptscore: Evaluate as you desire, 2023

Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, and Pengfei Liu. Gptscore: Evaluate as you desire, 2023

work page 2023
[55]

Mm-soc: Benchmarking multimodal large language models in social media platforms

Yiqiao Jin, Minje Choi, Gaurav Verma, Jindong Wang, and Srijan Kumar. Mm-soc: Benchmarking multimodal large language models in social media platforms. In ACL, 2024

work page 2024
[56]

Coauthor: Designing a human-ai collaborative writing dataset for exploring language model capabilities

Mina Lee, Percy Liang, and Qian Yang. Coauthor: Designing a human-ai collaborative writing dataset for exploring language model capabilities. In CHI Conference on Human Factors in Computing Systems, CHI ’22. ACM, April 2022

work page 2022
[57]

Zhang, Mark Harman, and Meng Wang

Shuyin Ouyang, Jie M. Zhang, Mark Harman, and Meng Wang. Llm is like a box of chocolates: the non-determinism of chatgpt in code generation, 2023

work page 2023
[58]

Llava-next: Improved reasoning, ocr, and world knowledge, January 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024

work page 2024
[59]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023

work page 2023
[60]

spending on IT hardware will decline

Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, and Kristina Toutanova. Pix2struct: Screenshot parsing as pretraining for visual language understanding, 2023. 16 Appendix: LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts A Examples of LogicVista Logi...

work page 2023