hub Canonical reference

Lisa: Reasoning seg- mentation via large language model

Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, Jiaya Jia · 2023 · arXiv 2308.00692

Canonical reference. 75% of citing Pith papers cite this work as background.

16 Pith papers citing it

Background 75% of classified citations

read on arXiv browse 16 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 6 dataset 1 method 1

citation-polarity summary

background 6 support 1 use dataset 1

representative citing papers

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

cs.CV · 2026-01-15 · unverdicted · novelty 8.0

Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.

Vision Harnessing Agent for Open Ad-hoc Segmentation

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

VASA is a vision-guided agent for open ad-hoc segmentation that creates and validates masks through planning, tool use, and error recovery, outperforming baselines on the new PARS benchmark and RefCOCOm.

WildDet3D: Scaling Promptable 3D Detection in the Wild

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

WildDet3D is a promptable 3D detector paired with a new 1M-image dataset across 13.5K categories that sets SOTA on open-world and zero-shot 3D detection benchmarks.

Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels

cs.CV · 2023-12-28 · conditional · novelty 7.0

Q-Align trains LMMs on discrete text-defined levels for visual scoring, achieving SOTA on IQA, IAA, and VQA while unifying the tasks in OneAlign.

Moondream Segmentation: From Words to Masks

cs.CV · 2026-04-03 · unverdicted · novelty 6.0

Moondream Segmentation achieves 80.2% cIoU on RefCOCO by autoregressively decoding paths from referring expressions and using RL to refine masks, plus releases a cleaned RefCOCO-M dataset.

Chat-Scene++: Exploiting Context-Rich Object Identification for 3D LLM

cs.CV · 2026-03-29 · unverdicted · novelty 6.0

Chat-Scene++ improves 3D scene understanding in multimodal LLMs by representing scenes as context-rich object sequences with identifier tokens and grounded chain-of-thought reasoning, reaching state-of-the-art on five benchmarks using pre-trained encoders.

MINGLE: VLMs for Semantically Complex Region Detection in Urban Scenes

cs.CV · 2025-09-16 · unverdicted · novelty 6.0

MINGLE is a modular pipeline that combines off-the-shelf detection tools with VLM reasoning to localize socially connected groups in urban scenes and is supported by a new 100K-image dataset.

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

cs.CV · 2024-03-14 · unverdicted · novelty 6.0

MM1 models achieve state-of-the-art few-shot multimodal results by pre-training on a careful mix of image-caption, interleaved, and text-only data with optimized image encoders.

MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

cs.CV · 2024-01-29 · conditional · novelty 6.0

MoE-LLaVA applies mixture-of-experts sparsity to LVLMs via MoE-Tuning, delivering LLaVA-1.5-7B level visual understanding and better hallucination resistance with only ~3B active parameters.

Hallucination of Multimodal Large Language Models: A Survey

cs.CV · 2024-04-29 · accept · novelty 5.0

The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.

Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

cs.CV · 2024-03-27 · unverdicted · novelty 5.0

Mini-Gemini enhances VLMs via high-resolution visual refinement, curated reasoning data, and self-guided generation to reach leading zero-shot benchmark results across 2B-34B LLMs.

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

cs.CV · 2023-12-21 · unverdicted · novelty 5.0

InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

cs.CV · 2024-04-25 · unverdicted · novelty 4.0

InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.

MobileVLM V2: Faster and Stronger Baseline for Vision Language Model

cs.CV · 2024-02-06 · unverdicted · novelty 4.0

MobileVLM V2 shows that 1.7B and 3B parameter vision-language models can reach or exceed the performance of 3B and 7B+ models on common VLM benchmarks via targeted design and data improvements.

Improved Baselines with Visual Instruction Tuning

cs.CV · 2023-10-05 · conditional · novelty 4.0

Simple changes to LLaVA using CLIP-ViT-L-336px, an MLP connector, and academic VQA data yield state-of-the-art results on 11 benchmarks with only 1.2M public examples and one-day training on 8 A100 GPUs.

A Survey on Multimodal Large Language Models

cs.CV · 2023-06-23 · accept · novelty 3.0

This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.

citing papers explorer

Showing 16 of 16 citing papers.

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding cs.CV · 2026-01-15 · unverdicted · none · ref 70
Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.
Vision Harnessing Agent for Open Ad-hoc Segmentation cs.CV · 2026-05-19 · unverdicted · none · ref 9
VASA is a vision-guided agent for open ad-hoc segmentation that creates and validates masks through planning, tool use, and error recovery, outperforming baselines on the new PARS benchmark and RefCOCOm.
WildDet3D: Scaling Promptable 3D Detection in the Wild cs.CV · 2026-04-09 · unverdicted · none · ref 25
WildDet3D is a promptable 3D detector paired with a new 1M-image dataset across 13.5K categories that sets SOTA on open-world and zero-shot 3D detection benchmarks.
Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels cs.CV · 2023-12-28 · conditional · none · ref 220
Q-Align trains LMMs on discrete text-defined levels for visual scoring, achieving SOTA on IQA, IAA, and VQA while unifying the tasks in OneAlign.
Moondream Segmentation: From Words to Masks cs.CV · 2026-04-03 · unverdicted · none · ref 10
Moondream Segmentation achieves 80.2% cIoU on RefCOCO by autoregressively decoding paths from referring expressions and using RL to refine masks, plus releases a cleaned RefCOCO-M dataset.
Chat-Scene++: Exploiting Context-Rich Object Identification for 3D LLM cs.CV · 2026-03-29 · unverdicted · none · ref 12
Chat-Scene++ improves 3D scene understanding in multimodal LLMs by representing scenes as context-rich object sequences with identifier tokens and grounded chain-of-thought reasoning, reaching state-of-the-art on five benchmarks using pre-trained encoders.
MINGLE: VLMs for Semantically Complex Region Detection in Urban Scenes cs.CV · 2025-09-16 · unverdicted · none · ref 20
MINGLE is a modular pipeline that combines off-the-shelf detection tools with VLM reasoning to localize socially connected groups in urban scenes and is supported by a new 100K-image dataset.
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training cs.CV · 2024-03-14 · unverdicted · none · ref 56
MM1 models achieve state-of-the-art few-shot multimodal results by pre-training on a careful mix of image-caption, interleaved, and text-only data with optimized image encoders.
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models cs.CV · 2024-01-29 · conditional · none · ref 17
MoE-LLaVA applies mixture-of-experts sparsity to LVLMs via MoE-Tuning, delivering LLaVA-1.5-7B level visual understanding and better hallucination resistance with only ~3B active parameters.
Hallucination of Multimodal Large Language Models: A Survey cs.CV · 2024-04-29 · accept · none · ref 96
The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models cs.CV · 2024-03-27 · unverdicted · none · ref 39
Mini-Gemini enhances VLMs via high-resolution visual refinement, curated reasoning data, and self-guided generation to reach leading zero-shot benchmark results across 2B-34B LLMs.
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks cs.CV · 2023-12-21 · unverdicted · none · ref 76
InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites cs.CV · 2024-04-25 · unverdicted · none · ref 43
InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model cs.CV · 2024-02-06 · unverdicted · none · ref 33
MobileVLM V2 shows that 1.7B and 3B parameter vision-language models can reach or exceed the performance of 3B and 7B+ models on common VLM benchmarks via targeted design and data improvements.
Improved Baselines with Visual Instruction Tuning cs.CV · 2023-10-05 · conditional · none · ref 26
Simple changes to LLaVA using CLIP-ViT-L-336px, an MLP connector, and academic VQA data yield state-of-the-art results on 11 benchmarks with only 1.2M public examples and one-day training on 8 A100 GPUs.
A Survey on Multimodal Large Language Models cs.CV · 2023-06-23 · accept · none · ref 144
This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.

Lisa: Reasoning seg- mentation via large language model

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer