hub Mixed citations

CogVLM: Visual Expert for Pretrained Language Models

Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang · 2023 · cs.CV · arXiv 2311.03079

Mixed citation behavior. Most common role is background (65%).

45 Pith papers citing it

Background 65% of classified citations

open full Pith review browse 45 citing papers arXiv PDF

abstract

We introduce CogVLM, a powerful open-source visual language foundation model. Different from the popular shallow alignment method which maps image features into the input space of language model, CogVLM bridges the gap between the frozen pretrained language model and image encoder by a trainable visual expert module in the attention and FFN layers. As a result, CogVLM enables deep fusion of vision language features without sacrificing any performance on NLP tasks. CogVLM-17B achieves state-of-the-art performance on 10 classic cross-modal benchmarks, including NoCaps, Flicker30k captioning, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA and TDIUC, and ranks the 2nd on VQAv2, OKVQA, TextVQA, COCO captioning, etc., surpassing or matching PaLI-X 55B. Codes and checkpoints are available at https://github.com/THUDM/CogVLM.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 11 baseline 3 method 3

citation-polarity summary

background 11 baseline 3 use method 3

representative citing papers

ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection

cs.CL · 2024-10-06 · unverdicted · novelty 8.0

ErrorRadar is a new benchmark of 2,500 multimodal K-12 math problems for MLLM error step identification and categorization, where GPT-4o trails human experts by ~10%.

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

cs.CL · 2023-11-27 · unverdicted · novelty 8.0

MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.

Single-Sample Black-Box Membership Inference Attack against Vision-Language Models via Cross-modal Semantic Alignment

cs.CV · 2026-05-17 · unverdicted · novelty 7.0

A cross-modal alignment attack achieves AUC 0.821 for single-sample black-box membership inference on VLMs such as LLaVA-1.5 by quantifying image-generated caption similarity.

AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation

cs.CV · 2026-04-20 · unverdicted · novelty 7.0

AnchorSeg uses ordered query banks of latent reasoning tokens plus a spatial anchor token and a Token-Mask Cycle Consistency loss to achieve 67.7% gIoU and 68.1% cIoU on the ReasonSeg benchmark.

MMR-AD: A Large-Scale Multimodal Dataset for Benchmarking General Anomaly Detection with Multimodal Large Language Models

cs.CV · 2026-04-13 · unverdicted · novelty 7.0

MMR-AD is a new benchmark dataset showing that current generalist MLLMs lag industrial needs for anomaly detection, with Anomaly-R1 delivering better results through reasoning and RL.

UIPress: Bringing Optical Token Compression to UI-to-Code Generation

cs.CL · 2026-04-10 · unverdicted · novelty 7.0

UIPress is the first encoder-side learned optical compression method for UI-to-Code that compresses visual tokens to 256, outperforming the uncompressed baseline by 7.5% CLIP score and the best inference-time baseline by 4.6% while delivering 9.1x TTFT speedup.

From Street View to Visual Network: Mapping the Visibility of Urban Landmarks with Vision-Language Models

cs.CV · 2025-05-17 · unverdicted · novelty 7.0

Reformulates landmark visibility assessment as VLM-based detection on street view images and builds a heterogeneous visibility graph to map visual connections among landmarks and urban spaces.

AdaMMS: Model Merging for Heterogeneous Multimodal Large Language Models with Unsupervised Coefficient Optimization

cs.CL · 2025-03-31 · unverdicted · novelty 7.0

AdaMMS merges heterogeneous MLLMs via architecture mapping, linear weight interpolation, and unsupervised hyper-parameter search, outperforming prior methods on vision-language benchmarks as the first such approach without labeled data.

OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning

cs.CV · 2024-12-31 · accept · novelty 7.0

OCRBench v2 is a new benchmark with four times more tasks than prior versions that reveals most large multimodal models score below 50 out of 100 on visual text tasks and share five specific weaknesses.

HumanVBench: Probing Human-Centric Video Understanding in MLLMs with Automatically Synthesized Benchmarks

cs.CV · 2024-12-23 · unverdicted · novelty 7.0

HumanVBench provides a 16-task benchmark for human-centric video understanding in MLLMs, created through automated annotation and distractor synthesis pipelines, and shows top models lag human performance on emotion perception and cross-modal alignment.

Perturb and Recover: Fine-tuning for Effective Backdoor Removal from CLIP

cs.LG · 2024-12-01 · conditional · novelty 7.0

PAR fine-tunes CLIP to remove backdoors from structured triggers while preserving standard performance, and works even with only synthetic image-text pairs.

VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents

cs.IR · 2024-10-14 · conditional · novelty 7.0

VisRAG achieves 20-40% better end-to-end performance than text-based RAG by directly embedding and retrieving document images with VLMs.

LVBench: An Extreme Long Video Understanding Benchmark

cs.CV · 2024-06-12 · accept · novelty 7.0

LVBench is a new benchmark for extreme long video understanding that evaluates multimodal large language models on hour-scale videos using tasks designed to probe extended memory and comprehension.

Cracking the Code of Juxtaposition: Can AI Models Understand the Humorous Contradictions

cs.CL · 2024-05-29 · unverdicted · novelty 7.0

Introduces YesBut benchmark showing state-of-the-art multimodal models lag humans on interpreting humorous contradictions in comics.

State Beyond Appearance: Diagnosing and Improving State Consistency in Dial-Based Measurement Reading

cs.CV · 2026-04-29 · unverdicted · novelty 6.0

MLLMs ignore dial state geometry and cluster by appearance, causing inconsistency under variations; TriSCA's state-distance alignment, metadata supervision, and objective alignment improve robustness on clock and gauge benchmarks.

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

cs.CV · 2025-08-25 · unverdicted · novelty 6.0

InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and agentic tasks.

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

cs.CV · 2025-07-01 · unverdicted · novelty 6.0

GLM-4.5V reaches state-of-the-art results on 42 multimodal benchmarks among open-source models of similar size by applying reinforcement learning with curriculum sampling to a strong vision foundation model.

InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners

cs.AI · 2025-04-19 · unverdicted · novelty 6.0

InfiGUI-R1 uses Reasoning Injection via spatial distillation followed by Deliberation Enhancement via RL to evolve GUI agents from reactive actors to deliberative reasoners, reporting strong performance on grounding and trajectory tasks.

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

cs.CV · 2025-04-14 · conditional · novelty 6.0

InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.

When 'YES' Meets 'BUT': Can Large Models Comprehend Contradictory Humor Through Comparative Reasoning?

cs.CV · 2025-03-29 · unverdicted · novelty 6.0

Presents YesBut (V2) benchmark and shows state-of-the-art VLMs significantly underperform humans on tasks requiring comparative reasoning for contradictory humor in comics.

MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models

cs.CV · 2025-01-06 · unverdicted · novelty 6.0

MotionBench is a new benchmark showing poor fine-grained motion understanding in VLMs and proposes TE Fusion to improve performance with higher frame rates.

VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation

cs.CV · 2024-12-30 · unverdicted · novelty 6.0

VisionReward learns multi-dimensional human preferences for image and video generation via hierarchical assessment and linear weighting, outperforming VideoScore by 17.2% in prediction accuracy and yielding 31.6% higher win rates in text-to-video models.

MM-MoralBench: A MultiModal Moral Evaluation Benchmark for Large Vision-Language Models

cs.CV · 2024-12-30 · unverdicted · novelty 6.0

MM-MoralBench is a new multimodal benchmark that evaluates over 20 LVLMs on moral judgment, classification, and response tasks, finding pronounced divergence from human consensus and limited benefits from scaling or chain-of-thought reasoning.

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

cs.CV · 2024-12-06 · unverdicted · novelty 6.0

InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.

citing papers explorer

Showing 45 of 45 citing papers.

ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection cs.CL · 2024-10-06 · unverdicted · none · ref 62 · internal anchor
ErrorRadar is a new benchmark of 2,500 multimodal K-12 math problems for MLLM error step identification and categorization, where GPT-4o trails human experts by ~10%.
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI cs.CL · 2023-11-27 · unverdicted · none · ref 78 · internal anchor
MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.
Single-Sample Black-Box Membership Inference Attack against Vision-Language Models via Cross-modal Semantic Alignment cs.CV · 2026-05-17 · unverdicted · none · ref 56 · internal anchor
A cross-modal alignment attack achieves AUC 0.821 for single-sample black-box membership inference on VLMs such as LLaVA-1.5 by quantifying image-generated caption similarity.
AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation cs.CV · 2026-04-20 · unverdicted · none · ref 206 · internal anchor
AnchorSeg uses ordered query banks of latent reasoning tokens plus a spatial anchor token and a Token-Mask Cycle Consistency loss to achieve 67.7% gIoU and 68.1% cIoU on the ReasonSeg benchmark.
MMR-AD: A Large-Scale Multimodal Dataset for Benchmarking General Anomaly Detection with Multimodal Large Language Models cs.CV · 2026-04-13 · unverdicted · none · ref 47 · internal anchor
MMR-AD is a new benchmark dataset showing that current generalist MLLMs lag industrial needs for anomaly detection, with Anomaly-R1 delivering better results through reasoning and RL.
UIPress: Bringing Optical Token Compression to UI-to-Code Generation cs.CL · 2026-04-10 · unverdicted · none · ref 53 · internal anchor
UIPress is the first encoder-side learned optical compression method for UI-to-Code that compresses visual tokens to 256, outperforming the uncompressed baseline by 7.5% CLIP score and the best inference-time baseline by 4.6% while delivering 9.1x TTFT speedup.
From Street View to Visual Network: Mapping the Visibility of Urban Landmarks with Vision-Language Models cs.CV · 2025-05-17 · unverdicted · none · ref 13 · internal anchor
Reformulates landmark visibility assessment as VLM-based detection on street view images and builds a heterogeneous visibility graph to map visual connections among landmarks and urban spaces.
AdaMMS: Model Merging for Heterogeneous Multimodal Large Language Models with Unsupervised Coefficient Optimization cs.CL · 2025-03-31 · unverdicted · none · ref 25 · internal anchor
AdaMMS merges heterogeneous MLLMs via architecture mapping, linear weight interpolation, and unsupervised hyper-parameter search, outperforming prior methods on vision-language benchmarks as the first such approach without labeled data.
OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning cs.CV · 2024-12-31 · accept · none · ref 144 · internal anchor
OCRBench v2 is a new benchmark with four times more tasks than prior versions that reveals most large multimodal models score below 50 out of 100 on visual text tasks and share five specific weaknesses.
HumanVBench: Probing Human-Centric Video Understanding in MLLMs with Automatically Synthesized Benchmarks cs.CV · 2024-12-23 · unverdicted · none · ref 50 · internal anchor
HumanVBench provides a 16-task benchmark for human-centric video understanding in MLLMs, created through automated annotation and distractor synthesis pipelines, and shows top models lag human performance on emotion perception and cross-modal alignment.
Perturb and Recover: Fine-tuning for Effective Backdoor Removal from CLIP cs.LG · 2024-12-01 · conditional · none · ref 43 · internal anchor
PAR fine-tunes CLIP to remove backdoors from structured triggers while preserving standard performance, and works even with only synthetic image-text pairs.
VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents cs.IR · 2024-10-14 · conditional · none · ref 24 · internal anchor
VisRAG achieves 20-40% better end-to-end performance than text-based RAG by directly embedding and retrieving document images with VLMs.
LVBench: An Extreme Long Video Understanding Benchmark cs.CV · 2024-06-12 · accept · none · ref 38 · internal anchor
LVBench is a new benchmark for extreme long video understanding that evaluates multimodal large language models on hour-scale videos using tasks designed to probe extended memory and comprehension.
Cracking the Code of Juxtaposition: Can AI Models Understand the Humorous Contradictions cs.CL · 2024-05-29 · unverdicted · none · ref 55 · internal anchor
Introduces YesBut benchmark showing state-of-the-art multimodal models lag humans on interpreting humorous contradictions in comics.
State Beyond Appearance: Diagnosing and Improving State Consistency in Dial-Based Measurement Reading cs.CV · 2026-04-29 · unverdicted · none · ref 39 · internal anchor
MLLMs ignore dial state geometry and cluster by appearance, causing inconsistency under variations; TriSCA's state-distance alignment, metadata supervision, and objective alignment improve robustness on clock and gauge benchmarks.
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency cs.CV · 2025-08-25 · unverdicted · none · ref 142 · internal anchor
InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and agentic tasks.
GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning cs.CV · 2025-07-01 · unverdicted · none · ref 59 · internal anchor
GLM-4.5V reaches state-of-the-art results on 42 multimodal benchmarks among open-source models of similar size by applying reinforcement learning with curriculum sampling to a strong vision foundation model.
InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners cs.AI · 2025-04-19 · unverdicted · none · ref 53 · internal anchor
InfiGUI-R1 uses Reasoning Injection via spatial distillation followed by Deliberation Enhancement via RL to evolve GUI agents from reactive actors to deliberative reasoners, reporting strong performance on grounding and trajectory tasks.
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models cs.CV · 2025-04-14 · conditional · none · ref 124 · internal anchor
InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
When 'YES' Meets 'BUT': Can Large Models Comprehend Contradictory Humor Through Comparative Reasoning? cs.CV · 2025-03-29 · unverdicted · none · ref 56 · internal anchor
Presents YesBut (V2) benchmark and shows state-of-the-art VLMs significantly underperform humans on tasks requiring comparative reasoning for contradictory humor in comics.
MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models cs.CV · 2025-01-06 · unverdicted · none · ref 38 · internal anchor
MotionBench is a new benchmark showing poor fine-grained motion understanding in VLMs and proposes TE Fusion to improve performance with higher frame rates.
VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation cs.CV · 2024-12-30 · unverdicted · none · ref 36 · internal anchor
VisionReward learns multi-dimensional human preferences for image and video generation via hierarchical assessment and linear weighting, outperforming VideoScore by 17.2% in prediction accuracy and yielding 31.6% higher win rates in text-to-video models.
MM-MoralBench: A MultiModal Moral Evaluation Benchmark for Large Vision-Language Models cs.CV · 2024-12-30 · unverdicted · none · ref 20 · internal anchor
MM-MoralBench is a new multimodal benchmark that evaluates over 20 LVLMs on moral judgment, classification, and response tasks, finding pronounced divergence from human consensus and limited benefits from scaling or chain-of-thought reasoning.
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling cs.CV · 2024-12-06 · unverdicted · none · ref 249 · internal anchor
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization cs.CL · 2024-11-15 · conditional · none · ref 100 · internal anchor
Mixed Preference Optimization with the MMPR dataset boosts multimodal CoT reasoning, lifting InternVL2-8B to 67.0 accuracy on MathVista (+8.7 points) and matching the 76B model.
Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models cs.CL · 2024-11-07 · conditional · none · ref 37 · internal anchor
MoT decouples non-embedding parameters by modality in transformers to match dense multi-modal performance with roughly one-third to one-half the FLOPs.
Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving cs.CV · 2024-10-29 · conditional · none · ref 15 · internal anchor
Senna decouples language-based high-level planning from an LVLM with low-level trajectory prediction from an E2E model, reporting 27% lower planning error and 33% lower collisions after pre-training on DriveX and fine-tuning on nuScenes.
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding cs.CV · 2024-10-22 · unverdicted · none · ref 25 · internal anchor
LongVU adaptively compresses long video tokens using DINOv2-based frame deduplication, text-guided cross-modal selection, and temporal spatial reduction to improve video-language understanding in MLLMs with minimal detail loss.
SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference cs.CV · 2024-10-06 · accept · none · ref 74 · internal anchor
SparseVLM uses text-guided attention to prune and recycle visual tokens in VLMs, delivering 54% FLOPs reduction and 37% lower latency with 97% accuracy retention on LLaVA.
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer cs.CV · 2024-08-12 · unverdicted · none · ref 103 · internal anchor
CogVideoX generates coherent 10-second text-to-video outputs at high resolution using a 3D VAE, expert adaptive LayerNorm transformer, progressive training, and a custom data pipeline, claiming state-of-the-art results.
Are We on the Right Way for Evaluating Large Vision-Language Models? cs.CV · 2024-03-29 · conditional · none · ref 45 · internal anchor
Current LVLM benchmarks overestimate capabilities because many questions can be answered without images due to design flaws or data leakage; MMStar is a human-curated set of 1,500 vision-indispensable samples across 6 capabilities and 18 axes with new metrics for leakage and true multi-modal gain.
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training cs.CV · 2024-03-14 · unverdicted · none · ref 114 · internal anchor
MM1 models achieve state-of-the-art few-shot multimodal results by pre-training on a careful mix of image-caption, interleaved, and text-only data with optimized image encoders.
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis cs.CV · 2024-03-05 · conditional · none · ref 192 · internal anchor
Biased noise sampling for rectified flows combined with a bidirectional text-image transformer architecture yields state-of-the-art high-resolution text-to-image results that scale predictably with model size.
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models cs.CV · 2024-02-19 · unverdicted · none · ref 21 · internal anchor
DriveVLM adds vision-language models with scene description, analysis, and hierarchical planning modules to autonomous driving, paired with a hybrid DriveVLM-Dual system tested on nuScenes and SUP-AD datasets and deployed on a production vehicle.
SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents cs.HC · 2024-01-17 · unverdicted · none · ref 34 · internal anchor
SeeClick improves visual GUI agents via GUI grounding pre-training on automatically curated data and introduces the ScreenSpot benchmark, with results indicating that stronger grounding boosts downstream task performance.
MMBench: Is Your Multi-modal Model an All-around Player? cs.CV · 2023-07-12 · accept · none · ref 47 · internal anchor
MMBench is a new bilingual benchmark that uses curated questions, CircularEval, and LLM-assisted answer conversion to provide objective, fine-grained evaluation of vision-language models.
NoisyGRPO: Incentivizing Multimodal CoT Reasoning via Noise Injection and Bayesian Estimation cs.CV · 2025-10-24 · unverdicted · none · ref 52 · internal anchor
NoisyGRPO is an RL framework that perturbs visual inputs with Gaussian noise for exploration and computes trajectory advantages via Bayesian posterior fusion of noise prior and reward likelihood to improve multimodal CoT generalization.
CogVLM2: Visual Language Models for Image and Video Understanding cs.CV · 2024-08-29 · conditional · none · ref 78 · internal anchor
CogVLM2 family achieves state-of-the-art results on image and video understanding benchmarks through improved visual expert architecture, higher resolution inputs, and automated temporal grounding for videos.
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models cs.CV · 2024-08-09 · unverdicted · none · ref 250 · internal anchor
mPLUG-Owl3 introduces hyper attention blocks to integrate vision and language for long image-sequence understanding and reports SOTA results on single-image, multi-image, and video benchmarks.
MiniCPM-V: A GPT-4V Level MLLM on Your Phone cs.CV · 2024-08-03 · conditional · none · ref 102 · internal anchor
MiniCPM-Llama3-V 2.5 delivers GPT-4V-level multimodal performance on phones through architecture, pretraining, and alignment optimizations.
Hallucination of Multimodal Large Language Models: A Survey cs.CV · 2024-04-29 · accept · none · ref 169 · internal anchor
The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models cs.CV · 2024-03-27 · unverdicted · none · ref 66 · internal anchor
Mini-Gemini enhances VLMs via high-resolution visual refinement, curated reasoning data, and self-guided generation to reach leading zero-shot benchmark results across 2B-34B LLMs.
PaliGemma: A versatile 3B VLM for transfer cs.CV · 2024-07-10 · unverdicted · none · ref 143 · internal anchor
PaliGemma is an open 3B VLM based on SigLIP and Gemma that achieves strong performance on nearly 40 diverse open-world tasks including benchmarks, remote-sensing, and segmentation.
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites cs.CV · 2024-04-25 · unverdicted · none · ref 117 · internal anchor
InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.
A Survey on Multimodal Large Language Models cs.CV · 2023-06-23 · accept · none · ref 77 · internal anchor
This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.

CogVLM: Visual Expert for Pretrained Language Models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer