Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
hub Mixed citations
MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices
Mixed citation behavior. Most common role is background (64%).
abstract
We present MobileVLM, a competent multimodal vision language model (MMVLM) targeted to run on mobile devices. It is an amalgamation of a myriad of architectural designs and techniques that are mobile-oriented, which comprises a set of language models at the scale of 1.4B and 2.7B parameters, trained from scratch, a multimodal vision model that is pre-trained in the CLIP fashion, cross-modality interaction via an efficient projector. We evaluate MobileVLM on several typical VLM benchmarks. Our models demonstrate on par performance compared with a few much larger models. More importantly, we measure the inference speed on both a Qualcomm Snapdragon 888 CPU and an NVIDIA Jeston Orin GPU, and we obtain state-of-the-art performance of 21.5 tokens and 65.3 tokens per second, respectively. Our code will be made available at: https://github.com/Meituan-AutoML/MobileVLM.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Derives exact guidance transition rates for discrete flow matching models that require only one model evaluation per sampling step and unify prior approximation-based methods.
Janus decouples visual encoding into task-specific pathways inside a single autoregressive transformer to unify multimodal understanding and generation while outperforming earlier unified models.
HPP decouples perception from reasoning in long-video VLMs by having an LLM run iterative programmatic probes on hierarchically segmented video, reporting gains on LongVideoBench, EgoSchema, VideoMME, and MLVU.
A cascaded knowledge distillation method with intermediate teachers improves efficiency of vision-language models like LLaVA while achieving state-of-the-art results on seven VQA benchmarks.
Distills 3D spatial reasoning from a 7B teacher VLM to a 2.29B student using VGGT encoder, multi-task losses, and Hidden CoT latent tokens, yielding 8.7x lower latency with 54-72% performance retention on ScanNet and 3D-FRONT.
A 0.5B student VLM distills from a 3B teacher using visual-switch distillation and DBiLD loss to gain 3.6 points on average across 10 multimodal benchmarks without architecture changes.
VaLR generates vision-aligned latent tokens before each reasoning step to preserve perceptual cues, improving VSI-Bench accuracy from 33.0% to 52.9%.
SEED-X is a unified multimodal foundation model that handles multi-granularity visual semantics for both comprehension and generation across arbitrary image sizes and ratios.
MM1 models achieve state-of-the-art few-shot multimodal results by pre-training on a careful mix of image-caption, interleaved, and text-only data with optimized image encoders.
DriveVLM adds vision-language models with scene description, analysis, and hierarchical planning modules to autonomous driving, paired with a hybrid DriveVLM-Dual system tested on nuScenes and SUP-AD datasets and deployed on a production vehicle.
MoE-LLaVA applies mixture-of-experts sparsity to LVLMs via MoE-Tuning, delivering LLaVA-1.5-7B level visual understanding and better hallucination resistance with only ~3B active parameters.
Hardware-in-the-loop tests on Snapdragon 8 Elite show NPU phase-dependent speedups (1.64x prefill, 1.18x decode) and 2.52x energy reduction for FastVLM-0.5B plus a graph rewrite enabling unsupported encoders.
Pilot evaluation of language-specific versus multilingual LoRA adapters on Qwen2.5-VL-3B for curator-guided BLV art descriptions in three languages.
Widthwise pruning of LVLM language backbones combined with supervised finetuning and hidden-state distillation recovers over 95% performance using just 5% of data across 3B-7B models.
RATNet applies analogical reasoning via a cyclic pre-training strategy to outperform prior foundation models in GI endoscopy diagnosis across diagnosis, few-shot, zero-shot, robustness, adaptation, and federated scenarios.
ClickAIXR combines controller-based object selection in XR with on-device VLM inference to enable private, precise multimodal queries about real objects.
Efficient3D prunes visual tokens in 3D MLLMs via DVTIE and ATR modules, reporting better performance than unpruned baselines on Scan2Cap and other benchmarks.
AutoDrive-R² adds four-step CoT reasoning with self-reflection to VLA models via SFT on nuScenesR²-6K and GRPO RL under spatial, dynamic, and smoothness rewards, reporting SOTA results on nuScenes and Waymo.
MiniCPM-Llama3-V 2.5 delivers GPT-4V-level multimodal performance on phones through architecture, pretraining, and alignment optimizations.
Mini-Gemini enhances VLMs via high-resolution visual refinement, curated reasoning data, and self-guided generation to reach leading zero-shot benchmark results across 2B-34B LLMs.
MobileVLM V2 shows that 1.7B and 3B parameter vision-language models can reach or exceed the performance of 3B and 7B+ models on common VLM benchmarks via targeted design and data improvements.
The survey identifies a key tension in multilingual vision-language models between language neutrality via contrastive learning and cultural awareness via diverse data, with most benchmarks relying on translation-based evaluation.
Scaling data, model size, and training optimization on the Janus architecture yields better multimodal understanding and more stable, instruction-following text-to-image generation.
citing papers explorer
No citing papers match the current filters.