DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).
super hub Mixed citations
PaliGemma: A versatile 3B VLM for transfer
Mixed citation behavior. Most common role is background (59%).
abstract
PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m vision encoder and the Gemma-2B language model. It is trained to be a versatile and broadly knowledgeable base model that is effective to transfer. It achieves strong performance on a wide variety of open-world tasks. We evaluate PaliGemma on almost 40 diverse tasks including standard VLM benchmarks, but also more specialized tasks such as remote-sensing and segmentation.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m vision encoder and the Gemma-2B language model. It is trained to be a versatile and broadly knowledgeable base model that is effective to transfer. It achieves strong performance on a wide variety of open-world tasks. We evaluate PaliGemma on almost 40 diverse tasks including standard VLM benchmarks, but also more specialized tasks such as remote-sensing and segmentation.
authors
co-cited works
representative citing papers
RoboLab is a new simulation benchmark with 120 tasks across visual, procedural, and relational axes that quantifies generalization gaps and perturbation sensitivity in task-generalist robotic policies.
Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.
Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
VLA models from VLM adaptation can be pruned 12-30% via multi-module joint scheme based on divergence signals while keeping ~90% performance on LIBERO without post-pruning recovery, unlike standard criteria that collapse.
VLA language backbones show high redundancy on manipulation benchmarks, with half the LLM blocks removable and even two blocks sufficient to recover baseline performance after fine-tuning, unlike vision and action pathways.
Koshur Pixel is the first large-scale synthetic OCR dataset for Kashmiri with 613,078 image-text pairs generated via SynthOCR-Gen from the KS-PRET-5M corpus across multiple fonts and granularities with 25+ augmentations.
NAC adapts multi-scale RVQGAN audio codecs with kinematic-specific losses to produce ordered action tokens that yield lower reconstruction error and higher task success than prior tokenizers in VLA models.
ThinkingVLA is a Mixture-of-Transformers VLA model that performs interleaved forward CoT for subgoal and image prediction followed by inverse CoT grounded on the predicted image to generate actions.
FTP-1 is the first foundation tactile policy pretrained on ~3000 hours of data from 26 sources across 21 sensors that improves performance on seen setups by 17.2% and transfers to unseen sensors with 31% success rate gain.
VLMs across families and scales show anchoring to discrete slant angles in zero-shot and prompted settings rather than human-like graded texture-based slant perception.
HapTile introduces a visuotactile dataset with haptic-informed teleoperation for language-conditioned contact-rich manipulation tasks and provides baseline policy benchmarks.
Argus achieves the highest reported NDCG scores among open late-interaction models on ViDoRe V1 and combined V1+V2 by introducing query-dependent document representations via a region-aware MoE on Qwen3.5-VL, trained on 9% of public data with a 1024-dim head.
DeepLatent introduces a parallel latent visual reasoning framework with learnable 2D tokens and continuous RL, trained via distillation then RL, plus a new 180K dataset, claiming SOTA benchmark results.
OmniGF adapts VLMs via dual-branch decoding and head embeddings to unify precise multi-person gaze localization with semantic and social reasoning, claiming new SOTA on benchmarks.
SELECT-LLM is the first active model selection framework for LLMs that uses expected information gain from pairwise output similarities to minimize required annotations, reporting up to 84.78% cost reduction across 23 datasets and 156 models.
GesVLA encodes gesture features directly into the latent space of VLA models using a dual-VLM architecture and a rendering-based data pipeline, yielding improved target grounding in real robotic tasks.
A3 reframes dynamic action chunk commitment in VLA models as self-speculative prefix verification, accepting the longest continuous sequence of actions that satisfies consensus-ordered conditional invariance and prefix-closed sequential consistency.
RIO introduces a lightweight open-source framework that abstracts real-time robot I/O to support easy switching between embodiments and platforms for collecting data and deploying VLAs.
Reducing visual input to one token per frame in VLA world models maintains or improves long-horizon performance on MetaWorld, LIBERO, and real-robot tasks.
AT-VLA proposes adaptive tactile injection and a dual-stream tactile reaction mechanism to enhance VLA models for contact-rich robotic manipulation with real-time responses.
Introduces ISS and NMR as interventional metrics to diagnose causal misalignment in VLA policies and link it to generalization performance.
MoSS augments VLAs with decoupled modality streams for multiple physical signals, achieving synergistic gains in real-world robot tasks via joint attention and auxiliary future-signal prediction.
SketchVLM lets VLMs generate non-destructive SVG annotations on input images to visually explain answers, raising visual reasoning accuracy by up to 28.5 points and annotation quality by 1.48x over baselines.
citing papers explorer
-
DataComp-VLM: Improved Open Datasets for Vision-Language Models
DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).
-
RoboLab: A High-Fidelity Simulation Benchmark for Analysis of Task Generalist Policies
RoboLab is a new simulation benchmark with 120 tasks across visual, procedural, and relational axes that quantifies generalization gaps and perturbation sensitivity in task-generalist robotic policies.
-
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.
-
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
-
Revisiting Parameter Redundancy in Vision-Language-Action Models: Insights from VLM-to-VLA Adaptation
VLA models from VLM adaptation can be pruned 12-30% via multi-module joint scheme based on divergence signals while keeping ~90% performance on LIBERO without post-pruning recovery, unlike standard criteria that collapse.
-
Drop-Then-Recovery: How Redundant Are Vision-Language-Action Models?
VLA language backbones show high redundancy on manipulation benchmarks, with half the LLM blocks removable and even two blocks sufficient to recover baseline performance after fine-tuning, unlike vision and action pathways.
-
Koshur Pixel: a large-scale synthetic ocr dataset for kashmiri
Koshur Pixel is the first large-scale synthetic OCR dataset for Kashmiri with 613,078 image-text pairs generated via SynthOCR-Gen from the KS-PRET-5M corpus across multiple fonts and granularities with 25+ augmentations.
-
NAC: Neural Action Codec for Vision-Language-Action Models
NAC adapts multi-scale RVQGAN audio codecs with kinematic-specific losses to produce ordered action tokens that yield lower reconstruction error and higher task success than prior tokenizers in VLA models.
-
ThinkingVLA: Interleaved Vision and Language Reasoning for Robotic Manipulation
ThinkingVLA is a Mixture-of-Transformers VLA model that performs interleaved forward CoT for subgoal and image prediction followed by inverse CoT grounded on the predicted image to generate actions.
-
FTP-1: A Generalist Foundation Tactile Policy Across Tactile Sensors for Contact-Rich Manipulation
FTP-1 is the first foundation tactile policy pretrained on ~3000 hours of data from 26 sources across 21 sensors that improves performance on seen setups by 17.2% and transfers to unseen sensors with 31% success rate gain.
-
Anchored, Not Graded: Vision-Language Models Fail at Slant-from-Texture Perception
VLMs across families and scales show anchoring to discrete slant angles in zero-shot and prompted settings rather than human-like graded texture-based slant perception.
-
HapTile: A Haptic-Informed Vision-Tactile-Language-Action Dataset for Contact-Rich Imitation Learning
HapTile introduces a visuotactile dataset with haptic-informed teleoperation for language-conditioned contact-rich manipulation tasks and provides baseline policy benchmarks.
-
Argus-Retriever: Vision-LLM Late-Interaction Retrieval with Region-Aware Query-Conditioned MoE for Visual Document Retrieval
Argus achieves the highest reported NDCG scores among open late-interaction models on ViDoRe V1 and combined V1+V2 by introducing query-dependent document representations via a region-aware MoE on Qwen3.5-VL, trained on 9% of public data with a 1024-dim head.
-
DeepLatent: Think with Images via Parallel Latent Visual Reasoning
DeepLatent introduces a parallel latent visual reasoning framework with learnable 2D tokens and continuous RL, trained via distillation then RL, plus a new 180K dataset, claiming SOTA benchmark results.
-
OmniGF: A Dual-Branch Vision-Language Framework for Unified Gaze Following
OmniGF adapts VLMs via dual-branch decoding and head embeddings to unify precise multi-person gaze localization with semantic and social reasoning, claiming new SOTA on benchmarks.
-
Large Language Model Selection with Limited Annotations
SELECT-LLM is the first active model selection framework for LLMs that uses expected information gain from pairwise output similarities to minimize required annotations, reporting up to 84.78% cost reduction across 23 datasets and 156 models.
-
GesVLA: Gesture-Aware Vision-Language-Action Model Embedded Representations
GesVLA encodes gesture features directly into the latent space of VLA models using a dual-VLM architecture and a rendering-based data pipeline, yielding improved target grounding in real robotic tasks.
-
Dynamic Execution Commitment of Vision-Language-Action Models
A3 reframes dynamic action chunk commitment in VLA models as self-speculative prefix verification, accepting the longest continuous sequence of actions that satisfies consensus-ordered conditional invariance and prefix-closed sequential consistency.
-
RIO: Flexible Real-Time Robot I/O for Cross-Embodiment Robot Learning
RIO introduces a lightweight open-source framework that abstracts real-time robot I/O to support easy switching between embodiments and platforms for collecting data and deploying VLAs.
-
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
Reducing visual input to one token per frame in VLA world models maintains or improves long-horizon performance on MetaWorld, LIBERO, and real-robot tasks.
-
AT-VLA: Adaptive Tactile Injection for Enhanced Feedback Reaction in Vision-Language-Action Models
AT-VLA proposes adaptive tactile injection and a dual-stream tactile reaction mechanism to enhance VLA models for contact-rich robotic manipulation with real-time responses.
-
Embodied Interpretability: Linking Causal Understanding to Generalization in Vision-Language-Action Models
Introduces ISS and NMR as interventional metrics to diagnose causal misalignment in VLA policies and link it to generalization performance.
-
Modular Sensory Stream for Integrating Physical Feedback in Vision-Language-Action Models
MoSS augments VLAs with decoupled modality streams for multiple physical signals, achieving synergistic gains in real-world robot tasks via joint attention and auxiliary future-signal prediction.
-
SketchVLM: Vision language models can annotate images to explain thoughts and guide users
SketchVLM lets VLMs generate non-destructive SVG annotations on input images to visually explain answers, raising visual reasoning accuracy by up to 28.5 points and annotation quality by 1.48x over baselines.
-
Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation
Alignment of vision-language models with human V1-V3 early visual cortex negatively predicts resistance to sycophantic gaslighting attacks.
-
DSCA: Dynamic Subspace Concept Alignment for Lifelong VLM Editing
DSCA turns concept isolation into an architectural property by dynamically creating orthogonal subspaces for non-interfering lifelong edits in vision-language models, sustaining over 95% success after 1000 sequential edits.
-
MixAtlas: Uncertainty-aware Data Mixture Optimization for Multimodal LLM Midtraining
MixAtlas uses CLIP-based decomposition and Gaussian process optimization on small proxies to discover data mixtures that improve multimodal benchmark performance by up to 17.6% and transfer to larger models with faster convergence.
-
Generative Control as Optimization: Time Unconditional Flow Matching for Adaptive and Robust Robotic Control
GeCO replaces time-dependent flow matching with time-unconditional optimization, enabling adaptive inference and intrinsic OOD detection for robotic imitation learning.
-
AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models
AR-VLA introduces a standalone autoregressive action expert with long-lived memory that generates context-aware continuous actions for VLAs, replacing chunk-based heads with smoother trajectories and maintained task success.
-
SAM 3: Segment Anything with Concepts
SAM 3 introduces promptable concept segmentation that doubles accuracy of prior systems on images and videos while improving standard SAM segmentation performance.
-
Guided Query Refinement: Multimodal Hybrid Retrieval with Test-Time Optimization
GQR is a test-time optimization technique that refines primary retriever query embeddings using complementary retriever scores to achieve high performance with smaller representations in multimodal visual document retrieval.
-
Social Human Robot Embodied Conversation (SHREC) Dataset: Benchmarking Foundational Models' Social Reasoning
SHREC is a new benchmark dataset of embodied human-robot conversations that shows substantial performance gaps in state-of-the-art foundation models on tasks involving social error detection and rationale generation.
-
Domain Arithmetic: One-Shot VLA Adaptation under Environmental Shifts
DART adapts VLA models to environmental shifts with one demonstration using subspace-aligned weight vector arithmetic.
-
3D HAMSTER: Bridging Planning and Control in Hierarchical Vision Language Action Models through 3D Trajectory Guidance
3D HAMSTER adds depth encoding and reconstruction to VLMs to produce 3D waypoint sequences that feed directly into pointcloud policies, claiming better generalization than 2D baselines under shifts.
-
When transformers learn "impossible" languages, what do they learn?
Transformers on impossible-language variants show gradual grammatical sensitivity loss but sharp long-sentence generation failures, supporting generative deficiency as a link to non-attestation.
-
The Speedup Paradox: Rethinking Inference Speed-Quality Trade-off in Embodied Tasks
TISED decomposes inference optimization effects on embodied tasks and identifies paradoxical outcomes where faster per-step inference can increase task completion time on static tasks or raise success rates on dynamic tasks.
-
LightSTAR: Efficient Visual Document Retrieval via Lightweight Selection with Vision-Adaptive Refinement
LightSTAR achieves state-of-the-art accuracy in visual document retrieval by decomposing the task into LLM-free high-recall candidate selection and vision-adaptive semantic refinement on candidates, cutting end-to-end latency several-fold.
-
FlowDPG: Deterministic Policy Gradient on Flow Matching Policies for Real-World Manipulation
FlowDPG distills critic gradients into flow matching velocity fields to enable BPTT-free DDPG-style policy improvement and reports 92% success on a real-world dual-arm AirPods assembly task.
-
OpenHLM: An Empirical Recipe for Whole-Body Humanoid Loco-Manipulation
OpenHLM is an empirical recipe yielding a whole-body humanoid VLA model that outperforms GR00T N1.6 and Ψ0 baselines on long-horizon tasks using less than half the demonstration time.
-
Wh0: Generative World Models as Scalable Sources of Egocentric Human Hand Manipulation Data
Wh0 generates scalable egocentric human manipulation videos with world models and converts them to boost pretrained VLA models' zero-shot dexterous task success from 8.3% to 38.9% on 18 real-world tasks.
-
Robot Critics that Sweat the Small Stuff
Fine-tuning VLMs with pairwise progress supervision from policy rollouts improves fine-grained failure detection and boosts robot manipulation success by 11% real-world and 5.9% in simulation.
-
Decoupling the Declarative from the Procedural in Vision-Language-Action Models
w²VLA restructures VLA information flow to decouple declarative semantics from procedural skills, enabling zero-shot transfer to novel objects.
-
VLA-FAIL: Efficient Task Failure Detection for Finetuned Vision-Language-Action Models
VLA-FAIL introduces last-layer Mahalanobis distance and action chunk consistency detectors that together enable early, reliable failure detection in finetuned VLAs without failure data or expensive sampling.
-
NRITYAM: Language Models Meet Art and Heritage of Dance
NRITYAM creates the largest multilingual benchmark for evaluating language models' understanding of dance traditions through expert-curated QA pairs.
-
Language-Instructed Vision Embeddings for Controllable and Generalizable Perception
LIVE uses language to generate task-centric vision embeddings at inference, reducing hallucinations by 34 points on MMVP, outperforming larger VLMs on VQA, and generalizing to unseen tasks.
-
WeaveLA: Event Driven Cross-Subtask Latent Memory Weaving for Repetitive Robot Manipulation
WeaveLA improves VLA policies for repetitive robot manipulation by event-triggered cross-subtask latent memory weaving, raising success on the hardest repetition tasks from 0% to 47.8% while leaving single-execution performance unchanged.
-
Visuals Lie, Consistency Speaks: Disentangling Spatial Attention from Reliability in Vision-Language Models
Spatial attention metrics in VLMs correlate near zero (R≈0.001) with accuracy while self-consistency predicts truth at R=0.429; reliability stems from generation dynamics rather than visual grounding.
-
APT: Action Expert Pretraining Improves Instruction Generalization of Vision-Language-Action Policies
APT pretrains the action expert as a vision-action prior on frozen VLM features then adds language through gated fusion to improve OOD instruction generalization in continuous-action VLA policies.
-
VeriSpace: Spatially Grounded Action Verification for Vision-Language-Action Models
VeriSpace is a 3D-aware action verifier that improves test-time action selection in VLA models by encoding scenes with visual and geometric information and reasoning over spatial relations and goal progress.
-
Hierarchical Policies from Verbal and Egocentric Human Signals for Natural Human-Robot Interaction
EDITH combines egocentric vision and gaze from smart glasses with language in a hierarchical policy to let robots interpret brief nonverbal human intent and reduce user effort in interactive tasks.