RoboLab is a new simulation benchmark with 120 tasks across visual, procedural, and relational axes that quantifies generalization gaps and perturbation sensitivity in task-generalist robotic policies.
hub Mixed citations
PaliGemma: A versatile 3B VLM for transfer
Mixed citation behavior. Most common role is background (59%).
abstract
PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m vision encoder and the Gemma-2B language model. It is trained to be a versatile and broadly knowledgeable base model that is effective to transfer. It achieves strong performance on a wide variety of open-world tasks. We evaluate PaliGemma on almost 40 diverse tasks including standard VLM benchmarks, but also more specialized tasks such as remote-sensing and segmentation.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m vision encoder and the Gemma-2B language model. It is trained to be a versatile and broadly knowledgeable base model that is effective to transfer. It achieves strong performance on a wide variety of open-world tasks. We evaluate PaliGemma on almost 40 diverse tasks including standard VLM benchmarks, but also more specialized tasks such as remote-sensing and segmentation.
co-cited works
representative citing papers
Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.
Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
Koshur Pixel is the first large-scale synthetic OCR dataset for Kashmiri with 613,078 image-text pairs generated via SynthOCR-Gen from the KS-PRET-5M corpus across multiple fonts and granularities with 25+ augmentations.
DeepLatent introduces a parallel latent visual reasoning framework with learnable 2D tokens and continuous RL, trained via distillation then RL, plus a new 180K dataset, claiming SOTA benchmark results.
GesVLA encodes gesture features directly into the latent space of VLA models using a dual-VLM architecture and a rendering-based data pipeline, yielding improved target grounding in real robotic tasks.
RIO introduces a lightweight open-source framework that abstracts real-time robot I/O to support easy switching between embodiments and platforms for collecting data and deploying VLAs.
Reducing visual input to one token per frame in VLA world models maintains or improves long-horizon performance on MetaWorld, LIBERO, and real-robot tasks.
AT-VLA proposes adaptive tactile injection and a dual-stream tactile reaction mechanism to enhance VLA models for contact-rich robotic manipulation with real-time responses.
MoSS augments VLAs with decoupled modality streams for multiple physical signals, achieving synergistic gains in real-world robot tasks via joint attention and auxiliary future-signal prediction.
SketchVLM lets VLMs generate non-destructive SVG annotations on input images to visually explain answers, raising visual reasoning accuracy by up to 28.5 points and annotation quality by 1.48x over baselines.
Alignment of vision-language models with human V1-V3 early visual cortex negatively predicts resistance to sycophantic gaslighting attacks.
DSCA turns concept isolation into an architectural property by dynamically creating orthogonal subspaces for non-interfering lifelong edits in vision-language models, sustaining over 95% success after 1000 sequential edits.
MixAtlas uses CLIP-based decomposition and Gaussian process optimization on small proxies to discover data mixtures that improve multimodal benchmark performance by up to 17.6% and transfer to larger models with faster convergence.
GeCO replaces time-dependent flow matching with time-unconditional optimization, enabling adaptive inference and intrinsic OOD detection for robotic imitation learning.
AR-VLA introduces a standalone autoregressive action expert with long-lived memory that generates context-aware continuous actions for VLAs, replacing chunk-based heads with smoother trajectories and maintained task success.
SAM 3 introduces promptable concept segmentation that doubles accuracy of prior systems on images and videos while improving standard SAM segmentation performance.
GQR is a test-time optimization technique that refines primary retriever query embeddings using complementary retriever scores to achieve high performance with smaller representations in multimodal visual document retrieval.
SHREC is a new benchmark dataset of embodied human-robot conversations that shows substantial performance gaps in state-of-the-art foundation models on tasks involving social error detection and rationale generation.
VLMs fail at visual counting extrapolation because they cannot project visual magnitudes onto symbolic tokens, despite intact perceptual representations, supporting a fractured magnitude hypothesis.
PARCEL is a new visual tokenization architecture combining pool-anchored resampling with conditioned elastic queries to enhance performance-efficiency tradeoffs in LVLMs over prior matryoshka methods.
Inverse dynamics prediction is added as an auxiliary task to reduce state aliasing in VLA models by directly supervising the vision encoder on action-relevant visual distinctions using only standard observation-action pairs.
TextTeacher uses frozen text embeddings from captions as semantic anchors to guide vision model training, improving ImageNet accuracy by up to 2.7 p.p. and transfer performance by 1.0 p.p. on average.
Reweighting training emphasis toward image-negative tokens and filtering hallucinated data reduces object hallucination in LVLMs across three model variants.
citing papers explorer
-
Breaking Lock-In: Preserving Steerability under Low-Data VLA Post-Training
DeLock mitigates lock-in in low-data VLA post-training via visual grounding preservation and test-time contrastive prompt guidance, outperforming baselines across eight evaluations while matching data-heavy generalist policies.
-
ST-$\pi$: Structured SpatioTemporal VLA for Robotic Manipulation
ST-π structures VLA models by having a spatiotemporal VLM produce causally ordered chunk-level prompts that guide a dual-generator action expert to jointly handle spatial and temporal control in robotic manipulation.
-
UAV-Track VLA: Embodied Aerial Tracking via Vision-Language-Action Models
UAV-Track VLA modifies the π0.5 VLA architecture with temporal compression and dual-branch decoding to reach 61.76% success and 269.65 average frames in long-distance pedestrian tracking on a new 890K-frame UAV dataset, while cutting inference latency by 33.4%.
-
FAST: Efficient Action Tokenization for Vision-Language-Action Models
FAST applies discrete cosine transform to robot action sequences for efficient tokenization, enabling autoregressive VLAs to succeed on high-frequency dexterous tasks and scale to 10k hours of data while matching diffusion VLA performance with up to 5x faster training.
-
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
π₀ is a vision-language-action flow model trained on diverse multi-platform robot data that supports zero-shot task performance, language instruction following, and efficient fine-tuning for dexterous tasks.
-
ZAYA1-VL-8B Technical Report
ZAYA1-VL-8B is a new MoE vision-language model with vision-specific LoRA adapters and bidirectional image attention that reports competitive performance against several 3B-4B models on image, reasoning, and counting benchmarks.