super hub Mixed citations

PaliGemma: A versatile 3B VLM for transfer

Alexander Kolesnikov, Andreas Steiner, Daniel Salz, Lucas Beyer, Xiao Wang · 2024 · cs.CV · arXiv 2407.07726

Mixed citation behavior. Most common role is background (59%).

155 Pith papers citing it

Background 59% of classified citations

open full Pith review browse 155 citing papers more from Alexander Kolesnikov arXiv PDF

abstract

PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m vision encoder and the Gemma-2B language model. It is trained to be a versatile and broadly knowledgeable base model that is effective to transfer. It achieves strong performance on a wide variety of open-world tasks. We evaluate PaliGemma on almost 40 diverse tasks including standard VLM benchmarks, but also more specialized tasks such as remote-sensing and segmentation.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 19 method 6 baseline 5 dataset 2

citation-polarity summary

background 19 use method 6 baseline 4 use dataset 2 unclear 1

claims ledger

abstract PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m vision encoder and the Gemma-2B language model. It is trained to be a versatile and broadly knowledgeable base model that is effective to transfer. It achieves strong performance on a wide variety of open-world tasks. We evaluate PaliGemma on almost 40 diverse tasks including standard VLM benchmarks, but also more specialized tasks such as remote-sensing and segmentation.

authors

Alexander Kolesnikov Andreas Steiner Andr\'e Susano Pinto Daniel Salz Lucas Beyer Xiao Wang

co-cited works

representative citing papers

DataComp-VLM: Improved Open Datasets for Vision-Language Models

cs.CV · 2026-06-26 · conditional · novelty 8.0 · 2 refs

DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).

RoboLab: A High-Fidelity Simulation Benchmark for Analysis of Task Generalist Policies

cs.RO · 2026-04-10 · unverdicted · novelty 8.0 · 2 refs

RoboLab is a new simulation benchmark with 120 tasks across visual, procedural, and relational axes that quantifies generalization gaps and perturbation sensitivity in task-generalist robotic policies.

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

cs.CV · 2026-01-15 · unverdicted · novelty 8.0

Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

cs.CV · 2024-09-25 · accept · novelty 8.0

Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.

Revisiting Parameter Redundancy in Vision-Language-Action Models: Insights from VLM-to-VLA Adaptation

cs.RO · 2026-06-30 · unverdicted · novelty 7.0

VLA models from VLM adaptation can be pruned 12-30% via multi-module joint scheme based on divergence signals while keeping ~90% performance on LIBERO without post-pruning recovery, unlike standard criteria that collapse.

Drop-Then-Recovery: How Redundant Are Vision-Language-Action Models?

cs.RO · 2026-06-26 · accept · novelty 7.0

VLA language backbones show high redundancy on manipulation benchmarks, with half the LLM blocks removable and even two blocks sufficient to recover baseline performance after fine-tuning, unlike vision and action pathways.

Koshur Pixel: a large-scale synthetic ocr dataset for kashmiri

cs.CV · 2026-06-22 · unverdicted · novelty 7.0

Koshur Pixel is the first large-scale synthetic OCR dataset for Kashmiri with 613,078 image-text pairs generated via SynthOCR-Gen from the KS-PRET-5M corpus across multiple fonts and granularities with 25+ augmentations.

ThinkingVLA: Interleaved Vision and Language Reasoning for Robotic Manipulation

cs.RO · 2026-06-16 · unverdicted · novelty 7.0

ThinkingVLA is a Mixture-of-Transformers VLA model that performs interleaved forward CoT for subgoal and image prediction followed by inverse CoT grounded on the predicted image to generate actions.

FTP-1: A Generalist Foundation Tactile Policy Across Tactile Sensors for Contact-Rich Manipulation

cs.RO · 2026-06-11 · unverdicted · novelty 7.0

FTP-1 is the first foundation tactile policy pretrained on ~3000 hours of data from 26 sources across 21 sensors that improves performance on seen setups by 17.2% and transfers to unseen sensors with 31% success rate gain.

Anchored, Not Graded: Vision-Language Models Fail at Slant-from-Texture Perception

cs.CV · 2026-06-04 · unverdicted · novelty 7.0

VLMs across families and scales show anchoring to discrete slant angles in zero-shot and prompted settings rather than human-like graded texture-based slant perception.

HapTile: A Haptic-Informed Vision-Tactile-Language-Action Dataset for Contact-Rich Imitation Learning

cs.RO · 2026-06-03 · unverdicted · novelty 7.0

HapTile introduces a visuotactile dataset with haptic-informed teleoperation for language-conditioned contact-rich manipulation tasks and provides baseline policy benchmarks.

Argus-Retriever: Vision-LLM Late-Interaction Retrieval with Region-Aware Query-Conditioned MoE for Visual Document Retrieval

cs.IR · 2026-06-03 · unverdicted · novelty 7.0

Argus achieves the highest reported NDCG scores among open late-interaction models on ViDoRe V1 and combined V1+V2 by introducing query-dependent document representations via a region-aware MoE on Qwen3.5-VL, trained on 9% of public data with a 1024-dim head.

DeepLatent: Think with Images via Parallel Latent Visual Reasoning

cs.CV · 2026-05-30 · unverdicted · novelty 7.0

DeepLatent introduces a parallel latent visual reasoning framework with learnable 2D tokens and continuous RL, trained via distillation then RL, plus a new 180K dataset, claiming SOTA benchmark results.

OmniGF: A Dual-Branch Vision-Language Framework for Unified Gaze Following

cs.CV · 2026-05-26 · unverdicted · novelty 7.0

OmniGF adapts VLMs via dual-branch decoding and head embeddings to unify precise multi-person gaze localization with semantic and social reasoning, claiming new SOTA on benchmarks.

Large Language Model Selection with Limited Annotations

cs.CL · 2026-05-24 · unverdicted · novelty 7.0

SELECT-LLM is the first active model selection framework for LLMs that uses expected information gain from pairwise output similarities to minimize required annotations, reporting up to 84.78% cost reduction across 23 datasets and 156 models.

GesVLA: Gesture-Aware Vision-Language-Action Model Embedded Representations

cs.RO · 2026-05-21 · unverdicted · novelty 7.0

GesVLA encodes gesture features directly into the latent space of VLA models using a dual-VLM architecture and a rendering-based data pipeline, yielding improved target grounding in real robotic tasks.

Dynamic Execution Commitment of Vision-Language-Action Models

cs.CV · 2026-05-12 · unverdicted · novelty 7.0 · 3 refs

A3 reframes dynamic action chunk commitment in VLA models as self-speculative prefix verification, accepting the longest continuous sequence of actions that satisfies consensus-ordered conditional invariance and prefix-closed sequential consistency.

RIO: Flexible Real-Time Robot I/O for Cross-Embodiment Robot Learning

cs.RO · 2026-05-12 · unverdicted · novelty 7.0

RIO introduces a lightweight open-source framework that abstracts real-time robot I/O to support easy switching between embodiments and platforms for collecting data and deploying VLAs.

One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy

cs.CV · 2026-05-08 · conditional · novelty 7.0 · 3 refs

Reducing visual input to one token per frame in VLA world models maintains or improves long-horizon performance on MetaWorld, LIBERO, and real-robot tasks.

AT-VLA: Adaptive Tactile Injection for Enhanced Feedback Reaction in Vision-Language-Action Models

cs.RO · 2026-05-08 · unverdicted · novelty 7.0 · 2 refs

AT-VLA proposes adaptive tactile injection and a dual-stream tactile reaction mechanism to enhance VLA models for contact-rich robotic manipulation with real-time responses.

Embodied Interpretability: Linking Causal Understanding to Generalization in Vision-Language-Action Models

cs.RO · 2026-05-01 · unverdicted · novelty 7.0

Introduces ISS and NMR as interventional metrics to diagnose causal misalignment in VLA policies and link it to generalization performance.

Modular Sensory Stream for Integrating Physical Feedback in Vision-Language-Action Models

cs.RO · 2026-04-25 · unverdicted · novelty 7.0

MoSS augments VLAs with decoupled modality streams for multiple physical signals, achieving synergistic gains in real-world robot tasks via joint attention and auxiliary future-signal prediction.

SketchVLM: Vision language models can annotate images to explain thoughts and guide users

cs.CV · 2026-04-23 · unverdicted · novelty 7.0

SketchVLM lets VLMs generate non-destructive SVG annotations on input images to visually explain answers, raising visual reasoning accuracy by up to 28.5 points and annotation quality by 1.48x over baselines.

Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation

cs.CV · 2026-04-15 · conditional · novelty 7.0

Alignment of vision-language models with human V1-V3 early visual cortex negatively predicts resistance to sycophantic gaslighting attacks.

citing papers explorer

Showing 50 of 155 citing papers.

DataComp-VLM: Improved Open Datasets for Vision-Language Models cs.CV · 2026-06-26 · conditional · none · ref 21 · 2 links · internal anchor
DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).
RoboLab: A High-Fidelity Simulation Benchmark for Analysis of Task Generalist Policies cs.RO · 2026-04-10 · unverdicted · none · ref 4 · 2 links · internal anchor
RoboLab is a new simulation benchmark with 120 tasks across visual, procedural, and relational axes that quantifies generalization gaps and perturbation sensitivity in task-generalist robotic policies.
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding cs.CV · 2026-01-15 · unverdicted · none · ref 13 · internal anchor
Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models cs.CV · 2024-09-25 · accept · none · ref 10 · internal anchor
Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
Revisiting Parameter Redundancy in Vision-Language-Action Models: Insights from VLM-to-VLA Adaptation cs.RO · 2026-06-30 · unverdicted · none · ref 2 · internal anchor
VLA models from VLM adaptation can be pruned 12-30% via multi-module joint scheme based on divergence signals while keeping ~90% performance on LIBERO without post-pruning recovery, unlike standard criteria that collapse.
Drop-Then-Recovery: How Redundant Are Vision-Language-Action Models? cs.RO · 2026-06-26 · accept · none · ref 2 · internal anchor
VLA language backbones show high redundancy on manipulation benchmarks, with half the LLM blocks removable and even two blocks sufficient to recover baseline performance after fine-tuning, unlike vision and action pathways.
Koshur Pixel: a large-scale synthetic ocr dataset for kashmiri cs.CV · 2026-06-22 · unverdicted · none · ref 34 · internal anchor
Koshur Pixel is the first large-scale synthetic OCR dataset for Kashmiri with 613,078 image-text pairs generated via SynthOCR-Gen from the KS-PRET-5M corpus across multiple fonts and granularities with 25+ augmentations.
ThinkingVLA: Interleaved Vision and Language Reasoning for Robotic Manipulation cs.RO · 2026-06-16 · unverdicted · none · ref 3 · internal anchor
ThinkingVLA is a Mixture-of-Transformers VLA model that performs interleaved forward CoT for subgoal and image prediction followed by inverse CoT grounded on the predicted image to generate actions.
FTP-1: A Generalist Foundation Tactile Policy Across Tactile Sensors for Contact-Rich Manipulation cs.RO · 2026-06-11 · unverdicted · none · ref 22 · internal anchor
FTP-1 is the first foundation tactile policy pretrained on ~3000 hours of data from 26 sources across 21 sensors that improves performance on seen setups by 17.2% and transfers to unseen sensors with 31% success rate gain.
Anchored, Not Graded: Vision-Language Models Fail at Slant-from-Texture Perception cs.CV · 2026-06-04 · unverdicted · none · ref 4 · internal anchor
VLMs across families and scales show anchoring to discrete slant angles in zero-shot and prompted settings rather than human-like graded texture-based slant perception.
HapTile: A Haptic-Informed Vision-Tactile-Language-Action Dataset for Contact-Rich Imitation Learning cs.RO · 2026-06-03 · unverdicted · none · ref 6 · internal anchor
HapTile introduces a visuotactile dataset with haptic-informed teleoperation for language-conditioned contact-rich manipulation tasks and provides baseline policy benchmarks.
Argus-Retriever: Vision-LLM Late-Interaction Retrieval with Region-Aware Query-Conditioned MoE for Visual Document Retrieval cs.IR · 2026-06-03 · unverdicted · none · ref 8 · internal anchor
Argus achieves the highest reported NDCG scores among open late-interaction models on ViDoRe V1 and combined V1+V2 by introducing query-dependent document representations via a region-aware MoE on Qwen3.5-VL, trained on 9% of public data with a 1024-dim head.
DeepLatent: Think with Images via Parallel Latent Visual Reasoning cs.CV · 2026-05-30 · unverdicted · none · ref 28 · internal anchor
DeepLatent introduces a parallel latent visual reasoning framework with learnable 2D tokens and continuous RL, trained via distillation then RL, plus a new 180K dataset, claiming SOTA benchmark results.
OmniGF: A Dual-Branch Vision-Language Framework for Unified Gaze Following cs.CV · 2026-05-26 · unverdicted · none · ref 18 · internal anchor
OmniGF adapts VLMs via dual-branch decoding and head embeddings to unify precise multi-person gaze localization with semantic and social reasoning, claiming new SOTA on benchmarks.
Large Language Model Selection with Limited Annotations cs.CL · 2026-05-24 · unverdicted · none · ref 14 · internal anchor
SELECT-LLM is the first active model selection framework for LLMs that uses expected information gain from pairwise output similarities to minimize required annotations, reporting up to 84.78% cost reduction across 23 datasets and 156 models.
GesVLA: Gesture-Aware Vision-Language-Action Model Embedded Representations cs.RO · 2026-05-21 · unverdicted · none · ref 4 · internal anchor
GesVLA encodes gesture features directly into the latent space of VLA models using a dual-VLM architecture and a rendering-based data pipeline, yielding improved target grounding in real robotic tasks.
Dynamic Execution Commitment of Vision-Language-Action Models cs.CV · 2026-05-12 · unverdicted · none · ref 22 · 3 links · internal anchor
A3 reframes dynamic action chunk commitment in VLA models as self-speculative prefix verification, accepting the longest continuous sequence of actions that satisfies consensus-ordered conditional invariance and prefix-closed sequential consistency.
RIO: Flexible Real-Time Robot I/O for Cross-Embodiment Robot Learning cs.RO · 2026-05-12 · unverdicted · none · ref 3 · internal anchor
RIO introduces a lightweight open-source framework that abstracts real-time robot I/O to support easy switching between embodiments and platforms for collecting data and deploying VLAs.
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy cs.CV · 2026-05-08 · conditional · none · ref 4 · 3 links · internal anchor
Reducing visual input to one token per frame in VLA world models maintains or improves long-horizon performance on MetaWorld, LIBERO, and real-robot tasks.
AT-VLA: Adaptive Tactile Injection for Enhanced Feedback Reaction in Vision-Language-Action Models cs.RO · 2026-05-08 · unverdicted · none · ref 3 · 2 links · internal anchor
AT-VLA proposes adaptive tactile injection and a dual-stream tactile reaction mechanism to enhance VLA models for contact-rich robotic manipulation with real-time responses.
Embodied Interpretability: Linking Causal Understanding to Generalization in Vision-Language-Action Models cs.RO · 2026-05-01 · unverdicted · none · ref 53 · internal anchor
Introduces ISS and NMR as interventional metrics to diagnose causal misalignment in VLA policies and link it to generalization performance.
Modular Sensory Stream for Integrating Physical Feedback in Vision-Language-Action Models cs.RO · 2026-04-25 · unverdicted · none · ref 3 · internal anchor
MoSS augments VLAs with decoupled modality streams for multiple physical signals, achieving synergistic gains in real-world robot tasks via joint attention and auxiliary future-signal prediction.
SketchVLM: Vision language models can annotate images to explain thoughts and guide users cs.CV · 2026-04-23 · unverdicted · none · ref 5 · internal anchor
SketchVLM lets VLMs generate non-destructive SVG annotations on input images to visually explain answers, raising visual reasoning accuracy by up to 28.5 points and annotation quality by 1.48x over baselines.
Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation cs.CV · 2026-04-15 · conditional · none · ref 57 · internal anchor
Alignment of vision-language models with human V1-V3 early visual cortex negatively predicts resistance to sycophantic gaslighting attacks.
DSCA: Dynamic Subspace Concept Alignment for Lifelong VLM Editing cs.CV · 2026-04-09 · unverdicted · none · ref 1 · internal anchor
DSCA turns concept isolation into an architectural property by dynamically creating orthogonal subspaces for non-interfering lifelong edits in vision-language models, sustaining over 95% success after 1000 sequential edits.
MixAtlas: Uncertainty-aware Data Mixture Optimization for Multimodal LLM Midtraining cs.LG · 2026-04-03 · unverdicted · none · ref 4 · internal anchor
MixAtlas uses CLIP-based decomposition and Gaussian process optimization on small proxies to discover data mixtures that improve multimodal benchmark performance by up to 17.6% and transfer to larger models with faster convergence.
Generative Control as Optimization: Time Unconditional Flow Matching for Adaptive and Robust Robotic Control cs.RO · 2026-03-18 · conditional · none · ref 3 · internal anchor
GeCO replaces time-dependent flow matching with time-unconditional optimization, enabling adaptive inference and intrinsic OOD detection for robotic imitation learning.
AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models cs.RO · 2026-03-10 · unverdicted · none · ref 3 · internal anchor
AR-VLA introduces a standalone autoregressive action expert with long-lived memory that generates context-aware continuous actions for VLAs, replacing chunk-based heads with smoother trajectories and maintained task success.
SAM 3: Segment Anything with Concepts cs.CV · 2025-11-20 · unverdicted · none · ref 9 · internal anchor
SAM 3 introduces promptable concept segmentation that doubles accuracy of prior systems on images and videos while improving standard SAM segmentation performance.
Guided Query Refinement: Multimodal Hybrid Retrieval with Test-Time Optimization cs.CL · 2025-10-06 · unverdicted · none · ref 3 · internal anchor
GQR is a test-time optimization technique that refines primary retriever query embeddings using complementary retriever scores to achieve high performance with smaller representations in multimodal visual document retrieval.
Social Human Robot Embodied Conversation (SHREC) Dataset: Benchmarking Foundational Models' Social Reasoning cs.HC · 2025-04-07 · unverdicted · none · ref 4 · internal anchor
SHREC is a new benchmark dataset of embodied human-robot conversations that shows substantial performance gaps in state-of-the-art foundation models on tasks involving social error detection and rationale generation.
Domain Arithmetic: One-Shot VLA Adaptation under Environmental Shifts cs.RO · 2026-07-01 · unverdicted · none · ref 2 · internal anchor
DART adapts VLA models to environmental shifts with one demonstration using subspace-aligned weight vector arithmetic.
3D HAMSTER: Bridging Planning and Control in Hierarchical Vision Language Action Models through 3D Trajectory Guidance cs.RO · 2026-06-30 · unverdicted · none · ref 1 · 2 links · internal anchor
3D HAMSTER adds depth encoding and reconstruction to VLMs to produce 3D waypoint sequences that feed directly into pointcloud policies, claiming better generalization than 2D baselines under shifts.
When transformers learn "impossible" languages, what do they learn? cs.CL · 2026-06-29 · unverdicted · none · ref 145 · internal anchor
Transformers on impossible-language variants show gradual grammatical sensitivity loss but sharp long-sentence generation failures, supporting generative deficiency as a link to non-attestation.
The Speedup Paradox: Rethinking Inference Speed-Quality Trade-off in Embodied Tasks cs.RO · 2026-06-26 · unverdicted · none · ref 14 · 2 links · internal anchor
TISED decomposes inference optimization effects on embodied tasks and identifies paradoxical outcomes where faster per-step inference can increase task completion time on static tasks or raise success rates on dynamic tasks.
NRITYAM: Language Models Meet Art and Heritage of Dance cs.CL · 2026-06-18 · unverdicted · none · ref 10 · internal anchor
NRITYAM creates the largest multilingual benchmark for evaluating language models' understanding of dance traditions through expert-curated QA pairs.
Language-Instructed Vision Embeddings for Controllable and Generalizable Perception cs.CV · 2026-06-17 · unverdicted · none · ref 2 · internal anchor
LIVE uses language to generate task-centric vision embeddings at inference, reducing hallucinations by 34 points on MMVP, outperforming larger VLMs on VQA, and generalizing to unseen tasks.
WeaveLA: Event Driven Cross-Subtask Latent Memory Weaving for Repetitive Robot Manipulation cs.CV · 2026-06-16 · unverdicted · none · ref 3 · internal anchor
WeaveLA improves VLA policies for repetitive robot manipulation by event-triggered cross-subtask latent memory weaving, raising success on the hardest repetition tasks from 0% to 47.8% while leaving single-execution performance unchanged.
Visuals Lie, Consistency Speaks: Disentangling Spatial Attention from Reliability in Vision-Language Models cs.CV · 2026-06-16 · unverdicted · none · ref 1 · internal anchor
Spatial attention metrics in VLMs correlate near zero (R≈0.001) with accuracy while self-consistency predicts truth at R=0.429; reliability stems from generation dynamics rather than visual grounding.
APT: Action Expert Pretraining Improves Instruction Generalization of Vision-Language-Action Policies cs.RO · 2026-06-10 · unverdicted · none · ref 33 · internal anchor
APT pretrains the action expert as a vision-action prior on frozen VLM features then adds language through gated fusion to improve OOD instruction generalization in continuous-action VLA policies.
VeriSpace: Spatially Grounded Action Verification for Vision-Language-Action Models cs.RO · 2026-06-09 · unverdicted · none · ref 2 · internal anchor
VeriSpace is a 3D-aware action verifier that improves test-time action selection in VLA models by encoding scenes with visual and geometric information and reasoning over spatial relations and goal progress.
Hierarchical Policies from Verbal and Egocentric Human Signals for Natural Human-Robot Interaction cs.RO · 2026-06-09 · unverdicted · none · ref 7 · internal anchor
EDITH combines egocentric vision and gaze from smart glasses with language in a hierarchical policy to let robots interpret brief nonverbal human intent and reduce user effort in interactive tasks.
What Matters in Orchestrating Robot Policies: A Systematic Study of Hierarchical VLA Agents cs.RO · 2026-06-09 · unverdicted · none · ref 39 · internal anchor
A systematic study of hierarchical VLA agents identifies design principles that improve robot manipulation performance over flat and naive hierarchical baselines in simulation and real-world experiments.
Your Model Already Knows: Attention-Guided Safety Filter for Vision-Language-Action Models cs.RO · 2026-06-08 · unverdicted · none · ref 20 · internal anchor
Internal attention heads in VLA policies localize targets for a CBF safety filter that enables real-time collision avoidance with dynamic obstacles and outperforms init-time oracle identification by 43% on average.
MotionVLA: Injecting Geometric Motion into Vision-Language-Action Model cs.RO · 2026-06-06 · unverdicted · none · ref 17 · internal anchor
MotionVLA converts short past video windows into compact trajectory-field tokens to supply motion-consistent evidence for vision-language-action robot policies, improving long-horizon manipulation.
TempoVLA: Learning Speed-Controllable Vision-Language-Action Policies cs.RO · 2026-06-04 · unverdicted · none · ref 60 · internal anchor
TempoVLA learns a single VLA policy with controllable execution speed via variable-speed trajectory augmentation and explicit speed conditioning.
ActiveMimic: Egocentric Video Pretraining with Active Perception cs.RO · 2026-06-04 · unverdicted · none · ref 45 · internal anchor
ActiveMimic pretrains on egocentric human video by recovering and modeling active camera motion as viewpoint actions, matching robot-data pretraining performance on real-world tasks.
DRIFT: A Residual Flow Adapter for Decoding Continuous Outputs in Vision-Language Models cs.CV · 2026-06-04 · unverdicted · none · ref 4 · internal anchor
DRIFT adapts pretrained VLMs to continuous decoding via a base predictor plus residual flow matching, outperforming regression and generative baselines on grounding and robotic control tasks.
MM-Matryoshka: Towards Budget-Elastic Visual Document Retrieval via a 2D Multimodal Matryoshka Training Framework cs.CV · 2026-06-03 · unverdicted · none · ref 61 · internal anchor
MM-Matryoshka is a 2D Matryoshka training framework enabling budget-elastic ColPali-style multi-vector visual document retrieval along dimension and layer without separate models per budget.
NewtPhys: Do Foundation Models Understand Newtonian Physics? cs.CV · 2026-06-02 · unverdicted · none · ref 5 · internal anchor
NewtPhys dataset with real-scene physics annotations reveals limitations in low-level Newtonian reasoning across 56 VLMs and 10 VFMs.

PaliGemma: A versatile 3B VLM for transfer

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer