super hub Mixed citations

PaliGemma: A versatile 3B VLM for transfer

Alexander Kolesnikov, Andreas Steiner, Daniel Salz, Lucas Beyer, Xiao Wang · 2024 · cs.CV · arXiv 2407.07726

Mixed citation behavior. Most common role is background (59%).

165 Pith papers citing it

Background 59% of classified citations

open full Pith review browse 165 citing papers more from Alexander Kolesnikov arXiv PDF

abstract

PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m vision encoder and the Gemma-2B language model. It is trained to be a versatile and broadly knowledgeable base model that is effective to transfer. It achieves strong performance on a wide variety of open-world tasks. We evaluate PaliGemma on almost 40 diverse tasks including standard VLM benchmarks, but also more specialized tasks such as remote-sensing and segmentation.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 19 method 6 baseline 5 dataset 2

citation-polarity summary

background 19 use method 6 baseline 4 use dataset 2 unclear 1

claims ledger

abstract PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m vision encoder and the Gemma-2B language model. It is trained to be a versatile and broadly knowledgeable base model that is effective to transfer. It achieves strong performance on a wide variety of open-world tasks. We evaluate PaliGemma on almost 40 diverse tasks including standard VLM benchmarks, but also more specialized tasks such as remote-sensing and segmentation.

authors

Alexander Kolesnikov Andreas Steiner Andr\'e Susano Pinto Daniel Salz Lucas Beyer Xiao Wang

co-cited works

representative citing papers

DataComp-VLM: Improved Open Datasets for Vision-Language Models

cs.CV · 2026-06-26 · conditional · novelty 8.0 · 2 refs

DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).

RoboLab: A High-Fidelity Simulation Benchmark for Analysis of Task Generalist Policies

cs.RO · 2026-04-10 · unverdicted · novelty 8.0 · 2 refs

RoboLab is a new simulation benchmark with 120 tasks across visual, procedural, and relational axes that quantifies generalization gaps and perturbation sensitivity in task-generalist robotic policies.

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

cs.CV · 2026-01-15 · unverdicted · novelty 8.0

Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

cs.CV · 2024-09-25 · accept · novelty 8.0

Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.

Revisiting Parameter Redundancy in Vision-Language-Action Models: Insights from VLM-to-VLA Adaptation

cs.RO · 2026-06-30 · unverdicted · novelty 7.0

VLA models from VLM adaptation can be pruned 12-30% via multi-module joint scheme based on divergence signals while keeping ~90% performance on LIBERO without post-pruning recovery, unlike standard criteria that collapse.

Drop-Then-Recovery: How Redundant Are Vision-Language-Action Models?

cs.RO · 2026-06-26 · accept · novelty 7.0

VLA language backbones show high redundancy on manipulation benchmarks, with half the LLM blocks removable and even two blocks sufficient to recover baseline performance after fine-tuning, unlike vision and action pathways.

Koshur Pixel: a large-scale synthetic ocr dataset for kashmiri

cs.CV · 2026-06-22 · unverdicted · novelty 7.0 · 2 refs

Koshur Pixel is the first large-scale synthetic OCR dataset for Kashmiri with 613,078 image-text pairs generated via SynthOCR-Gen from the KS-PRET-5M corpus across multiple fonts and granularities with 25+ augmentations.

NAC: Neural Action Codec for Vision-Language-Action Models

cs.RO · 2026-06-19 · unverdicted · novelty 7.0

NAC adapts multi-scale RVQGAN audio codecs with kinematic-specific losses to produce ordered action tokens that yield lower reconstruction error and higher task success than prior tokenizers in VLA models.

ThinkingVLA: Interleaved Vision and Language Reasoning for Robotic Manipulation

cs.RO · 2026-06-16 · unverdicted · novelty 7.0

ThinkingVLA is a Mixture-of-Transformers VLA model that performs interleaved forward CoT for subgoal and image prediction followed by inverse CoT grounded on the predicted image to generate actions.

FTP-1: A Generalist Foundation Tactile Policy Across Tactile Sensors for Contact-Rich Manipulation

cs.RO · 2026-06-11 · unverdicted · novelty 7.0

FTP-1 is the first foundation tactile policy pretrained on ~3000 hours of data from 26 sources across 21 sensors that improves performance on seen setups by 17.2% and transfers to unseen sensors with 31% success rate gain.

Anchored, Not Graded: Vision-Language Models Fail at Slant-from-Texture Perception

cs.CV · 2026-06-04 · unverdicted · novelty 7.0

VLMs across families and scales show anchoring to discrete slant angles in zero-shot and prompted settings rather than human-like graded texture-based slant perception.

HapTile: A Haptic-Informed Vision-Tactile-Language-Action Dataset for Contact-Rich Imitation Learning

cs.RO · 2026-06-03 · unverdicted · novelty 7.0

HapTile introduces a visuotactile dataset with haptic-informed teleoperation for language-conditioned contact-rich manipulation tasks and provides baseline policy benchmarks.

Argus-Retriever: Vision-LLM Late-Interaction Retrieval with Region-Aware Query-Conditioned MoE for Visual Document Retrieval

cs.IR · 2026-06-03 · unverdicted · novelty 7.0

Argus achieves the highest reported NDCG scores among open late-interaction models on ViDoRe V1 and combined V1+V2 by introducing query-dependent document representations via a region-aware MoE on Qwen3.5-VL, trained on 9% of public data with a 1024-dim head.

DeepLatent: Think with Images via Parallel Latent Visual Reasoning

cs.CV · 2026-05-30 · unverdicted · novelty 7.0

DeepLatent introduces a parallel latent visual reasoning framework with learnable 2D tokens and continuous RL, trained via distillation then RL, plus a new 180K dataset, claiming SOTA benchmark results.

OmniGF: A Dual-Branch Vision-Language Framework for Unified Gaze Following

cs.CV · 2026-05-26 · unverdicted · novelty 7.0

OmniGF adapts VLMs via dual-branch decoding and head embeddings to unify precise multi-person gaze localization with semantic and social reasoning, claiming new SOTA on benchmarks.

Large Language Model Selection with Limited Annotations

cs.CL · 2026-05-24 · unverdicted · novelty 7.0

SELECT-LLM is the first active model selection framework for LLMs that uses expected information gain from pairwise output similarities to minimize required annotations, reporting up to 84.78% cost reduction across 23 datasets and 156 models.

GesVLA: Gesture-Aware Vision-Language-Action Model Embedded Representations

cs.RO · 2026-05-21 · unverdicted · novelty 7.0

GesVLA encodes gesture features directly into the latent space of VLA models using a dual-VLM architecture and a rendering-based data pipeline, yielding improved target grounding in real robotic tasks.

Dynamic Execution Commitment of Vision-Language-Action Models

cs.CV · 2026-05-12 · unverdicted · novelty 7.0 · 3 refs

A3 reframes dynamic action chunk commitment in VLA models as self-speculative prefix verification, accepting the longest continuous sequence of actions that satisfies consensus-ordered conditional invariance and prefix-closed sequential consistency.

RIO: Flexible Real-Time Robot I/O for Cross-Embodiment Robot Learning

cs.RO · 2026-05-12 · unverdicted · novelty 7.0

RIO introduces a lightweight open-source framework that abstracts real-time robot I/O to support easy switching between embodiments and platforms for collecting data and deploying VLAs.

One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy

cs.CV · 2026-05-08 · conditional · novelty 7.0 · 3 refs

Reducing visual input to one token per frame in VLA world models maintains or improves long-horizon performance on MetaWorld, LIBERO, and real-robot tasks.

AT-VLA: Adaptive Tactile Injection for Enhanced Feedback Reaction in Vision-Language-Action Models

cs.RO · 2026-05-08 · unverdicted · novelty 7.0 · 2 refs

AT-VLA proposes adaptive tactile injection and a dual-stream tactile reaction mechanism to enhance VLA models for contact-rich robotic manipulation with real-time responses.

Embodied Interpretability: Linking Causal Understanding to Generalization in Vision-Language-Action Models

cs.RO · 2026-05-01 · unverdicted · novelty 7.0

Introduces ISS and NMR as interventional metrics to diagnose causal misalignment in VLA policies and link it to generalization performance.

Modular Sensory Stream for Integrating Physical Feedback in Vision-Language-Action Models

cs.RO · 2026-04-25 · unverdicted · novelty 7.0

MoSS augments VLAs with decoupled modality streams for multiple physical signals, achieving synergistic gains in real-world robot tasks via joint attention and auxiliary future-signal prediction.

SketchVLM: Vision language models can annotate images to explain thoughts and guide users

cs.CV · 2026-04-23 · unverdicted · novelty 7.0

SketchVLM lets VLMs generate non-destructive SVG annotations on input images to visually explain answers, raising visual reasoning accuracy by up to 28.5 points and annotation quality by 1.48x over baselines.

citing papers explorer

Showing 50 of 165 citing papers.

DataComp-VLM: Improved Open Datasets for Vision-Language Models cs.CV · 2026-06-26 · conditional · none · ref 21 · 2 links · internal anchor
DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).
RoboLab: A High-Fidelity Simulation Benchmark for Analysis of Task Generalist Policies cs.RO · 2026-04-10 · unverdicted · none · ref 4 · 2 links · internal anchor
RoboLab is a new simulation benchmark with 120 tasks across visual, procedural, and relational axes that quantifies generalization gaps and perturbation sensitivity in task-generalist robotic policies.
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding cs.CV · 2026-01-15 · unverdicted · none · ref 13 · internal anchor
Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models cs.CV · 2024-09-25 · accept · none · ref 10 · internal anchor
Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
Revisiting Parameter Redundancy in Vision-Language-Action Models: Insights from VLM-to-VLA Adaptation cs.RO · 2026-06-30 · unverdicted · none · ref 2 · internal anchor
VLA models from VLM adaptation can be pruned 12-30% via multi-module joint scheme based on divergence signals while keeping ~90% performance on LIBERO without post-pruning recovery, unlike standard criteria that collapse.
Drop-Then-Recovery: How Redundant Are Vision-Language-Action Models? cs.RO · 2026-06-26 · accept · none · ref 2 · internal anchor
VLA language backbones show high redundancy on manipulation benchmarks, with half the LLM blocks removable and even two blocks sufficient to recover baseline performance after fine-tuning, unlike vision and action pathways.
Koshur Pixel: a large-scale synthetic ocr dataset for kashmiri cs.CV · 2026-06-22 · unverdicted · none · ref 33 · 2 links · internal anchor
Koshur Pixel is the first large-scale synthetic OCR dataset for Kashmiri with 613,078 image-text pairs generated via SynthOCR-Gen from the KS-PRET-5M corpus across multiple fonts and granularities with 25+ augmentations.
NAC: Neural Action Codec for Vision-Language-Action Models cs.RO · 2026-06-19 · unverdicted · none · ref 19 · internal anchor
NAC adapts multi-scale RVQGAN audio codecs with kinematic-specific losses to produce ordered action tokens that yield lower reconstruction error and higher task success than prior tokenizers in VLA models.
ThinkingVLA: Interleaved Vision and Language Reasoning for Robotic Manipulation cs.RO · 2026-06-16 · unverdicted · none · ref 3 · internal anchor
ThinkingVLA is a Mixture-of-Transformers VLA model that performs interleaved forward CoT for subgoal and image prediction followed by inverse CoT grounded on the predicted image to generate actions.
FTP-1: A Generalist Foundation Tactile Policy Across Tactile Sensors for Contact-Rich Manipulation cs.RO · 2026-06-11 · unverdicted · none · ref 22 · internal anchor
FTP-1 is the first foundation tactile policy pretrained on ~3000 hours of data from 26 sources across 21 sensors that improves performance on seen setups by 17.2% and transfers to unseen sensors with 31% success rate gain.
Anchored, Not Graded: Vision-Language Models Fail at Slant-from-Texture Perception cs.CV · 2026-06-04 · unverdicted · none · ref 4 · internal anchor
VLMs across families and scales show anchoring to discrete slant angles in zero-shot and prompted settings rather than human-like graded texture-based slant perception.
HapTile: A Haptic-Informed Vision-Tactile-Language-Action Dataset for Contact-Rich Imitation Learning cs.RO · 2026-06-03 · unverdicted · none · ref 6 · internal anchor
HapTile introduces a visuotactile dataset with haptic-informed teleoperation for language-conditioned contact-rich manipulation tasks and provides baseline policy benchmarks.
Argus-Retriever: Vision-LLM Late-Interaction Retrieval with Region-Aware Query-Conditioned MoE for Visual Document Retrieval cs.IR · 2026-06-03 · unverdicted · none · ref 8 · internal anchor
Argus achieves the highest reported NDCG scores among open late-interaction models on ViDoRe V1 and combined V1+V2 by introducing query-dependent document representations via a region-aware MoE on Qwen3.5-VL, trained on 9% of public data with a 1024-dim head.
DeepLatent: Think with Images via Parallel Latent Visual Reasoning cs.CV · 2026-05-30 · unverdicted · none · ref 28 · internal anchor
DeepLatent introduces a parallel latent visual reasoning framework with learnable 2D tokens and continuous RL, trained via distillation then RL, plus a new 180K dataset, claiming SOTA benchmark results.
OmniGF: A Dual-Branch Vision-Language Framework for Unified Gaze Following cs.CV · 2026-05-26 · unverdicted · none · ref 18 · internal anchor
OmniGF adapts VLMs via dual-branch decoding and head embeddings to unify precise multi-person gaze localization with semantic and social reasoning, claiming new SOTA on benchmarks.
Large Language Model Selection with Limited Annotations cs.CL · 2026-05-24 · unverdicted · none · ref 14 · internal anchor
SELECT-LLM is the first active model selection framework for LLMs that uses expected information gain from pairwise output similarities to minimize required annotations, reporting up to 84.78% cost reduction across 23 datasets and 156 models.
GesVLA: Gesture-Aware Vision-Language-Action Model Embedded Representations cs.RO · 2026-05-21 · unverdicted · none · ref 4 · internal anchor
GesVLA encodes gesture features directly into the latent space of VLA models using a dual-VLM architecture and a rendering-based data pipeline, yielding improved target grounding in real robotic tasks.
Dynamic Execution Commitment of Vision-Language-Action Models cs.CV · 2026-05-12 · unverdicted · none · ref 22 · 3 links · internal anchor
A3 reframes dynamic action chunk commitment in VLA models as self-speculative prefix verification, accepting the longest continuous sequence of actions that satisfies consensus-ordered conditional invariance and prefix-closed sequential consistency.
RIO: Flexible Real-Time Robot I/O for Cross-Embodiment Robot Learning cs.RO · 2026-05-12 · unverdicted · none · ref 3 · internal anchor
RIO introduces a lightweight open-source framework that abstracts real-time robot I/O to support easy switching between embodiments and platforms for collecting data and deploying VLAs.
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy cs.CV · 2026-05-08 · conditional · none · ref 4 · 3 links · internal anchor
Reducing visual input to one token per frame in VLA world models maintains or improves long-horizon performance on MetaWorld, LIBERO, and real-robot tasks.
AT-VLA: Adaptive Tactile Injection for Enhanced Feedback Reaction in Vision-Language-Action Models cs.RO · 2026-05-08 · unverdicted · none · ref 3 · 2 links · internal anchor
AT-VLA proposes adaptive tactile injection and a dual-stream tactile reaction mechanism to enhance VLA models for contact-rich robotic manipulation with real-time responses.
Embodied Interpretability: Linking Causal Understanding to Generalization in Vision-Language-Action Models cs.RO · 2026-05-01 · unverdicted · none · ref 53 · internal anchor
Introduces ISS and NMR as interventional metrics to diagnose causal misalignment in VLA policies and link it to generalization performance.
Modular Sensory Stream for Integrating Physical Feedback in Vision-Language-Action Models cs.RO · 2026-04-25 · unverdicted · none · ref 3 · internal anchor
MoSS augments VLAs with decoupled modality streams for multiple physical signals, achieving synergistic gains in real-world robot tasks via joint attention and auxiliary future-signal prediction.
SketchVLM: Vision language models can annotate images to explain thoughts and guide users cs.CV · 2026-04-23 · unverdicted · none · ref 5 · internal anchor
SketchVLM lets VLMs generate non-destructive SVG annotations on input images to visually explain answers, raising visual reasoning accuracy by up to 28.5 points and annotation quality by 1.48x over baselines.
Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation cs.CV · 2026-04-15 · conditional · none · ref 57 · internal anchor
Alignment of vision-language models with human V1-V3 early visual cortex negatively predicts resistance to sycophantic gaslighting attacks.
DSCA: Dynamic Subspace Concept Alignment for Lifelong VLM Editing cs.CV · 2026-04-09 · unverdicted · none · ref 1 · internal anchor
DSCA turns concept isolation into an architectural property by dynamically creating orthogonal subspaces for non-interfering lifelong edits in vision-language models, sustaining over 95% success after 1000 sequential edits.
MixAtlas: Uncertainty-aware Data Mixture Optimization for Multimodal LLM Midtraining cs.LG · 2026-04-03 · unverdicted · none · ref 4 · internal anchor
MixAtlas uses CLIP-based decomposition and Gaussian process optimization on small proxies to discover data mixtures that improve multimodal benchmark performance by up to 17.6% and transfer to larger models with faster convergence.
Generative Control as Optimization: Time Unconditional Flow Matching for Adaptive and Robust Robotic Control cs.RO · 2026-03-18 · conditional · none · ref 3 · internal anchor
GeCO replaces time-dependent flow matching with time-unconditional optimization, enabling adaptive inference and intrinsic OOD detection for robotic imitation learning.
AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models cs.RO · 2026-03-10 · unverdicted · none · ref 3 · internal anchor
AR-VLA introduces a standalone autoregressive action expert with long-lived memory that generates context-aware continuous actions for VLAs, replacing chunk-based heads with smoother trajectories and maintained task success.
SAM 3: Segment Anything with Concepts cs.CV · 2025-11-20 · unverdicted · none · ref 9 · internal anchor
SAM 3 introduces promptable concept segmentation that doubles accuracy of prior systems on images and videos while improving standard SAM segmentation performance.
Guided Query Refinement: Multimodal Hybrid Retrieval with Test-Time Optimization cs.CL · 2025-10-06 · unverdicted · none · ref 3 · internal anchor
GQR is a test-time optimization technique that refines primary retriever query embeddings using complementary retriever scores to achieve high performance with smaller representations in multimodal visual document retrieval.
Social Human Robot Embodied Conversation (SHREC) Dataset: Benchmarking Foundational Models' Social Reasoning cs.HC · 2025-04-07 · unverdicted · none · ref 4 · internal anchor
SHREC is a new benchmark dataset of embodied human-robot conversations that shows substantial performance gaps in state-of-the-art foundation models on tasks involving social error detection and rationale generation.
Domain Arithmetic: One-Shot VLA Adaptation under Environmental Shifts cs.RO · 2026-07-01 · unverdicted · none · ref 2 · internal anchor
DART adapts VLA models to environmental shifts with one demonstration using subspace-aligned weight vector arithmetic.
3D HAMSTER: Bridging Planning and Control in Hierarchical Vision Language Action Models through 3D Trajectory Guidance cs.RO · 2026-06-30 · unverdicted · none · ref 1 · 2 links · internal anchor
3D HAMSTER adds depth encoding and reconstruction to VLMs to produce 3D waypoint sequences that feed directly into pointcloud policies, claiming better generalization than 2D baselines under shifts.
When transformers learn "impossible" languages, what do they learn? cs.CL · 2026-06-29 · unverdicted · none · ref 145 · internal anchor
Transformers on impossible-language variants show gradual grammatical sensitivity loss but sharp long-sentence generation failures, supporting generative deficiency as a link to non-attestation.
The Speedup Paradox: Rethinking Inference Speed-Quality Trade-off in Embodied Tasks cs.RO · 2026-06-26 · unverdicted · none · ref 14 · 2 links · internal anchor
TISED decomposes inference optimization effects on embodied tasks and identifies paradoxical outcomes where faster per-step inference can increase task completion time on static tasks or raise success rates on dynamic tasks.
LightSTAR: Efficient Visual Document Retrieval via Lightweight Selection with Vision-Adaptive Refinement cs.CV · 2026-06-22 · unverdicted · none · ref 3 · internal anchor
LightSTAR achieves state-of-the-art accuracy in visual document retrieval by decomposing the task into LLM-free high-recall candidate selection and vision-adaptive semantic refinement on candidates, cutting end-to-end latency several-fold.
FlowDPG: Deterministic Policy Gradient on Flow Matching Policies for Real-World Manipulation cs.RO · 2026-06-21 · unverdicted · none · ref 22 · internal anchor
FlowDPG distills critic gradients into flow matching velocity fields to enable BPTT-free DDPG-style policy improvement and reports 92% success on a real-world dual-arm AirPods assembly task.
OpenHLM: An Empirical Recipe for Whole-Body Humanoid Loco-Manipulation cs.RO · 2026-06-20 · unverdicted · none · ref 23 · internal anchor
OpenHLM is an empirical recipe yielding a whole-body humanoid VLA model that outperforms GR00T N1.6 and Ψ0 baselines on long-horizon tasks using less than half the demonstration time.
Wh0: Generative World Models as Scalable Sources of Egocentric Human Hand Manipulation Data cs.RO · 2026-06-20 · unverdicted · none · ref 59 · internal anchor
Wh0 generates scalable egocentric human manipulation videos with world models and converts them to boost pretrained VLA models' zero-shot dexterous task success from 8.3% to 38.9% on 18 real-world tasks.
Robot Critics that Sweat the Small Stuff cs.RO · 2026-06-19 · unverdicted · none · ref 21 · internal anchor
Fine-tuning VLMs with pairwise progress supervision from policy rollouts improves fine-grained failure detection and boosts robot manipulation success by 11% real-world and 5.9% in simulation.
Decoupling the Declarative from the Procedural in Vision-Language-Action Models cs.RO · 2026-06-19 · unverdicted · none · ref 1 · internal anchor
w²VLA restructures VLA information flow to decouple declarative semantics from procedural skills, enabling zero-shot transfer to novel objects.
VLA-FAIL: Efficient Task Failure Detection for Finetuned Vision-Language-Action Models cs.LG · 2026-06-19 · unverdicted · none · ref 33 · internal anchor
VLA-FAIL introduces last-layer Mahalanobis distance and action chunk consistency detectors that together enable early, reliable failure detection in finetuned VLAs without failure data or expensive sampling.
NRITYAM: Language Models Meet Art and Heritage of Dance cs.CL · 2026-06-18 · unverdicted · none · ref 10 · internal anchor
NRITYAM creates the largest multilingual benchmark for evaluating language models' understanding of dance traditions through expert-curated QA pairs.
Language-Instructed Vision Embeddings for Controllable and Generalizable Perception cs.CV · 2026-06-17 · unverdicted · none · ref 2 · internal anchor
LIVE uses language to generate task-centric vision embeddings at inference, reducing hallucinations by 34 points on MMVP, outperforming larger VLMs on VQA, and generalizing to unseen tasks.
WeaveLA: Event Driven Cross-Subtask Latent Memory Weaving for Repetitive Robot Manipulation cs.CV · 2026-06-16 · unverdicted · none · ref 3 · internal anchor
WeaveLA improves VLA policies for repetitive robot manipulation by event-triggered cross-subtask latent memory weaving, raising success on the hardest repetition tasks from 0% to 47.8% while leaving single-execution performance unchanged.
Visuals Lie, Consistency Speaks: Disentangling Spatial Attention from Reliability in Vision-Language Models cs.CV · 2026-06-16 · unverdicted · none · ref 1 · internal anchor
Spatial attention metrics in VLMs correlate near zero (R≈0.001) with accuracy while self-consistency predicts truth at R=0.429; reliability stems from generation dynamics rather than visual grounding.
APT: Action Expert Pretraining Improves Instruction Generalization of Vision-Language-Action Policies cs.RO · 2026-06-10 · unverdicted · none · ref 33 · internal anchor
APT pretrains the action expert as a vision-action prior on frozen VLM features then adds language through gated fusion to improve OOD instruction generalization in continuous-action VLA policies.
VeriSpace: Spatially Grounded Action Verification for Vision-Language-Action Models cs.RO · 2026-06-09 · unverdicted · none · ref 2 · internal anchor
VeriSpace is a 3D-aware action verifier that improves test-time action selection in VLA models by encoding scenes with visual and geometric information and reasoning over spatial relations and goal progress.
Hierarchical Policies from Verbal and Egocentric Human Signals for Natural Human-Robot Interaction cs.RO · 2026-06-09 · unverdicted · none · ref 7 · internal anchor
EDITH combines egocentric vision and gaze from smart glasses with language in a hierarchical policy to let robots interpret brief nonverbal human intent and reduce user effort in interactive tasks.

PaliGemma: A versatile 3B VLM for transfer

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer