Recognition: no theorem link
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
Pith reviewed 2026-05-11 21:17 UTC · model grok-4.3
The pith
SmolVLA is a compact vision-language-action model that matches the performance of models ten times larger while running on consumer hardware.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SmolVLA adapts a compact VLM into a VLA that retains competitive performance on robotic tasks despite being roughly one-tenth the size of prior VLAs; it achieves this by training on community data from low-cost platforms, running on a single GPU, and using an asynchronous inference design that separates perception-action prediction from action execution to support higher control frequencies through chunked generation.
What carries the argument
The SmolVLA model architecture, which adapts a small pretrained VLM with an added asynchronous inference stack that decouples perception and action prediction from execution to enable chunked action generation at higher rates.
If this is right
- Robotic policies become trainable and deployable without industrial-scale compute clusters.
- Natural-language-driven control becomes feasible on consumer-grade GPUs and even CPUs.
- Higher-frequency action execution is possible through decoupled asynchronous inference and chunked predictions.
- Community data sources can substitute for large curated datasets in VLA training.
- All components including models and data are made openly available for immediate reuse and extension.
Where Pith is reading between the lines
- The design could extend to other low-cost robot platforms by swapping in their community datasets while keeping the same small backbone.
- Edge deployment on embedded devices becomes more realistic once the CPU-compatible inference path is further optimized.
- The asynchronous stack may generalize to other VLA or VLM-based systems that need to maintain high control rates under variable perception latency.
Load-bearing premise
Community-collected data from affordable robotic platforms is sufficiently diverse and high-quality to support competitive performance on real-world tasks without needing the scale of academic or industrial datasets.
What would settle it
A direct head-to-head evaluation on the paper's real-world benchmarks in which SmolVLA, trained only on the released community data, falls substantially below the performance of a 10x-larger VLA baseline under identical task conditions and evaluation protocols.
read the original abstract
Vision-language models (VLMs) pretrained on large-scale multimodal datasets encode rich visual and linguistic knowledge, making them a strong foundation for robotics. Rather than training robotic policies from scratch, recent approaches adapt VLMs into vision-language-action (VLA) models that enable natural language-driven perception and control. However, existing VLAs are typically massive--often with billions of parameters--leading to high training costs and limited real-world deployability. Moreover, they rely on academic and industrial datasets, overlooking the growing availability of community-collected data from affordable robotic platforms. In this work, we present SmolVLA, a small, efficient, and community-driven VLA that drastically reduces both training and inference costs, while retaining competitive performance. SmolVLA is designed to be trained on a single GPU and deployed on consumer-grade GPUs or even CPUs. To further improve responsiveness, we introduce an asynchronous inference stack decoupling perception and action prediction from action execution, allowing higher control rates with chunked action generation. Despite its compact size, SmolVLA achieves performance comparable to VLAs that are 10x larger. We evaluate SmolVLA on a range of both simulated as well as real-world robotic benchmarks and release all code, pretrained models, and training data.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces SmolVLA, a compact vision-language-action model for robotics that is trained on a single GPU using community-collected data from affordable platforms. It includes an asynchronous inference stack that decouples perception/action prediction from execution to enable higher control rates with chunked actions. The central claim is that this small model achieves performance comparable to VLAs that are 10x larger on both simulated and real-world robotic tasks, with all code, pretrained weights, and training data released for reproducibility.
Significance. If the empirical results hold, the work is significant for demonstrating that competitive VLA performance is achievable with modest compute and open community data rather than large-scale academic or industrial datasets. The explicit release of reproducible code, models, and data is a clear strength that allows independent verification and lowers barriers for further research in efficient robotics.
major comments (1)
- [§4] §4 (Experiments): The claim of performance comparable to 10x larger VLAs is central but requires explicit side-by-side quantitative metrics (success rates, task scores) with named baselines, number of trials, and error bars or variance; without these details the comparability cannot be rigorously assessed from the reported evaluations.
minor comments (3)
- [Abstract and §1] The abstract and introduction repeat the limitations of existing large VLAs without adding new information; condensing this would improve readability.
- [§2] Notation for model sizes (parameter counts) and hardware requirements should be introduced consistently in §2 or §3 with a table for clarity.
- [§3] Figure captions for the architecture and inference stack diagrams could include more detail on data flow and timing to aid understanding of the asynchronous design.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive recommendation for minor revision. We address the single major comment point by point below.
read point-by-point responses
-
Referee: [§4] §4 (Experiments): The claim of performance comparable to 10x larger VLAs is central but requires explicit side-by-side quantitative metrics (success rates, task scores) with named baselines, number of trials, and error bars or variance; without these details the comparability cannot be rigorously assessed from the reported evaluations.
Authors: We agree that the central claim requires more explicit quantitative support for rigorous assessment. In the revised manuscript, we will expand the experiments section to include a dedicated comparison table listing success rates and task scores for SmolVLA alongside named larger VLA baselines. The table will report the number of evaluation trials per task and include error bars or standard deviation to quantify variance. This addition will directly address the concern while preserving the existing results. revision: yes
Circularity Check
No significant circularity; empirical model release with independent benchmarks
full rationale
The manuscript introduces SmolVLA as a compact VLA architecture trained on community-collected data, evaluated on simulated and real-world robotic tasks, and released with code/weights/data. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-referential definitions appear. Central performance claims rest on external benchmark comparisons and released artifacts that permit independent verification, satisfying the criteria for a self-contained empirical contribution with no load-bearing reductions to internal inputs.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 57 Pith papers
-
Realtime-VLA FLASH: Speculative Inference Framework for Diffusion-based VLAs
A new speculative inference system speeds up diffusion VLAs to 19.1 ms average latency (3.04x faster) on LIBERO by replacing most full 58 ms inferences with 7.8 ms draft rounds while preserving task performance.
-
Test-time Sparsity for Extreme Fast Action Diffusion
Test-time sparsity with a parallel pipeline and omnidirectional feature reuse accelerates action diffusion by 5x to 47.5 Hz while cutting FLOPs 92% with no performance loss.
-
From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation
MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.
-
Premover: Fast Vision-Language-Action Control by Acting Before Instructions Are Complete
Premover enables VLA policies to act on partial instructions by precomputing focus maps from intermediate backbone layers, reducing wall-clock time 13.6 percent on LIBERO while preserving 95 percent success rate.
-
Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models
Pace-and-Path Correction decomposes a quadratic cost minimization into orthogonal pace and path channels to correct chunked actions in VLA models, raising success rates by up to 28.8% in dynamic settings.
-
Do Vision-Language-Models show human-like logical problem-solving capability in point and click puzzle games?
VLATIM benchmark reveals large VLMs excel at high-level planning in physics puzzles but struggle with precise visual grounding and mouse control, so they lack human-like problem-solving capabilities.
-
ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models
ALAM creates algebraically consistent latent action transitions from videos to act as auxiliary generative targets, raising robot policy success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.
-
LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models
LoopVLA adds recurrent refinement and learned sufficiency estimation to VLA models, cutting parameters 45% and raising throughput 1.7x while matching baseline task success on LIBERO and VLA-Arena.
-
Towards Backdoor-Based Ownership Verification for Vision-Language-Action Models
GuardVLA embeds a stealthy backdoor watermark in VLAs via secret messages in visual data and uses a swap-and-detect mechanism for post-release ownership verification that preserves task performance.
-
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
Reducing visual input to one token per frame in VLA world models maintains or improves long-horizon performance on MetaWorld, LIBERO, and real-robot tasks.
-
NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models
NoiseGate learns per-latent timestep schedules as an information-gating policy in diffusion-based world action models, yielding consistent gains on RoboTwin manipulation tasks.
-
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation
OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
-
MolmoAct2: Action Reasoning Models for Real-world Deployment
MolmoAct2 delivers an open VLA model with new specialized components, datasets, and techniques that outperforms baselines on benchmarks while releasing all weights, code, and data for real-world robot use.
-
Characterizing Vision-Language-Action Models across XPUs: Constraints and Acceleration for On-Robot Deployment
VLA models exhibit a compute-bound VLM phase followed by a memory-bound action phase on edge hardware; DP-Cache and V-AEFusion reduce redundancy and enable pipeline parallelism for up to 6x speedup on NPUs with margin...
-
Generative Control as Optimization: Time Unconditional Flow Matching for Adaptive and Robust Robotic Control
GeCO replaces time-dependent flow matching with time-unconditional optimization, enabling adaptive inference and intrinsic OOD detection for robotic imitation learning.
-
You've Got a Golden Ticket: Improving Generative Robot Policies With A Single Noise Vector
Optimizing a single constant initial noise vector for frozen generative robot policies improves success rates on 38 of 43 tasks by up to 58% relative improvement.
-
See What Matters: Differentiable Grid Sample Pruning for Generalizable Vision-Language-Action Model
GridS reduces visual tokens in VLA models to under 10% of the original count via task-aware differentiable resampling, delivering 76% lower FLOPs with no drop in task success rate on benchmarks and real robots.
-
Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models
Pace-and-Path Correction is a closed-form inference-time operator that decomposes a quadratic cost minimization into orthogonal pace compression and path offset channels to correct dynamics-blindness in chunked-action...
-
PriorVLA: Prior-Preserving Adaptation for Vision-Language-Action Models
PriorVLA preserves pretrained priors in VLA models through a frozen Prior Expert and trained Adaptation Expert, delivering better robot manipulation performance than full fine-tuning with only 25% of the parameter updates.
-
Unified Noise Steering for Efficient Human-Guided VLA Adaptation
UniSteer unifies human corrective actions and noise-space RL for VLA adaptation by inverting actions to noise targets, raising success rates from 20% to 90% in 66 minutes across four real-world manipulation tasks.
-
ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models
ALAM introduces algebraic consistency regularization on latent action transitions from videos, raising VLA success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.
-
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
Reducing visual input to one token per frame in world models for vision-language-action policies maintains long-horizon performance while improving success rates on MetaWorld, LIBERO, and real-robot tasks.
-
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
Reducing visual input to one token per frame via adaptive attention pooling and a unified flow-matching objective improves long-horizon performance in VLA policies on MetaWorld, LIBERO, and real-robot tasks.
-
Escaping the Diversity Trap in Robotic Manipulation via Anchor-Centric Adaptation
Anchor-Centric Adaptation escapes the diversity trap by prioritizing repeated demonstrations at core anchors over broad coverage, yielding higher success rates under fixed data budgets in robotic manipulation.
-
When to Trust Imagination: Adaptive Action Execution for World Action Models
A verifier called Future Forward Dynamics Causal Attention enables adaptive action execution in World Action Models, reducing model inferences by 69% and improving success rates in robotic tasks.
-
When to Trust Imagination: Adaptive Action Execution for World Action Models
Future Forward Dynamics Causal Attention (FFDC) enables World Action Models to adaptively choose action chunk lengths based on prediction-observation consistency, cutting model inferences by 69% and improving real-wor...
-
ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation
ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.
-
MolmoAct2: Action Reasoning Models for Real-world Deployment
MolmoAct2 is an open VLA model that outperforms baselines like Pi-05 on 7 benchmarks and whose backbone surpasses GPT-5 on 13 embodied-reasoning tasks through new datasets, specialized training, and architecture chang...
-
VLA-ATTC: Adaptive Test-Time Compute for VLA Models with Relative Action Critic Model
VLA-ATTC equips VLA models with adaptive test-time compute via an uncertainty clutch and relative action critic, cutting failure rates by over 50% on LIBERO-LONG.
-
MSACT: Multistage Spatial Alignment for Stable Low-Latency Fine Manipulation
MSACT improves localization stability and task success rates in limited-data bimanual manipulation by extracting stable 2D attention points and aligning predicted attention sequences across frames without keypoint labels.
-
Stereo Multistage Spatial Attention for Real-Time Mobile Manipulation Under Visual Scale Variation and Disturbances
A stereo multistage spatial attention deep predictive learning system improves robustness and success rates for real-time mobile manipulation under visual scale variation and disturbances.
-
$M^2$-VLA: Boosting Vision-Language Models for Generalizable Manipulation via Layer Mixture and Meta-Skills
M²-VLA shows that generalized VLMs can serve as direct backbones for robotic manipulation by selectively extracting task-critical features via Mixture of Layers and adding Meta Skill Modules for efficient trajectory learning.
-
AsyncShield: A Plug-and-Play Edge Adapter for Asynchronous Cloud-based VLA Navigation
AsyncShield restores VLA geometric intent from latency via kinematic pose mapping and uses PPO-Lagrangian to balance tracking with LiDAR safety constraints in a plug-and-play module.
-
LeHome: A Simulation Environment for Deformable Object Manipulation in Household Scenarios
LeHome is a simulation platform offering high-fidelity dynamics for robotic manipulation of varied deformable objects in household settings, with support for multiple robot embodiments including low-cost hardware.
-
CorridorVLA: Explicit Spatial Constraints for Generative Action Heads via Sparse Anchors
CorridorVLA improves VLA models by using predicted sparse anchors to impose explicit spatial corridors on action trajectories, yielding 3.4-12.4% success rate gains on LIBERO-Plus with GR00T-Corr reaching 83.21%.
-
LongBench: Evaluating Robotic Manipulation Policies on Real-World Long-Horizon Tasks
LongBench is a new real-world benchmark that separates execution robustness from context-dependent reasoning in long-horizon robotic manipulation and shows these are distinct challenges not uniformly solved by memory-...
-
From Seeing to Simulating: Generative High-Fidelity Simulation with Digital Cousins for Generalizable Robot Learning and Evaluation
Digital Cousins is a generative real-to-sim method that creates diverse high-fidelity simulation scenes from real panoramas to improve generalization in robot learning and evaluation.
-
Long-Term Memory for VLA-based Agents in Open-World Task Execution
ChemBot adds dual-layer memory and future-state asynchronous inference to VLA models, enabling better long-horizon success in chemical lab automation on collaborative robots.
-
Grounded World Model for Semantically Generalizable Planning
A vision-language-aligned world model turns visuomotor MPC into a language-following planner that reaches 87% success on 288 unseen semantic tasks where standard VLAs drop to 22%.
-
Device-Conditioned Neural Architecture Search for Efficient Robotic Manipulation
DC-QFA trains one supernet over architectures and bit-widths, then runs a fast per-device search plus multi-step distillation to deliver 2-3x faster robotic policies across hardware with negligible success-rate drop.
-
VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis
VAG is a synchronized dual-stream flow-matching framework that generates aligned video-action pairs for synthetic embodied data synthesis and policy pretraining.
-
A1: A Fully Transparent Open-Source, Adaptive and Efficient Truncated Vision-Language-Action Model
A1 is a transparent VLA framework achieving state-of-the-art robot manipulation success with up to 72% lower latency via adaptive layer truncation and inter-layer flow matching.
-
SnapFlow: One-Step Action Generation for Flow-Matching VLAs via Progressive Self-Distillation
SnapFlow compresses multi-step denoising in flow-matching VLAs into one step via progressive self-distillation using two-step Euler shortcuts from marginal velocities, matching 10-step teacher success rates with 9.6x ...
-
Grounding Hierarchical Vision-Language-Action Models Through Explicit Language-Action Alignment
A contrastive alignment model plus offline preference learning explicitly grounds hierarchical VLA language descriptions to actions and visuals on LanguageTable, achieving performance comparable to fully supervised fi...
-
E-VLA: Event-Augmented Vision-Language-Action Model for Dark and Blurred Scenes
E-VLA integrates event streams directly into VLA models via lightweight fusion, raising Pick-Place success from 0% to 60-90% at 20 lux and from 0% to 20-25% under severe motion blur.
-
Adaptive Action Chunking at Inference-time for Vision-Language-Action Models
Adaptive Action Chunking uses action entropy to dynamically adjust chunk sizes in VLA models, improving performance on simulated and real robotic manipulation tasks.
-
FASTER: Rethinking Real-Time Flow VLAs
FASTER uses a horizon-aware flow sampling schedule to compress immediate-action denoising to one step, slashing effective reaction latency in real-robot VLA deployments.
-
Fast-WAM: Do World Action Models Need Test-time Future Imagination?
Fast-WAM shows that explicit future imagination at test time is not required for strong WAM performance; video modeling during training provides the main benefit.
-
InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy
InternVLA-M1 uses spatially guided pre-training on 2.3M examples followed by action post-training to deliver up to 17% gains on robot manipulation benchmarks and 20.6% on unseen objects.
-
AttenA+: Rectifying Action Inequality in Robotic Foundation Models
AttenA+ applies velocity-driven action attention to reweight training objectives toward kinematically critical low-velocity segments, yielding small benchmark gains on Libero and RoboTwin without added parameters.
-
Nautilus: From One Prompt to Plug-and-Play Robot Learning
NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.
-
BioProVLA-Agent: An Affordable, Protocol-Driven, Vision-Enhanced VLA-Enabled Embodied Multi-Agent System with Closed-Loop-Capable Reasoning for Biological Laboratory Manipulation
BioProVLA-Agent integrates protocol parsing, visual state verification, and VLA-based execution in a closed-loop multi-agent framework with AugSmolVLA augmentation to improve robustness for biological lab tasks like t...
-
Understanding Asynchronous Inference Methods for Vision-Language-Action Models
Controlled benchmarks show per-step residual correction (A2C2) as most effective for VLA asynchronous inference up to d=8 delays on Kinetix with over 90% solve rate, outperforming inpainting and conditioning while tra...
-
PokeVLA: Empowering Pocket-Sized Vision-Language-Action Model with Comprehensive World Knowledge Guidance
PokeVLA is a lightweight VLA model pre-trained on 2.4M samples for spatial grounding and reasoning, then adapted via multi-view semantics and geometry alignment to achieve state-of-the-art robot manipulation performance.
-
VLA Foundry: A Unified Framework for Training Vision-Language-Action Models
VLA Foundry provides a single training stack for VLA models and releases open models that match prior closed-source performance or outperform baselines on multi-task manipulation in simulation.
-
Causal World Modeling for Robot Control
LingBot-VA combines video world modeling with policy learning via Mixture-of-Transformers, closed-loop rollouts, and asynchronous inference to improve robot manipulation in simulation and real settings.
-
SEVO: Semantic-Enhanced Virtual Observation for Robust VLA Manipulation via Active Illumination and Data-Centric Collection
SEVO raises ACT and SmolVLA pick-and-place success from 30-35% to 75-85% in novel environments by using active illumination, semantic cues, and diversified teleoperation data.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Gabriel Martín Blázquez, Guilherme Penedo, Lewis Tunstall, Andrés Marafioti, Hynek Kydlíček, Agustín Piqueres Lajarín, Vaibhav Srivastav, et al. Smollm2: When smol goes big–data-centric training of a small language model.arXiv preprint arXiv:2502.02737,
work page internal anchor Pith review arXiv
-
[3]
https://arxiv.org/abs/2502.13923. Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, and Sağnak Taşırlar. Introducing our multimodal models, 2023.https://www.adept.ai/blog/fuyu-8b. Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Mic...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
15 Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Haus- man, Brian Ichter, et al.π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Perception Encoder: The best visual embeddings are not at the output of the network
Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Rasheed, et al. Perception encoder: The best visual embeddings are not at the output of the network. arXiv preprint arXiv:2504.13181,
work page internal anchor Pith review arXiv
-
[7]
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gr...
work page 1901
-
[9]
Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Saehoon Kim
https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf. Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Saehoon Kim. Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/coyo-dataset,
work page 2020
-
[10]
Pali: A jointly-scaled mul- tilingual language-image model
Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, et al. Pali: A jointly-scaled multilingual language-image model.arXiv preprint arXiv:2209.06794,
-
[11]
Pali-3 vi- sion language models: Smaller, faster, stronger
Xi Chen, Xiao Wang, Lucas Beyer, Alexander Kolesnikov, Jialin Wu, Paul Voigtlaender, Basil Mustafa, Sebastian Goodman, Ibrahim Alabdulmohsin, Piotr Padlewski, et al. Pali-3 vision language models: Smaller, faster, stronger.arXiv preprint arXiv:2310.09199,
-
[12]
Yuhui Chen, Shuai Tian, Shugao Liu, Yingting Zhou, Haoran Li, and Dongbin Zhao. Conrft: A reinforced fine-tuning method for vla models via consistency policy.arXiv preprint arXiv:2502.05450,
-
[13]
Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
https://openreview.net/forum?id=vvoWPYqZJA. Nilaksh Das, Saket Dingliwal, Srikanth Ronanki, Rohit Paturi, Zhaocheng Huang, Prashant Mathur, Jie Yuan, Dhanush Bekal, Xing Niu, Sai Muralidhar Jayanthi, et al. Speechverse: A large-scale generalizable audio language model.arXiv preprint arXiv:2405.08295,
-
[15]
Moshi: a speech-text foundation model for real-time dialogue
Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, and Neil Zeghidour. Moshi: a speech-text foundation model for real-time dialogue.arXiv preprint arXiv:2410.00037,
work page internal anchor Pith review arXiv
-
[16]
Unveiling encoder-free vision-language models.arXiv preprint arXiv:2406.11832, 2024
16 Haiwen Diao, Yufeng Cui, Xiaotong Li, Yueze Wang, Huchuan Lu, and Xinlong Wang. Unveiling encoder-free vision-language models. arXiv preprint arXiv:2406.11832,
-
[17]
Haiwen Diao, Xiaotong Li, Yufeng Cui, Yueze Wang, Haoge Deng, Ting Pan, Wenxuan Wang, Huchuan Lu, and Xinlong Wang. Evev2: Improved baselines for encoder-free vision-language models.arXiv preprint arXiv:2502.06788,
-
[18]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets
Frederik Ebert, Yanlai Yang, Karl Schmeckpeper, Bernadette Bucher, Georgios Georgakis, Kostas Daniilidis, Chelsea Finn, and Sergey Levine. Bridge data: Boosting generalization of robotic skills with cross-domain datasets.arXiv preprint arXiv:2109.13396,
work page internal anchor Pith review arXiv
-
[20]
Scalable pre-training of large autoregressive image models.arXiv preprint arXiv:2401.08541, 2024
Alaaeldin El-Nouby, Michal Klein, Shuangfei Zhai, Miguel Angel Bautista, Alexander Toshev, Vaishaal Shankar, Joshua M Susskind, and Armand Joulin. Scalable pre-training of large autoregressive image models.arXiv preprint arXiv:2401.08541,
-
[21]
Nicklas Hansen, Xiaolong Wang, and Hao Su
doi: 10.1109/MRA.2021.3138382. Nicklas Hansen, Xiaolong Wang, and Hao Su. Temporal difference learning for model predictive control. InICML,
-
[22]
Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b.arXiv preprint arXiv:2310.06825,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset. arXiv preprint arXiv:2403.12945,
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
Vik Korrapati. Moondream. Online, 2024.https://moondream.ai/. Accessed: 2025-03-27. Hugo Laurençon, Lucile Saulnier, Leo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M Rush, Douwe Kiela, Matthieu Cord, and Victor Sanh. OBELICS: An open web-scale filtered dataset of interleaved image-text documents. In...
work page 2024
-
[25]
https://openreview.net/forum?id=SKN2hflBIZ. 17 Hugo Laurençon, Léo Tronchon, Matthieu Cord, and Victor Sanh. What matters when building vision-language models?arXiv preprint arXiv:2405.02246,
-
[26]
Behavior generation with latent actions.arXiv preprint arXiv:2403.03181, 2024
Seungjae Lee, Yibin Wang, Haritheja Etukuru, H Jin Kim, Nur Muhammad Mahi Shafiullah, and Lerrel Pinto. Behavior generation with latent actions.arXiv preprint arXiv:2403.03181,
-
[27]
Vila: On pre-training for visual language models
Ji Lin, Hongxu Yin, Wei Ping, Yao Lu, Pavlo Molchanov, Andrew Tao, Huizi Mao, Jan Kautz, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models.arXiv preprint arXiv:2312.07533, 2023a. Xi Victoria Lin, Akshat Shrivastava, Liang Luo, Srinivasan Iyer, Mike Lewis, Gargi Ghosh, Luke Zettlemoyer, and Armen Agha- janyan. Moma: Efficient...
-
[28]
Ziyi Lin, Chris Liu, Renrui Zhang, Peng Gao, Longtian Qiu, Han Xiao, Han Qiu, Chen Lin, Wenqi Shao, Keqin Chen, et al. Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models.arXiv preprint arXiv:2311.07575, 2023b. Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for ...
-
[29]
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee
Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023a. Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InNeurIPS 2023 Workshop o...
-
[30]
Rectified flow: A marginal preserving approach to o ptimal transport
Qiang Liu. Rectified flow: A marginal preserving approach to optimal transport.arXiv preprint arXiv:2209.14577,
-
[31]
doi: 10.18653/v1/2023.eacl-main.185
Association for Computational Linguistics. doi: 10.18653/v1/2023.eacl-main.185. https://aclanthology.org/2023.eacl-main.185. Andrés Marafioti, Orr Zohar, Miquel Farré, Merve Noyan, Elie Bakouch, Pedro Cuenca, Cyril Zakka, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, et al. Smolvlm: Redefining small and efficient multimodal models.arXiv preprint arXiv:2...
-
[32]
PyTorch: An Imperative Style, High-Performance Deep Learning Library
A Paszke. Pytorch: An imperative style, high-performance deep learning library.arXiv preprint arXiv:1912.01703,
work page internal anchor Pith review Pith/arXiv arXiv 1912
-
[33]
FAST: Efficient Action Tokenization for Vision-Language-Action Models
Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747,
work page internal anchor Pith review Pith/arXiv arXiv
-
[34]
An empirical study of autoregressive pre-training from videos.arXiv:2501.05453, 2025
Jathushan Rajasegaran, Ilija Radosavovic, Rahul Ravishankar, Yossi Gandelsman, Christoph Feichtenhofer, and Jitendra Malik. An empirical study of autoregressive pre-training from videos.arXiv preprint arXiv:2501.05453,
-
[35]
Skipping computations in multimodal llms.arXiv preprint arXiv:2410.09454,
Mustafa Shukor and Matthieu Cord. Skipping computations in multimodal llms.arXiv preprint arXiv:2410.09454,
-
[36]
Scaling laws for native multimodal models.arXiv preprint arXiv:2504.07951, 2025
18 Mustafa Shukor, Corentin Dancette, and Matthieu Cord. ep-alm: Efficient perceptual augmentation of language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22056–22069, 2023a. Mustafa Shukor, Corentin Dancette, Alexandre Rame, and Matthieu Cord. Unival: Unified model for image, video, audio and language tasks. ...
-
[37]
Curran Associates, Inc., 2015.https://proceedings.neurips.cc/paper_files/paper/2015/file/ 8d55a249e6baa5c06772297520da2051-Paper.pdf. Shengkun Tang, Yaqing Wang, Zhenglun Kong, Tianchi Zhang, Yao Li, Caiwen Ding, Yanzhi Wang, Yi Liang, and Dongkuan Xu. You need multiple exiting: Dynamic early exiting for accelerating unified vision language model. InProce...
work page 2015
-
[38]
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818,
work page internal anchor Pith review Pith/arXiv arXiv
-
[39]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,
work page internal anchor Pith review Pith/arXiv arXiv
-
[40]
Octo: An Open-Source Generalist Robot Policy
Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213,
work page internal anchor Pith review Pith/arXiv arXiv
-
[41]
Théophane Vallaeys, Mustafa Shukor, Matthieu Cord, and Jakob Verbeek. Improved baselines for data-efficient perceptual augmentation of llms.arXiv preprint arXiv:2403.13499,
-
[42]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191,
work page internal anchor Pith review Pith/arXiv arXiv
-
[43]
InternVideo2.5: Empowering video MLLMs with long and rich context modeling.arXiv:2501.12386, 2025
Yi Wang, Xinhao Li, Ziang Yan, Yinan He, Jiashuo Yu, Xiangyu Zeng, Chenting Wang, Changlian Ma, Haian Huang, Jianfei Gao, et al. Internvideo2. 5: Empowering video mllms with long and rich context modeling.arXiv preprint arXiv:2501.12386,
-
[44]
Tinyvla: To- wards fast, data-efficient vision-language-action models for robotic manipulation, 2024
Junjie Wen, Yichen Zhu, Jinming Li, Minjie Zhu, Kun Wu, Zhiyuan Xu, Ning Liu, Ran Cheng, Chaomin Shen, Yaxin Peng, et al. Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation.arXiv preprint arXiv:2409.12514,
-
[45]
DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control
Junjie Wen, Yichen Zhu, Jinming Li, Zhibin Tang, Chaomin Shen, and Feifei Feng. Dexvla: Vision-language model with plug-in diffusion expert for general robot control.arXiv preprint arXiv:2502.05855,
-
[46]
Decomposing the generalization gap in imitation learning for visual robotic manipulation
Annie Xie, Lisa Lee, Ted Xiao, and Chelsea Finn. Decomposing the generalization gap in imitation learning for visual robotic manipulation. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 3153–3160. IEEE,
work page 2024
-
[47]
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
https://arxiv.org/abs/2408.01800. Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. InConference on robot learning, pages 1094–1100. PMLR,
work page internal anchor Pith review Pith/arXiv arXiv
-
[48]
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. Videollama 3: Frontier multimodal foundation models for image and video understanding.arXiv preprint arXiv:2501.13106,
work page internal anchor Pith review Pith/arXiv arXiv
-
[49]
Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware
Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware. arXiv preprint arXiv:2304.13705,
work page internal anchor Pith review Pith/arXiv arXiv
-
[50]
Tinyllava: A framework of small-scale large multimodal models,
Baichuan Zhou, Ying Hu, Xi Weng, Junlong Jia, Jie Luo, Xien Liu, Ji Wu, and Lei Huang. Tinyllava: A framework of small-scale large multimodal models.arXiv preprint arXiv:2402.14289,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.