pith. sign in

hub Mixed citations

F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions

Mixed citation behavior. Most common role is background (60%).

22 Pith papers citing it
Background 60% of classified citations
abstract

Executing language-conditioned tasks in dynamic visual environments remains a central challenge in embodied AI. Existing Vision-Language-Action (VLA) models predominantly adopt reactive state-to-action mappings, often leading to short-sighted behaviors and poor robustness in dynamic scenes. In this paper, we introduce F1, a pretrained VLA framework which integrates the visual foresight generation into decision-making pipeline. F1 adopts a Mixture-of-Transformer architecture with dedicated modules for perception, foresight generation, and control, thereby bridging understanding, generation, and actions. At its core, F1 employs a next-scale prediction mechanism to synthesize goal-conditioned visual foresight as explicit planning targets. By forecasting plausible future visual states, F1 reformulates action generation as a foresight-guided inverse dynamics problem, enabling actions that implicitly achieve visual goals. To endow F1 with robust and generalizable capabilities, we propose a three-stage training recipe on an extensive dataset comprising over 330k trajectories across 136 diverse tasks. This training scheme enhances modular reasoning and equips the model with transferable visual foresight, which is critical for complex and dynamic environments. Extensive evaluations on real-world tasks and simulation benchmarks demonstrate F1 consistently outperforms existing approaches, achieving substantial gains in both task success rate and generalization ability.

hub tools

citation-role summary

background 6 baseline 3 method 1

citation-polarity summary

years

2026 19 2025 3

representative citing papers

UAM: A Dual-Stream Perspective on Forgetting in VLA Training

cs.CV · 2026-05-15 · unverdicted · novelty 6.0

UAM adds a Dorsal Expert initialized from a generative model and trained on visual dynamics prediction to preserve over 95% of VLM multimodal ability in VLA training while achieving top success rates on manipulation tasks including OOD cases.

VLANeXt: Recipes for Building Strong VLA Models

cs.CV · 2026-02-20 · conditional · novelty 6.0

VLANeXt distills 12 design insights from a unified VLA study into a model that outperforms prior methods on LIBERO benchmarks while releasing code for further exploration.

PhysBrain 1.0 Technical Report

cs.RO · 2026-05-14 · unverdicted · novelty 5.0

PhysBrain 1.0 extracts scene elements, spatial dynamics, actions and depth relations from human egocentric video to create QA supervision for VLMs, then transfers the resulting physical priors to VLA policies via capability-preserving adaptation.

Cortex 2.0: Grounding World Models in Real-World Industrial Deployment

cs.RO · 2026-04-22 · unverdicted · novelty 5.0

Cortex 2.0 introduces world-model-based planning that generates and scores future trajectories to outperform reactive vision-language-action baselines on industrial robotic tasks including pick-and-place, sorting, and unpacking.

Motus: A Unified Latent Action World Model

cs.CV · 2025-12-15 · unverdicted · novelty 5.0

Motus unifies understanding, video generation, and action in one latent world model via MoT experts and optical-flow latent actions, reporting gains over prior methods in simulation and real robots.

World Action Models: The Next Frontier in Embodied AI

cs.RO · 2026-05-12 · unverdicted · novelty 4.0

The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

World Model for Robot Learning: A Comprehensive Survey

cs.RO · 2026-04-30 · unverdicted · novelty 3.0

A comprehensive survey that organizes the literature on world models in robot learning, their roles in policy learning, planning, simulation, and video-based generation, with connections to navigation, driving, datasets, and benchmarks.

citing papers explorer

Showing 22 of 22 citing papers.