GraspFoM creates a shared 3D latent from SAM3D priors, adds an anchor-initialized diffuser for multimodal grasps, and uses reconstruction-aware scoring plus residual updates to jointly achieve SOTA reconstruction and grasping with few extra parameters.
EvoDriveVLA: Evolving Driving VLA Models via Collaborative Perception-Planning Distillation
2 Pith papers cite this work. Polarity classification is still indexing.
abstract
Vision-Language-Action models have shown great promise for autonomous driving, yet they suffer from degraded perception after unfreezing the visual encoder and struggle with accumulated instability in long-term planning. To address these challenges, we propose EvoDriveVLA-a novel collaborative perception-planning distillation framework that integrates self-anchored perceptual constraints and future-informed trajectory optimization. Specifically, self-anchored visual distillation leverages self-anchor teacher to deliver visual anchoring constraints, regularizing student representations via trajectory-guided key-region awareness. In parallel, future-informed trajectory distillation employs a future-aware oracle teacher with coarse-to-fine trajectory refinement and Monte Carlo dropout sampling to synthesize reasoning trajectories that model future evolutions, enabling the student model to internalize the future-aware insights of the teacher. EvoDriveVLA achieves SOTA performance in nuScenes open-loop evaluation and significantly enhances performance in NAVSIM closed-loop evaluation. Our code is available at: https://github.com/hey-cjj/EvoDriveVLA.
years
2026 2verdicts
UNVERDICTED 2representative citing papers
SparseStreet applies node-based learnable pruning followed by static background compression to 3D Gaussian Splatting, reporting up to 80% reduction in primitives with minimal quality loss on Waymo and nuScenes street scene data.
citing papers explorer
-
GraspFoM: Towards Reconstruction-Driven Robotic Grasping with 3D Foundation Priors
GraspFoM creates a shared 3D latent from SAM3D priors, adds an anchor-initialized diffuser for multimodal grasps, and uses reconstruction-aware scoring plus residual updates to jointly achieve SOTA reconstruction and grasping with few extra parameters.