By adding future visual state prediction and a dedicated inverse kinematics diffusion network that uses only visual boundary conditions, a 0.5B driving VLA recovers visual grounding and matches 7-8B models on NAVSIM-v2 and nuScenes.
Title resolution pending
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 4roles
method 1polarities
use method 1representative citing papers
GibbsTTS combines a training-free kinetic-optimal scheduler with finite-step moment correction in MI-DFM to deliver top naturalness and strong speaker similarity in zero-shot TTS.
Sparse Kernels turn kernel ridge regression into end-to-end differentiable PyTorch layers that support training-free transfer, nonlinear probing, and hybrid models while matching or augmenting neural readouts in some settings.
DualStreamHybrid assigns ViT-Tiny to RGB and MobileNetV2 to 20-channel flow, projects features to common space, and finds cross-attention best on UCF11 (98.12%) while weighted fusion is most consistent on UCF50 (96.86%).
citing papers explorer
-
Grounding Driving VLA via Inverse Kinematics
By adding future visual state prediction and a dedicated inverse kinematics diffusion network that uses only visual boundary conditions, a 0.5B driving VLA recovers visual grounding and matches 7-8B models on NAVSIM-v2 and nuScenes.
-
Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech
GibbsTTS combines a training-free kinetic-optimal scheduler with finite-step moment correction in MI-DFM to deliver top naturalness and strong speaker similarity in zero-shot TTS.
-
Differentiable Kernel Ridge Regression for Deep Learning Pipelines
Sparse Kernels turn kernel ridge regression into end-to-end differentiable PyTorch layers that support training-free transfer, nonlinear probing, and hybrid models while matching or augmenting neural readouts in some settings.
-
A Heterogeneous Two-Stream Framework for Video Action Recognition with Comparative Fusion Analysis
DualStreamHybrid assigns ViT-Tiny to RGB and MobileNetV2 to 20-channel flow, projects features to common space, and finds cross-attention best on UCF11 (98.12%) while weighted fusion is most consistent on UCF50 (96.86%).