VLA-World improves autonomous driving by using action-guided future image generation followed by reflective reasoning over the imagined scene to refine trajectories.
Reasoning in computer vision: Taxonomy, models, tasks, and methodologies.arXiv preprint arXiv:2508.10523, 2025
8 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
method 1polarities
use method 1representative citing papers
Introduces MTRS task, MTRefSeg-21K benchmark of 21K image-text-mask triplets, and MTRefSeg-R1 LVLM baseline that outperforms standard models via two-stage change-aware training.
FSDrive uses a generated future scene frame as visual spatio-temporal CoT to improve VLA models for safer autonomous driving trajectory prediction.
AtmoFuseNet fuses multi-view sky cameras, millimeter-wave radar, and ceilometer data via hierarchical cross-attention, variational refinement, and motion estimation to produce 4D cloud microphysical fields and wind with reported MAEs of 0.026 g m^{-3} LWC and 1.18 m s^{-1} wind speed.
Gradient boosting with conformal prediction and mutual-information stability selection yields NAFLD risk predictions with 91.3% empirical coverage at 90% nominal level and AUROC 0.91 on multicenter Chinese data.
Audio language models are benchmarked on five semantic and paralinguistic reasoning tasks to reveal limitations in handling spoken audio evidence, accent variation, and domain shifts.
VOTE-RAG applies retrieval voting across diverse queries and response voting across independent generations to mitigate hallucination-on-hallucination in RAG, matching or exceeding complex baselines on six benchmarks with a parallelizable design.
citing papers explorer
-
FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving
FSDrive uses a generated future scene frame as visual spatio-temporal CoT to improve VLA models for safer autonomous driving trajectory prediction.