hub Canonical reference

arXiv preprint arXiv:2312.14150 , year=

Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Ping Luo, Andreas Geiger, Hongyang Li · 2023 · arXiv 2312.14150

Canonical reference. 100% of citing Pith papers cite this work as background.

16 Pith papers citing it

Background 100% of classified citations

read on arXiv browse 16 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 5

citation-polarity summary

background 5

representative citing papers

CCTVBench: Contrastive Consistency Traffic VideoQA Benchmark for Multimodal LLMs

cs.CV · 2026-04-22 · unverdicted · novelty 8.0

CCTVBench exposes a large gap between standard QA accuracy and contrastive consistency in traffic video reasoning for multimodal LLMs and introduces C-TCD to narrow that gap.

MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

cs.CV · 2024-08-23 · conditional · novelty 8.0

MME-RealWorld is the largest manually annotated high-resolution benchmark for MLLMs, where even the best models achieve less than 60% accuracy on challenging real-world tasks.

TRIP-Evaluate: An Open Multimodal Benchmark for Evaluating Large Models in Transportation

cs.CV · 2026-04-29 · accept · novelty 7.0

TRIP-Evaluate is a new open multimodal benchmark with 837 text, image, and point-cloud items organized by a role-task-knowledge taxonomy to evaluate large models on transportation workflows.

AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning

cs.CV · 2025-03-10 · unverdicted · novelty 7.0

AlphaDrive uses GRPO-based RL rewards and two-stage SFT+RL training on VLMs to improve autonomous driving planning performance and efficiency while producing emergent multimodal capabilities.

Visual Adversarial Attack on Vision-Language Models for Autonomous Driving

cs.CV · 2024-11-27 · unverdicted · novelty 7.0

ADvLM is the first visual adversarial attack framework for VLMs in autonomous driving, using semantic-invariant induction via LLM-generated prompt libraries and scenario-associated attention-based enhancement to achieve SOTA attack effectiveness across benchmarks and real-world tests.

ChainFlow-VLA: Causal Flow Planning with Vision-Language Models

cs.CV · 2026-05-22 · unverdicted · novelty 6.0

ChainFlow-VLA unifies autoregressive causal trajectory modes with VLM-conditioned diffusion refinement to reach 94.85 on NAVSIM v1, matching human performance.

EgoDyn-Bench: Evaluating Ego-Motion Understanding in Vision-Centric Foundation Models for Autonomous Driving

cs.CV · 2026-04-22 · unverdicted · novelty 6.0

EgoDyn-Bench reveals a perception bottleneck in vision-centric foundation models: ego-motion logic derives from language while visual input adds negligible signal, with explicit trajectories restoring consistency.

OneDrive: Unified Multi-Paradigm Driving with Vision-Language-Action Models

cs.CV · 2026-04-20 · unverdicted · novelty 6.0

OneDrive unifies heterogeneous decoding in a single VLM transformer decoder for end-to-end driving, achieving 0.28 L2 error and 0.18 collision rate on nuScenes plus 86.8 PDMS on NAVSIM.

Beyond the Beep: Scalable Collision Anticipation and Real-Time Explainability with BADAS-2.0

cs.CV · 2026-04-07 · unverdicted · novelty 6.0

BADAS-2.0 scales collision anticipation with a 178k-video long-tail benchmark built via active oracle selection, 7-12x faster distilled edge models, and object-centric attention heatmaps plus VLM-based textual reasoning.

OmniDrive-R1: Reinforcement-driven Interleaved Multi-modal Chain-of-Thought for Trustworthy Vision-Language Autonomous Driving

cs.CV · 2025-12-16 · unverdicted · novelty 6.0

OmniDrive-R1 boosts VLM reasoning score from 51.77% to 80.35% and answer accuracy from 37.81% to 73.62% on DriveLMM-o1 via reinforcement-driven interleaved multi-modal chain-of-thought with annotation-free grounding.

B4DL: A Benchmark for 4D LiDAR LLM in Spatio-Temporal Understanding

cs.CV · 2025-08-07 · unverdicted · novelty 6.0

B4DL provides a new benchmark, scalable data generation pipeline, and MLLM architecture for direct spatio-temporal reasoning on raw 4D LiDAR data.

DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving

cs.CV · 2025-05-22 · unverdicted · novelty 6.0

DriveMoE applies scene-specialized Vision MoE and skill-specialized Action MoE to a VLA baseline to achieve SOTA closed-loop performance on Bench2Drive.

Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving

cs.CV · 2024-10-29 · conditional · novelty 6.0

Senna decouples language-based high-level planning from an LVLM with low-level trajectory prediction from an E2E model, reporting 27% lower planning error and 33% lower collisions after pre-training on DriveX and fine-tuning on nuScenes.

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

cs.CV · 2024-02-19 · unverdicted · novelty 6.0

DriveVLM adds vision-language models with scene description, analysis, and hierarchical planning modules to autonomous driving, paired with a hybrid DriveVLM-Dual system tested on nuScenes and SUP-AD datasets and deployed on a production vehicle.

EponaV2: Driving World Model with Comprehensive Future Reasoning

cs.CV · 2026-05-14 · unverdicted · novelty 5.0

EponaV2 advances perception-free driving world models by forecasting comprehensive future 3D geometry and semantic representations, achieving SOTA planning performance on NAVSIM benchmarks.

Structured Labeling Enables Faster Vision-Language Models for End-to-End Autonomous Driving

cs.CV · 2025-06-05 · unverdicted · novelty 4.0

Introduces structured NuScenes-S dataset and 0.9B FastDrive VLM claiming 20% higher decision accuracy and over 10x inference speedup versus larger unstructured VLMs.

citing papers explorer

Showing 16 of 16 citing papers.

CCTVBench: Contrastive Consistency Traffic VideoQA Benchmark for Multimodal LLMs cs.CV · 2026-04-22 · unverdicted · none · ref 20
CCTVBench exposes a large gap between standard QA accuracy and contrastive consistency in traffic video reasoning for multimodal LLMs and introduces C-TCD to narrow that gap.
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans? cs.CV · 2024-08-23 · conditional · none · ref 55
MME-RealWorld is the largest manually annotated high-resolution benchmark for MLLMs, where even the best models achieve less than 60% accuracy on challenging real-world tasks.
TRIP-Evaluate: An Open Multimodal Benchmark for Evaluating Large Models in Transportation cs.CV · 2026-04-29 · accept · none · ref 25
TRIP-Evaluate is a new open multimodal benchmark with 837 text, image, and point-cloud items organized by a role-task-knowledge taxonomy to evaluate large models on transportation workflows.
AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning cs.CV · 2025-03-10 · unverdicted · none · ref 34
AlphaDrive uses GRPO-based RL rewards and two-stage SFT+RL training on VLMs to improve autonomous driving planning performance and efficiency while producing emergent multimodal capabilities.
Visual Adversarial Attack on Vision-Language Models for Autonomous Driving cs.CV · 2024-11-27 · unverdicted · none · ref 46
ADvLM is the first visual adversarial attack framework for VLMs in autonomous driving, using semantic-invariant induction via LLM-generated prompt libraries and scenario-associated attention-based enhancement to achieve SOTA attack effectiveness across benchmarks and real-world tests.
ChainFlow-VLA: Causal Flow Planning with Vision-Language Models cs.CV · 2026-05-22 · unverdicted · none · ref 15
ChainFlow-VLA unifies autoregressive causal trajectory modes with VLM-conditioned diffusion refinement to reach 94.85 on NAVSIM v1, matching human performance.
EgoDyn-Bench: Evaluating Ego-Motion Understanding in Vision-Centric Foundation Models for Autonomous Driving cs.CV · 2026-04-22 · unverdicted · none · ref 30
EgoDyn-Bench reveals a perception bottleneck in vision-centric foundation models: ego-motion logic derives from language while visual input adds negligible signal, with explicit trajectories restoring consistency.
OneDrive: Unified Multi-Paradigm Driving with Vision-Language-Action Models cs.CV · 2026-04-20 · unverdicted · none · ref 38
OneDrive unifies heterogeneous decoding in a single VLM transformer decoder for end-to-end driving, achieving 0.28 L2 error and 0.18 collision rate on nuScenes plus 86.8 PDMS on NAVSIM.
Beyond the Beep: Scalable Collision Anticipation and Real-Time Explainability with BADAS-2.0 cs.CV · 2026-04-07 · unverdicted · none · ref 15
BADAS-2.0 scales collision anticipation with a 178k-video long-tail benchmark built via active oracle selection, 7-12x faster distilled edge models, and object-centric attention heatmaps plus VLM-based textual reasoning.
OmniDrive-R1: Reinforcement-driven Interleaved Multi-modal Chain-of-Thought for Trustworthy Vision-Language Autonomous Driving cs.CV · 2025-12-16 · unverdicted · none · ref 34
OmniDrive-R1 boosts VLM reasoning score from 51.77% to 80.35% and answer accuracy from 37.81% to 73.62% on DriveLMM-o1 via reinforcement-driven interleaved multi-modal chain-of-thought with annotation-free grounding.
B4DL: A Benchmark for 4D LiDAR LLM in Spatio-Temporal Understanding cs.CV · 2025-08-07 · unverdicted · none · ref 27
B4DL provides a new benchmark, scalable data generation pipeline, and MLLM architecture for direct spatio-temporal reasoning on raw 4D LiDAR data.
DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving cs.CV · 2025-05-22 · unverdicted · none · ref 14
DriveMoE applies scene-specialized Vision MoE and skill-specialized Action MoE to a VLA baseline to achieve SOTA closed-loop performance on Bench2Drive.
Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving cs.CV · 2024-10-29 · conditional · none · ref 31
Senna decouples language-based high-level planning from an LVLM with low-level trajectory prediction from an E2E model, reporting 27% lower planning error and 33% lower collisions after pre-training on DriveX and fine-tuning on nuScenes.
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models cs.CV · 2024-02-19 · unverdicted · none · ref 60
DriveVLM adds vision-language models with scene description, analysis, and hierarchical planning modules to autonomous driving, paired with a hybrid DriveVLM-Dual system tested on nuScenes and SUP-AD datasets and deployed on a production vehicle.
EponaV2: Driving World Model with Comprehensive Future Reasoning cs.CV · 2026-05-14 · unverdicted · none · ref 60
EponaV2 advances perception-free driving world models by forecasting comprehensive future 3D geometry and semantic representations, achieving SOTA planning performance on NAVSIM benchmarks.
Structured Labeling Enables Faster Vision-Language Models for End-to-End Autonomous Driving cs.CV · 2025-06-05 · unverdicted · none · ref 5
Introduces structured NuScenes-S dataset and 0.9B FastDrive VLM claiming 20% higher decision accuracy and over 10x inference speedup versus larger unstructured VLMs.

arXiv preprint arXiv:2312.14150 , year=

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer