VideoLLaMA [ 36] is an instruction-tuned multi- modal model that integrates visual and auditory information using a vision-language and audio-language branch

is designed for fine-grained temporal understanding, employing boundary-aware training to improve event boundary detection in videos · 2025

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

cs.CV · 2025-08-07 · unverdicted · novelty 6.0

B4DL provides a new benchmark, scalable data generation pipeline, and MLLM architecture for direct spatio-temporal reasoning on raw 4D LiDAR data.

Showing 1 of 1 citing paper.

B4DL: A Benchmark for 4D LiDAR LLM in Spatio-Temporal Understanding cs.CV · 2025-08-07 · unverdicted · none · ref 46
B4DL provides a new benchmark, scalable data generation pipeline, and MLLM architecture for direct spatio-temporal reasoning on raw 4D LiDAR data.