Response-G1 uses query-guided scene graphs, memory retrieval, and augmented prompting to improve when Video-LLMs decide to respond during streaming videos.
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
4 Pith papers cite this work. Polarity classification is still indexing.
years
2026 4representative citing papers
AFMRL uses MLLM-generated attributes in attribute-guided contrastive learning and retrieval-aware reinforcement to achieve SOTA fine-grained multimodal retrieval on e-commerce datasets.
HRNav decomposes image-goal navigation into VLM-based short-horizon planning and RL-based execution with a wandering suppression penalty to improve performance in complex unseen settings.
citing papers explorer
-
Response-G1: Explicit Scene Graph Modeling for Proactive Streaming Video Understanding
Response-G1 uses query-guided scene graphs, memory retrieval, and augmented prompting to improve when Video-LLMs decide to respond during streaming videos.
-
AFMRL: Attribute-Enhanced Fine-Grained Multi-Modal Representation Learning in E-commerce
AFMRL uses MLLM-generated attributes in attribute-guided contrastive learning and retrieval-aware reinforcement to achieve SOTA fine-grained multimodal retrieval on e-commerce datasets.
-
Think before Go: Hierarchical Reasoning for Image-goal Navigation
HRNav decomposes image-goal navigation into VLM-based short-horizon planning and RL-based execution with a wandering suppression penalty to improve performance in complex unseen settings.
- Dual-Anchoring: Addressing State Drift in Vision-Language Navigation