Minerva-Ego is a new benchmark for egocentric visual reasoning with dense human-annotated traces and masks, showing that spatiotemporal hints substantially improve frontier model performance.
Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality.Advances in neural information processing systems, 36:31096–31116,
3 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 3representative citing papers
Argos is an agentic verifier that adaptively picks scoring functions to evaluate accuracy, localization, and reasoning quality, enabling stronger multimodal RL training for AI agents.
Empirical study shows bidirectional but sensitive relationship between compositionality and long-caption understanding in VLMs, promoted by high-quality grounded data and affected by architectural choices like frozen positional embeddings.
citing papers explorer
-
Minerva-Ego: Spatiotemporal Hints for Egocentric Video Understanding
Minerva-Ego is a new benchmark for egocentric visual reasoning with dense human-annotated traces and masks, showing that spatiotemporal hints substantially improve frontier model performance.
-
Multimodal Reinforcement Learning with Adaptive Verifier for AI Agents
Argos is an agentic verifier that adaptively picks scoring functions to evaluate accuracy, localization, and reasoning quality, enabling stronger multimodal RL training for AI agents.
-
Long Story Short: Disentangling Compositionality and Long-Caption Understanding in Contrastive VLMs
Empirical study shows bidirectional but sensitive relationship between compositionality and long-caption understanding in VLMs, promoted by high-quality grounded data and affected by architectural choices like frozen positional embeddings.