EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.
Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST) , year =
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3verdicts
UNVERDICTED 3representative citing papers
CoCoDA co-evolves a typed compositional DAG of primitive and composite tools with the agent planner, using signature-based retrieval and a size-based reward to scale libraries efficiently and let an 8B model match or beat a 32B model on math and code benchmarks.
SCENE is a new benchmark for testing LLMs on recognizing implicit social norms and adapting to sanctions in multi-party group chats.
citing papers explorer
-
EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding
EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.
-
CoCoDA: Co-evolving Compositional DAG for Tool-Augmented Agents
CoCoDA co-evolves a typed compositional DAG of primitive and composite tools with the agent planner, using signature-based retrieval and a size-based reward to scale libraries efficiently and let an 8B model match or beat a 32B model on math and code benchmarks.
-
SCENE: Recognizing Social Norms and Sanctioning in Group Chats
SCENE is a new benchmark for testing LLMs on recognizing implicit social norms and adapting to sanctions in multi-party group chats.