MGPO elicits grounding in LMMs via multi-turn RL with binary rewards, yielding 5.4% and 5.2% gains on MME-Realworld and V* Bench and surpassing GPT-4o on the latter after training on 21K samples.
EgoLife: Towards egocentric life assistant
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
Introduces Personal VCL formalization and benchmark revealing LMM context gaps, plus an Agentic Context Bank baseline that boosts personalized visual reasoning.
RoSHI is a hybrid wearable that combines sparse IMUs and egocentric SLAM to capture accurate full-body 3D pose and shape data in natural environments for robot learning.
The paper surveys human memory categories, maps them to LLM memory, and proposes a new three-dimension (object, form, time) categorization into eight quadrants to organize existing work and highlight open problems.
An edge-deployed multimodal LLM pipeline for online episodic memory QA reaches 51.76% accuracy on an 8 GB GPU and 54.40% on a local server, within 4-5 points of a 56% cloud baseline.
A survey that deconstructs LLM agent systems via a methodology-centered taxonomy linking design principles to emergent behaviors, applications, and challenges.
citing papers explorer
-
High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning
MGPO elicits grounding in LMMs via multi-turn RL with binary rewards, yielding 5.4% and 5.2% gains on MME-Realworld and V* Bench and surpassing GPT-4o on the latter after training on 21K samples.
-
Personal Visual Context Learning in Large Multimodal Models
Introduces Personal VCL formalization and benchmark revealing LMM context gaps, plus an Agentic Context Bank baseline that boosts personalized visual reasoning.
-
RoSHI: A Versatile Robot-oriented Suit for Human Data In-the-Wild
RoSHI is a hybrid wearable that combines sparse IMUs and egocentric SLAM to capture accurate full-body 3D pose and shape data in natural environments for robot learning.
-
From Human Memory to AI Memory: A Survey on Memory Mechanisms in the Era of LLMs
The paper surveys human memory categories, maps them to LLM memory, and proposes a new three-dimension (object, form, time) categorization into eight quadrants to organize existing work and highlight open problems.
-
Exploring Multimodal LMMs for Online Episodic Memory Question Answering on the Edge
An edge-deployed multimodal LLM pipeline for online episodic memory QA reaches 51.76% accuracy on an 8 GB GPU and 54.40% on a local server, within 4-5 points of a 56% cloud baseline.
-
Large Language Model Agent: A Survey on Methodology, Applications and Challenges
A survey that deconstructs LLM agent systems via a methodology-centered taxonomy linking design principles to emergent behaviors, applications, and challenges.