EgoPro-Bench: Benchmarking Personalized Proactive Interaction in Egocentric Video Streams

Chenxu Guo; Dongchuan Ran; Hewei Guo; Kaibing Wang; Lewei Lu; Linyu Ou; Wenwen Tong; Xueheng Li

arxiv: 2605.07299 · v1 · submitted 2026-05-08 · 💻 cs.CV · cs.AI

EgoPro-Bench: Benchmarking Personalized Proactive Interaction in Egocentric Video Streams

Dongchuan Ran , Linyu Ou , Xueheng Li , Wenwen Tong , Chenxu Guo , Hewei Guo , Kaibing Wang , Lewei Lu This is my paper

Pith reviewed 2026-05-11 00:52 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords egocentric videoproactive interactionmultimodal large language modelsbenchmarkhuman-machine interactionintention understandingstreaming videopersonalized agents

0 comments

The pith

EgoPro-Bench uses simulated user profiles to generate personalized intentions and precise interaction timings from streaming egocentric videos across twelve domains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces EgoPro-Bench to move multimodal large language models from reactive responses to proactive assistance based on continuous first-person video. It builds a dataset of 2,400 evaluation videos and over 12,000 training videos by using simulated profiles to create diverse user intentions and high-fidelity human-machine interaction points. A new evaluation protocol measures how well models recognize when to act and what to do. The authors also test a limited-token reasoning approach called short thinking to keep interactions low-latency on live streams. If the benchmark works as intended, models can learn to offer timely help without waiting for explicit user commands.

Core claim

EgoPro-Bench comprises 2,400 videos in the evaluation set and over 12,000 in the training set, constructed via simulated user profiles to produce diverse intentions and high-fidelity HMI data across 12 domains. It supplies a specialized evaluation protocol and metrics, trains models for efficient reasoning on streaming video, and introduces the interaction principle of allocating a limited token budget before intent recognition. Experiments show that training on this data improves intention understanding in MLLMs and enables accurate identification of appropriate HMI timings.

What carries the argument

EgoPro-Bench, the benchmark that converts simulated user profiles into diverse personalized intentions and precise HMI timing labels within streaming egocentric videos.

If this is right

Models trained on the benchmark show improved ability to understand user intentions from ongoing video input.
Models gain the capacity to identify suitable moments for initiating human-machine interactions.
The short-thinking principle with constrained token budgets supports efficient low-latency responses on live streams.
The dataset and protocol provide a standardized foundation for developing user-centric proactive agents.
Training on the 12-domain collection enables models suited to real-time streaming scenarios.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The simulation method could reduce reliance on large-scale real-user data collection for testing proactive behaviors in personal assistance or monitoring settings.
The benchmark's structure might integrate with other egocentric video tasks to create combined training pipelines for agents that both perceive and act.
If the generated data generalizes, the evaluation protocol could serve as a reusable test for proactivity in additional multimodal systems.
Extensions could test whether the same limited-token principle transfers to non-video modalities such as audio or sensor streams.

Load-bearing premise

Simulated user profiles produce intentions and interaction timings that accurately reflect real personalized human contexts across the twelve domains.

What would settle it

A side-by-side comparison of the benchmark's generated HMI timings and intentions against timings and intentions collected from real users performing the same tasks in the same domains.

Figures

Figures reproduced from arXiv: 2605.07299 by Chenxu Guo, Dongchuan Ran, Hewei Guo, Kaibing Wang, Lewei Lu, Linyu Ou, Wenwen Tong, Xueheng Li.

**Figure 1.** Figure 1: Examples and data distribution of EgoPro-Bench. The benchmark consists of two main categories (event-driven and intentdriven) and covers 12 distinct domains for personalized proactive interaction. Abstract Existing Multimodal Large Language Models (MLLMs) remain primarily reactive, failing to continuously perceive environments or proactively assist users. While emerging benchmarks address proactivity, th… view at source ↗

**Figure 2.** Figure 2: , proactive interaction necessitates continuous monitoring of streaming inputs and autonomous response timing grounded in visual and user contexts (Deng et al., 2025). The passive nature of existing MLLMs thus hinders their effective deployment in complex real-world settings. Benchmarks are pivotal for tracking the rapid advancements in this domain. Extensive research (Tang et al., 2025; Kumar, 2025) has… view at source ↗

**Figure 3.** Figure 3: Data synthesis pipeline for event-driven and intent-driven proactive interaction. The event-driven branch divides tasks into “object” and “action”, focusing on visual and temporal precision. The intent-driven branch synthesizes personalized user intents by injecting persona profiles into diverse domain scenarios. Strict data filtering and quality checks are applied throughout the pipeline. 3. Benchmark 3.1… view at source ↗

read the original abstract

Existing Multimodal Large Language Models (MLLMs) remain primarily reactive, failing to continuously perceive environments or proactively assist users. While emerging benchmarks address proactivity, they are largely confined to alert scenarios, neglect personalized context, and fail to evaluate the precise timing of human-machine interactions (HMI).In this paper, we introduce EgoPro-Bench, a novel benchmark for training and evaluating proactive interaction capabilities based on streaming egocentric videos; it comprises 2,400 videos in the evaluation set and over 12,000 videos in the training set.Unlike previous works, EgoPro-Bench leverages simulated user profiles to generate diverse user intentions and to construct high-fidelity HMI data across 12 distinct domains.Subsequently, we propose a specialized evaluation protocol and metrics, train proactive interaction models designed for efficient reasoning and low-latency interaction on streaming video data, and conduct comprehensive evaluations.Furthermore, we introduce an interaction principle termed "short thinking, better interaction", which allocates a limited token budget prior to intent recognition, thereby enhancing interaction performance.The experiments demonstrate that EgoPro-Bench substantially enhances the intention understanding capabilities of MLLMs and enables accurate identification of appropriate timings for HMI, thereby laying a solid foundation for next-generation user-centric proactive interactive agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EgoPro-Bench adds a concrete benchmark for proactive personalized MLLM timing in egocentric streams but its results stand or fall on unvalidated simulated user profiles.

read the letter

EgoPro-Bench is a new benchmark for training and evaluating proactive, personalized interactions in streaming egocentric video. It uses simulated profiles to create intentions and HMI timings across 12 domains, adds a short-thinking principle to limit token use before acting, and reports gains in intention understanding and timing accuracy on 2,400 eval videos plus a larger training set. That combination of streaming first-person input, personalization, and explicit timing metrics is new relative to earlier reactive or alert-only benchmarks.

Referee Report

2 major / 1 minor

Summary. The paper introduces EgoPro-Bench, a benchmark for personalized proactive interaction in streaming egocentric videos comprising 2,400 evaluation videos and over 12,000 training videos across 12 domains. It generates data via simulated user profiles to create diverse intentions and high-fidelity HMI timings, proposes an evaluation protocol with metrics, trains proactive MLLM-based models emphasizing efficient reasoning and low latency, and introduces the 'short thinking, better interaction' principle that limits token budget before intent recognition. Experiments are reported to show that the benchmark substantially improves MLLM intention understanding and enables accurate HMI timing identification.

Significance. If the simulated profiles prove to be high-fidelity representations of real personalized human contexts, the benchmark and associated models could fill a notable gap in proactive AI research by moving beyond reactive or alert-only scenarios to support user-centric, timing-aware agents in egocentric settings. The emphasis on low-latency inference and the proposed interaction principle may offer practical pathways for deploying such systems on resource-constrained devices.

major comments (2)

[Abstract and Benchmark Construction] Benchmark construction (abstract and methods): The central claim that EgoPro-Bench 'substantially enhances' intention understanding and enables 'accurate identification' of HMI timings rests on the assertion that simulated user profiles produce 'high-fidelity' personalized data. No grounding in real egocentric recordings, human-subject validation studies, or inter-rater agreement metrics for intention realism and timing appropriateness is described; without this, measured gains on the benchmark do not necessarily transfer to the claimed 'next-generation user-centric proactive interactive agents.'
[Abstract and Experiments] Experiments and evaluation (abstract): The claim of enhancement is presented without reported baselines, concrete metrics for intention understanding and timing accuracy, error bars, data splits, or details on post-hoc analysis choices. This under-specification prevents verification of the soundness of the reported improvements and is load-bearing for the paper's primary empirical contribution.

minor comments (1)

[Abstract] The abstract would benefit from a concise statement of the exact metrics and baseline models used in the experiments to allow readers to immediately gauge the scale of reported gains.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major comment point by point below, providing honest responses and indicating the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Abstract and Benchmark Construction] Benchmark construction (abstract and methods): The central claim that EgoPro-Bench 'substantially enhances' intention understanding and enables 'accurate identification' of HMI timings rests on the assertion that simulated user profiles produce 'high-fidelity' personalized data. No grounding in real egocentric recordings, human-subject validation studies, or inter-rater agreement metrics for intention realism and timing appropriateness is described; without this, measured gains on the benchmark do not necessarily transfer to the claimed 'next-generation user-centric proactive interactive agents.'

Authors: We appreciate the referee highlighting this important aspect of our claims. The simulated profiles were designed to scale the generation of diverse, personalized intentions and precise HMI timings across 12 domains in a manner that would be prohibitively expensive and logistically challenging with real human subjects. The simulation draws on domain expertise and structured rules to model user contexts. That said, we acknowledge that the manuscript does not include direct human-subject validation or inter-rater agreement metrics. In the revised version, we will add a detailed subsection in the Methods describing the profile simulation process and its grounding in expert guidelines. We will also insert an explicit Limitations section that discusses the reliance on simulation, the absence of real egocentric human validation, and the resulting scope of our claims. These changes will provide greater transparency without overstating generalizability. revision: partial
Referee: [Abstract and Experiments] Experiments and evaluation (abstract): The claim of enhancement is presented without reported baselines, concrete metrics for intention understanding and timing accuracy, error bars, data splits, or details on post-hoc analysis choices. This under-specification prevents verification of the soundness of the reported improvements and is load-bearing for the paper's primary empirical contribution.

Authors: We agree that the abstract, due to its brevity, does not include these experimental details. The full manuscript does report baseline comparisons against standard MLLMs, concrete metrics (intention understanding accuracy and HMI timing F1/precision), results from multiple runs with error bars, the 12,000/2,400 train/evaluation split, and analysis protocols. To address the concern directly, we will revise the abstract to include key quantitative results and ensure all experimental specifications are clearly summarized in the main text and highlighted in tables. This will make the empirical contribution immediately verifiable. revision: yes

Circularity Check

0 steps flagged

No circularity detected; benchmark construction and models are independent contributions

full rationale

The paper introduces EgoPro-Bench via simulated user profiles to generate intentions and HMI data, proposes specialized models for streaming video, and introduces the 'short thinking, better interaction' principle. No equations, derivations, or load-bearing steps are present that reduce by construction to fitted inputs, self-definitions, or self-citation chains. Claims of enhancement rest on empirical evaluation of the new benchmark rather than renaming known results or smuggling ansatzes. The derivation chain is self-contained as a benchmark proposal without any of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim rests on the assumption that simulated profiles produce realistic intentions without external validation; no free parameters or formal axioms are explicitly listed in the abstract, but the benchmark itself introduces the simulated profiles as a core mechanism.

invented entities (1)

simulated user profiles no independent evidence
purpose: generate diverse user intentions and high-fidelity HMI data across 12 domains
Core to constructing the benchmark data; no independent evidence or real-user validation is mentioned in the abstract.

pith-pipeline@v0.9.0 · 5548 in / 1092 out tokens · 52290 ms · 2026-05-11T00:52:00.031795+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

EgoPro-Bench leverages simulated user profiles to generate diverse user intentions and to construct high-fidelity HMI data across 12 distinct domains... 'short thinking, better interaction' paradigm... ProAct-Stream model
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose EgoPro-Bench... 2,400 videos... training set... RL with Group Sequence Policy Optimization... length reward... semantic consistency reward

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages

[1]

Analyze the provided video scenario or event description

work page
[2]

A complex event must involve one or more of the following: - Temporal Logic: Sequence of actions (e.g., before, after, while, afterwards)

Identify specific Complex Events. A complex event must involve one or more of the following: - Temporal Logic: Sequence of actions (e.g., before, after, while, afterwards). - Causality: One action causing another. - Multi-object Interaction: Interactions between different objects or people. - Attribute Identification: Specific visual details (clothing, co...

work page
[3]

Show the part where two men in matching blue sweatshirts eat McDonalds and walk together afterwards

Formulate a natural language User Instruction that a user would ask to find or describe this event. #Instruction Style & Patterns Generate natural, varied user queries. Do not stick to a single formula. Use the following patterns as references: - Temporal Sequences: “Show the part where two men in matching blue sweatshirts eat McDonalds and walk together ...

work page
[5]

{target object}

Visual Input: You will receive a single image frame extracted from a video stream. #YOUR MISSION Determine if the Target Object (“{target object}”) is clearly visible in the current frame to warrant a user alert. #JUDGMENT CRITERIA (Strict Rules) To output a positive alert, the object must meet the following conditions: 1.Visibility:The object must be vis...

work page
[6]

Target Object:{target object}

work page
[7]

{target object}

Visual Input: You will receive a single image frame extracted from a video stream. #YOUR MISSION Determine if the Target Object (“{target object}”) is clearly visible in the current frame to warrant a user alert. #JUDGMENT CRITERIA (Strict Rules) To output a positive alert, the object must meet the following conditions: 1.Visibility:The object must be vis...

work page
[8]

{target object}

Target Object: “{target object}”

work page
[9]

{current label}

Current Label: “{current label}” - “<Attention>” means: Object is DETECTED. - “<Silence>” means: Object is ABSENT. #V ALIDATION LOGIC - IF Object is Visible→Label MUST be “<Attention>”. - IF Object is NOT Visible→Label MUST be “<Silence>”. #OUTPUT FORMAT Analysis: [Briefly explain what you see and if it matches the label] Final Decision: [GOOD or BAD] Sys...

work page
[10]

{target object}

User Instruction: “{target object}”

work page
[11]

Visual Input: Single video frame

work page
[12]

{current label}

Current Label: “{current label}” #V ALIDATION LOGIC MATRIX CASE A: Label is “<Attention>” - V ALID (GOOD): ◦The event described is visually relevant in the frame. ◦Tolerance: It is ACCEPTABLE if the action is just starting (initiation) or just finishing (follow-through). ◦ Close-ups/Occlusion:The frame shows the essential object or interaction, even if th...

work page
[14]

The user,

Domain Description: Cooking/kitchen scene 3.USER MEMORYGeneration Notes (Notice) #USER MEMORY Generation Notes (Notice) Content Strategy: - Focus onone primary pointfrom the video or user information. - Do not describe current video content; relate it to aprior memory(days, weeks, or months ago). - Avoid experiences that occurred within the last 72 hours....

work page
[16]

The user,

Domain Description: Shopping scene 3.USER MEMORYGeneration Notes (Notice) #USER MEMORY Generation Notes (Notice) Content Strategy: - Focus onone primary pointfrom the video or user information. - Do not describe current video content; relate it to aprior memory(e.g., recently bought a car and needs accessories). - Avoid experiences that occurred within th...

work page
[18]

The user,

Domain Description: Entertainment/Game scene 3.USER MEMORYGeneration Notes (Notice) #USER MEMORY Generation Notes (Notice) Content Strategy: - Focus onone primary pointfrom the video or user information. - Do not describe current video content; relate it to aprior memory(e.g., learned table tennis when young). - Avoid experiences that occurred within the ...

work page
[20]

The user,

Domain Description: Working/manual/task scene 3.USER MEMORYGeneration Notes (Notice) #USER MEMORY Generation Notes (Notice) 18 EgoPro-Bench: Benchmarking Personalized Proactive Interaction in Egocentric Video Streams Content Strategy: - Focus onone primary pointfrom the video or user information. - Do not describe current video content; relate it to aprio...

work page
[22]

The user,

Domain Description: Daily life (casual chat/emotional care) 3.USER MEMORYGeneration Notes (Notice) #USER MEMORY Generation Notes (Notice) Content Strategy: - Focus onone primary pointfrom the video or user information. - Do not describe current video content; relate it to aprior memory(e.g., went to see a doctor for back pain a few days ago). - Avoid expe...

work page
[24]

The user,

Domain Description: Travel/life (casual chat/emotional care) 3.USER MEMORYGeneration Notes (Notice) #USER MEMORY Generation Notes (Notice) Content Strategy: - Focus onone primary pointfrom the video or user information. - Do not describe current video content; relate it to aprior memory(e.g., preference for local food when traveling). - Avoid experiences ...

work page
[26]

The user,

Domain Description: Painting/artistic creation scene 3.USER MEMORYGeneration Notes (Notice) #USER MEMORY Generation Notes (Notice) Content Strategy: - Focus onone primary pointfrom the video or user information. - Do not describe current video content; relate it to aprior memory(e.g., learned to paint when young). - Avoid experiences that occurred within ...

work page
[28]

The user,

Domain Description: Sports scene 3.USER MEMORYGeneration Notes (Notice) #USER MEMORY Generation Notes (Notice) Content Strategy: - Focus onone primary pointfrom the video or user information. - Do not describe current video content; relate it to aprior memory(e.g., learned table tennis when young). - Avoid experiences that occurred within the last 72 hour...

work page
[29]

User’s Information: ${user}

work page
[30]

The user,

Domain Description: Driving scene 3.USER MEMORYGeneration Notes (Notice) 21 EgoPro-Bench: Benchmarking Personalized Proactive Interaction in Egocentric Video Streams #USER MEMORY Generation Notes (Notice) Content Strategy: - Focus onone primary pointfrom the video or user information. - Do not describe current video content; relate it to aprior memory(e.g...

work page
[31]

User’s current activity: ${user memory}

work page
[32]

justification

Current scene context: ${domain background} #Response Timing and Content Requirements Dimension 1: Response Timing Suitability (Safety-Critical) *5 points: Obstacle detected within 3 meters; change in terrain (stairs, curbs); tactile paving available; or approaching danger (vehicles/cyclists). *1 point: Path is completely clear for at least 5 meters; grou...

work page
[33]

User memory:{{User Memory}}

work page
[34]

AI model response:{{Model Response}}

work page
[35]

justification

Ground-truth (GT) response:{{GT Response}} #Evaluation Dimensions and Scoring Criteria (1 / 3 / 5) Dimension 1: Memory Consistency *5 points: No contradiction at all. All memory-related statements are fully consistent. *3 points: Generally reasonable with no obvious or critical conflicts. *1 point: The response clearly contradicts the user’s memory. Dimen...

work page
[36]

User Instruction: The event we are looking for

work page
[37]

Image: The visual evidence

work page
[38]

<Attention>

Ground Truth: “<Attention>” or “<Silence>”. #REASONING LOGIC (CRITICAL) Case 1: If Ground Truth is “<Attention>” - Your goal is tofind evidence to supportthe alert. -Handling Partial Visibility (Close-ups): ◦ If the instruction mentions “A man holding a Kindle” but only hands are visible, DO NOT say “I cannot see the man.” ◦ INSTEAD, say: “I see hands hol...

work page
[39]

User Memory: The user’s specific habit, history, or preference

work page
[40]

Ground Truth Response: The target output (message or<Silence>)

work page
[41]

Relevance Score: A score (0–5) indicating relevance

work page
[42]

I see [action]... which aligns with [Memory]... so I must [provide guidance]

Current Image: The visual evidence. #Task Simulate the inference flow:Observe→Connect→Decide. Generate a reasoning thought that logically justifies the Ground Truth Response based on the visual evidence and the score. #Reasoning Logic (Choose one based on inputs) Case 1: Active Response (Score 5 & GT is text) - Observe→Connect→Decide: “I see [action]... w...

work page

[1] [1]

Analyze the provided video scenario or event description

work page

[2] [2]

A complex event must involve one or more of the following: - Temporal Logic: Sequence of actions (e.g., before, after, while, afterwards)

Identify specific Complex Events. A complex event must involve one or more of the following: - Temporal Logic: Sequence of actions (e.g., before, after, while, afterwards). - Causality: One action causing another. - Multi-object Interaction: Interactions between different objects or people. - Attribute Identification: Specific visual details (clothing, co...

work page

[3] [3]

Show the part where two men in matching blue sweatshirts eat McDonalds and walk together afterwards

Formulate a natural language User Instruction that a user would ask to find or describe this event. #Instruction Style & Patterns Generate natural, varied user queries. Do not stick to a single formula. Use the following patterns as references: - Temporal Sequences: “Show the part where two men in matching blue sweatshirts eat McDonalds and walk together ...

work page

[4] [5]

{target object}

Visual Input: You will receive a single image frame extracted from a video stream. #YOUR MISSION Determine if the Target Object (“{target object}”) is clearly visible in the current frame to warrant a user alert. #JUDGMENT CRITERIA (Strict Rules) To output a positive alert, the object must meet the following conditions: 1.Visibility:The object must be vis...

work page

[5] [6]

Target Object:{target object}

work page

[6] [7]

{target object}

Visual Input: You will receive a single image frame extracted from a video stream. #YOUR MISSION Determine if the Target Object (“{target object}”) is clearly visible in the current frame to warrant a user alert. #JUDGMENT CRITERIA (Strict Rules) To output a positive alert, the object must meet the following conditions: 1.Visibility:The object must be vis...

work page

[7] [8]

{target object}

Target Object: “{target object}”

work page

[8] [9]

{current label}

Current Label: “{current label}” - “<Attention>” means: Object is DETECTED. - “<Silence>” means: Object is ABSENT. #V ALIDATION LOGIC - IF Object is Visible→Label MUST be “<Attention>”. - IF Object is NOT Visible→Label MUST be “<Silence>”. #OUTPUT FORMAT Analysis: [Briefly explain what you see and if it matches the label] Final Decision: [GOOD or BAD] Sys...

work page

[9] [10]

{target object}

User Instruction: “{target object}”

work page

[10] [11]

Visual Input: Single video frame

work page

[11] [12]

{current label}

Current Label: “{current label}” #V ALIDATION LOGIC MATRIX CASE A: Label is “<Attention>” - V ALID (GOOD): ◦The event described is visually relevant in the frame. ◦Tolerance: It is ACCEPTABLE if the action is just starting (initiation) or just finishing (follow-through). ◦ Close-ups/Occlusion:The frame shows the essential object or interaction, even if th...

work page

[12] [14]

The user,

Domain Description: Cooking/kitchen scene 3.USER MEMORYGeneration Notes (Notice) #USER MEMORY Generation Notes (Notice) Content Strategy: - Focus onone primary pointfrom the video or user information. - Do not describe current video content; relate it to aprior memory(days, weeks, or months ago). - Avoid experiences that occurred within the last 72 hours....

work page

[13] [16]

The user,

Domain Description: Shopping scene 3.USER MEMORYGeneration Notes (Notice) #USER MEMORY Generation Notes (Notice) Content Strategy: - Focus onone primary pointfrom the video or user information. - Do not describe current video content; relate it to aprior memory(e.g., recently bought a car and needs accessories). - Avoid experiences that occurred within th...

work page

[14] [18]

The user,

Domain Description: Entertainment/Game scene 3.USER MEMORYGeneration Notes (Notice) #USER MEMORY Generation Notes (Notice) Content Strategy: - Focus onone primary pointfrom the video or user information. - Do not describe current video content; relate it to aprior memory(e.g., learned table tennis when young). - Avoid experiences that occurred within the ...

work page

[15] [20]

The user,

Domain Description: Working/manual/task scene 3.USER MEMORYGeneration Notes (Notice) #USER MEMORY Generation Notes (Notice) 18 EgoPro-Bench: Benchmarking Personalized Proactive Interaction in Egocentric Video Streams Content Strategy: - Focus onone primary pointfrom the video or user information. - Do not describe current video content; relate it to aprio...

work page

[16] [22]

The user,

Domain Description: Daily life (casual chat/emotional care) 3.USER MEMORYGeneration Notes (Notice) #USER MEMORY Generation Notes (Notice) Content Strategy: - Focus onone primary pointfrom the video or user information. - Do not describe current video content; relate it to aprior memory(e.g., went to see a doctor for back pain a few days ago). - Avoid expe...

work page

[17] [24]

The user,

Domain Description: Travel/life (casual chat/emotional care) 3.USER MEMORYGeneration Notes (Notice) #USER MEMORY Generation Notes (Notice) Content Strategy: - Focus onone primary pointfrom the video or user information. - Do not describe current video content; relate it to aprior memory(e.g., preference for local food when traveling). - Avoid experiences ...

work page

[18] [26]

The user,

Domain Description: Painting/artistic creation scene 3.USER MEMORYGeneration Notes (Notice) #USER MEMORY Generation Notes (Notice) Content Strategy: - Focus onone primary pointfrom the video or user information. - Do not describe current video content; relate it to aprior memory(e.g., learned to paint when young). - Avoid experiences that occurred within ...

work page

[19] [28]

The user,

Domain Description: Sports scene 3.USER MEMORYGeneration Notes (Notice) #USER MEMORY Generation Notes (Notice) Content Strategy: - Focus onone primary pointfrom the video or user information. - Do not describe current video content; relate it to aprior memory(e.g., learned table tennis when young). - Avoid experiences that occurred within the last 72 hour...

work page

[20] [29]

User’s Information: ${user}

work page

[21] [30]

The user,

Domain Description: Driving scene 3.USER MEMORYGeneration Notes (Notice) 21 EgoPro-Bench: Benchmarking Personalized Proactive Interaction in Egocentric Video Streams #USER MEMORY Generation Notes (Notice) Content Strategy: - Focus onone primary pointfrom the video or user information. - Do not describe current video content; relate it to aprior memory(e.g...

work page

[22] [31]

User’s current activity: ${user memory}

work page

[23] [32]

justification

Current scene context: ${domain background} #Response Timing and Content Requirements Dimension 1: Response Timing Suitability (Safety-Critical) *5 points: Obstacle detected within 3 meters; change in terrain (stairs, curbs); tactile paving available; or approaching danger (vehicles/cyclists). *1 point: Path is completely clear for at least 5 meters; grou...

work page

[24] [33]

User memory:{{User Memory}}

work page

[25] [34]

AI model response:{{Model Response}}

work page

[26] [35]

justification

Ground-truth (GT) response:{{GT Response}} #Evaluation Dimensions and Scoring Criteria (1 / 3 / 5) Dimension 1: Memory Consistency *5 points: No contradiction at all. All memory-related statements are fully consistent. *3 points: Generally reasonable with no obvious or critical conflicts. *1 point: The response clearly contradicts the user’s memory. Dimen...

work page

[27] [36]

User Instruction: The event we are looking for

work page

[28] [37]

Image: The visual evidence

work page

[29] [38]

<Attention>

Ground Truth: “<Attention>” or “<Silence>”. #REASONING LOGIC (CRITICAL) Case 1: If Ground Truth is “<Attention>” - Your goal is tofind evidence to supportthe alert. -Handling Partial Visibility (Close-ups): ◦ If the instruction mentions “A man holding a Kindle” but only hands are visible, DO NOT say “I cannot see the man.” ◦ INSTEAD, say: “I see hands hol...

work page

[30] [39]

User Memory: The user’s specific habit, history, or preference

work page

[31] [40]

Ground Truth Response: The target output (message or<Silence>)

work page

[32] [41]

Relevance Score: A score (0–5) indicating relevance

work page

[33] [42]

I see [action]... which aligns with [Memory]... so I must [provide guidance]

Current Image: The visual evidence. #Task Simulate the inference flow:Observe→Connect→Decide. Generate a reasoning thought that logically justifies the Ground Truth Response based on the visual evidence and the score. #Reasoning Logic (Choose one based on inputs) Case 1: Active Response (Score 5 & GT is text) - Observe→Connect→Decide: “I see [action]... w...

work page