Bridging Modalities, Spanning Time: Structured Memory for Ultra-Long Agentic Video Reasoning

arxiv: 2605.08271 · v1 · submitted 2026-05-08 · 💻 cs.CV · cs.AI

Bridging Modalities, Spanning Time: Structured Memory for Ultra-Long Agentic Video Reasoning

Jiazheng Li , Chi-Hao Wu , Yunze Liu , Kaize Ding , Jundong Li , Chuxu Zhang This is my paper

Pith reviewed 2026-05-12 01:26 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords multimodal memory graphultra-long video reasoningagentic videonarrative chaincross-modal retrievalegocentric videotraining-free framework

0 comments p. Extension

The pith

A training-free multimodal memory graph with typed edges and a narrative chain lets agents reason over videos spanning days or weeks without discarding most evidence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current multimodal models cannot handle ultra-long videos because even large context windows cover only short segments and most frames are discarded early. The paper shows that building a graph to connect episodic, semantic, and visual information through specific edge types, plus a chain of distilled narrative summaries, creates a unified structure for retrieval. An agentic loop then interleaves graph lookups with narrative facts to cover both modality and time dimensions in one pipeline. If this holds, systems could maintain coherent understanding across continuous recordings like personal lifelogs or extended surveillance without losing long-range connections.

Core claim

The central claim is that a multimodal memory graph using six typed edges to link episodic, semantic, and visual content, combined with an interleaved narrative chain that distills entity biographies and recurring activity events, supports reliable cross-modal retrieval in a training-free agentic loop, allowing effective reasoning over ultra-long videos where prior approaches lose coherence across days or weeks.

What carries the argument

The multimodal memory graph with six typed edges plus interleaved narrative chain, which unifies content types for cross-modal retrieval while compressing long-horizon stories into injectable facts.

If this is right

Agents retrieve relevant information spanning both modalities and extended time periods in a single pipeline.
Performance on long-video question answering improves over general-purpose and prior agentic baselines without any additional training.
Narrative fact injection maintains coherence for recurring activities and entity histories across the full video duration.
The structure reduces the need to fit entire videos into context windows by distilling key elements into the memory.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same graph-plus-chain structure could apply to other long sequential inputs such as audio transcripts or sensor logs.
Adding dynamic edge weighting based on query type might further improve retrieval precision on diverse tasks.
The approach implies that explicit memory organization can complement or reduce reliance on ever-larger context windows for very long inputs.

Load-bearing premise

The load-bearing premise is that a training-free graph with six typed edges and a narrative chain can reliably unify episodic, semantic, and visual content to support accurate retrieval across days or weeks of video.

What would settle it

Run the system on a new set of videos longer than one week and check whether cross-modal retrieval accuracy drops below that of simple frame-sampling baselines or whether narrative summaries fail to connect events separated by multiple days.

Figures

Figures reproduced from arXiv: 2605.08271 by Chi-Hao Wu, Chuxu Zhang, Jiazheng Li, Jundong Li, Kaize Ding, Yunze Liu.

**Figure 1.** Figure 1: Ultra-long video reasoning on EgoLifeQA ID=244. (a) Long-context MLLMs sub-sample frames to fit a million-token budget, dropping the required clip. (b) Memory-based retrieval fails in two ways: cross-modal fragmentation (each modality queried separately) and missing cross-time episodes (bottom-up aggregations miss detailed facts). (c) MAGIC-Video fixes both via a Multimodal Memory Graph (cross-modal PPR) … view at source ↗

**Figure 2.** Figure 2: MAGIC-Video pipeline. Offline (left): preprocessing produces multi-granularity captions, named entities, semantic triples, and visual embeddings, from which we build the Multimodal Memory Graph (four node types connected by six typed edges) and the Narrative Memory Chain (topic chains + event chains). Online (right): for each question, an agentic loop seeds cross-modal Personalized PageRank over the graph,… view at source ↗

**Figure 3.** Figure 3: Cumulative ablation of stacking MMG and then NMC on top of an independent-retrieval (IR) baseline. The third bar is the full MAGIC-Video system. EgoLifeQA / Ego-R1 use MC accuracy; MM-Lifelong uses GPT-5-judged answer accuracy (same 0–100 scale). 5.2 Impact of the Multimodal Memory Graph We now drill into where MMG’s +8.4-point EgoLifeQA lift concentrates, by breaking it down across all five subtask categ… view at source ↗

read the original abstract

Understanding ultra-long videos such as egocentric recordings, live streams, or surveillance footage spanning days to weeks, remains a challenge. For current multimodal LLMs: even with million-token context windows, frame budgets cover only tens of minutes of densely sampled video, and most evidence is discarded before inference begins. Memory-augmented and agentic approaches help with scale, but their retrieval remains fragmented across modalities and lacks long-range narrative summaries that span days or weeks. We propose \textbf{MAGIC-Video}, a training-free framework built around a multimodal memory graph with interleaved narrative chain: the graph unifies episodic, semantic, and visual content through six typed edges and supports cross-modal retrieval, while the chain distils long-horizon entity biographies and recurring activity events. At inference time, an agentic loop interleaves graph retrieval with narrative fact injection, covering both the modality and time dimensions of ultra-long video in a single retrieval pipeline. On EgoLifeQA, Ego-R1 and MM-Lifelong, MAGIC-Video consistently outperforms strong general-purpose, long-video, and agentic baselines, with gains of 10.1, 7.4, and 5.9 points over the prior best agentic system on each benchmark. Code is available at https://github.com/lijiazheng0917/MAGIC-video.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MAGIC-Video gives a training-free graph-plus-narrative memory setup that beats several agentic baselines on long-video QA by 6-10 points, with code released, but the construction and retrieval details need close checking to confirm the gains hold up.

read the letter

The paper's main contribution is a concrete architecture for handling days-long videos in multimodal LLMs. It builds a graph with six typed edges to connect episodic, semantic, and visual elements, then adds an interleaved narrative chain that tracks entity biographies and recurring events. At inference an agentic loop pulls from the graph and injects facts from the chain in one pipeline. This combination is not in the baselines they cite, and the training-free design plus public code at the GitHub link is a practical plus for anyone who wants to experiment quickly. The reported numbers on EgoLifeQA, Ego-R1, and MM-Lifelong show steady gains over prior agentic systems, which matters for applications like egocentric recording or surveillance where context windows still fall short. What works here is the focus on both modality bridging and long-range narrative without adding parameters or fine-tuning. The benchmarks are relevant and the improvements are large enough to notice. The soft spots are in the methods section. The abstract and high-level description do not spell out the exact rules for creating the six edge types or the retrieval scoring, so it is hard to tell how much manual tuning went into the graph or whether the same rules would transfer to new video domains. There is also no mention of statistical significance or error analysis in the summary, which leaves open the chance that the gains are sensitive to the particular test sets. If the full paper includes ablations on edge types and retrieval variants, that would address the main concern. This work is aimed at people building agentic video systems or memory modules for long-horizon multimodal tasks. A reader who needs ideas for scaling beyond minute-scale clips would find the code and architecture worth pulling down and testing on their own data. It is coherent on its own terms and engages the right prior literature, so it deserves a serious referee to check the implementation details and run additional controls. I would send it to peer review rather than desk reject.

Referee Report

3 major / 3 minor

Summary. The paper proposes MAGIC-Video, a training-free framework centered on a multimodal memory graph with six typed edges and an interleaved narrative chain. The graph unifies episodic, semantic, and visual content from ultra-long videos (days to weeks) to enable cross-modal retrieval, while the chain provides long-horizon entity biographies and activity summaries. An agentic inference loop interleaves graph retrieval with narrative injection. The work reports consistent outperformance over general-purpose, long-video, and agentic baselines, with absolute gains of 10.1, 7.4, and 5.9 points on EgoLifeQA, Ego-R1, and MM-Lifelong respectively, and releases public code.

Significance. If the empirical gains and retrieval mechanism hold, the contribution would be significant for scaling multimodal LLMs to ultra-long video without prohibitive context costs or task-specific training. The structured memory design that explicitly bridges modalities and time horizons addresses a clear gap in current agentic video systems. The explicit code release strengthens the work by supporting direct reproducibility and follow-on research.

major comments (3)

[§3.2] §3.2: The construction rules for the six typed edges (and how they enable reliable cross-modal unification of episodic/semantic/visual nodes) are described at a high level but lack the precise algorithmic specification or pseudocode needed to verify that retrieval remains effective across multi-day horizons; this is load-bearing for the central performance claims.
[§4.3] §4.3 and Table 3: The ablation isolating the interleaved narrative chain reports only modest additional gains over the graph alone; this weakens the assertion that the chain is essential for distilling long-range biographies and recurring events.
[§5.1] §5.1: No error bars, standard deviations, or statistical significance tests (e.g., paired t-tests) accompany the reported benchmark deltas of 10.1/7.4/5.9 points; without these, it is impossible to assess whether the outperformance over the prior best agentic system is robust.

minor comments (3)

[Figure 2] Figure 2: The diagram of the agentic loop would benefit from explicit labels on the retrieval and injection steps to clarify the interleaving process.
[§2] §2: A few recent long-video memory papers are cited but the related-work discussion does not explicitly contrast the six-edge graph design against prior graph-based or hierarchical memory approaches.
Notation: The symbols for node types and edge relations are introduced without a consolidated table, making cross-references in later sections harder to follow.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, providing clarifications and committing to revisions where appropriate to strengthen the manuscript.

read point-by-point responses

Referee: [§3.2] §3.2: The construction rules for the six typed edges (and how they enable reliable cross-modal unification of episodic/semantic/visual nodes) are described at a high level but lack the precise algorithmic specification or pseudocode needed to verify that retrieval remains effective across multi-day horizons; this is load-bearing for the central performance claims.

Authors: We appreciate this observation and agree that greater precision would enhance verifiability. The six typed edges are defined by explicit rules based on temporal adjacency, semantic similarity via embeddings, visual feature matching, entity co-occurrence, activity chaining, and cross-modal linking. In the revised manuscript we will add a dedicated algorithmic subsection with pseudocode detailing the construction process, edge typing logic, and how these enable reliable cross-modal retrieval over multi-day spans. This directly addresses the load-bearing concern for the performance claims. revision: yes
Referee: [§4.3] §4.3 and Table 3: The ablation isolating the interleaved narrative chain reports only modest additional gains over the graph alone; this weakens the assertion that the chain is essential for distilling long-range biographies and recurring events.

Authors: We acknowledge that the ablation in Table 3 shows only modest incremental gains from the narrative chain. While the chain is not the sole driver of performance, it provides complementary long-horizon distillation of entity biographies and recurring events that the graph alone does not fully capture, as illustrated in our qualitative case studies. We will revise the text in §4.3 and the abstract to describe the chain as a complementary component for long-range coherence rather than claiming it is strictly essential, and we will expand the discussion to highlight scenarios where its contribution is most pronounced. revision: partial
Referee: [§5.1] §5.1: No error bars, standard deviations, or statistical significance tests (e.g., paired t-tests) accompany the reported benchmark deltas of 10.1/7.4/5.9 points; without these, it is impossible to assess whether the outperformance over the prior best agentic system is robust.

Authors: This is a fair critique. The reported numbers reflect single deterministic runs under fixed seeds and retrieval configurations. In the revision we will add a limitations paragraph noting the absence of variance estimates and will include results from additional runs (where computationally feasible) to report standard deviations. We will also emphasize the consistency of gains across three distinct benchmarks as supporting evidence of robustness, while avoiding unsubstantiated statistical claims. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical engineering framework with no derivation chain

full rationale

The paper introduces MAGIC-Video as a training-free multimodal memory graph plus narrative chain for long video reasoning. No equations, fitted parameters, predictions, or first-principles derivations are claimed or present in the abstract or described architecture. Performance gains are reported as empirical results on public benchmarks with linked code, making the contribution self-contained against external evaluation rather than internally forced by definition or self-citation. No load-bearing step reduces to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the framework description implies design choices for edge typing and narrative distillation but does not enumerate them.

pith-pipeline@v0.9.0 · 5559 in / 1138 out tokens · 51921 ms · 2026-05-12T01:26:57.828743+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

multimodal memory graph ... six typed edges ... cross-modal Personalized PageRank ... Narrative Memory Chain ... topic chains ... event chains
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

six typed edges: MENTIONED_IN, CO_CLIP, APPEARED_IN, HAS_PROPERTY, TEMPORAL_NEXT, CONTAINS

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages

[1]

Combine both sources into one fluent paragraph

work page
[2]

Keep the visual description as the primary narrative

work page
[3]

The narrator explains ’...’ while the camera shows

Integrate speech content naturally (e.g., "The narrator explains ’...’ while the camera shows")

work page
[4]

If transcript is empty, return the visual description as-is

work page
[5]

If visual description is empty but transcript exists, describe the speech

work page
[6]

# Output Output ONLY the merged caption text

Be concise: 2-4 sentences. # Output Output ONLY the merged caption text. No JSON, no explanations. M.2 Multi-granularity caption aggregation (30 s, to 3 min / 10 min / 1 h) Used to roll up the ﬁne 30-s captions into coarser Episode captions at 3-min, 10-min, and 1-h granularities. "You will be provided with some descriptions. Merge events into one single ...

work page
[7]

named_entities

"named_entities": A list of named entities found in the paragraph. 37

work page
[8]

I" instead of replacing with terms like

"triples": A list of RDF triples, each as a 3-element list [subject, predicate, object]. Pay attention to the following requirements: - Each triple should contain at least one, but preferably two, of the named entities. - When resolving pronouns, if the pronoun refers to the first-person (e.g., I, me, my), keep it as "I" instead of replacing with terms li...

work page
[9]

**Which existing triples to remove/pop** - those that should be merged with the new triple or conflict with it

work page
[10]

**How to update the new triple** - to reflect the consolidation, merge information, or resolve conflicts # Consolidation Rules:

work page
[11]

**Merge Similar Information**: If existing triples express very similar information to the new triple, remove them and update the new triple to capture the most complete/accurate representation

work page
[12]

**Resolve Conflicts**: If the new triple conflicts with existing ones, decide which is more accurate/recent and remove the outdated ones

work page
[13]

**Update with Context**: Use information from existing triples to make the new triple more specific or accurate

work page
[14]

{entity}

**Preserve Unique Information**: Only remove existing triples if they are redundant or conflicting # Output Format: Return ONLY a JSON object with the following two keys: - ‘updated_triple‘ (List[str]): The new triple, possibly updated [subject, predicate, object] - ‘triples_to_remove‘ (List[int]): Indices of existing triples to remove (empty list if none...

work page
[15]

{entity}

**Who & When** (EntityLog): Who used/touched/moved "{entity}" first? Whose "{entity}" is this? Where was "{entity}" before?

work page
[16]

{entity}

**What happened** (EventRecall): When was "{entity}" first/last mentioned or discussed? What happened with "{entity}" last time?

work page
[17]

{entity}

**Relationships** (RelationMap): Who helped whom with "{entity}"? Who gave/passed "{entity}" to whom? Who was together when using "{entity}"? 38

work page
[18]

{entity}

**Plans & Decisions** (TaskMaster): Who suggested/planned to buy/make/use "{entity}"? What was decided about "{entity}"?

work page
[19]

{entity}

**Habits & Preferences** (HabitInsight): Who always/usually/often uses "{entity}"? Who likes/dislikes "{entity}"? What does someone typically do with "{entity}"? Rules: - Always use specific person names (A1_JAKE, Shure, Katrina, Tasha, Lucia, Alice, Nicous, Choiszt, Jack). NEVER use "I", "he", "she", "they", "we". - Include the WHERE (which room, locatio...

work page
[20]

Who handed the black marker to Shure?

**search**: Retrieve memory to begin, continue, or extend progress toward the answer. - Write the search query as a **natural-language sentence or question** (NOT a list of keywords). Good: "Who handed the black marker to Shure?" Bad: "black marker Shure hand location" - The retrieval system uses semantic embedding similarity, so natural sentences work mu...

work page
[21]

[No new results]

**answer**: Stop searching because the accumulated results are sufficient. - If 2+ consecutive rounds returned "[No new results]", you MUST answer with what you have. # Context Inputs: - Current Query - Round History: Log of past retrieval rounds. Each round is written in this format: ### Round N Decision: <search|answer> Search Query: <query text> Retrie...

work page 2026
[22]

**search**: Retrieve memory to begin, continue, or extend progress toward the answer - Choose one memory type and form a keyword(phrase)-style search query

work page
[23]

- No memory type selection is needed

**answer**: Stop searching because the accumulated results are sufficient. - No memory type selection is needed. # Memory Types:

work page
[24]

Query by EVENT/ACTION

Episodic: Specific events/actions. Query by EVENT/ACTION

work page
[25]

Query by ENTITY/CONCEPT

Semantic: Entities/relationships. Query by ENTITY/CONCEPT

work page
[26]

search" or

Visual: Scene/setting snapshots. Query by SCENE/SETTING or TIMESTAMP RANGE. # STRICT OUTPUT RULES: - Always decide **first**: "search" or "answer". - If decision = "search": Must include "selected_memory" with exactly one memory type and one query. - If decision = "answer": Do NOT include "selected_memory". - Always output in valid JSON only, no extra com...

work page

[1] [1]

Combine both sources into one fluent paragraph

work page

[2] [2]

Keep the visual description as the primary narrative

work page

[3] [3]

The narrator explains ’...’ while the camera shows

Integrate speech content naturally (e.g., "The narrator explains ’...’ while the camera shows")

work page

[4] [4]

If transcript is empty, return the visual description as-is

work page

[5] [5]

If visual description is empty but transcript exists, describe the speech

work page

[6] [6]

# Output Output ONLY the merged caption text

Be concise: 2-4 sentences. # Output Output ONLY the merged caption text. No JSON, no explanations. M.2 Multi-granularity caption aggregation (30 s, to 3 min / 10 min / 1 h) Used to roll up the ﬁne 30-s captions into coarser Episode captions at 3-min, 10-min, and 1-h granularities. "You will be provided with some descriptions. Merge events into one single ...

work page

[7] [7]

named_entities

"named_entities": A list of named entities found in the paragraph. 37

work page

[8] [8]

I" instead of replacing with terms like

"triples": A list of RDF triples, each as a 3-element list [subject, predicate, object]. Pay attention to the following requirements: - Each triple should contain at least one, but preferably two, of the named entities. - When resolving pronouns, if the pronoun refers to the first-person (e.g., I, me, my), keep it as "I" instead of replacing with terms li...

work page

[9] [9]

**Which existing triples to remove/pop** - those that should be merged with the new triple or conflict with it

work page

[10] [10]

**How to update the new triple** - to reflect the consolidation, merge information, or resolve conflicts # Consolidation Rules:

work page

[11] [11]

**Merge Similar Information**: If existing triples express very similar information to the new triple, remove them and update the new triple to capture the most complete/accurate representation

work page

[12] [12]

**Resolve Conflicts**: If the new triple conflicts with existing ones, decide which is more accurate/recent and remove the outdated ones

work page

[13] [13]

**Update with Context**: Use information from existing triples to make the new triple more specific or accurate

work page

[14] [14]

{entity}

**Preserve Unique Information**: Only remove existing triples if they are redundant or conflicting # Output Format: Return ONLY a JSON object with the following two keys: - ‘updated_triple‘ (List[str]): The new triple, possibly updated [subject, predicate, object] - ‘triples_to_remove‘ (List[int]): Indices of existing triples to remove (empty list if none...

work page

[15] [15]

{entity}

**Who & When** (EntityLog): Who used/touched/moved "{entity}" first? Whose "{entity}" is this? Where was "{entity}" before?

work page

[16] [16]

{entity}

**What happened** (EventRecall): When was "{entity}" first/last mentioned or discussed? What happened with "{entity}" last time?

work page

[17] [17]

{entity}

**Relationships** (RelationMap): Who helped whom with "{entity}"? Who gave/passed "{entity}" to whom? Who was together when using "{entity}"? 38

work page

[18] [18]

{entity}

**Plans & Decisions** (TaskMaster): Who suggested/planned to buy/make/use "{entity}"? What was decided about "{entity}"?

work page

[19] [19]

{entity}

**Habits & Preferences** (HabitInsight): Who always/usually/often uses "{entity}"? Who likes/dislikes "{entity}"? What does someone typically do with "{entity}"? Rules: - Always use specific person names (A1_JAKE, Shure, Katrina, Tasha, Lucia, Alice, Nicous, Choiszt, Jack). NEVER use "I", "he", "she", "they", "we". - Include the WHERE (which room, locatio...

work page

[20] [20]

Who handed the black marker to Shure?

**search**: Retrieve memory to begin, continue, or extend progress toward the answer. - Write the search query as a **natural-language sentence or question** (NOT a list of keywords). Good: "Who handed the black marker to Shure?" Bad: "black marker Shure hand location" - The retrieval system uses semantic embedding similarity, so natural sentences work mu...

work page

[21] [21]

[No new results]

**answer**: Stop searching because the accumulated results are sufficient. - If 2+ consecutive rounds returned "[No new results]", you MUST answer with what you have. # Context Inputs: - Current Query - Round History: Log of past retrieval rounds. Each round is written in this format: ### Round N Decision: <search|answer> Search Query: <query text> Retrie...

work page 2026

[22] [22]

**search**: Retrieve memory to begin, continue, or extend progress toward the answer - Choose one memory type and form a keyword(phrase)-style search query

work page

[23] [23]

- No memory type selection is needed

**answer**: Stop searching because the accumulated results are sufficient. - No memory type selection is needed. # Memory Types:

work page

[24] [24]

Query by EVENT/ACTION

Episodic: Specific events/actions. Query by EVENT/ACTION

work page

[25] [25]

Query by ENTITY/CONCEPT

Semantic: Entities/relationships. Query by ENTITY/CONCEPT

work page

[26] [26]

search" or

Visual: Scene/setting snapshots. Query by SCENE/SETTING or TIMESTAMP RANGE. # STRICT OUTPUT RULES: - Always decide **first**: "search" or "answer". - If decision = "search": Must include "selected_memory" with exactly one memory type and one query. - If decision = "answer": Do NOT include "selected_memory". - Always output in valid JSON only, no extra com...

work page