{"total":12,"items":[{"citing_arxiv_id":"2606.27677","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DIM-WAM: World-Action Modeling with Diverse Historical Event Memory","primary_cat":"cs.RO","submitted_at":"2026-06-26T03:17:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DiM-WAM is a memory-augmented world-action model that integrates multi-scale historical events and global task progress to improve long-horizon robot manipulation performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21472","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Stream3D: Sequential Multi-View 3D Generation via Evidential Memory","primary_cat":"cs.CV","submitted_at":"2026-05-20T17:55:16+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17543","ref_index":62,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"HL-OutPaint: Coarse-to-Fine Video Outpainting for High-Resolution Long-Range Videos","primary_cat":"cs.CV","submitted_at":"2026-05-17T16:52:38+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14487","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity","primary_cat":"cs.CV","submitted_at":"2026-05-14T07:27:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Head Forcing assigns tailored KV cache strategies to local, anchor, and memory attention heads plus head-wise RoPE re-encoding to extend autoregressive video generation from seconds to minutes without training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06512","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DCR: Counterfactual Attractor Guidance for Rare Compositional Generation","primary_cat":"cs.CV","submitted_at":"2026-05-07T16:22:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DCR uses a counterfactual attractor and projection-based repulsion to suppress default completion bias in diffusion models, improving fidelity for rare compositional prompts while preserving quality.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[13] Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Duval, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, and Ishan Misra. Emu video: Factorizing text-to-video generation by explicit image conditioning.arXiv preprint arXiv:2311.10709, 2023. [14] Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen. Latent video diffusion models for high-fidelity long video generation.arXiv preprint arXiv:2211.13221, 2022. [15] Roberto Henschel, Levon Khachatryan, Daniil Hayrapetyan, Hayk Poghosyan, Vahram Tadevosyan, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Streamingt2v: Consistent, dynamic, and extendable long video generation from text.arXiv preprint arXiv:2403.14773, 2024. [16] Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P"},{"citing_arxiv_id":"2602.07775","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion","primary_cat":"cs.CV","submitted_at":"2026-02-08T02:16:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Rolling Sink is a training-free cache adjustment technique that maintains visual consistency in autoregressive video diffusion models for ultra-long open-ended generation beyond training horizons.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"The surfer's expression is one of exhilaration and focus. A mid-shot from a low-angle perspective capturing the surfer's motion and the wave's power. B Related Works Video Diffusion Models.Video generation is of great benefit in neural simu- lators [2,3,9] and world models [4,5,23,37,46]. Synthesizing photorealistic videos using video diffusion models [7,8,14,20,28,29,33,34,36,49,67,73,82,90,100,108, 110] has become the community standard, following the substantial success of image diffusion models [30-32,35,51-53,56,58,59,61,64,78,83,92]. Thanks to the strong scaling abilities of video diffusion models and the internet-scale data, the industries have presented many powerful video generators [21,47,70,79,91]. Autoregressive Video Diffusion Models.Video diffusion models typically"},{"citing_arxiv_id":"2602.02958","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization","primary_cat":"cs.LG","submitted_at":"2026-02-03T00:54:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Quant VideoGen reduces KV cache memory by up to 7 times in autoregressive video diffusion models via semantic aware smoothing and progressive residual quantization, achieving better quality than baselines with under 4% latency overhead.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.09547","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"GoViG: Goal-Conditioned Visual Navigation Instruction Generation via Multimodal Reasoning","primary_cat":"cs.CV","submitted_at":"2025-08-13T07:05:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GoViG decomposes goal-conditioned navigation instruction generation into visual state prediction and instruction synthesis using an autoregressive multimodal LLM with one-pass and interleaved reasoning, showing gains on a new R2R-Goal dataset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.21996","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VRAG: Learning World Models for Interactive Video Generation","primary_cat":"cs.CV","submitted_at":"2025-05-28T05:55:44+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.16819","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Character-Centered Dialogue Generation from Scene-Level Prompts","primary_cat":"cs.CV","submitted_at":"2025-05-22T15:54:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A training-free framework generates expressive, character-grounded dialogue and speech from scene prompts using vision-language encoders, LLMs, and a recursive narrative memory bank for cross-scene consistency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.12605","ref_index":211,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey","primary_cat":"cs.CV","submitted_at":"2025-03-16T18:39:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"The paper provides the first comprehensive survey of multimodal chain-of-thought reasoning, including foundational concepts, a taxonomy of methodologies, application analyses, challenges, and future directions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Parallel to these works about multimodal understanding, research also explored multimodal con- tent generation. In image generation, models such as Kosmos-2 [201], GILL [202], Emu [203], and MiniGPT-5 [204] have achieved breakthroughs. Audio generation has seen advancements with SpeechGPT [205, 206] and AudioPaLM [207], while video generation research, including CogVideo [208], VideoPoet [209], Video-Lavit [210], and StreamingT2V [211], has laid the groundwork for multimodal content creation. The recent introduction of GPT-4o [212], capable of both understand- ing and generating images and audio, has shifted attention toward \"any-to-any\" paradigm models. 8 CommonArchitecturesofMLLMs Image Text MLLMs Comprehension Video Audio 3D ··· Text output Image Text MLLMsVideo Audio 3D ···"},{"citing_arxiv_id":"2503.06310","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Scene-Action Prompt Fusion for Coherent Text-to-Video Storytelling","primary_cat":"cs.CV","submitted_at":"2025-03-08T19:04:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A prompt fusion approach combines bidirectional time-weighted latent blending, dynamics-informed prompt weighting via CLIP, and semantic action representations to produce temporally consistent long videos from text without retraining.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}