{"total":20,"items":[{"citing_arxiv_id":"2606.24876","ref_index":70,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FLAT: Feedforward Latent Triangle Splatting for Geometrically Accurate Scene Generation","primary_cat":"cs.CV","submitted_at":"2026-06-23T17:53:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"FLAT maps compressed video diffusion latents to explicit triangle splats via ray-centered rotation parameterization and a product window function, reporting better geometric accuracy than 3D Gaussian baselines under identical training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.17800","ref_index":60,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MaineCoon: Pursuing A Real-Time Audio-Visual Social World Model","primary_cat":"cs.CV","submitted_at":"2026-06-16T11:25:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MaineCoon is presented as the first 22B-parameter real-time streaming audio-visual autoregressive model optimized for social-interactive applications, using novel training techniques and an agentic inference framework.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.17730","ref_index":46,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ActWorld: From Explorable to Interactive World Model via Action-Aware Memory","primary_cat":"cs.CV","submitted_at":"2026-06-16T09:47:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ActWorld extends navigation-centric world models to support mid-rollout object interactions via chunk-autoregressive generation, action-aware memory routing, and a persistent memory bank, backed by a 100K annotated interaction dataset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.11751","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AnchorEdit: Maintaining Temporal Consistency in Multi-turn Image Editing via Causal Memory","primary_cat":"cs.CV","submitted_at":"2026-06-10T07:26:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AnchorEdit is the first autoregressive diffusion framework for causal multi-turn image editing, achieving claimed SOTA consistency over 10+ rounds via three-stage training and a memory mechanism.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.09639","ref_index":90,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CineDance: Towards Next-Generation Multi-Shot Long-Form Cinematic Audio-Video Generation","primary_cat":"cs.CV","submitted_at":"2026-06-08T15:35:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Introduces CineDance-1M dataset for multi-shot long-form text-to-audio-video generation along with CineBench and a model adaptation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.08091","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VideoWeaver: Evaluating and Evolving Skills for Agentic Long Video Generation","primary_cat":"cs.CV","submitted_at":"2026-06-06T10:35:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces VideoWeaver benchmark (16 categories, 285 cases) plus agent-as-judge and skill-evolution algorithm to assess and improve agentic long video generation across frameworks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00793","ref_index":95,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MBench: A Comprehensive Benchmark on Memory Capability for Video World Models","primary_cat":"cs.CV","submitted_at":"2026-05-30T16:17:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MBench is a new benchmark that quantifies long-term memory in video world models via three hierarchical consistency dimensions evaluated on curated real videos.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30083","ref_index":54,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Future Forcing: Future-aware Training-free KV Cache Policy for Autoregressive Video Generation","primary_cat":"cs.CV","submitted_at":"2026-05-28T15:30:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Future Forcing constructs a future query proxy from historical pre-RoPE statistics to score and merge KV tokens, improving subject consistency by up to 1.49 on VBench-Long for 60s AR video generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.26244","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AV","primary_cat":"cs.CV","submitted_at":"2026-05-25T18:12:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LongAV-Compass is a new benchmark and evaluation framework for minute-scale audio-visual generation across T2AV, I2AV, and V2AV with multi-dimensional assessment.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.24892","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"X-Foresight: A Joint Vision-Action Causal Forecasting Network via Predictive World Modeling","primary_cat":"cs.CV","submitted_at":"2026-05-24T06:37:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"X-Foresight adds a long-horizon chunk-wise auto-regressive world model with temporal importance sampling and curriculum learning to VLA architectures for improved planning and generative fidelity.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20183","ref_index":75,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MSAVBench: Towards Comprehensive and Reliable Evaluation of Multi-Shot Audio-Video Generation","primary_cat":"cs.CV","submitted_at":"2026-05-19T17:59:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MSAVBench is the first comprehensive benchmark for multi-shot audio-video generation featuring four dimensions, challenging scenarios, and an adaptive hybrid evaluation framework that achieves 91.5% Spearman correlation with human judgments.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18739","ref_index":71,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation","primary_cat":"cs.CV","submitted_at":"2026-05-18T17:57:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LongLive-2.0 delivers an NVFP4 parallel infrastructure that enables direct training of long multi-shot autoregressive diffusion video models and achieves up to 2.15x training and 1.84x inference speedups on Blackwell and other GPUs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18346","ref_index":52,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Focused Forcing: Content-Aware Per-Frame KV Selection for Efficient Autoregressive Video Diffusion","primary_cat":"cs.CV","submitted_at":"2026-05-18T12:58:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Focused Forcing is a training-free per-frame KV selection method that combines attention scores with diversity metrics and head-importance estimation to accelerate autoregressive video diffusion up to 1.48x while improving quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15190","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"RAVEN: Real-time Autoregressive Video Extrapolation with Consistency-model GRPO","primary_cat":"cs.CV","submitted_at":"2026-05-14T17:59:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RAVEN aligns training and inference for causal autoregressive video diffusion via interleaved rollout repacking and introduces CM-GRPO for direct RL on consistency-model kernels, claiming better quality than recent baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15182","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Warp-as-History: Generalizable Camera-Controlled Video Generation from One Training Video","primary_cat":"cs.CV","submitted_at":"2026-05-14T17:58:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Warp-as-History enables zero-shot camera trajectory following in frozen video models by supplying camera-warped pseudo-history, with single-video LoRA fine-tuning improving generalization to unseen videos.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15042","ref_index":50,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"EverAnimate: Minute-Scale Human Animation via Latent Flow Restoration","primary_cat":"cs.CV","submitted_at":"2026-05-14T16:36:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EverAnimate restores drifted latent flow trajectories in chunked video generation via persistent latent propagation and restorative flow matching, achieving measurable gains in PSNR, SSIM, LPIPS, and FID over prior long-animation methods with only LoRA tuning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11596","ref_index":27,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"HorizonDrive: Self-Corrective Autoregressive World Model for Long-horizon Driving Simulation","primary_cat":"cs.CV","submitted_at":"2026-05-12T06:22:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HorizonDrive is a new anti-drifting autoregressive training and distillation method that enables minute-scale stable driving video rollouts by making the teacher model rollout-capable via scheduled rollout recovery and teacher rollout DMD.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09681","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Forcing-KV: Hybrid KV Cache Compression for Efficient Autoregressive Video Diffusion Models","primary_cat":"cs.CV","submitted_at":"2026-05-10T17:59:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Forcing-KV applies head-specific static and dynamic pruning to KV caches in AR video diffusion models, achieving over 29 fps, 30% memory reduction, and up to 2.82x speedup at maintained quality.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"Zhang, N. Ma, H. Chen, M. Agrawala, L. Guibas, G. Wetzsteinet al., \"Mode Seeking meets Mean Seeking for Fast Long Video Generation,\" arXiv preprint arXiv:2602.24289, 2026. [30] M. L. Team, X. Cai, Q. Huang, Z. Kang, H. Li, S. Liang, L. Ma, S. Ren, X. Wei, R. Xie, and T. Zhang, \"LongCat-Video Technical Report,\"arXiv preprint arXiv:2510.22200, 2025. [31] S. Yuan, Y . Yin, Z. Li, X. Huang, X. Yang, and L. Yuan, \"Helios: Real Real-Time Long Video Generation Model,\"arXiv preprint arXiv:2603.04379, 2026. [32] H. Xi, S. Yang, Y . Zhao, C. Xu, M. Li, X. Li, Y . Lin, H. Cai, J. Zhang, D. Liet al., \"Sparse Video-Gen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity,\" in Forty-second International Conference on Machine Learning, 2025."},{"citing_arxiv_id":"2604.18564","ref_index":66,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MultiWorld: Scalable Multi-Agent Multi-View Video World Models","primary_cat":"cs.CV","submitted_at":"2026-04-20T17:52:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MultiWorld is a scalable framework for multi-agent multi-view video world models that improves controllability and consistency over single-agent baselines in game and robot tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"els incorporate various signals like camera controls [13,36,52,65] and action controls [7,10,12,35] to simulate future states. Recent studies have explored sev- eral essential properties [16] of interactive video world models, such as physical consistency [37,47,49,72], and long-horizon coherence [53,56,62], alongside effi- cient real-time generation [17,61,66,75] to enable practical deployment. With these properties, world models can serve as powerful simulators for downstream tasks like game generation [39,55], embodied AI [6,27], and autonomous driv- ing [33,58]. Game video world models [44,64] control the environment and simu- late player observations based on provided actions. Robotic video world models"},{"citing_arxiv_id":"2604.16592","ref_index":209,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Human Cognition in Machines: A Unified Perspective of World Models","primary_cat":"cs.RO","submitted_at":"2026-04-17T17:51:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and proposes Epistemic World Models as a new category for scientific discovery agents.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[78] 2026 2D✗ ✓ ✗ ✓ ✗ ✗ ✗Iteratively denoise-and-re-noise video latents at inference time to self-correct physics/motion artifacts. V-JEPA [13] 2024 2D✓ ✓ ✗ ✗ ✗ ✗ ✗Self-supervisedvideolearningbypre- dicting masked spatio-temporal re- gions in latent space. VideoWeave [46] 2026 2D✓ ✗ ✗ ✗ ✗ ✗ ✗Splice short captioned videos into synthetic long videos to cheaply train better video-language models. Helios [209] 2026 2D✓ ✗ ✗ ✓ ✗ ✗ ✗14B video generation model running real-time on one H100 via context compression and drift-aware train- ing. Marble World Model [169, 98] 2025 3D✗ ✓ ✗ ✓ ✗ ✗ ✗Multimodal 3D world generator. Garrido et al. [52] 2026 2D✗ ✓ ✗ ✗ ✗ ✗ ✗Learn action-conditioned World Models from unlabeled in-the-wild videos by inferring continuous latent"}],"limit":50,"offset":0}