hub

Streamingt2v: Con- sistent, dynamic, and extendable long video generation from text

“Streamingt2v: Consistent, dynamic, extendable long video generation from text,” · 2024 · arXiv 2403.14773

12 Pith papers cite this work. Polarity classification is still indexing.

12 Pith papers citing it

read on arXiv browse 12 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3

citation-polarity summary

background 3

representative citing papers

DCR: Counterfactual Attractor Guidance for Rare Compositional Generation

cs.CV · 2026-05-07 · unverdicted · novelty 7.0

DCR uses a counterfactual attractor and projection-based repulsion to suppress default completion bias in diffusion models, improving fidelity for rare compositional prompts while preserving quality.

DIM-WAM: World-Action Modeling with Diverse Historical Event Memory

cs.RO · 2026-06-26 · unverdicted · novelty 6.0

DiM-WAM is a memory-augmented world-action model that integrates multi-scale historical events and global task progress to improve long-horizon robot manipulation performance.

Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity

cs.CV · 2026-05-14 · unverdicted · novelty 6.0

Head Forcing assigns tailored KV cache strategies to local, anchor, and memory attention heads plus head-wise RoPE re-encoding to extend autoregressive video generation from seconds to minutes without training.

Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion

cs.CV · 2026-02-08 · unverdicted · novelty 6.0

Rolling Sink is a training-free cache adjustment technique that maintains visual consistency in autoregressive video diffusion models for ultra-long open-ended generation beyond training horizons.

Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization

cs.LG · 2026-02-03 · unverdicted · novelty 6.0

Quant VideoGen reduces KV cache memory by up to 7 times in autoregressive video diffusion models via semantic aware smoothing and progressive residual quantization, achieving better quality than baselines with under 4% latency overhead.

GoViG: Goal-Conditioned Visual Navigation Instruction Generation via Multimodal Reasoning

cs.CV · 2025-08-13 · unverdicted · novelty 6.0

GoViG decomposes goal-conditioned navigation instruction generation into visual state prediction and instruction synthesis using an autoregressive multimodal LLM with one-pass and interleaved reasoning, showing gains on a new R2R-Goal dataset.

Character-Centered Dialogue Generation from Scene-Level Prompts

cs.CV · 2025-05-22 · unverdicted · novelty 4.0

A training-free framework generates expressive, character-grounded dialogue and speech from scene prompts using vision-language encoders, LLMs, and a recursive narrative memory bank for cross-scene consistency.

Scene-Action Prompt Fusion for Coherent Text-to-Video Storytelling

cs.CV · 2025-03-08 · unverdicted · novelty 3.0

A prompt fusion approach combines bidirectional time-weighted latent blending, dynamics-informed prompt weighting via CLIP, and semantic action representations to produce temporally consistent long videos from text without retraining.

Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

cs.CV · 2025-03-16 · unverdicted · novelty 2.0

The paper provides the first comprehensive survey of multimodal chain-of-thought reasoning, including foundational concepts, a taxonomy of methodologies, application analyses, challenges, and future directions.

Stream3D: Sequential Multi-View 3D Generation via Evidential Memory

cs.CV · 2026-05-20

HL-OutPaint: Coarse-to-Fine Video Outpainting for High-Resolution Long-Range Videos

cs.CV · 2026-05-17

VRAG: Learning World Models for Interactive Video Generation

cs.CV · 2025-05-28

citing papers explorer

Showing 12 of 12 citing papers.

DCR: Counterfactual Attractor Guidance for Rare Compositional Generation cs.CV · 2026-05-07 · unverdicted · none · ref 15
DCR uses a counterfactual attractor and projection-based repulsion to suppress default completion bias in diffusion models, improving fidelity for rare compositional prompts while preserving quality.
DIM-WAM: World-Action Modeling with Diverse Historical Event Memory cs.RO · 2026-06-26 · unverdicted · none · ref 33
DiM-WAM is a memory-augmented world-action model that integrates multi-scale historical events and global task progress to improve long-horizon robot manipulation performance.
Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity cs.CV · 2026-05-14 · unverdicted · none · ref 20
Head Forcing assigns tailored KV cache strategies to local, anchor, and memory attention heads plus head-wise RoPE re-encoding to extend autoregressive video generation from seconds to minutes without training.
Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion cs.CV · 2026-02-08 · unverdicted · none · ref 33
Rolling Sink is a training-free cache adjustment technique that maintains visual consistency in autoregressive video diffusion models for ultra-long open-ended generation beyond training horizons.
Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization cs.LG · 2026-02-03 · unverdicted · none · ref 4
Quant VideoGen reduces KV cache memory by up to 7 times in autoregressive video diffusion models via semantic aware smoothing and progressive residual quantization, achieving better quality than baselines with under 4% latency overhead.
GoViG: Goal-Conditioned Visual Navigation Instruction Generation via Multimodal Reasoning cs.CV · 2025-08-13 · unverdicted · none · ref 10
GoViG decomposes goal-conditioned navigation instruction generation into visual state prediction and instruction synthesis using an autoregressive multimodal LLM with one-pass and interleaved reasoning, showing gains on a new R2R-Goal dataset.
Character-Centered Dialogue Generation from Scene-Level Prompts cs.CV · 2025-05-22 · unverdicted · none · ref 26
A training-free framework generates expressive, character-grounded dialogue and speech from scene prompts using vision-language encoders, LLMs, and a recursive narrative memory bank for cross-scene consistency.
Scene-Action Prompt Fusion for Coherent Text-to-Video Storytelling cs.CV · 2025-03-08 · unverdicted · none · ref 15
A prompt fusion approach combines bidirectional time-weighted latent blending, dynamics-informed prompt weighting via CLIP, and semantic action representations to produce temporally consistent long videos from text without retraining.
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey cs.CV · 2025-03-16 · unverdicted · none · ref 211
The paper provides the first comprehensive survey of multimodal chain-of-thought reasoning, including foundational concepts, a taxonomy of methodologies, application analyses, challenges, and future directions.
Stream3D: Sequential Multi-View 3D Generation via Evidential Memory cs.CV · 2026-05-20 · unreviewed · ref 18
HL-OutPaint: Coarse-to-Fine Video Outpainting for High-Resolution Long-Range Videos cs.CV · 2026-05-17 · unreviewed · ref 62
VRAG: Learning World Models for Interactive Video Generation cs.CV · 2025-05-28 · unreviewed · ref 39

Streamingt2v: Con- sistent, dynamic, and extendable long video generation from text

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer