Coarse-to-Real: Generative Rendering for Populated Dynamic Scenes

Chongjian Ge; Dan Casas; Gonzalo Gomez-Nogales; Marc Comino-Trinidad; Peiye Zhuang; Yicong Hong; Yi Zhou

arxiv: 2601.22301 · v3 · submitted 2026-01-29 · 💻 cs.CV

Coarse-to-Real: Generative Rendering for Populated Dynamic Scenes

Gonzalo Gomez-Nogales , Yicong Hong , Chongjian Ge , Peiye Zhuang , Marc Comino-Trinidad , Dan Casas , Yi Zhou This is my paper

Pith reviewed 2026-05-16 09:26 UTC · model grok-4.3

classification 💻 cs.CV

keywords generative renderingurban crowd videoscoarse-to-fine controlneural rendererdomain adaptationvideo synthesis3D simulation

0 comments

The pith

Coarse 3D simulations of urban crowds can be turned into realistic, controllable videos using a neural generative renderer trained in two stages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a framework called C2R that takes minimal coarse 3D inputs to produce realistic urban scene videos with people. Traditional methods need detailed assets and lots of computation, but this approach uses learned models to add appearance and fine dynamics. It first builds a prior from real videos, then uses limited paired data to tie the coarse controls to the realistic output. This allows text prompts to guide the realism while keeping the scene layout and motions from the 3D input intact. The result is a system that works on various game-like inputs and maintains consistency over time.

Core claim

C2R is a generative rendering framework that synthesizes real-style urban crowd videos from coarse 3D simulations. Coarse 3D renderings explicitly control scene layout, camera motion, and human trajectories, while a learned neural renderer generates realistic appearance, lighting, and fine-scale dynamics guided by text prompts. A two-stage synthetic-real domain-hedging strategy first learns a strong generative prior from large-scale real footage and then introduces controllability with a small amount of paired synthetic coarse-to-fine data to anchor shared implicit spatio-temporal features.

What carries the argument

The two-stage domain-hedging strategy, which learns a generative prior from real footage and anchors it with limited paired coarse-to-fine data to enable control from coarse inputs.

If this is right

Supports coarse-to-fine control over scene elements.
Generalizes to diverse CG and game inputs.
Produces temporally consistent videos.
Generates realistic outputs from minimal 3D input.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Such methods could reduce the cost of creating populated scenes for films or games by minimizing the need for high-detail modeling.
Extending the approach might allow real-time control in interactive applications if inference speed improves.
This bridging of coarse simulation and real appearance could inspire similar techniques for other dynamic scene types like traffic or nature.

Load-bearing premise

A small amount of paired synthetic coarse-to-fine data suffices to anchor shared implicit spatio-temporal features across domains after learning a generative prior from real footage.

What would settle it

Generating videos from a new set of coarse 3D inputs and checking if they exhibit temporal inconsistencies or fail to match the realism of real urban footage would falsify the claim if widespread.

read the original abstract

Traditional rendering pipelines rely on complex assets, accurate materials and lighting, and substantial computational resources to produce realistic imagery, yet they still face challenges in scalability and realism for populated dynamic scenes. We present C2R (Coarse-to-Real), a generative rendering framework that synthesizes real-style urban crowd videos from coarse 3D simulations. Our approach uses coarse 3D renderings to explicitly control scene layout, camera motion, and human trajectories, while a learned neural renderer generates realistic appearance, lighting, and fine-scale dynamics guided by text prompts. To overcome the lack of paired training data between coarse simulations and real videos, we adopt a two-stage synthetic-real domain-hedging strategy that first learns a strong generative prior from large-scale real footage, and then introduces controllability by using a small amount of paired synthetic coarse-to-fine data to anchor shared implicit spatio-temporal features across domains. The resulting system supports coarse-to-fine control, generalizes across diverse CG and game inputs, and produces temporally consistent, controllable, and realistic urban scene videos from minimal 3D input. We will release the model and project webpage at https://gonzalognogales.github.io/coarse2real/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The two-stage domain-hedging method for controllable video synthesis from coarse 3D inputs is a reasonable practical idea, but the abstract supplies no metrics or ablations to show whether the small paired data actually works.

read the letter

The main takeaway is a two-stage pipeline that first trains a generative prior on large-scale real urban footage, then fine-tunes it with a small amount of paired coarse-to-fine synthetic data to add explicit control over scene layout, camera paths, and human trajectories while using text prompts for appearance. This lets the system take minimal 3D simulation input and output temporally consistent realistic videos without building full assets.

Referee Report

2 major / 2 minor

Summary. The paper proposes C2R, a generative rendering framework for synthesizing realistic urban crowd videos from coarse 3D simulations. It employs a two-stage synthetic-real domain-hedging strategy: first learning a generative prior from large-scale real footage, followed by fine-tuning with a small amount of paired synthetic coarse-to-fine data to anchor shared implicit spatio-temporal features. The system is claimed to provide coarse-to-fine control, generalize across diverse CG and game inputs, and produce temporally consistent, controllable, and realistic videos from minimal 3D input.

Significance. If validated, this could advance generative rendering for populated dynamic scenes by reducing reliance on detailed assets while retaining explicit control over layout, camera, and trajectories. The two-stage hedging strategy for bridging real and synthetic domains addresses a key data scarcity issue and, if shown to work with minimal paired examples, would be a practical contribution to controllable video synthesis.

major comments (2)

[Abstract] Abstract: The central claims that the method 'produces temporally consistent, controllable, and realistic urban scene videos' and 'generalizes across diverse CG and game inputs' are stated without any quantitative metrics (e.g., FID, temporal consistency scores), ablation studies, baseline comparisons, or failure-case analysis. This absence is load-bearing because the soundness of the two-stage anchoring assumption cannot be evaluated from the provided description alone.
[Method] Method section (two-stage strategy): The assumption that a small amount of paired synthetic coarse-to-fine data suffices to anchor shared implicit spatio-temporal features across domains without catastrophic forgetting or domain-shift artifacts is presented at a high level. No specifics are given on paired-data volume, alignment losses, or empirical tests of feature correspondence, leaving the weakest assumption untested.

minor comments (2)

[Abstract] Abstract: The statement that the model and webpage will be released is welcome, but the manuscript provides no details on code, training hyperparameters, or dataset construction that would allow reproduction.
[Introduction] Introduction: Terminology such as 'coarse-to-fine control' and 'implicit spatio-temporal features' is used without explicit definitions or pointers to the relevant prior neural-rendering literature.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, indicating where revisions have been made to improve clarity and support for the claims.

read point-by-point responses

Referee: [Abstract] Abstract: The central claims that the method 'produces temporally consistent, controllable, and realistic urban scene videos' and 'generalizes across diverse CG and game inputs' are stated without any quantitative metrics (e.g., FID, temporal consistency scores), ablation studies, baseline comparisons, or failure-case analysis. This absence is load-bearing because the soundness of the two-stage anchoring assumption cannot be evaluated from the provided description alone.

Authors: The abstract is written to be concise while summarizing the core contributions. The full manuscript provides the requested quantitative support in the Experiments section, including FID scores for realism, temporal consistency metrics, ablation studies on the two-stage approach, baseline comparisons, and failure-case analysis. To directly address the concern, we have revised the abstract to include a brief reference to these key quantitative results from the paper. revision: yes
Referee: [Method] Method section (two-stage strategy): The assumption that a small amount of paired synthetic coarse-to-fine data suffices to anchor shared implicit spatio-temporal features across domains without catastrophic forgetting or domain-shift artifacts is presented at a high level. No specifics are given on paired-data volume, alignment losses, or empirical tests of feature correspondence, leaving the weakest assumption untested.

Authors: We agree that greater specificity strengthens the presentation of the two-stage strategy. In the revised manuscript, we have expanded the Method section to detail the paired-data volume, the specific alignment losses used to anchor features across domains, and empirical tests (including visualizations and metrics) confirming feature correspondence without catastrophic forgetting or significant domain shift. revision: yes

Circularity Check

0 steps flagged

No circularity: method is a descriptive two-stage training procedure without definitional reduction

full rationale

The paper presents C2R as a generative framework that first trains a prior on real footage then anchors features with small paired synthetic data. No equations, fitted parameters, or self-citations are described that would make any prediction or output equivalent to the input by construction. The central claims rest on the empirical effectiveness of the neural renderer and domain-hedging strategy, which are external to any self-referential loop in the provided text. This is the expected non-finding for a methods paper whose derivation chain is a training recipe rather than a closed mathematical derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are specified in the provided text. The framework implicitly assumes neural networks can learn cross-domain spatio-temporal features from the described training procedure.

pith-pipeline@v0.9.0 · 5531 in / 1128 out tokens · 29488 ms · 2026-05-16T09:26:19.470583+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

two-stage synthetic-real domain-hedging strategy that first learns a strong generative prior from large-scale real footage, and then introduces controllability by using a small amount of paired synthetic coarse-to-fine data to anchor shared implicit spatio-temporal features
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

implicit spatio-temporal features extracted from both real and synthetic inputs

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.