VideoNeuMat: Neural Material Extraction from Generative Video Models

Bowen Xue; Fabrice Rousselle; Milos Hasan; Saeed Hadadan; Zahra Montazeri; Zheng Zeng

arxiv: 2602.07272 · v2 · pith:W4XHI77Hnew · submitted 2026-02-06 · 💻 cs.CV · cs.GR

VideoNeuMat: Neural Material Extraction from Generative Video Models

Bowen Xue , Saeed Hadadan , Zheng Zeng , Fabrice Rousselle , Zahra Montazeri , Milos Hasan This is my paper

Pith reviewed 2026-05-21 13:57 UTC · model grok-4.3

classification 💻 cs.CV cs.GR

keywords neural material extractionvideo diffusion modelsvirtual gonioreflectometerlarge reconstruction modelreusable neural assetsmaterial transferphotorealistic renderinggenerative video to 3D

0 comments

The pith

Generative video models hold extractable material knowledge that converts into reusable neural 3D assets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates a pipeline that pulls realistic material properties from large video diffusion models trained on broad internet data. It first adapts the video model to output short clips of flat material samples viewed along fixed camera and lighting paths. A second model then converts each 17-frame clip into a compact neural material description in a single forward pass. These descriptions render correctly under new viewpoints and new lights and display more variety and realism than materials trained only on synthetic examples. If the transfer works as described, high-quality material libraries become available without the need to assemble fresh synthetic training sets or rely on manual artist input.

Core claim

Finetuning a large video diffusion model produces controlled videos of material appearances that function as measurements from a virtual gonioreflectometer; a separate large reconstruction model then inverts any such 17-frame video into neural material parameters that generalize to novel views and lights while exceeding the visual quality of the original synthetic training distribution.

What carries the argument

Virtual gonioreflectometer formed by finetuning the video model to generate structured material videos, paired with single-pass inference inside the large reconstruction model to recover neural material parameters.

If this is right

Neural materials produced this way generalize to novel viewing directions and lighting conditions.
The extracted materials display greater realism and diversity than those obtained from limited synthetic training sets.
Material knowledge moves from internet-scale video models into compact, reusable neural 3D assets.
Single-pass inference from 17 frames replaces multi-view optimization for material recovery.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pipeline could supply appearance parameters for complete 3D objects once geometry is obtained from separate extraction methods.
Extending the video generation stage to capture moving or deforming surfaces might yield materials with time-varying effects such as weathering.
The approach suggests that other appearance properties now entangled in video models could be isolated by analogous structured generation and inversion steps.

Load-bearing premise

Finetuning the video model successfully isolates material appearance from geometry and lighting so that the generated videos supply clean measurements for later reconstruction.

What would settle it

Extracted neural materials that visibly mismatch real photographs or ground-truth renderings when placed under previously unseen light directions or camera angles would show the transfer has not occurred.

read the original abstract

Creating photorealistic materials for 3D rendering requires exceptional artistic skill. Generative models for materials could help, but are currently limited by the lack of high-quality training data. While recent video generative models effortlessly produce realistic material appearances, this knowledge remains entangled with geometry and lighting. We present VideoNeuMat, a two-stage pipeline that extracts reusable neural material assets from video diffusion models. First, we finetune a large video model (Wan 2.1 14B) to generate material sample videos under controlled camera and lighting trajectories, effectively creating a "virtual gonioreflectometer" that preserves the model's material realism while learning a structured measurement pattern. Second, we reconstruct compact neural materials from these videos through a Large Reconstruction Model (LRM) finetuned from a smaller Wan 1.3B video backbone. From 17 generated video frames, our LRM performs single-pass inference to predict neural material parameters that generalize to novel viewing and lighting conditions. The resulting materials exhibit realism and diversity far exceeding the limited synthetic training data, demonstrating that material knowledge can be successfully transferred from internet-scale video models into standalone, reusable neural 3D assets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VideoNeuMat repurposes a finetuned video diffusion model as a virtual gonioreflectometer then feeds 17 frames into an LRM for single-pass neural material extraction, but the evidence for clean disentanglement and generalization is still thin.

read the letter

The main takeaway is that this paper gives a concrete two-stage recipe for turning knowledge inside large video models into reusable neural materials without building new capture datasets from scratch. They finetune Wan 2.1 14B on controlled trajectories to produce structured videos, then run a smaller Wan-based LRM to output neural material parameters in one shot. That pipeline direction feels fresh relative to earlier material capture or generative work I have seen.

Referee Report

2 major / 2 minor

Summary. The paper proposes VideoNeuMat, a two-stage pipeline for extracting reusable neural material assets from generative video models. It first finetunes Wan 2.1 14B to generate 17-frame videos of material samples under controlled camera and lighting trajectories (creating a virtual gonioreflectometer), then finetunes a Large Reconstruction Model (LRM) from a Wan 1.3B backbone to infer neural material parameters in a single pass. The central claim is that these parameters generalize to novel views and lighting, yielding materials with greater realism and diversity than those from limited synthetic training data, thereby transferring implicit material knowledge from internet-scale video models into standalone 3D assets.

Significance. If the generalization and disentanglement claims hold with supporting quantitative evidence, the work would offer a practical route to photorealistic neural materials without exhaustive synthetic data collection, leveraging the material appearance priors already present in large video diffusion models. This could impact downstream applications in 3D rendering, virtual production, and material design by producing compact, reusable assets that outperform current synthetic-data baselines.

major comments (2)

[Abstract / §3] Abstract and §3 (virtual gonioreflectometer description): the finetuning procedure is presented as imposing structured camera/lighting trajectories while preserving material realism, yet no explicit loss term, architectural constraint (e.g., material-constancy regularizer or geometry-free latent), or ablation is referenced to suppress residual geometry/lighting correlations inherited from pretraining. Without such a mechanism, the 17-frame videos may not supply measurements whose appearance variations are attributable solely to material parameters, undermining the downstream claim that the LRM recovers view- and light-independent BRDFs.
[Abstract / Evaluation] Abstract and evaluation sections: the claims of 'realism and diversity far exceeding the limited synthetic training data' and successful generalization rest on qualitative statements; no error metrics, comparison baselines, or quantitative protocol for measuring generalization to novel viewing/lighting conditions are described. This absence makes it impossible to assess whether the LRM outputs are load-bearing improvements or merely reproductions of training-video cues.

minor comments (2)

[Method] Clarify the precise parameterization of the neural material (e.g., which BRDF lobes or latent dimensions are predicted by the LRM) and how it is rendered for novel conditions.
[Abstract / §4] The abstract states 'single-pass inference' from 17 frames; confirm whether this is strictly feed-forward or involves any test-time optimization, and add a diagram of the LRM architecture.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We have carefully reviewed the major comments and provide point-by-point responses below, including clarifications on our current approach and commitments to revisions that strengthen the presentation of the finetuning mechanism and evaluation.

read point-by-point responses

Referee: [Abstract / §3] Abstract and §3 (virtual gonioreflectometer description): the finetuning procedure is presented as imposing structured camera/lighting trajectories while preserving material realism, yet no explicit loss term, architectural constraint (e.g., material-constancy regularizer or geometry-free latent), or ablation is referenced to suppress residual geometry/lighting correlations inherited from pretraining. Without such a mechanism, the 17-frame videos may not supply measurements whose appearance variations are attributable solely to material parameters, undermining the downstream claim that the LRM recovers view- and light-independent BRDFs.

Authors: We appreciate the referee's observation on this critical aspect of disentanglement. The manuscript's finetuning strategy for the video model (Wan 2.1 14B) centers on generating videos from fixed material samples under explicitly controlled camera and lighting trajectories, which is intended to shift the model's focus toward material appearance variations while inheriting realism from pretraining. This data-driven structure serves as an implicit mechanism to reduce geometry and lighting correlations. However, we acknowledge that an explicit regularizer would provide stronger guarantees and address potential residual effects. In the revised manuscript, we will introduce a material-constancy loss term during finetuning that enforces consistency of inferred material parameters across frames of the same sample. We will also add an ablation study quantifying the impact of this term on view- and light-independence, to be included in §3 and the supplementary material. revision: yes
Referee: [Abstract / Evaluation] Abstract and evaluation sections: the claims of 'realism and diversity far exceeding the limited synthetic training data' and successful generalization rest on qualitative statements; no error metrics, comparison baselines, or quantitative protocol for measuring generalization to novel viewing/lighting conditions are described. This absence makes it impossible to assess whether the LRM outputs are load-bearing improvements or merely reproductions of training-video cues.

Authors: We agree that quantitative metrics and a clear protocol are essential to rigorously support the claims of superior realism, diversity, and generalization to novel conditions. The original manuscript emphasizes qualitative visual comparisons to highlight the benefits of transferring knowledge from large-scale video models over purely synthetic training data. To directly address this limitation, the revised version will incorporate a quantitative evaluation section. This will define a protocol for testing generalization (using held-out camera/lighting trajectories and materials), report error metrics such as PSNR, SSIM, and LPIPS on novel-view and novel-light renderings, and include direct comparisons against LRM baselines trained on limited synthetic datasets. These additions will appear in the evaluation section along with corresponding tables and figures. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation relies on external video models and independent LRM training

full rationale

The paper describes a two-stage pipeline that first finetunes an external large video model (Wan 2.1 14B) on controlled trajectories to produce videos, then trains a separate LRM (from Wan 1.3B backbone) to invert those videos into neural material parameters. No equation or claim reduces the output parameters to a direct fit or redefinition of the input videos by construction; the generalization claim to novel views/lighting is presented as an empirical outcome of the LRM inference rather than a tautology. The abstract and pipeline description contain no self-citations that bear the central load, no ansatz smuggled via prior work, and no renaming of known results as new derivations. The process is self-contained against the external generative model and does not exhibit any of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The approach implicitly assumes that video diffusion models contain disentangleable material knowledge and that the LRM can invert the generation process without additional supervision.

pith-pipeline@v0.9.0 · 5753 in / 1228 out tokens · 36453 ms · 2026-05-21T13:57:18.589544+00:00 · methodology

VideoNeuMat: Neural Material Extraction from Generative Video Models

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)