MAOAM: Unified Object and Material Selection with Vision-Language Models

Iliyan Georgiev; Jaden Park; Jason Kuen; Kangning Liu; Krishna Kumar Singh; Michael Fischer; Valentin Deschaintre; Yong Jae Lee

arxiv: 2606.04880 · v1 · pith:6CCEHK5Znew · submitted 2026-06-02 · 💻 cs.CV

MAOAM: Unified Object and Material Selection with Vision-Language Models

Jaden Park , Valentin Deschaintre , Jason Kuen , Kangning Liu , Iliyan Georgiev , Krishna Kumar Singh , Yong Jae Lee , Michael Fischer This is my paper

Pith reviewed 2026-06-28 11:03 UTC · model grok-4.3

classification 💻 cs.CV

keywords vision-language modelsimage segmentationmaterial selectionobject selectioninteractive image editingmulti-task learningclick and text prompts

0 comments

The pith

MAOAM uses one vision-language model to select either objects or materials from text or click prompts and output pixel masks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MAOAM as a single framework that interprets user intent to select objects or materials in an image and produces accurate masks, whether the input is text, a click, or both. Prior VLM selection methods handled only objects and one input type, which restricts editing tasks such as changing all surfaces of one material. MAOAM pairs a VLM that encodes entities, attributes, and relations with a segmentation head that decodes to masks. Because no suitable material datasets existed, the authors built a pipeline that gathers images with material masks and uses VLMs to write rich descriptions, then trains the model on click-based selection, text-based selection, and an auxiliary visual question-answering task. The resulting model shows accurate results on varied scenes and gains further accuracy when text and clicks are supplied together at inference time even though training used only single-modality prompts.

Core claim

MAOAM is a unified selection framework in which a vision-language model interprets a user's object-level or material-level intent from text or click prompts, encodes the relevant visual entities, attributes, and spatial relations, and a segmentation head converts the output token into a pixel-accurate mask; the model is trained with a multi-task objective on click and text selection plus an auxiliary VQA task derived from VLM-generated material descriptions obtained through a scalable pipeline that collects real and synthetic images carrying material masks.

What carries the argument

VLM with segmentation head that interprets selection intent for objects or materials and decodes the output token into a mask.

If this is right

The same model supports both object and material selection without separate networks.
Text-plus-click prompts at inference improve mask quality even though training used only uni-modal examples.
Material selection enables direct re-texturing or instance editing of all surfaces sharing one material.
Training with the auxiliary VQA task strengthens material understanding beyond pure segmentation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Embedding the model in photo-editing tools would let users change every instance of a chosen material in one operation rather than clicking each surface separately.
The same data-pipeline idea could be reused to add other selection criteria such as color families or lighting conditions if matching image collections exist.
Running the model on video frames or 3-D rendered scenes would test whether material selections remain coherent across time or viewpoint changes.

Load-bearing premise

The material descriptions produced by the VLM data-generation pipeline are accurate and semantically rich enough to train effective material-level selection.

What would settle it

Collect a held-out set of images with human-verified material region annotations and measure whether MAOAM's material selections match those regions at rates no higher than an object-only baseline or random guessing.

Figures

Figures reproduced from arXiv: 2606.04880 by Iliyan Georgiev, Jaden Park, Jason Kuen, Kangning Liu, Krishna Kumar Singh, Michael Fischer, Valentin Deschaintre, Yong Jae Lee.

**Figure 1.** Figure 1: Our method, MAOAM, enables click- and text-based selection of both materials and objects via single model with a unified interface. Given an input [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: MAOAM architecture overview. Given an input image, MAOAM takes a task prompt specifying the selection criteria (i.e. objects or materials) alongside [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Example annotations. We show the images’ dense, per-pixel material annotations overlaid in the second row. We additionally show two versions of our [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Example of VQA generation with hard negative mining. The VLM [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Material selection. We compare our method against baselines on a material selection task, both click- and text-based (first two and last two rows, [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Object selection. We compare our method against several baselines on an object selection task. Materialistic neither supports text-based queries, nor [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 9.** Figure 9: Varying text prompts. The same visual prompt (star marker) is inter [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

**Figure 10.** Figure 10: Disambiguation. Although both objects are yellow, MAOAM infers [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗

**Figure 11.** Figure 11: Mask quality. Our model performs well on intricate selection targets [PITH_FULL_IMAGE:figures/full_fig_p008_11.png] view at source ↗

**Figure 12.** Figure 12: Editing. The selection output (masks displayed as insets) for click [PITH_FULL_IMAGE:figures/full_fig_p008_12.png] view at source ↗

**Figure 14.** Figure 14: Joint object-material reasoning. MAOAM can switch between se [PITH_FULL_IMAGE:figures/full_fig_p010_14.png] view at source ↗

**Figure 15.** Figure 15: Limitations. For the first image, the model fails to distinguish the [PITH_FULL_IMAGE:figures/full_fig_p010_15.png] view at source ↗

read the original abstract

Selection is a core operation in interactive image editing. To be practical, a user should be able to specify and disambiguate the desired selection region through either text or click-based interactions, and the system should support selecting not only objects but also other criteria, such as materials. Material-based selection is valuable for tasks like re-texturing surfaces or editing instances of a specific material. However, existing vision-language-model (VLM) based selection methods are object-centric and typically support a single interaction modality, limiting their applicability. In this work, we thus present Mask Any Object And Material (MAOAM), a unified selection framework that enables precise object and material-level selection across both text- and click-based interactions. MAOAM leverages a VLM with a segmentation head to produce pixel-accurate masks from user prompts: the VLM interprets the user's selection intent (object or material-level) and encodes visual entities, attributes, and spatial relations, while the segmentation head decodes the output token into a mask. A key challenge is the lack of material selection datasets with text annotations. We propose a scalable data generation pipeline: we collect real and synthetic images with material masks, and leverage VLMs to generate material descriptions with rich visual-semantics. We train MAOAM with a multi-task objective over click and text-based selection, along with an auxiliary VQA task derived from the material descriptions to facilitate deeper material understanding. Despite being trained with uni-modal prompts, our model exhibits an emergent improvement in selection when combining text and clicks at inference, enabling flexible image editing workflows. Experiments demonstrate accurate and coherent selections across diverse objects, materials, and interaction scenarios, highlighting robustness in practice.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MAOAM unifies object and material selection in one VLM-plus-segmentation model with text and click support, but the material description pipeline has no reported accuracy checks.

read the letter

The paper's main advance is a single framework that lets users select either objects or materials via text, clicks, or both, using a VLM backbone with a segmentation head. They build a data pipeline that takes images with material masks and has VLMs write descriptions, then train on click-based selection, text-based selection, and an auxiliary VQA task on those descriptions. The model shows an unexpected gain when text and clicks are used together at test time even though training was uni-modal.

This fills a real gap: earlier VLM selection work stayed object-centric and single-modality. The multi-task objective and the scalable data route are sensible engineering choices that make the unification feasible.

The soft spot is the one flagged in the stress-test. The material descriptions come from VLMs with no quantitative accuracy numbers, human agreement scores, or error analysis provided. If those texts contain hallucinations or miss visual detail, the material selection training and the claimed robustness rest on shaky ground. The abstract asserts accurate results across scenarios, but without the actual metrics, baselines, or ablation tables it is hard to gauge how much the method moves the needle.

This is for people building interactive editing tools or working on VLM segmentation. A reader focused on practical multi-modal selection would find the unification and the emergent behavior worth seeing. It should go to peer review because the core idea is new and the setup is reproducible in principle, even if the data validation needs tightening.

Referee Report

2 major / 2 minor

Summary. The paper presents MAOAM, a VLM-based framework with a segmentation head for unified object- and material-level selection in images. It supports both text and click prompts, introduces a scalable pipeline that uses VLMs to generate material descriptions from masked images, trains via a multi-task objective (selection + auxiliary VQA), and reports an emergent benefit from combined text+click prompts at inference along with accurate performance across diverse scenarios.

Significance. If the central claims hold, the work would be significant for interactive image editing by extending VLM selection beyond object-centric, single-modality methods to include material-level disambiguation with flexible interactions. The scalable data-generation pipeline and multi-task training that yields emergent cross-modal behavior are concrete strengths that could influence downstream editing tools.

major comments (2)

[Data Generation Pipeline] Data Generation Pipeline section: the central claim that the pipeline produces 'rich visual-semantics' sufficient for material-level selection and the auxiliary VQA task rests on unvalidated VLM-generated descriptions. No quantitative accuracy metric, human agreement study, or error analysis on the generated text is reported; if hallucinations or omissions are present, both the multi-task objective and the claimed material selection accuracy would be undermined.
[Experiments] Experiments section: the claim of 'accurate and coherent selections' and 'emergent improvement' when combining text and clicks is presented without reported ablations isolating the contribution of the auxiliary VQA task or the material-description quality, making it difficult to assess whether the multi-task objective is load-bearing for the reported robustness.

minor comments (2)

[Method] Notation for the segmentation head output token and its decoding into masks could be clarified with an explicit equation or diagram reference.
[Experiments] The abstract states results 'across diverse objects, materials, and interaction scenarios' but the experiments section would benefit from a table summarizing quantitative metrics (e.g., IoU, precision) against explicit baselines for both object and material tasks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments below and commit to revisions that directly strengthen the validation and analysis sections.

read point-by-point responses

Referee: [Data Generation Pipeline] Data Generation Pipeline section: the central claim that the pipeline produces 'rich visual-semantics' sufficient for material-level selection and the auxiliary VQA task rests on unvalidated VLM-generated descriptions. No quantitative accuracy metric, human agreement study, or error analysis on the generated text is reported; if hallucinations or omissions are present, both the multi-task objective and the claimed material selection accuracy would be undermined.

Authors: We agree that the current manuscript lacks direct quantitative validation of the VLM-generated material descriptions. While end-to-end selection performance provides indirect support, this does not substitute for explicit checks on description quality. In the revised version we will add a human agreement study on a sampled subset of generated descriptions together with an error analysis categorizing hallucinations, omissions, and attribute accuracy. This will be reported with inter-annotator agreement metrics. revision: yes
Referee: [Experiments] Experiments section: the claim of 'accurate and coherent selections' and 'emergent improvement' when combining text and clicks is presented without reported ablations isolating the contribution of the auxiliary VQA task or the material-description quality, making it difficult to assess whether the multi-task objective is load-bearing for the reported robustness.

Authors: We acknowledge the absence of targeted ablations on the auxiliary VQA task and on material-description quality. The reported results demonstrate overall performance and the emergent text+click behavior, yet do not isolate the contribution of each component. In revision we will add (i) an ablation removing the VQA loss and (ii) a controlled comparison using lower-quality or template-based descriptions, reporting the resulting changes in selection metrics and emergent behavior. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper describes a training pipeline that uses external VLMs to generate material descriptions from images with masks, then trains a VLM+segmentation model on a multi-task objective (click/text selection + auxiliary VQA). No equations, fitted parameters, or predictions are shown to reduce to the inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The data-generation step relies on off-the-shelf VLMs rather than the model's own outputs, and experimental claims are presented as empirical results rather than derived identities. This matches the default case of an independent modeling effort.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on abstract only; no explicit free parameters or invented entities are described. The framework assumes VLMs can generate usable material annotations and interpret selection intent correctly.

axioms (1)

domain assumption VLMs can accurately interpret and encode selection intent for objects, materials, attributes, and spatial relations from text or click prompts
The core mechanism relies on the VLM to produce output tokens that the segmentation head decodes into masks.

pith-pipeline@v0.9.1-grok · 5857 in / 1228 out tokens · 33452 ms · 2026-06-28T11:03:21.263341+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

144 extracted references · 2 canonical work pages

[1]

1994 , publisher=

An introduction to the conjugate gradient method without the agonizing pain , author=. 1994 , publisher=

1994
[2]

International Conference on Computer Vision (ICCV) , month =

Qi, Lu and Kuen, Jason and Shen, Tiancheng and Gu, Jiuxiang and Guo, Weidong and Jia, Jiaya and Lin, Zhe and Yang, Ming-Hsuan , title =. International Conference on Computer Vision (ICCV) , month =
[3]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Glamm: Pixel grounding large multimodal model , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[4]

and Chai, Yuning and Park, Dennis and Lee, Yong Jae , title =

Cai, Mu and Liu, Haotian and Mustikovela, Siva Karthik and Meyer, Gregory P. and Chai, Yuning and Park, Dennis and Lee, Yong Jae , title =. IEEE Conference on Computer Vision and Pattern Recognition , year =
[5]

International conference on machine learning , pages=

Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

2021
[6]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Xia, Zhuofan and Han, Dongchen and Han, Yizeng and Pan, Xuran and Song, Shiji and Huang, Gao , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2024 , pages =

2024
[7]

European Conference on Computer Vision , pages=

Psalm: Pixelwise segmentation with large multi-modal model , author=. European Conference on Computer Vision , pages=. 2025 , organization=

2025
[8]

arXiv preprint arXiv:2503.06520 , year =

Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement , author =. arXiv preprint arXiv:2503.06520 , year =

Pith/arXiv arXiv
[9]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Lisa: Reasoning segmentation via large language model , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[10]

ACM i3D , year=

Transforming a Non-Differentiable Rasterizer into a Differentiable One with Stochastic Gradient Estimation , author=. ACM i3D , year=
[11]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =

Hu, Rui and Zhu, Lianghui and Zhang, Yuxuan and Cheng, Tianheng and Liu, Lei and Liu, Heng and Ran, Longjin and Chen, Xiaoxin and Liu, Wenyu and Wang, Xinggang , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =. 2025 , pages =

2025
[12]

2025 , eprint=

SAM 3: Segment Anything with Concepts , author=. 2025 , eprint=

2025
[13]

arXiv:2304.02643 , year=

Segment Anything , author=. arXiv:2304.02643 , year=

Pith/arXiv arXiv
[14]

arXiv preprint arXiv:2408.00714 , url=

SAM 2: Segment Anything in Images and Videos , author=. arXiv preprint arXiv:2408.00714 , url=

Pith/arXiv arXiv
[15]

SoftwareX , volume=

ACORNS: An easy-to-use code generator for gradients and Hessians , author=. SoftwareX , volume=. 2022 , publisher=

2022
[16]

1999 , publisher=

Numerical optimization , author=. 1999 , publisher=

1999
[17]

Neural computation , volume=

Fast exact multiplication by the Hessian , author=. Neural computation , volume=. 1994 , publisher=

1994
[18]

ACM Sigplan Notices , volume=

Smooth interpretation , author=. ACM Sigplan Notices , volume=. 2010 , publisher=

2010
[19]

2014 , publisher=

Solving transcendental equations: the Chebyshev polynomial proxy and other numerical rootfinders, perturbation series, and oracles , author=. 2014 , publisher=

2014
[20]

IEEE 1988 Int Conf Neural Networks , pages=

Backpropagation: Past and future , author=. IEEE 1988 Int Conf Neural Networks , pages=. 1988 , organization=

1988
[21]

The rendering equation , author=. Proc. SIGGRAPH , year=
[22]

NeurIPS , volume=

Learning with differentiable pertubed optimizers , author=. NeurIPS , volume=
[23]

Perturbations, Optimization, and Statistics , volume=

Perturbation techniques in online learning and optimization , author=. Perturbations, Optimization, and Statistics , volume=. 2016 , publisher=

2016
[24]

IEEE Trans, Vis, and Comp, Graph, , volume=

Caustics mapping: An image-space technique for real-time caustics , author=. IEEE Trans, Vis, and Comp, Graph, , volume=
[25]

A versatile scene model with differentiable visibility applied to generative pose estimation , author=. Proc. ICCV , year=
[26]

Soft rasterizer: A differentiable renderer for image-based 3d reasoning , author=. Proc. ICCV , year=
[27]

OpenDR: An approximate differentiable renderer , author=. Proc. ECCV , year=
[28]

arXiv preprint arXiv:2006.12057 , year=

Differentiable rendering: A survey , author=. arXiv preprint arXiv:2006.12057 , year=

arXiv 2006
[29]

ACM Tran Graph

Reparameterizing discontinuous integrands for differentiable rendering , author=. ACM Tran Graph. , volume=
[30]

Neural 3d mesh renderer , author=. Proc. CVPR , year=
[31]

Differentiable

Li, Tzu-Mao and Aittala, Miika and Durand, Fr. Differentiable. ACM Trans Graph. , volume=
[32]

Im2vec: Synthesizing vector graphics without vector supervision , author=. Proc. CVPR , year=
[33]

ACM Trans

Differentiable signed distance function rendering , author=. ACM Trans. Graph. (Proc. SIGGRAPH) , volume=
[34]

NeurIPS , volume=

Differentiable rendering with perturbed optimizers , author=. NeurIPS , volume=
[35]

ACM SIGGRAPH , year=

Node Graph Optimization Using Differentiable Proxies , author=. ACM SIGGRAPH , year=
[36]

ACM Trans Graph (Proc

Metappearance: Meta-Learning for Visual Appearance Reproduction , author=. ACM Trans Graph (Proc. SIGGRAPH Asia) , year=
[37]

arXiv preprint arXiv:2210.03510 , year=

Learning to Learn and Sample BRDFs , author=. arXiv preprint arXiv:2210.03510 , year=

arXiv
[38]

J of the American Statistical Association , volume=

Safe and effective importance sampling , author=. J of the American Statistical Association , volume=
[39]

and Deussen, Oliver and Cohen-Or, Daniel , title =

Petersen, Felix and Bermano, Amit H. and Deussen, Oliver and Cohen-Or, Daniel , title =
[40]

GenDR: A Generalized Differentiable Renderer , author=. Proc. CVPR , year=
[41]

Mathematical proceedings of the Cambridge philosophical society , volume=

General principles of antithetic variates , author=. Mathematical proceedings of the Cambridge philosophical society , volume=
[42]

ACM Trans

Unbiased warped-area sampling for differentiable rendering , author=. ACM Trans. Graph. , volume=
[43]

Importance sampling

Belhe, Yash and Xu, Bing and Bangaru, Sai Praveen and Ramamoorthi, Ravi and Li, Tzu-Mao , journal=. Importance sampling. 2024 , publisher=

2024
[44]

Antithetic sampling for

Zhang, Cheng and Dong, Zhao and Doggett, Michael and Zhao, Shuang , journal=. Antithetic sampling for
[45]

SIAM Scientific Computing , volume=

A nonlinear primal-dual method for total variation-based image restoration , author=. SIAM Scientific Computing , volume=. 1999 , publisher=

1999
[46]

Fast image deconvolution using hyper-Laplacian priors , author=. Proc. NeurIPS , volume=
[47]

Texture mapping progressive meshes , author=. Proc. SIGGRAPH , pages=
[48]

SIGGRAPH , pages=

Geometric modeling in shape space , author=. SIGGRAPH , pages=
[49]

A duality based approach for realtime

Zach, Christopher and Pock, Thomas and Bischof, Horst , booktitle=. A duality based approach for realtime
[50]

, author=

Anisotropic Huber-L1 Optical Flow. , author=. BMVC , volume=. 2009 , organization=

2009
[51]

ICML , pages=

Learning recurrent neural networks with hessian-free optimization , author=. ICML , pages=
[52]

arXiv preprint arXiv:1301.3584 , year=

Revisiting natural gradient for deep networks , author=. arXiv preprint arXiv:1301.3584 , year=

Pith/arXiv arXiv
[53]

Adahessian: An adaptive second order optimizer for machine learning , author=. Proc. AAAI , volume=
[54]

arXiv preprint arXiv:2204.06125 , volume=

Hierarchical text-conditional image generation with clip latents , author=. arXiv preprint arXiv:2204.06125 , volume=

Pith/arXiv arXiv
[55]

Inverse global illumination: Recovering reflectance models of real scenes from photographs , author=. Proc. SIGGRAPH , pages=
[56]

arXiv preprint arXiv:1412.6980 , year=

Adam: A method for stochastic optimization , author=. arXiv preprint arXiv:1412.6980 , year=

Pith/arXiv arXiv
[57]

arXiv preprint arXiv:1609.04747 , year=

An overview of gradient descent optimization algorithms , author=. arXiv preprint arXiv:1609.04747 , year=

Pith/arXiv arXiv
[58]

ACM Trans

Projective sampling for differentiable rendering of geometry , author=. ACM Trans. Graph. , volume=
[59]

Optimally combining sampling techniques for Monte Carlo rendering , author=. POrioc. SIGGRAPH , pages=
[60]

1998 , publisher=

Robust Monte Carlo methods for light transport simulation , author=. 1998 , publisher=

1998
[61]

ACM Trans

Mitsuba 2: A retargetable forward and inverse renderer , author=. ACM Trans. Graph. , volume=
[62]

Deep-learning the Latent Space of Light Transport , author=. Comp. Graph. Forum , volume=
[63]

The computer journal , volume=

A rapidly convergent descent method for minimization , author=. The computer journal , volume=. 1963 , publisher=

1963
[64]

arXiv preprint arXiv:2006.03427 , year=

Learning Neural Light Transport , author=. arXiv preprint arXiv:2006.03427 , year=

arXiv 2006
[65]

NeuRIPS , volume=

Learning to predict 3d objects with an interpolation-based differentiable renderer , author=. NeuRIPS , volume=
[66]

NeuRIPS , volume=

Unsupervised learning of shape and pose with differentiable point clouds , author=. NeuRIPS , volume=
[67]

R efer I t G ame: Referring to Objects in Photographs of Natural Scenes

Kazemzadeh, Sahar and Ordonez, Vicente and Matten, Mark and Berg, Tamara. R efer I t G ame: Referring to Objects in Photographs of Natural Scenes. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing ( EMNLP ). 2014

2014
[68]

Dist: Rendering deep implicit signed distance function with differentiable sphere tracing , author=. Proc. CVPR , year=
[69]

3DV , year=

V-net: Fully convolutional neural networks for volumetric medical image segmentation , author=. 3DV , year=
[70]

Inverse rendering for complex indoor scenes: Shape, spatially-varying lighting and svbrdf from a single image , author=. Proc. CVPR , year=
[71]

Note sur la convergence de m

Polak, Elijah and Ribiere, Gerard , journal=. Note sur la convergence de m. 1969 , publisher=

1969
[72]

ACM SIGGRAPH , year=

Eikonal Fields for Refractive Novel-View Synthesis , author=. ACM SIGGRAPH , year=
[73]

Sdfdiff: Differentiable rendering of signed distance fields for 3d shape optimization , author=. Proc. CVPR , year=
[74]

Escaping plato's cave using adversarial training: 3d shape from unstructured 2d image collections , author=. Proc. ICCV , year=
[75]

Multi-view supervision for single-view reconstruction via differentiable ray consistency , author=. Proc. CVPR , year=
[76]

Comm ACM , volume=

Nerf: Representing scenes as neural radiance fields for view synthesis , author=. Comm ACM , volume=
[77]

Differentiable Rendering using

Xing, Jiankai and Luan, Fujun and Yan, Ling-Qi and Hu, Xuejun and Qian, Houde and Xu, Kun , journal=. Differentiable Rendering using
[78]

arXiv preprint arXiv:1212.4507 , year=

Variational optimization , author=. arXiv preprint arXiv:1212.4507 , year=

Pith/arXiv arXiv
[79]

, author=

Optimization by Variational Bounding. , author=. ESANN , year=
[80]

ICML , pages=

Do differentiable simulators give better policy gradients? , author=. ICML , pages=. 2022 , organization=

2022

Showing first 80 references.

[1] [1]

1994 , publisher=

An introduction to the conjugate gradient method without the agonizing pain , author=. 1994 , publisher=

1994

[2] [2]

International Conference on Computer Vision (ICCV) , month =

Qi, Lu and Kuen, Jason and Shen, Tiancheng and Gu, Jiuxiang and Guo, Weidong and Jia, Jiaya and Lin, Zhe and Yang, Ming-Hsuan , title =. International Conference on Computer Vision (ICCV) , month =

[3] [3]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Glamm: Pixel grounding large multimodal model , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[4] [4]

and Chai, Yuning and Park, Dennis and Lee, Yong Jae , title =

Cai, Mu and Liu, Haotian and Mustikovela, Siva Karthik and Meyer, Gregory P. and Chai, Yuning and Park, Dennis and Lee, Yong Jae , title =. IEEE Conference on Computer Vision and Pattern Recognition , year =

[5] [5]

International conference on machine learning , pages=

Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

2021

[6] [6]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Xia, Zhuofan and Han, Dongchen and Han, Yizeng and Pan, Xuran and Song, Shiji and Huang, Gao , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2024 , pages =

2024

[7] [7]

European Conference on Computer Vision , pages=

Psalm: Pixelwise segmentation with large multi-modal model , author=. European Conference on Computer Vision , pages=. 2025 , organization=

2025

[8] [8]

arXiv preprint arXiv:2503.06520 , year =

Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement , author =. arXiv preprint arXiv:2503.06520 , year =

Pith/arXiv arXiv

[9] [9]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Lisa: Reasoning segmentation via large language model , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[10] [10]

ACM i3D , year=

Transforming a Non-Differentiable Rasterizer into a Differentiable One with Stochastic Gradient Estimation , author=. ACM i3D , year=

[11] [11]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =

Hu, Rui and Zhu, Lianghui and Zhang, Yuxuan and Cheng, Tianheng and Liu, Lei and Liu, Heng and Ran, Longjin and Chen, Xiaoxin and Liu, Wenyu and Wang, Xinggang , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =. 2025 , pages =

2025

[12] [12]

2025 , eprint=

SAM 3: Segment Anything with Concepts , author=. 2025 , eprint=

2025

[13] [13]

arXiv:2304.02643 , year=

Segment Anything , author=. arXiv:2304.02643 , year=

Pith/arXiv arXiv

[14] [14]

arXiv preprint arXiv:2408.00714 , url=

SAM 2: Segment Anything in Images and Videos , author=. arXiv preprint arXiv:2408.00714 , url=

Pith/arXiv arXiv

[15] [15]

SoftwareX , volume=

ACORNS: An easy-to-use code generator for gradients and Hessians , author=. SoftwareX , volume=. 2022 , publisher=

2022

[16] [16]

1999 , publisher=

Numerical optimization , author=. 1999 , publisher=

1999

[17] [17]

Neural computation , volume=

Fast exact multiplication by the Hessian , author=. Neural computation , volume=. 1994 , publisher=

1994

[18] [18]

ACM Sigplan Notices , volume=

Smooth interpretation , author=. ACM Sigplan Notices , volume=. 2010 , publisher=

2010

[19] [19]

2014 , publisher=

Solving transcendental equations: the Chebyshev polynomial proxy and other numerical rootfinders, perturbation series, and oracles , author=. 2014 , publisher=

2014

[20] [20]

IEEE 1988 Int Conf Neural Networks , pages=

Backpropagation: Past and future , author=. IEEE 1988 Int Conf Neural Networks , pages=. 1988 , organization=

1988

[21] [21]

The rendering equation , author=. Proc. SIGGRAPH , year=

[22] [22]

NeurIPS , volume=

Learning with differentiable pertubed optimizers , author=. NeurIPS , volume=

[23] [23]

Perturbations, Optimization, and Statistics , volume=

Perturbation techniques in online learning and optimization , author=. Perturbations, Optimization, and Statistics , volume=. 2016 , publisher=

2016

[24] [24]

IEEE Trans, Vis, and Comp, Graph, , volume=

Caustics mapping: An image-space technique for real-time caustics , author=. IEEE Trans, Vis, and Comp, Graph, , volume=

[25] [25]

A versatile scene model with differentiable visibility applied to generative pose estimation , author=. Proc. ICCV , year=

[26] [26]

Soft rasterizer: A differentiable renderer for image-based 3d reasoning , author=. Proc. ICCV , year=

[27] [27]

OpenDR: An approximate differentiable renderer , author=. Proc. ECCV , year=

[28] [28]

arXiv preprint arXiv:2006.12057 , year=

Differentiable rendering: A survey , author=. arXiv preprint arXiv:2006.12057 , year=

arXiv 2006

[29] [29]

ACM Tran Graph

Reparameterizing discontinuous integrands for differentiable rendering , author=. ACM Tran Graph. , volume=

[30] [30]

Neural 3d mesh renderer , author=. Proc. CVPR , year=

[31] [31]

Differentiable

Li, Tzu-Mao and Aittala, Miika and Durand, Fr. Differentiable. ACM Trans Graph. , volume=

[32] [32]

Im2vec: Synthesizing vector graphics without vector supervision , author=. Proc. CVPR , year=

[33] [33]

ACM Trans

Differentiable signed distance function rendering , author=. ACM Trans. Graph. (Proc. SIGGRAPH) , volume=

[34] [34]

NeurIPS , volume=

Differentiable rendering with perturbed optimizers , author=. NeurIPS , volume=

[35] [35]

ACM SIGGRAPH , year=

Node Graph Optimization Using Differentiable Proxies , author=. ACM SIGGRAPH , year=

[36] [36]

ACM Trans Graph (Proc

Metappearance: Meta-Learning for Visual Appearance Reproduction , author=. ACM Trans Graph (Proc. SIGGRAPH Asia) , year=

[37] [37]

arXiv preprint arXiv:2210.03510 , year=

Learning to Learn and Sample BRDFs , author=. arXiv preprint arXiv:2210.03510 , year=

arXiv

[38] [38]

J of the American Statistical Association , volume=

Safe and effective importance sampling , author=. J of the American Statistical Association , volume=

[39] [39]

and Deussen, Oliver and Cohen-Or, Daniel , title =

Petersen, Felix and Bermano, Amit H. and Deussen, Oliver and Cohen-Or, Daniel , title =

[40] [40]

GenDR: A Generalized Differentiable Renderer , author=. Proc. CVPR , year=

[41] [41]

Mathematical proceedings of the Cambridge philosophical society , volume=

General principles of antithetic variates , author=. Mathematical proceedings of the Cambridge philosophical society , volume=

[42] [42]

ACM Trans

Unbiased warped-area sampling for differentiable rendering , author=. ACM Trans. Graph. , volume=

[43] [43]

Importance sampling

Belhe, Yash and Xu, Bing and Bangaru, Sai Praveen and Ramamoorthi, Ravi and Li, Tzu-Mao , journal=. Importance sampling. 2024 , publisher=

2024

[44] [44]

Antithetic sampling for

Zhang, Cheng and Dong, Zhao and Doggett, Michael and Zhao, Shuang , journal=. Antithetic sampling for

[45] [45]

SIAM Scientific Computing , volume=

A nonlinear primal-dual method for total variation-based image restoration , author=. SIAM Scientific Computing , volume=. 1999 , publisher=

1999

[46] [46]

Fast image deconvolution using hyper-Laplacian priors , author=. Proc. NeurIPS , volume=

[47] [47]

Texture mapping progressive meshes , author=. Proc. SIGGRAPH , pages=

[48] [48]

SIGGRAPH , pages=

Geometric modeling in shape space , author=. SIGGRAPH , pages=

[49] [49]

A duality based approach for realtime

Zach, Christopher and Pock, Thomas and Bischof, Horst , booktitle=. A duality based approach for realtime

[50] [50]

, author=

Anisotropic Huber-L1 Optical Flow. , author=. BMVC , volume=. 2009 , organization=

2009

[51] [51]

ICML , pages=

Learning recurrent neural networks with hessian-free optimization , author=. ICML , pages=

[52] [52]

arXiv preprint arXiv:1301.3584 , year=

Revisiting natural gradient for deep networks , author=. arXiv preprint arXiv:1301.3584 , year=

Pith/arXiv arXiv

[53] [53]

Adahessian: An adaptive second order optimizer for machine learning , author=. Proc. AAAI , volume=

[54] [54]

arXiv preprint arXiv:2204.06125 , volume=

Hierarchical text-conditional image generation with clip latents , author=. arXiv preprint arXiv:2204.06125 , volume=

Pith/arXiv arXiv

[55] [55]

Inverse global illumination: Recovering reflectance models of real scenes from photographs , author=. Proc. SIGGRAPH , pages=

[56] [56]

arXiv preprint arXiv:1412.6980 , year=

Adam: A method for stochastic optimization , author=. arXiv preprint arXiv:1412.6980 , year=

Pith/arXiv arXiv

[57] [57]

arXiv preprint arXiv:1609.04747 , year=

An overview of gradient descent optimization algorithms , author=. arXiv preprint arXiv:1609.04747 , year=

Pith/arXiv arXiv

[58] [58]

ACM Trans

Projective sampling for differentiable rendering of geometry , author=. ACM Trans. Graph. , volume=

[59] [59]

Optimally combining sampling techniques for Monte Carlo rendering , author=. POrioc. SIGGRAPH , pages=

[60] [60]

1998 , publisher=

Robust Monte Carlo methods for light transport simulation , author=. 1998 , publisher=

1998

[61] [61]

ACM Trans

Mitsuba 2: A retargetable forward and inverse renderer , author=. ACM Trans. Graph. , volume=

[62] [62]

Deep-learning the Latent Space of Light Transport , author=. Comp. Graph. Forum , volume=

[63] [63]

The computer journal , volume=

A rapidly convergent descent method for minimization , author=. The computer journal , volume=. 1963 , publisher=

1963

[64] [64]

arXiv preprint arXiv:2006.03427 , year=

Learning Neural Light Transport , author=. arXiv preprint arXiv:2006.03427 , year=

arXiv 2006

[65] [65]

NeuRIPS , volume=

Learning to predict 3d objects with an interpolation-based differentiable renderer , author=. NeuRIPS , volume=

[66] [66]

NeuRIPS , volume=

Unsupervised learning of shape and pose with differentiable point clouds , author=. NeuRIPS , volume=

[67] [67]

R efer I t G ame: Referring to Objects in Photographs of Natural Scenes

Kazemzadeh, Sahar and Ordonez, Vicente and Matten, Mark and Berg, Tamara. R efer I t G ame: Referring to Objects in Photographs of Natural Scenes. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing ( EMNLP ). 2014

2014

[68] [68]

Dist: Rendering deep implicit signed distance function with differentiable sphere tracing , author=. Proc. CVPR , year=

[69] [69]

3DV , year=

V-net: Fully convolutional neural networks for volumetric medical image segmentation , author=. 3DV , year=

[70] [70]

Inverse rendering for complex indoor scenes: Shape, spatially-varying lighting and svbrdf from a single image , author=. Proc. CVPR , year=

[71] [71]

Note sur la convergence de m

Polak, Elijah and Ribiere, Gerard , journal=. Note sur la convergence de m. 1969 , publisher=

1969

[72] [72]

ACM SIGGRAPH , year=

Eikonal Fields for Refractive Novel-View Synthesis , author=. ACM SIGGRAPH , year=

[73] [73]

Sdfdiff: Differentiable rendering of signed distance fields for 3d shape optimization , author=. Proc. CVPR , year=

[74] [74]

Escaping plato's cave using adversarial training: 3d shape from unstructured 2d image collections , author=. Proc. ICCV , year=

[75] [75]

Multi-view supervision for single-view reconstruction via differentiable ray consistency , author=. Proc. CVPR , year=

[76] [76]

Comm ACM , volume=

Nerf: Representing scenes as neural radiance fields for view synthesis , author=. Comm ACM , volume=

[77] [77]

Differentiable Rendering using

Xing, Jiankai and Luan, Fujun and Yan, Ling-Qi and Hu, Xuejun and Qian, Houde and Xu, Kun , journal=. Differentiable Rendering using

[78] [78]

arXiv preprint arXiv:1212.4507 , year=

Variational optimization , author=. arXiv preprint arXiv:1212.4507 , year=

Pith/arXiv arXiv

[79] [79]

, author=

Optimization by Variational Bounding. , author=. ESANN , year=

[80] [80]

ICML , pages=

Do differentiable simulators give better policy gradients? , author=. ICML , pages=. 2022 , organization=

2022