Recognition: 2 theorem links
· Lean TheoremDailyArt: Discovering Articulation from Single Static Images via Latent Dynamics
Pith reviewed 2026-05-10 18:13 UTC · model grok-4.3
The pith
DailyArt estimates all joint parameters of articulated objects from a single closed-state image by synthesizing an opened state under the same view and comparing the two.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DailyArt formulates articulated joint estimation from a single static image as a synthesis-mediated reasoning problem. Instead of directly regressing joints from a heavily occluded observation, the method first synthesizes a maximally articulated opened state under the same camera view to expose articulation cues, then estimates the full set of joint parameters from the discrepancy between the observed and synthesized states using a set-prediction formulation that recovers all joints simultaneously without object-specific templates, multi-view inputs, or explicit part annotations at test time.
What carries the argument
Synthesis-mediated reasoning process that generates a maximally articulated opened state from the input closed-state image to expose hidden articulation cues, followed by set-prediction of the complete joint set from the discrepancy between the two states.
Load-bearing premise
A maximally articulated opened state can be reliably synthesized from the single closed-state image under the same camera view, and the discrepancy between observed and synthesized states directly yields accurate joint parameters.
What would settle it
Compare estimated joints against ground-truth annotations on a dataset of articulated objects by checking whether applying the joints to the closed image produces a synthesized open state that matches independent multi-view captures or real opened photographs of the same instances.
Figures
read the original abstract
Articulated objects are essential for embodied AI and world models, yet inferring their kinematics from a single closed-state image remains challenging because crucial motion cues are often occluded. Existing methods either require multi-state observations or rely on explicit part priors, retrieval, or other auxiliary inputs that partially expose the structure to be inferred. In this work, we present DailyArt, which formulates articulated joint estimation from a single static image as a synthesis-mediated reasoning problem. Instead of directly regressing joints from a heavily occluded observation, DailyArt first synthesizes a maximally articulated opened state under the same camera view to expose articulation cues, and then estimates the full set of joint parameters from the discrepancy between the observed and synthesized states. Using a set-prediction formulation, DailyArt recovers all joints simultaneously without requiring object-specific templates, multi-view inputs, or explicit part annotations at test time. Taking estimated joints as conditions, the framework further supports part-level novel state synthesis as a downstream capability. Extensive experiments show that DailyArt achieves strong performance in articulated joint estimation and supports part-level novel state synthesis conditioned on joints. Project page is available at https://rangooo123.github.io/DaliyArt.github.io/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces DailyArt, a method for inferring all articulation joint parameters (axes, limits, states) from a single static image of an object in a closed state. It first uses a latent dynamics model to synthesize a maximally articulated open state under the identical camera view, then recovers the full set of joints simultaneously from the image discrepancy via a set-prediction network. The approach claims to operate without object-specific templates, multi-view inputs, or explicit part annotations at test time, and extends to part-level novel state synthesis conditioned on the recovered joints. Experiments report strong performance on joint estimation and synthesis tasks.
Significance. If the synthesized open states are geometrically valid and the discrepancy signal reliably encodes true kinematic parameters rather than dataset biases, the synthesis-mediated formulation offers a template-free route to single-image articulation discovery that could scale better than multi-view or retrieval-based alternatives for embodied AI and world modeling. The set-prediction treatment of variable joint sets is a technically clean choice that avoids ordering assumptions common in prior regression approaches.
major comments (3)
- [§3.2] §3.2 (Latent Dynamics Synthesis): The central claim that the synthesized open state exposes accurate articulation cues rests on the assumption that the dynamics model produces physically plausible configurations; however, the manuscript provides no explicit kinematic constraints, cycle-consistency losses, or physical plausibility regularizers on the synthesis step, leaving open the possibility that the model learns category-typical appearance changes instead of true joint dynamics. This directly undermines the downstream discrepancy-to-joint inversion.
- [§4.3] §4.3 (Ablation Studies): The ablation on synthesis quality (e.g., replacing the dynamics model with a simple image translation baseline) is absent; without it, it is impossible to isolate whether performance gains come from the discrepancy signal or from implicit training-time supervision on articulated data distributions, which is load-bearing for the no-test-time-annotation claim.
- [Table 2] Table 2 (Joint Estimation Metrics): The reported metrics do not include per-joint axis-angle error or limit accuracy breakdowns stratified by occlusion level; aggregate mAP alone does not confirm that the set-prediction head inverts the discrepancy into geometrically correct parameters rather than fitting appearance correlations.
minor comments (3)
- [Figure 4] Figure 4: The qualitative examples of synthesized open states would be clearer with overlaid joint axes and ground-truth open images for direct visual comparison of geometric fidelity.
- [§2] §2 (Related Work): The discussion of prior single-image articulation methods omits recent works on implicit kinematic priors in NeRF-style representations; adding these would better situate the synthesis-mediated contribution.
- [§3.1] Notation in §3.1: The definition of the discrepancy map D(I_closed, I_open) should explicitly state whether it is pixel-wise difference, feature difference, or optical-flow based, as this choice affects the set-prediction input.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We address each major comment below, indicating where revisions will be made to strengthen the manuscript.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Latent Dynamics Synthesis): The central claim that the synthesized open state exposes accurate articulation cues rests on the assumption that the dynamics model produces physically plausible configurations; however, the manuscript provides no explicit kinematic constraints, cycle-consistency losses, or physical plausibility regularizers on the synthesis step, leaving open the possibility that the model learns category-typical appearance changes instead of true joint dynamics. This directly undermines the downstream discrepancy-to-joint inversion.
Authors: We agree that the absence of explicit kinematic constraints leaves room for the model to potentially capture appearance correlations rather than pure dynamics. The latent dynamics model is trained end-to-end with reconstruction and adversarial objectives on paired closed-open image data, which provides implicit supervision for plausible articulations. To address the concern directly, we will revise §3.2 to clarify the training objectives and incorporate a cycle-consistency loss between synthesized states and the original input to better enforce kinematic consistency. revision: yes
-
Referee: [§4.3] §4.3 (Ablation Studies): The ablation on synthesis quality (e.g., replacing the dynamics model with a simple image translation baseline) is absent; without it, it is impossible to isolate whether performance gains come from the discrepancy signal or from implicit training-time supervision on articulated data distributions, which is load-bearing for the no-test-time-annotation claim.
Authors: We acknowledge that this ablation is missing and would help isolate the contribution of the dynamics model versus general image translation. We will add the requested ablation in §4.3, replacing the latent dynamics synthesis with a standard image-to-image translation baseline (e.g., pix2pix-style) and reporting the resulting joint estimation performance to demonstrate that the discrepancy signal from dynamics synthesis is key. revision: yes
-
Referee: [Table 2] Table 2 (Joint Estimation Metrics): The reported metrics do not include per-joint axis-angle error or limit accuracy breakdowns stratified by occlusion level; aggregate mAP alone does not confirm that the set-prediction head inverts the discrepancy into geometrically correct parameters rather than fitting appearance correlations.
Authors: The set-prediction formulation makes per-joint stratification by occlusion non-trivial, as joints are predicted as an unordered set without fixed ordering or per-instance occlusion labels. We will add axis-angle error and limit accuracy metrics to Table 2 in the revision. However, full occlusion-stratified breakdowns would require additional annotations not present in the dataset; we will instead report errors on subsets with varying visibility where feasible and discuss this limitation. revision: partial
Circularity Check
No circularity: synthesis-mediated estimation remains an independent intermediate step
full rationale
The paper's core chain—synthesizing a maximally articulated opened state from a single closed image via latent dynamics, then recovering joints from the observed-synthesized discrepancy using set prediction—does not reduce to self-definition, fitted inputs renamed as predictions, or self-citation load-bearing. The abstract and available description position synthesis as an external reasoning aid to expose cues, with no equations showing joint parameters defined in terms of the synthesis output or the estimator inverting its own training signals by construction. No uniqueness theorems or ansatzes are imported via self-citation, and the no-template/no-annotation claim at test time is presented as a consequence of the two-stage formulation rather than a renaming of known patterns. The derivation is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A maximally articulated opened state can be synthesized from a single closed-state image under the same camera view to expose articulation cues.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
DailyArt first synthesizes a maximally articulated opened state under the same camera view to expose articulation cues, and then estimates the full set of joint parameters from the discrepancy between the observed and synthesized states. Using a set-prediction formulation...
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We lift the synthesized image pair (I0, Î1) into dense 3D point-maps P0, P1 ... using VGGT ... motion seeds ... transformer-based estimator
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
A vision-language-action flow model for general robot control,
K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichteret al., “A vision-language-action flow model for general robot control,”RSS, 2024
2024
-
[2]
Rt-1: Robotics transformer for real-world control at scale,
A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsuet al., “Rt-1: Robotics transformer for real-world control at scale,”RSS, 2022
2022
-
[3]
Reconciling reality through simulation: A real-to-sim-to- real approach for robust manipulation,
M. Torne, A. Simeonov, Z. Li, A. Chan, T. Chen, A. Gupta, and P. Agrawal, “Reconciling reality through simulation: A real-to-sim-to- real approach for robust manipulation,”Robotics: Science and Systems, 2024
2024
-
[4]
Towards safe and trustworthy embodied ai: foundations, status, and prospects,
X. Tan, B. Liu, Y . Bao, Q. Tian, Z. Gao, X. Wu, Z. Luo, S. Wang, Y . Zhang, X. Wanget al., “Towards safe and trustworthy embodied ai: foundations, status, and prospects,” 2025
2025
-
[5]
Flowbothd: History- aware diffuser handling ambiguities in articulated objects manipulation,
Y . Li, W. H. Leng, Y . Fang, B. Eisner, and D. Held, “Flowbothd: History- aware diffuser handling ambiguities in articulated objects manipulation,” arXiv preprint arXiv:2410.07078, 2024
-
[6]
A survey of embodied ai: From simulators to research tasks,
J. Duan, S. Yu, H. L. Tan, H. Zhu, and C. Tan, “A survey of embodied ai: From simulators to research tasks,”IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 6, no. 2, pp. 230–244, 2022
2022
-
[7]
Real2code: Recon- struct articulated objects via code generation,
Z. Mandi, Y . Weng, D. Bauer, and S. Song, “Real2code: Reconstruct ar- ticulated objects via code generation,”arXiv preprint arXiv:2406.08474, 2024
-
[8]
Ag2manip: Learning novel manipulation skills with agent-agnostic visual and action representations,
P. Li, T. Liu, Y . Li, M. Han, H. Geng, S. Wang, Y . Zhu, S.-C. Zhu, and S. Huang, “Ag2manip: Learning novel manipulation skills with agent-agnostic visual and action representations,” in2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2024, pp. 573–580
2024
-
[9]
PartNet: A large-scale benchmark for fine-grained and hierarchical part-level 3D object understanding,
K. Mo, S. Zhu, A. X. Chang, L. Yi, S. Tripathi, L. J. Guibas, and H. Su, “PartNet: A large-scale benchmark for fine-grained and hierarchical part-level 3D object understanding,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, June 2019
2019
-
[10]
X. Lian, Z. Yu, R. Liang, Y . Wang, L. R. Luo, K. Chen, Y . Zhou, Q. Tang, X. Xu, Z. Lyuet al., “Infinite mobility: Scalable high-fidelity synthesis of articulated objects via procedural generation,”arXiv preprint arXiv:2503.13424, 2025
-
[11]
Triposg: High-fidelity 3d shape synthesis using large-scale rectified flow models,
Y . Li, Z.-X. Zou, Z. Liu, D. Wang, Y . Liang, Z. Yu, X. Liu, Y .-C. Guo, D. Liang, W. Ouyanget al., “Triposg: High-fidelity 3d shape synthesis using large-scale rectified flow models,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025
2025
-
[12]
Partnet: A large-scale benchmark for fine-grained and hierarchical part-level 3d object understanding,
K. Mo, S. Zhu, A. X. Chang, L. Yi, S. Tripathi, L. J. Guibas, and H. Su, “Partnet: A large-scale benchmark for fine-grained and hierarchical part-level 3d object understanding,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 909– 918
2019
-
[13]
Structured 3d latents for scalable and versatile 3d generation,
J. Xiang, Z. Lv, S. Xu, Y . Deng, R. Wang, B. Zhang, D. Chen, X. Tong, and J. Yang, “Structured 3d latents for scalable and versatile 3d generation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2025, pp. 21 469–21 480
2025
-
[14]
arXiv preprint arXiv:2506.16504 , year=
Z. Lai, Y . Zhao, H. Liu, Z. Zhao, Q. Lin, H. Shi, X. Yang, M. Yang, S. Yang, Y . Fenget al., “Hunyuan3d 2.5: Towards high-fidelity 3d assets generation with ultimate details,”arXiv preprint arXiv:2506.16504, 2025
-
[15]
3dtopia-xl: Scaling high-quality 3d asset generation via primitive diffusion,
Z. Chen, J. Tang, Y . Dong, Z. Cao, F. Hong, Y . Lan, T. Wang, H. Xie, T. Wu, S. Saitoet al., “3dtopia-xl: Scaling high-quality 3d asset generation via primitive diffusion,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 26 576–26 586
2025
-
[16]
arXiv preprint arXiv:2309.16653 , year=
J. Tang, J. Ren, H. Zhou, Z. Liu, and G. Zeng, “Dreamgaussian: Generative gaussian splatting for efficient 3d content creation,”arXiv preprint arXiv:2309.16653, 2023
-
[17]
Building inter- actable replicas of complex articulated objects via gaussian splatting
Y . Liu, B. Jia, R. Lu, J. Ni, S.-C. Zhu, and S. Huang, “Building inter- actable replicas of complex articulated objects via gaussian splatting.” in The Thirteenth International Conference on Learning Representations, 2025
2025
-
[18]
Articulate-anything: Auto- matic modeling of articulated objects via a vision-language foundation model,
L. Le, J. Xie, W. Liang, H.-J. Wang, Y . Yang, Y . J. Ma, K. Vedder, A. Krishna, D. Jayaraman, and E. Eaton, “Articulate-anything: Auto- matic modeling of articulated objects via a vision-language foundation model,” inThe Thirteenth International Conference on Learning Repre- sentations, 2025
2025
-
[19]
Urdformer: A pipeline for con- structing articulated simulation environments from real-world images,
Z. Chen, A. Walsman, M. Memmel, K. Mo, A. Fang, K. Vemuri, A. Wu, D. Fox, and A. Gupta, “Urdformer: A pipeline for constructing articulated simulation environments from real-world images,”arXiv preprint arXiv:2405.11656, 2024
-
[20]
Sin- gapo: Single image controlled generation of articulated parts in objects,
J. Liu, D. Iliash, A. X. Chang, M. Savva, and A. Mahdavi-Amiri, “Sin- gapo: Single image controlled generation of articulated parts in objects,” The Thirteenth International Conference on Learning Representations, 2025
2025
-
[21]
Paris: Part-level reconstruc- tion and motion analysis for articulated objects,
J. Liu, A. Mahdavi-Amiri, and M. Savva, “Paris: Part-level reconstruc- tion and motion analysis for articulated objects,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 352–363
2023
-
[22]
Ditto: Building digital twins of articulated objects from interaction,
Z. Jiang, C.-C. Hsu, and Y . Zhu, “Ditto: Building digital twins of articulated objects from interaction,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5616–5626
2022
-
[23]
Larm: A large articulated object reconstruction model,
S. Yuan, R. Shi, X. Wei, X. Zhang, H. Su, and M. Liu, “Larm: A large articulated object reconstruction model,” inProceedings of the SIGGRAPH Asia 2025 Conference Papers, 2025, pp. 1–12
2025
-
[24]
Partrm: Modeling part-level dynamics with large cross- state reconstruction model,
M. Gao, Y . Pan, H.-a. Gao, Z. Zhang, W. Li, H. Dong, H. Tang, L. Yi, and H. Zhao, “Partrm: Modeling part-level dynamics with large cross- state reconstruction model,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 7004–7014
2025
-
[25]
Dreamart: Generating interactable articulated objects from a single image,
R. Lu, Y . Liu, J. Tang, J. Ni, Y . Wang, D. Wan, G. Zeng, Y . Chen, and S. Huang, “Dreamart: Generating interactable articulated objects from a single image,”Proceedings of the SIGGRAPH Asia 2025 Conference Papers, 2025
2025
-
[26]
Puppet-master: Scaling interactive video generation as a motion prior for part-level dynamics,
R. Li, C. Zheng, C. Rupprecht, and A. Vedaldi, “Puppet-master: Scaling interactive video generation as a motion prior for part-level dynamics,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 13 405–13 415
2025
-
[27]
Partfield: Learning 3d feature fields for part segmentation and beyond,
M. Liu, M. A. Uy, D. Xiang, H. Su, S. Fidler, N. Sharp, and J. Gao, “Partfield: Learning 3d feature fields for part segmentation and beyond,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 9704–9715
2025
-
[28]
Self-supervised neu- ral articulated shape and appearance models,
F. Wei, R. Chabra, L. Ma, C. Lassner, M. Zollh ¨ofer, S. Rusinkiewicz, C. Sweeney, R. Newcombe, and M. Slavcheva, “Self-supervised neu- ral articulated shape and appearance models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 15 816–15 826
2022
-
[29]
A-sdf: Learning disentangled signed distance functions for articulated shape representation,
J. Mu, W. Qiu, A. Kortylewski, A. Yuille, N. Vasconcelos, and X. Wang, “A-sdf: Learning disentangled signed distance functions for articulated shape representation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 13 001–13 011
2021
-
[30]
Distributional depth- based estimation of object articulation models,
A. Jain, S. Giguere, R. Lioutikov, and S. Niekum, “Distributional depth- based estimation of object articulation models,” inConference on Robot Learning. PMLR, 2022, pp. 1611–1621
2022
-
[31]
Rosi: Recovering 3d shape interiors from few articulation images,
A. G. Patil, Y . Qian, S. Yang, B. Jackson, E. Bennett, and H. Zhang, “Rosi: Recovering 3d shape interiors from few articulation images,” arXiv preprint arXiv:2304.06342, 2023
-
[32]
Reacto: Reconstructing articulated objects from a single video,
C. Song, J. Wei, C. S. Foo, G. Lin, and F. Liu, “Reacto: Reconstructing articulated objects from a single video,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 5384–5395
2024
-
[33]
Articulate your nerf: Unsupervised articulated object modeling via conditional view synthesis,
J. Deng, K. Subr, and H. Bilen, “Articulate your nerf: Unsupervised articulated object modeling via conditional view synthesis,”Advances in Neural Information Processing Systems, vol. 37, pp. 119 717–119 741, 2024
2024
-
[34]
Articulatedgs: Self-supervised digital twin modeling of articulated objects using 3d gaussian splatting,
J. Guo, Y . Xin, G. Liu, K. Xu, L. Liu, and R. Hu, “Articulatedgs: Self-supervised digital twin modeling of articulated objects using 3d gaussian splatting,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 27 144–27 153
2025
-
[35]
Neural implicit representation for building digital twins of unknown articulated objects,
Y . Weng, B. Wen, J. Tremblay, V . Blukis, D. Fox, L. Guibas, and S. Birchfield, “Neural implicit representation for building digital twins of unknown articulated objects,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 3141–3150
2024
-
[36]
Lvsm: A large view synthesis model with minimal 3d inductive bias,
H. Jin, H. Jiang, H. Tan, K. Zhang, S. Bi, T. Zhang, F. Luan, N. Snavely, and Z. Xu, “Lvsm: A large view synthesis model with minimal 3d inductive bias,” inThe Thirteenth International Conference on Learning Representations, 2025
2025
-
[37]
Nerfies: Deformable neural radiance fields,
K. Park, U. Sinha, J. T. Barron, S. Bouaziz, D. B. Goldman, S. M. Seitz, and R. Martin-Brualla, “Nerfies: Deformable neural radiance fields,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 5865–5874
2021
-
[38]
Hypernerf: A higher-dimensional representation for topologically varying neural radiance fields,
K. Park, U. Sinha, P. Hedman, J. T. Barron, S. Bouaziz, D. B. Goldman, R. Martin-Brualla, and S. M. Seitz, “Hypernerf: A higher-dimensional representation for topologically varying neural radiance fields,” inACM SIGGRAPH Asia Conference Papers, 2021
2021
-
[39]
Detection based part-level articulated object reconstruction from single rgbd image,
Y . Kawana and T. Harada, “Detection based part-level articulated object reconstruction from single rgbd image,”Advances in Neural Information Processing Systems, vol. 36, pp. 18 444–18 473, 2023. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 10
2023
-
[40]
Op-align: Object-level and part- level alignment for self-supervised category-level articulated object pose estimation,
Y . Che, R. Furukawa, and A. Kanezaki, “Op-align: Object-level and part- level alignment for self-supervised category-level articulated object pose estimation,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 72–88
2024
-
[41]
Physx-anything: Simulation-ready physical 3d assets from single image,
Z. Cao, F. Hong, Z. Chen, L. Pan, and Z. Liu, “Physx-anything: Simulation-ready physical 3d assets from single image,”Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2026
2026
-
[42]
Freeart3d: Training-free articulated object generation using 3d diffusion,
C. Chen, I. Liu, X. Wei, H. Su, and M. Liu, “Freeart3d: Training-free articulated object generation using 3d diffusion,” inProceedings of the SIGGRAPH Asia 2025 Conference Papers, 2025, pp. 1–13
2025
-
[43]
Meshart: Generating articulated meshes with structure-guided transformers,
D. Gao, Y . Siddiqui, L. Li, and A. Dai, “Meshart: Generating articulated meshes with structure-guided transformers,” inProceedings of the Com- puter Vision and Pattern Recognition Conference, 2025, pp. 618–627
2025
-
[44]
Single- view 3d scene reconstruction with high-fidelity shape and texture,
Y . Chen, J. Ni, N. Jiang, Y . Zhang, Y . Zhu, and S. Huang, “Single- view 3d scene reconstruction with high-fidelity shape and texture,” in 2024 International Conference on 3D Vision (3DV). IEEE, 2024, pp. 1456–1467
2024
-
[45]
A point set generation network for 3d object reconstruction from a single image,
H. Fan, H. Su, and L. J. Guibas, “A point set generation network for 3d object reconstruction from a single image,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 605–613
2017
-
[46]
DreamFusion: Text-to-3D using 2D Diffusion
B. Poole, A. Jain, J. T. Barron, and B. Mildenhall, “Dreamfusion: Text- to-3d using 2d diffusion,”arXiv preprint arXiv:2209.14988, 2022
work page internal anchor Pith review arXiv 2022
-
[47]
Adding conditional control to text-to- image diffusion models,
Z. Lvmin and A. Maneesh, “Adding conditional control to text-to- image diffusion models,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023
2023
-
[48]
Opd: Single-view 3d openable part detection,
H. Jiang, Y . Mao, M. Savva, and A. X. Chang, “Opd: Single-view 3d openable part detection,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 410–426
2022
-
[49]
Drag your gan: Interactive point-based manipulation on the generative image manifold,
X. Pan, A. Tewari, T. Leimk ¨uhler, L. Liu, A. Meka, and C. Theobalt, “Drag your gan: Interactive point-based manipulation on the generative image manifold,” inACM SIGGRAPH 2023 Conference Proceedings, 2023
2023
-
[50]
Dragapart: Learning a part-level motion prior for articulated objects,
R. Li, C. Zheng, C. Rupprecht, and A. Vedaldi, “Dragapart: Learning a part-level motion prior for articulated objects,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 165–183
2024
-
[51]
Dreamo: Articulated 3d reconstruction from a single casual video,
T. Tu, M.-F. Li, C. H. Lin, Y .-C. Cheng, M. Sun, and M.-H. Yang, “Dreamo: Articulated 3d reconstruction from a single casual video,” in 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). IEEE, 2025, pp. 2269–2279
2025
-
[52]
One-2-3-45++: Fast single image to 3d objects with consistent multi-view generation and 3d diffusion,
M. Liu, R. Shi, L. Chen, Z. Zhang, C. Xu, X. Wei, H. Chen, C. Zeng, J. Gu, and H. Su, “One-2-3-45++: Fast single image to 3d objects with consistent multi-view generation and 3d diffusion,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 10 072–10 083
2024
-
[53]
Monoart: Progressive structural reasoning for monocular articulated 3d reconstruction,
H. Li, H. Xie, J. Xu, B. Wen, F. Hong, and Z. Liu, “Monoart: Progressive structural reasoning for monocular articulated 3d reconstruction,”arXiv preprint arXiv:2603.19231, 2026
-
[54]
Wonder3d: Single image to 3d using cross-domain diffusion,
X. Long, Y .-C. Guo, C. Lin, Y . Liu, Z. Dou, L. Liu, Y . Ma, S.-H. Zhang, M. Habermann, C. Theobaltet al., “Wonder3d: Single image to 3d using cross-domain diffusion,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 9970–9980
2024
-
[55]
arXiv preprint arXiv:2308.08089 , year=
S. Yin, C. Wu, J. Liang, J. Shi, H. Li, G. Ming, and N. Duan, “Dragnuwa: Fine-grained control in video generation by integrating text, image, and trajectory,”arXiv preprint arXiv:2308.08089, 2023
-
[56]
Artilatent: Realistic articulated 3d object generation via structured latents,
H. Chen, Y . Lan, Y . Chen, and X. Pan, “Artilatent: Realistic articulated 3d object generation via structured latents,” inProceedings of the SIGGRAPH Asia 2025 Conference Papers, 2025, pp. 1–11
2025
-
[57]
Loftr: Detector- free local feature matching with transformers,
J. Sun, Z. Shen, Y . Wang, H. Bao, and X. Zhou, “Loftr: Detector- free local feature matching with transformers,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 8922–8931
2021
-
[58]
Magic3d: High-resolution text-to- 3d content creation,
C.-H. Lin, J. Gao, L. Tang, T. Takikawa, X. Zeng, X. Huang, K. Kreis, S. Fidler, M.-Y . Liu, and T.-Y . Lin, “Magic3d: High-resolution text-to- 3d content creation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 300–309
2023
-
[59]
Zero-1-to-3: Zero-shot one image to 3d object,
R. Liu, R. Wu, B. Van Hoorick, P. Tokmakov, S. Zakharov, and C. V on- drick, “Zero-1-to-3: Zero-shot one image to 3d object,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 9298–9309
2023
-
[60]
Clipscore: A reference-free evaluation metric for image captioning,
J. Hessel, A. Holtzman, M. Forbes, R. Le Bras, and Y . Choi, “Clipscore: A reference-free evaluation metric for image captioning,” inProceedings of the 2021 conference on empirical methods in natural language processing, 2021, pp. 7514–7528
2021
-
[61]
Objaverse: A universe of annotated 3d objects,
M. Deitke, D. Schwenk, J. Salvador, L. Weihs, O. Michel, E. VanderBilt, L. Schmidt, K. Ehsani, A. Kembhavi, and A. Farhadi, “Objaverse: A universe of annotated 3d objects,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 13 142–13 153
2023
-
[62]
Objaverse-xl: A uni- verse of 10m+ 3d objects,
M. Deitke, R. Liu, M. Wallingford, H. Ngo, O. Michel, A. Kusupati, A. Fan, C. Laforte, V . V oleti, S. Y . Gadreet al., “Objaverse-xl: A uni- verse of 10m+ 3d objects,”Advances in neural information processing systems, vol. 36, pp. 35 799–35 813, 2023
2023
-
[63]
Dinov2: Learning robust visual features without supervision,
M. Oquab, T. Darcet, T. Moutakanni, H. V . V o, M. Szafraniec, V . Khali- dov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, R. Howes, P.-Y . Huang, H. Xu, V . Sharma, S.-W. Li, W. Galuba, M. Rabbat, M. Assran, N. Ballas, G. Synnaeve, I. Misra, H. Jegou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski, “Dinov2: Learning robust visual features withou...
2023
-
[64]
Segment anything,
A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, P. Doll´ar, and R. Girshick, “Segment anything,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023
2023
-
[65]
Vggt: Visual geometry grounded transformer,
J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny, “Vggt: Visual geometry grounded transformer,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 5294– 5306
2025
-
[66]
The hungarian method for the assignment problem,
H. W. Kuhn, “The hungarian method for the assignment problem,”Naval research logistics quarterly, vol. 2, no. 1-2, pp. 83–97, 1955
1955
-
[67]
Sapien: A simulated part-based interactive environment,
F. Xiang, Y . Qin, K. Mo, Y . Xia, H. Zhu, F. Liu, M. Liu, H. Jiang, Y . Yuan, H. Wanget al., “Sapien: A simulated part-based interactive environment,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11 097–11 107
2020
-
[68]
Akb-48: A real-world articulated object knowledge base,
L. Liu, W. Xu, H. Fu, S. Qian, Q. Yu, Y . Han, and C. Lu, “Akb-48: A real-world articulated object knowledge base,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 14 809–14 818
2022
-
[69]
Image quality assessment: from error visibility to structural similarity,
Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,”IEEE transactions on image processing, vol. 13, no. 4, pp. 600–612, 2004
2004
-
[70]
The unreasonable effectiveness of deep features as a perceptual metric,
R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 586–595
2018
-
[71]
Gans trained by a two time-scale update rule converge to a local nash equilibrium,
M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,”Advances in neural information processing systems, vol. 30, 2017. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 11 Fig. 8.Visual Comparison on Joint-conditioned Novel State Synthesis (Stage ...
2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.