{"work":{"id":"e7fa8cee-3041-47a6-96ad-2f903f261b47","openalex_id":null,"doi":"10.1109/cvpr52688.2022.01042","arxiv_id":null,"raw_key":"raw:b18a13318603c4b1d9dec1f0","title":"In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition","authors":[{"given":"Robin","family":"Rombach","sequence":"first","affiliation":[{"name":"Ludwig Maximilian University of Munich &#x0026; IWR, Heidelberg University,Germany"}]},{"given":"Andreas","family":"Blattmann","sequence":"additional","affiliation":[{"name":"Ludwig Maximilian University of Munich &#x0026; IWR, Heidelberg University,Germany"}]},{"given":"Dominik","family":"Lorenz","sequence":"additional","affiliation":[{"name":"Ludwig Maximilian University of Munich &#x0026; IWR, Heidelberg University,Germany"}]},{"given":"Patrick","family":"Esser","sequence":"additional","affiliation":[{"name":"Runway ML"}]},{"given":"Bjorn","family":"Ommer","sequence":"additional","affiliation":[{"name":"Ludwig Maximilian University of Munich &#x0026; IWR, Heidelberg University,Germany"}]}],"authors_text":"Rombach, R","year":2022,"venue":"2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","abstract":null,"external_url":"https://doi.org/10.1109/cvpr52688.2022.01042","cited_by_count":13832,"metadata_source":"raw_reference","metadata_fetched_at":"2026-05-27T08:28:50.434004+00:00","pith_arxiv_id":null,"created_at":"2026-05-13T14:57:53.832888+00:00","updated_at":"2026-06-05T21:23:00.469572+00:00","title_quality_ok":false,"display_title":"In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition","render_title":"In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition"},"hub":{"state":{"work_id":"e7fa8cee-3041-47a6-96ad-2f903f261b47","tier":"super_hub","tier_reason":"100+ Pith inbound or 10,000+ external citations","pith_inbound_count":46,"external_cited_by_count":13832,"distinct_field_count":3,"first_pith_cited_at":"2024-01-15T07:50:18+00:00","last_pith_cited_at":"2026-05-20T17:52:10+00:00","author_build_status":"needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-06-10T16:56:49.011367+00:00","tier_text":"super_hub"},"tier":"super_hub","role_counts":[{"context_role":"background","n":9},{"context_role":"method","n":2}],"polarity_counts":[{"context_polarity":"background","n":9},{"context_polarity":"use_method","n":2}],"runs":{"ask_index":{"job_type":"ask_index","status":"succeeded","result":{"title":"In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition","claims":[{"claim_text":"globally consistent 3D scenes that remain stable under large viewpoint changes. fundamentally an ill-posed problem: Simple texts or image inputs fail to provide a comprehensive representation of the entire 3D space. Consequently, inferring massive amounts of missing information for unseen areas while maintaining ge- ometric consistency remains a significant challenge. Deep generative models, particularly diffusion models [13,17,34,35,37], ad- dress this by leveraging strong 2D visual priors. How","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"language model to guide the evolution process. Importantly, it works on black-box generation models by requiring only image outputs. Finally, we evaluate PromptEvolver across multiple prompt inversion benchmarks and show that it consistently outperforms competing methods. Keywords:Prompt inversion·Text to image generation 1 Introduction Text-to-image (T2I) diffusion models [21,35,48] have transformed visual con- tent creation, enabling users to generate photorealistic images from natural- langua","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"Thisphysics-basedreference ˆImv v guaranteesglobalilluminationconsistencyacross views but lacks photorealistic high-frequency details (e.g. specularities, sky tex- tures), so we use it as a structural guidance signal for the generative stage. Generative Refinement via IC-Light.We refineˆImv v with IC-Light [48], a re- lighting diffusion model adapted from Stable Diffusion [25]. While IC-Light pro- duces photorealistic lighting effects, applying it independently per view breaks multi-view consist","claim_type":"background","confidence":0.85,"evidence_strength":"citation_context"},{"claim_text":"Subsequent works further enhance controllability and semantic alignment, including Prompt-to-Prompt [11], DiffEdit [7], Imagic [18], Plug-and-Play Diffusion Features [43], and ControlNet [59]. More recent approaches explore richer instruction interfaces and multimodal reasoning, such as MGIE [9] and GenArtist [46], while subject-driven and compositional editing are studied in DreamBooth [35], Blended Diffusion [1], SDEdit [25], and image translation methods such as Detail Fusion GAN [ 20]. Comme","claim_type":"background","confidence":0.85,"evidence_strength":"citation_context"},{"claim_text":"To validate the effectiveness of our proposed Neural Simulation in recovering real-world data distributions from simulation, we consider the following set of diverse comparative approaches: 1) Classical Simulation(Sim), denoting the canonical raw simulation pipeline without neural-driven refinement; 2) Baseline, a video-to-video generation model built on Stable Diffusion 1.5 [39] with temporal continuity post-processing [54]; 3) Zero-Shot, referring to the backbone model deployed without any sim","claim_type":"background","confidence":0.8,"evidence_strength":"citation_context"},{"claim_text":"Several methods explicitly incorporate inpainting modules to hallucinate missing details in saturated re- gions [23,60,111]. However, when using limited-capacity generative models, the synthesized content often lacks realism or fine details. 2.3 Generative HDR Advancesingenerativemodeling,includingGANs[4,9,10,22,40,48-50,79,83,106] and diffusion models [3,16,31,34,39,67,74,88-90,96,102,105,107,108,112,113], have shown strong priors for image and video generation. Some approaches learn themapping","claim_type":"background","confidence":0.8,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (7 contexts).","role_counts":[{"n":7,"context_role":"background"},{"n":2,"context_role":"method"}]},"error":null,"updated_at":"2026-05-19T03:21:20.363226+00:00"},"author_expand":{"job_type":"author_expand","status":"succeeded","result":{"authors_linked":[{"id":"ff5f9003-34f5-4bb5-976d-2194e328b0d7","orcid":null,"display_name":"Robin Rombach"},{"id":"af305c1f-915b-4035-8a0b-28a18a2bc39f","orcid":null,"display_name":"Andreas Blattmann"},{"id":"2b89671d-89cd-442f-a731-9155a466b451","orcid":null,"display_name":"Dominik Lorenz"},{"id":"a2c4ddca-df43-4987-86eb-5c5593bdc4b2","orcid":null,"display_name":"Patrick Esser"},{"id":"4dda3e9b-8a12-48c8-97d3-a669c9e24799","orcid":null,"display_name":"Bjorn Ommer"}]},"error":null,"updated_at":"2026-05-19T03:21:29.307933+00:00"},"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-19T03:21:27.592298+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"Advances in neural information processing systems33, 6840–6851 (2020)","work_id":"effc8173-611c-4e0c-8d0b-066e3ae07f56","shared_citers":11},{"title":"In: International conference on machine learning","work_id":"69077c18-2906-43b3-9114-9a7539e22548","shared_citers":11},{"title":"Wan: Open and Advanced Large-Scale Video Generative Models","work_id":"ad3ebc3b-4224-46c9-b61d-bcf135da0a7c","shared_citers":9},{"title":"Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets","work_id":"4f68eada-27e3-437a-a2fe-6e4ca524d0d3","shared_citers":8},{"title":null,"work_id":"1b7735a2-669d-484f-a863-738e74836eb0","shared_citers":8},{"title":"Denoising Diffusion Implicit Models","work_id":"8fa2128b-d18c-405c-ac92-0e669cf89ac0","shared_citers":7},{"title":"Flow Matching for Generative Modeling","work_id":"6edb71c4-5d64-40af-a394-9757ea051a36","shared_citers":7},{"title":"CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer","work_id":"f38fc088-12aa-4bf4-9ecd-08d3e797ccb7","shared_citers":6},{"title":"In: Proceedings of the IEEE conference on computer vision and pattern recognition","work_id":"41330798-83e6-449d-917b-756edfcd6ba6","shared_citers":6},{"title":"In: Proceedings of the IEEE/CVF international conference on computer vision","work_id":"f45abe73-9b82-4c3d-a051-6138a362f6a0","shared_citers":6},{"title":"SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis","work_id":"8034c587-fba6-4941-87ba-c98f2ac962cb","shared_citers":6},{"title":"An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion","work_id":"ca618c21-3ba6-448e-bd86-bcecff3cdeb5","shared_citers":5},{"title":"Score-Based Generative Modeling through Stochastic Differential Equations","work_id":"d9110e53-a5d4-4794-a4c5-a575e91c31ad","shared_citers":5},{"title":"Auto-Encoding Variational Bayes","work_id":"97d95295-30e1-42b4-bbf6-85f0fa4edb44","shared_citers":4},{"title":"Commu- nications of the ACM65(1), 99–106 (2021)","work_id":"5ea2b2ea-f18b-4e34-b4af-69773a45258d","shared_citers":4},{"title":"Decoupled Weight Decay Regularization","work_id":"07ef7360-d385-4033-83f7-8384a6325204","shared_citers":4},{"title":"Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow","work_id":"a1989e1b-d66d-4533-be3a-fb9c5fd62290","shared_citers":4},{"title":"FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space","work_id":"5dfe19d5-3541-4803-8fe9-3c8b9e29b281","shared_citers":4},{"title":"HunyuanVideo: A Systematic Framework For Large Video Generative Models","work_id":"881efa7e-7e73-4c66-9cc3-2803e551061c","shared_citers":4},{"title":"Prompt-to-Prompt Image Editing with Cross Attention Control","work_id":"196f7eef-d65a-47e4-b815-9a188f6aedcf","shared_citers":4},{"title":"Qwen3-VL Technical Report","work_id":"1fe243aa-e3c0-4da6-b391-4cbcfc88d5c0","shared_citers":4},{"title":"Advances in neural information processing systems30(2017)","work_id":"2012eadf-82a4-46f1-937b-ac51258b0e43","shared_citers":3},{"title":"Advances in neural information processing systems34, 8780–8794 (2021)","work_id":"decd43f6-b6f0-482e-b0ae-677e50fdb68d","shared_citers":3},{"title":"Advances in neural information processing systems35, 8633– 8646 (2022)","work_id":"50554743-f5c9-4353-ade9-01a25c7a031b","shared_citers":3}],"time_series":[{"n":1,"year":2024},{"n":28,"year":2026}],"dependency_candidates":[{"n":1,"role":"method","polarity":"use_method","paper_title":"ReplicateAnyScene: Zero-Shot Video-to-3D Composition via Textual-Visual-Spatial Alignment","primary_cat":"cs.CV","context_text":"We then independently condition a diverse suite of generation models on multiple individual input views for the final selection of high-quality 3D assets. Finally, professional modelers manually place these assets against reference scene meshes to ensure accurate spatial layouts. occlusion, we apply generative completion [19] and super-resolution models [43] to restore missing details. We then manually select the highest-quality 3D as- set from these diverse candidates. Finally, professional 3D modelers manually align and place these selected assets against the reference scene meshes. This meticulous manual placement ensures physically accurate spatial locations and structural consistency in the final compositional layout.","citing_arxiv_id":"2604.10789"},{"n":1,"role":"method","polarity":"use_method","paper_title":"HairOrbit: Multi-view Aware 3D Hair Modeling from Single Portraits","primary_cat":"cs.CV","context_text":"minimizing the forgetting of its pre-learned real-world hair priors, we fine-tune the model with LoRA applied to the Q, K, V, O projections and the first and last linear layers (FFN.0 and FFN.2) of each transformer block. We prepare multi-view renderings from carefully selected typical 3D hairstyles, covering a wide range of variations in length, curliness, and partition. Following diffusion- based [23] video generation models, we optimize the LoRA parameters using a standard noise prediction loss. Given a clean latentx0 and a timestept∼ U(1, T), a noisy latentxt is obtained by the forward diffusion process: xt = √¯αt x0 + √ 1−¯αt ϵ,ϵ∼ N(0,I). (1) where¯αt denotes the cumulative noise schedule. The modelϵθ(xt, t,c)predicts the noiseϵconditioned onc, which in our case corresponds to the latent repre-","citing_arxiv_id":"2604.02867"}]},"error":null,"updated_at":"2026-05-19T03:21:29.338066+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-19T03:21:24.820876+00:00"},"role_polarity":{"job_type":"role_polarity","status":"succeeded","result":{"title":"In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition","claims":[{"claim_text":"globally consistent 3D scenes that remain stable under large viewpoint changes. fundamentally an ill-posed problem: Simple texts or image inputs fail to provide a comprehensive representation of the entire 3D space. Consequently, inferring massive amounts of missing information for unseen areas while maintaining ge- ometric consistency remains a significant challenge. Deep generative models, particularly diffusion models [13,17,34,35,37], ad- dress this by leveraging strong 2D visual priors. How","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"language model to guide the evolution process. Importantly, it works on black-box generation models by requiring only image outputs. Finally, we evaluate PromptEvolver across multiple prompt inversion benchmarks and show that it consistently outperforms competing methods. Keywords:Prompt inversion·Text to image generation 1 Introduction Text-to-image (T2I) diffusion models [21,35,48] have transformed visual con- tent creation, enabling users to generate photorealistic images from natural- langua","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"Thisphysics-basedreference ˆImv v guaranteesglobalilluminationconsistencyacross views but lacks photorealistic high-frequency details (e.g. specularities, sky tex- tures), so we use it as a structural guidance signal for the generative stage. Generative Refinement via IC-Light.We refineˆImv v with IC-Light [48], a re- lighting diffusion model adapted from Stable Diffusion [25]. While IC-Light pro- duces photorealistic lighting effects, applying it independently per view breaks multi-view consist","claim_type":"background","confidence":0.85,"evidence_strength":"citation_context"},{"claim_text":"Subsequent works further enhance controllability and semantic alignment, including Prompt-to-Prompt [11], DiffEdit [7], Imagic [18], Plug-and-Play Diffusion Features [43], and ControlNet [59]. More recent approaches explore richer instruction interfaces and multimodal reasoning, such as MGIE [9] and GenArtist [46], while subject-driven and compositional editing are studied in DreamBooth [35], Blended Diffusion [1], SDEdit [25], and image translation methods such as Detail Fusion GAN [ 20]. Comme","claim_type":"background","confidence":0.85,"evidence_strength":"citation_context"},{"claim_text":"To validate the effectiveness of our proposed Neural Simulation in recovering real-world data distributions from simulation, we consider the following set of diverse comparative approaches: 1) Classical Simulation(Sim), denoting the canonical raw simulation pipeline without neural-driven refinement; 2) Baseline, a video-to-video generation model built on Stable Diffusion 1.5 [39] with temporal continuity post-processing [54]; 3) Zero-Shot, referring to the backbone model deployed without any sim","claim_type":"background","confidence":0.8,"evidence_strength":"citation_context"},{"claim_text":"Several methods explicitly incorporate inpainting modules to hallucinate missing details in saturated re- gions [23,60,111]. However, when using limited-capacity generative models, the synthesized content often lacks realism or fine details. 2.3 Generative HDR Advancesingenerativemodeling,includingGANs[4,9,10,22,40,48-50,79,83,106] and diffusion models [3,16,31,34,39,67,74,88-90,96,102,105,107,108,112,113], have shown strong priors for image and video generation. Some approaches learn themapping","claim_type":"background","confidence":0.8,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (7 contexts).","role_counts":[{"n":7,"context_role":"background"},{"n":2,"context_role":"method"}]},"error":null,"updated_at":"2026-05-19T03:21:29.343513+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition","claims":[{"claim_text":"globally consistent 3D scenes that remain stable under large viewpoint changes. fundamentally an ill-posed problem: Simple texts or image inputs fail to provide a comprehensive representation of the entire 3D space. Consequently, inferring massive amounts of missing information for unseen areas while maintaining ge- ometric consistency remains a significant challenge. Deep generative models, particularly diffusion models [13,17,34,35,37], ad- dress this by leveraging strong 2D visual priors. How","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"language model to guide the evolution process. Importantly, it works on black-box generation models by requiring only image outputs. Finally, we evaluate PromptEvolver across multiple prompt inversion benchmarks and show that it consistently outperforms competing methods. Keywords:Prompt inversion·Text to image generation 1 Introduction Text-to-image (T2I) diffusion models [21,35,48] have transformed visual con- tent creation, enabling users to generate photorealistic images from natural- langua","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"Thisphysics-basedreference ˆImv v guaranteesglobalilluminationconsistencyacross views but lacks photorealistic high-frequency details (e.g. specularities, sky tex- tures), so we use it as a structural guidance signal for the generative stage. Generative Refinement via IC-Light.We refineˆImv v with IC-Light [48], a re- lighting diffusion model adapted from Stable Diffusion [25]. While IC-Light pro- duces photorealistic lighting effects, applying it independently per view breaks multi-view consist","claim_type":"background","confidence":0.85,"evidence_strength":"citation_context"},{"claim_text":"Subsequent works further enhance controllability and semantic alignment, including Prompt-to-Prompt [11], DiffEdit [7], Imagic [18], Plug-and-Play Diffusion Features [43], and ControlNet [59]. More recent approaches explore richer instruction interfaces and multimodal reasoning, such as MGIE [9] and GenArtist [46], while subject-driven and compositional editing are studied in DreamBooth [35], Blended Diffusion [1], SDEdit [25], and image translation methods such as Detail Fusion GAN [ 20]. Comme","claim_type":"background","confidence":0.85,"evidence_strength":"citation_context"},{"claim_text":"To validate the effectiveness of our proposed Neural Simulation in recovering real-world data distributions from simulation, we consider the following set of diverse comparative approaches: 1) Classical Simulation(Sim), denoting the canonical raw simulation pipeline without neural-driven refinement; 2) Baseline, a video-to-video generation model built on Stable Diffusion 1.5 [39] with temporal continuity post-processing [54]; 3) Zero-Shot, referring to the backbone model deployed without any sim","claim_type":"background","confidence":0.8,"evidence_strength":"citation_context"},{"claim_text":"Several methods explicitly incorporate inpainting modules to hallucinate missing details in saturated re- gions [23,60,111]. However, when using limited-capacity generative models, the synthesized content often lacks realism or fine details. 2.3 Generative HDR Advancesingenerativemodeling,includingGANs[4,9,10,22,40,48-50,79,83,106] and diffusion models [3,16,31,34,39,67,74,88-90,96,102,105,107,108,112,113], have shown strong priors for image and video generation. Some approaches learn themapping","claim_type":"background","confidence":0.8,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (7 contexts).","role_counts":[{"n":7,"context_role":"background"},{"n":2,"context_role":"method"}]},"error":null,"updated_at":"2026-05-19T03:21:27.595856+00:00"}},"summary":{"title":"In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition","claims":[{"claim_text":"globally consistent 3D scenes that remain stable under large viewpoint changes. fundamentally an ill-posed problem: Simple texts or image inputs fail to provide a comprehensive representation of the entire 3D space. Consequently, inferring massive amounts of missing information for unseen areas while maintaining ge- ometric consistency remains a significant challenge. Deep generative models, particularly diffusion models [13,17,34,35,37], ad- dress this by leveraging strong 2D visual priors. How","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"language model to guide the evolution process. Importantly, it works on black-box generation models by requiring only image outputs. Finally, we evaluate PromptEvolver across multiple prompt inversion benchmarks and show that it consistently outperforms competing methods. Keywords:Prompt inversion·Text to image generation 1 Introduction Text-to-image (T2I) diffusion models [21,35,48] have transformed visual con- tent creation, enabling users to generate photorealistic images from natural- langua","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"Thisphysics-basedreference ˆImv v guaranteesglobalilluminationconsistencyacross views but lacks photorealistic high-frequency details (e.g. specularities, sky tex- tures), so we use it as a structural guidance signal for the generative stage. Generative Refinement via IC-Light.We refineˆImv v with IC-Light [48], a re- lighting diffusion model adapted from Stable Diffusion [25]. While IC-Light pro- duces photorealistic lighting effects, applying it independently per view breaks multi-view consist","claim_type":"background","confidence":0.85,"evidence_strength":"citation_context"},{"claim_text":"Subsequent works further enhance controllability and semantic alignment, including Prompt-to-Prompt [11], DiffEdit [7], Imagic [18], Plug-and-Play Diffusion Features [43], and ControlNet [59]. More recent approaches explore richer instruction interfaces and multimodal reasoning, such as MGIE [9] and GenArtist [46], while subject-driven and compositional editing are studied in DreamBooth [35], Blended Diffusion [1], SDEdit [25], and image translation methods such as Detail Fusion GAN [ 20]. Comme","claim_type":"background","confidence":0.85,"evidence_strength":"citation_context"},{"claim_text":"To validate the effectiveness of our proposed Neural Simulation in recovering real-world data distributions from simulation, we consider the following set of diverse comparative approaches: 1) Classical Simulation(Sim), denoting the canonical raw simulation pipeline without neural-driven refinement; 2) Baseline, a video-to-video generation model built on Stable Diffusion 1.5 [39] with temporal continuity post-processing [54]; 3) Zero-Shot, referring to the backbone model deployed without any sim","claim_type":"background","confidence":0.8,"evidence_strength":"citation_context"},{"claim_text":"Several methods explicitly incorporate inpainting modules to hallucinate missing details in saturated re- gions [23,60,111]. However, when using limited-capacity generative models, the synthesized content often lacks realism or fine details. 2.3 Generative HDR Advancesingenerativemodeling,includingGANs[4,9,10,22,40,48-50,79,83,106] and diffusion models [3,16,31,34,39,67,74,88-90,96,102,105,107,108,112,113], have shown strong priors for image and video generation. Some approaches learn themapping","claim_type":"background","confidence":0.8,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (7 contexts).","role_counts":[{"n":7,"context_role":"background"},{"n":2,"context_role":"method"}]},"graph":{"co_cited":[{"title":"Advances in neural information processing systems33, 6840–6851 (2020)","work_id":"effc8173-611c-4e0c-8d0b-066e3ae07f56","shared_citers":11},{"title":"In: International conference on machine learning","work_id":"69077c18-2906-43b3-9114-9a7539e22548","shared_citers":11},{"title":"Wan: Open and Advanced Large-Scale Video Generative Models","work_id":"ad3ebc3b-4224-46c9-b61d-bcf135da0a7c","shared_citers":9},{"title":"Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets","work_id":"4f68eada-27e3-437a-a2fe-6e4ca524d0d3","shared_citers":8},{"title":null,"work_id":"1b7735a2-669d-484f-a863-738e74836eb0","shared_citers":8},{"title":"Denoising Diffusion Implicit Models","work_id":"8fa2128b-d18c-405c-ac92-0e669cf89ac0","shared_citers":7},{"title":"Flow Matching for Generative Modeling","work_id":"6edb71c4-5d64-40af-a394-9757ea051a36","shared_citers":7},{"title":"CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer","work_id":"f38fc088-12aa-4bf4-9ecd-08d3e797ccb7","shared_citers":6},{"title":"In: Proceedings of the IEEE conference on computer vision and pattern recognition","work_id":"41330798-83e6-449d-917b-756edfcd6ba6","shared_citers":6},{"title":"In: Proceedings of the IEEE/CVF international conference on computer vision","work_id":"f45abe73-9b82-4c3d-a051-6138a362f6a0","shared_citers":6},{"title":"SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis","work_id":"8034c587-fba6-4941-87ba-c98f2ac962cb","shared_citers":6},{"title":"An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion","work_id":"ca618c21-3ba6-448e-bd86-bcecff3cdeb5","shared_citers":5},{"title":"Score-Based Generative Modeling through Stochastic Differential Equations","work_id":"d9110e53-a5d4-4794-a4c5-a575e91c31ad","shared_citers":5},{"title":"Auto-Encoding Variational Bayes","work_id":"97d95295-30e1-42b4-bbf6-85f0fa4edb44","shared_citers":4},{"title":"Commu- nications of the ACM65(1), 99–106 (2021)","work_id":"5ea2b2ea-f18b-4e34-b4af-69773a45258d","shared_citers":4},{"title":"Decoupled Weight Decay Regularization","work_id":"07ef7360-d385-4033-83f7-8384a6325204","shared_citers":4},{"title":"Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow","work_id":"a1989e1b-d66d-4533-be3a-fb9c5fd62290","shared_citers":4},{"title":"FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space","work_id":"5dfe19d5-3541-4803-8fe9-3c8b9e29b281","shared_citers":4},{"title":"HunyuanVideo: A Systematic Framework For Large Video Generative Models","work_id":"881efa7e-7e73-4c66-9cc3-2803e551061c","shared_citers":4},{"title":"Prompt-to-Prompt Image Editing with Cross Attention Control","work_id":"196f7eef-d65a-47e4-b815-9a188f6aedcf","shared_citers":4},{"title":"Qwen3-VL Technical Report","work_id":"1fe243aa-e3c0-4da6-b391-4cbcfc88d5c0","shared_citers":4},{"title":"Advances in neural information processing systems30(2017)","work_id":"2012eadf-82a4-46f1-937b-ac51258b0e43","shared_citers":3},{"title":"Advances in neural information processing systems34, 8780–8794 (2021)","work_id":"decd43f6-b6f0-482e-b0ae-677e50fdb68d","shared_citers":3},{"title":"Advances in neural information processing systems35, 8633– 8646 (2022)","work_id":"50554743-f5c9-4353-ade9-01a25c7a031b","shared_citers":3}],"time_series":[{"n":1,"year":2024},{"n":28,"year":2026}],"dependency_candidates":[{"n":1,"role":"method","polarity":"use_method","paper_title":"ReplicateAnyScene: Zero-Shot Video-to-3D Composition via Textual-Visual-Spatial Alignment","primary_cat":"cs.CV","context_text":"We then independently condition a diverse suite of generation models on multiple individual input views for the final selection of high-quality 3D assets. Finally, professional modelers manually place these assets against reference scene meshes to ensure accurate spatial layouts. occlusion, we apply generative completion [19] and super-resolution models [43] to restore missing details. We then manually select the highest-quality 3D as- set from these diverse candidates. Finally, professional 3D modelers manually align and place these selected assets against the reference scene meshes. This meticulous manual placement ensures physically accurate spatial locations and structural consistency in the final compositional layout.","citing_arxiv_id":"2604.10789"},{"n":1,"role":"method","polarity":"use_method","paper_title":"HairOrbit: Multi-view Aware 3D Hair Modeling from Single Portraits","primary_cat":"cs.CV","context_text":"minimizing the forgetting of its pre-learned real-world hair priors, we fine-tune the model with LoRA applied to the Q, K, V, O projections and the first and last linear layers (FFN.0 and FFN.2) of each transformer block. We prepare multi-view renderings from carefully selected typical 3D hairstyles, covering a wide range of variations in length, curliness, and partition. Following diffusion- based [23] video generation models, we optimize the LoRA parameters using a standard noise prediction loss. Given a clean latentx0 and a timestept∼ U(1, T), a noisy latentxt is obtained by the forward diffusion process: xt = √¯αt x0 + √ 1−¯αt ϵ,ϵ∼ N(0,I). (1) where¯αt denotes the cumulative noise schedule. The modelϵθ(xt, t,c)predicts the noiseϵconditioned onc, which in our case corresponds to the latent repre-","citing_arxiv_id":"2604.02867"}]},"authors":[{"id":"af305c1f-915b-4035-8a0b-28a18a2bc39f","orcid":null,"display_name":"Andreas Blattmann","source":"manual","import_confidence":0.72},{"id":"4dda3e9b-8a12-48c8-97d3-a669c9e24799","orcid":null,"display_name":"Bjorn Ommer","source":"manual","import_confidence":0.72},{"id":"2b89671d-89cd-442f-a731-9155a466b451","orcid":null,"display_name":"Dominik Lorenz","source":"manual","import_confidence":0.72},{"id":"a2c4ddca-df43-4987-86eb-5c5593bdc4b2","orcid":null,"display_name":"Patrick Esser","source":"manual","import_confidence":0.72},{"id":"ff5f9003-34f5-4bb5-976d-2194e328b0d7","orcid":null,"display_name":"Robin Rombach","source":"manual","import_confidence":0.72}]}}