A new large-scale triplet dataset and diffusion transformer model using coarse human masks deliver improved video virtual try-on quality and generalization in challenging real-world conditions.
Not all noises are created equally: Diffusion noise selection and optimization.CoRR, abs/2407.14041
8 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 8roles
background 4polarities
background 4representative citing papers
EyeMulator augments CodeLLM fine-tuning loss with token weights derived from human eye-tracking scan paths, producing large gains on code translation and summarization across StarCoder, Llama-3.2 and DeepSeek-Coder.
Refines charge quantization via homotopy type A whose homotopy groups classify brane charges and homology groups classify higher-form symmetries, deriving swampland-like constraints that rule out noncompact gauge groups and non-nilpotent Lie algebras for field strengths.
FASTER models multi-candidate denoising as an MDP and trains a value function to filter actions early, delivering the performance of full sampling at lower cost in diffusion RL policies.
LABBench2 is a more challenging benchmark than LAB-Bench for assessing AI performance on biology research tasks, with frontier models showing accuracy drops of 26-46% across subtasks.
GeneCS compiler reduces ancillary qubits and checks by over 85% on average for single- and cross-code logical operations on stabilizer codes while preserving error rates and scaling to over 10,000 qubits.
Eywa enables language-based agentic AI systems to collaborate with specialized scientific foundation models for improved performance on structured data tasks.
CG-MLLM is a multimodal LLM using a Mixture-of-Transformer architecture with separate TokenAR and BlockAR components integrated with a pre-trained vision-language backbone and 3D VAE to enable 3D captioning and high-fidelity generation.
citing papers explorer
-
TripVVT: A Large-Scale Triplet Dataset and a Coarse-Mask Baseline for In-the-Wild Video Virtual Try-On
A new large-scale triplet dataset and diffusion transformer model using coarse human masks deliver improved video virtual try-on quality and generalization in challenging real-world conditions.
-
EyeMulator: Improving Code Language Models by Mimicking Human Visual Attention
EyeMulator augments CodeLLM fine-tuning loss with token weights derived from human eye-tracking scan paths, producing large gains on code translation and summarization across StarCoder, Llama-3.2 and DeepSeek-Coder.
-
Generalised Symmetries and Swampland-Type Constraints from Charge Quantisation via Rational Homotopy Theory
Refines charge quantization via homotopy type A whose homotopy groups classify brane charges and homology groups classify higher-form symmetries, deriving swampland-like constraints that rule out noncompact gauge groups and non-nilpotent Lie algebras for field strengths.
-
FASTER: Value-Guided Sampling for Fast RL
FASTER models multi-candidate denoising as an MDP and trains a value function to filter actions early, delivering the performance of full sampling at lower cost in diffusion RL policies.
-
LABBench2: An Improved Benchmark for AI Systems Performing Biology Research
LABBench2 is a more challenging benchmark than LAB-Bench for assessing AI performance on biology research tasks, with frontier models showing accuracy drops of 26-46% across subtasks.
-
GeneCS: Synthesizing Resource-Efficient Code Surgery for Arbitrary Quantum Stabilizer Codes
GeneCS compiler reduces ancillary qubits and checks by over 85% on average for single- and cross-code logical operations on stabilizer codes while preserving error rates and scaling to over 10,000 qubits.
-
Heterogeneous Scientific Foundation Model Collaboration
Eywa enables language-based agentic AI systems to collaborate with specialized scientific foundation models for improved performance on structured data tasks.
-
CG-MLLM: Captioning and Generating 3D content via Multi-modal Large Language Models
CG-MLLM is a multimodal LLM using a Mixture-of-Transformer architecture with separate TokenAR and BlockAR components integrated with a pre-trained vision-language backbone and 3D VAE to enable 3D captioning and high-fidelity generation.