iTryOn is a video diffusion Transformer that injects spatial 3D hand guidance and semantic action captions to enable interactive garment replacement in videos.
Neural Information Processing Systems , year=
9 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
TCDA introduces TC-DAG to filter cross-thread noise while preserving temporal order and D-RoPE to align semantics across layers and reduce distance dilution, achieving state-of-the-art results on two DiaASQ benchmarks.
Loss-Free Balancing keeps expert loads balanced in MoE models by dynamically adjusting routing-score biases based on recent usage, avoiding auxiliary-loss interference and yielding better performance.
Absorbing discrete diffusion models the conditional distributions of clean data; reparameterizing yields a time-independent RADD that unifies with AO-ARMs and reaches SOTA perplexity among diffusion models on zero-shot language benchmarks.
ARL2 replaces quadratic cross-frame attention in AR video diffusion with a fixed-size recurrent state, achieving linear-time scaling and constant memory while preserving quality.
SP-KV trains a utility predictor jointly with the LLM to dynamically prune low-utility KV cache entries, achieving 3-10x memory reduction during generation with negligible performance loss.
A graph-spectral importance score based on layer-wise structural distortion between pre- and post-activation neuron graphs identifies removable neurons for iterative pruning without intermediate updates, followed by recovery fine-tuning.
LaplacianFormer uses a Laplacian kernel with an injective feature map and efficient approximations to achieve linear attention that preserves mid-range interactions better than Gaussian-based linear attention in vision transformers.
A new model unifies per-pixel and word tokens in a generative language model with per-pixel embeddings, color folding, and unsupervised image pretraining, reporting good performance on small models with limited data.
citing papers explorer
-
iTryOn: Mastering Interactive Video Virtual Try-On with Spatial-Semantic Guidance
iTryOn is a video diffusion Transformer that injects spatial 3D hand guidance and semantic action captions to enable interactive garment replacement in videos.
-
TCDA: Thread-Constrained Discourse-Aware Modeling for Conversational Sentiment Quadruple Analysis
TCDA introduces TC-DAG to filter cross-thread noise while preserving temporal order and D-RoPE to align semantics across layers and reduce distance dilution, achieving state-of-the-art results on two DiaASQ benchmarks.
-
Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts
Loss-Free Balancing keeps expert loads balanced in MoE models by dynamically adjusting routing-score biases based on recent usage, avoiding auxiliary-loss interference and yielding better performance.
-
Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data
Absorbing discrete diffusion models the conditional distributions of clean data; reparameterizing yields a time-independent RADD that unifies with AO-ARMs and reaches SOTA perplexity among diffusion models on zero-shot language benchmarks.
-
Attend Locally, Remember Linearly: Linear Attention as Cross-Frame Memory for Autoregressive Video Diffusion
ARL2 replaces quadratic cross-frame attention in AR video diffusion with a fixed-size recurrent state, achieving linear-time scaling and constant memory while preserving quality.
-
Self-Pruned Key-Value Attention: Learning When to Write by Predicting Future Utility
SP-KV trains a utility predictor jointly with the LLM to dynamically prune low-utility KV cache entries, achieving 3-10x memory reduction during generation with negligible performance loss.
-
Spectral structural distortion reveals redundant neurons in neural networks
A graph-spectral importance score based on layer-wise structural distortion between pre- and post-activation neuron graphs identifies removable neurons for iterative pruning without intermediate updates, followed by recovery fine-tuning.
-
LaplacianFormer:Rethinking Linear Attention with Laplacian Kernel
LaplacianFormer uses a Laplacian kernel with an injective feature map and efficient approximations to achieve linear attention that preserves mid-range interactions better than Gaussian-based linear attention in vision transformers.
-
Unified Pix Token And Word Token Generative Language Model
A new model unifies per-pixel and word tokens in a generative language model with per-pixel embeddings, color folding, and unsupervised image pretraining, reporting good performance on small models with limited data.