VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.
hub
European conference on computer vision , pages=
16 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
years
2026 16roles
background 1polarities
unclear 1representative citing papers
Training on automatically generated hard negative captions improves vision-language models' zero-shot detection of fine-grained image-text mismatches and robustness to noisy inputs.
ProjLens shows that backdoor parameters in MLLMs are encoded in low-rank subspaces of the projector and that embeddings shift toward the target direction with magnitude linear in input norm, activating only on poisoned samples.
SafeDiffusion-R1 uses online GRPO with CLIP embedding steering to cut inappropriate content from 48.9% to 18.07% and nudity detections from 646 to 15 in diffusion models while raising GenEval scores from 42.08% to 47.83% and generalizing across seven harm categories without supervised pairs or extra
MambaPanoptic is a fully Mamba-based panoptic segmentation model that uses MambaFPN for multi-scale features and a QuadMamba kernel generator to outperform PanopticDeepLab and PanopticFCN on Cityscapes and COCO while using fewer parameters than Mask2Former.
VPD-100K is a large-scale fine-grained visual privacy dataset with 100k images and 33 classes, accompanied by a frequency-domain attention module that improves detection on image and video benchmarks.
Head Similarity extends identity recognition to structured whole-head similarity by capturing intra-identity appearance variations via hierarchical supervision on a weakly-labeled video benchmark.
Latent diffusion models exhibit geometric decoupling where curvature in out-of-distribution generation is misallocated to unstable semantic boundaries instead of image details, identifying geometric hotspots as the structural cause of editing instability.
STAR-IOD applies scale-decoupled topology alignment and K-Means-based pseudo-label refinement to reduce catastrophic forgetting in remote sensing incremental object detection, reporting 1.7% and 2.1% mAP gains on new DIOR-IOD and DOTA-IOD datasets.
Near-reversible Runge-Kutta diffusion ODE solvers with vector-field smoothing improve stability and edit fidelity for large changes in text-guided image editing compared to exactly reversible alternatives.
SSL clustering is derived as KL-divergence optimization where a teacher-distribution constraint normalizes via inverse cluster priors and simplifies to batch centering by Jensen's inequality.
APEX is an assumption-free image quality metric using Sliced Wasserstein Distance on CLIP and DINOv2 embeddings that claims superior robustness to degradations and cross-dataset stability.
Stochastic image enhancement methods are shown to be variants of a shared SDE differing in drift, diffusion, terminal distributions and boundary conditions, with controlled experiments revealing no single dominant family and a new modular library released.
Authors call for contamination-resistant LLM benchmarks that exploit Transformer training-inference asymmetry and require new mathematical methods for cross-architecture interoperability.
citing papers explorer
-
VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing
VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.
-
HNC: Leveraging Hard Negative Captions towards Models with Fine-Grained Visual-Linguistic Comprehension Capabilities
Training on automatically generated hard negative captions improves vision-language models' zero-shot detection of fine-grained image-text mismatches and robustness to noisy inputs.
-
ProjLens: Unveiling the Role of Projectors in Multimodal Model Safety
ProjLens shows that backdoor parameters in MLLMs are encoded in low-rank subspaces of the projector and that embeddings shift toward the target direction with magnitude linear in input norm, activating only on poisoned samples.
-
SafeDiffusion-R1: Online Reward Steering for Safe Diffusion Post-Training
SafeDiffusion-R1 uses online GRPO with CLIP embedding steering to cut inappropriate content from 48.9% to 18.07% and nudity detections from 646 to 15 in diffusion models while raising GenEval scores from 42.08% to 47.83% and generalizing across seven harm categories without supervised pairs or extra
-
MambaPanoptic: A Vision Mamba-based Structured State Space Framework for Panoptic Segmentation
MambaPanoptic is a fully Mamba-based panoptic segmentation model that uses MambaFPN for multi-scale features and a QuadMamba kernel generator to outperform PanopticDeepLab and PanopticFCN on Cityscapes and COCO while using fewer parameters than Mask2Former.
-
VPD-100K: Towards Generalizable and Fine-grained Visual Privacy Protection
VPD-100K is a large-scale fine-grained visual privacy dataset with 100k images and 33 classes, accompanied by a frequency-domain attention module that improves detection on image and video benchmarks.
-
Head Similarity: Modeling Structured Whole-Head Appearance Beyond Face Recognition
Head Similarity extends identity recognition to structured whole-head similarity by capturing intra-identity appearance variations via hierarchical supervision on a weakly-labeled video benchmark.
-
Geometric Decoupling: Diagnosing the Structural Instability of Latent
Latent diffusion models exhibit geometric decoupling where curvature in out-of-distribution generation is misallocated to unstable semantic boundaries instead of image details, identifying geometric hotspots as the structural cause of editing instability.
-
STAR-IOD: Scale-decoupled Topology Alignment with Pseudo-label Refinement for Remote Sensing Incremental Object Detection
STAR-IOD applies scale-decoupled topology alignment and K-Means-based pseudo-label refinement to reduce catastrophic forgetting in remote sensing incremental object detection, reporting 1.7% and 2.1% mAP gains on new DIOR-IOD and DOTA-IOD datasets.
-
Stable and Near-Reversible Diffusion ODE Solvers for Image Editing
Near-reversible Runge-Kutta diffusion ODE solvers with vector-field smoothing improve stability and edit fidelity for large changes in text-guided image editing compared to exactly reversible alternatives.
-
Information theoretic underpinning of self-supervised learning by clustering
SSL clustering is derived as KL-divergence optimization where a teacher-distribution constraint normalizes via inverse cluster priors and simplifies to batch centering by Jensen's inequality.
-
APEX: Assumption-free Projection-based Embedding eXamination Metric for Image Quality Assessment
APEX is an assumption-free image quality metric using Sliced Wasserstein Distance on CLIP and DINOv2 embeddings that claims superior robustness to degradations and cross-dataset stability.
-
Unifying Deep Stochastic Processes for Image Enhancement
Stochastic image enhancement methods are shown to be variants of a shared SDE differing in drift, diffusion, terminal distributions and boundary conditions, with controlled experiments revealing no single dominant family and a new modular library released.
-
LLM Benchmark Datasets Should Be Contamination-Resistant
Authors call for contamination-resistant LLM benchmarks that exploit Transformer training-inference asymmetry and require new mathematical methods for cross-architecture interoperability.
- Diagnosing and Correcting Concept Omission in Multimodal Diffusion Transformers
- VIDA: A dataset for Visually Dependent Ambiguity in Multimodal Machine Translation