OP2GS adds instance identities and dual opacities to 3D Gaussians so that visual rendering and object-mask rendering are handled by separate opacity channels, reducing label contamination while attaching semantics at the object level.
hub Mixed citations
Learning transferable visual models from natural language supervision
Mixed citation behavior. Most common role is background (57%).
hub tools
citation-role summary
citation-polarity summary
representative citing papers
SSDA uses spectral magnitude alignment and structural-guided low-rank adaptation to close frequency and adjacency gaps when large vision models process time series rendered as images.
A training-free adaptive subspace projection method mitigates semantic collapsing in generative personalization by isolating and adjusting drift in a low-dimensional subspace using the stable pre-trained embedding as anchor.
Unimodal model representations converge to a relational structure captured by the Indra representation via V-enriched Yoneda embedding, which is unique and structure-preserving and improves cross-model and cross-modal robustness when instantiated with angular distance.
Averaging and temporally interpolating text latents in VLAs enables 83% success on novel task combinations in the libero-ood benchmark where SOTA models achieve under 15%.
Contrastive pretraining on mammography atlas image-text pairs improves BI-RADS classification F1 by 1-14% especially in low-label regimes, outperforming equivalent numbers of direct labels in some settings.
RAE v2 reaches gFID 1.06 on ImageNet-256 in 80 epochs by combining multi-layer encoder sums, complementary REPA targets, and free guidance via output reparameterization.
DiRotQ uses PCA-based rotation-aware activation quantization combined with GPTQ to achieve better FID and PSNR in 4-bit diffusion transformers than prior methods like SVDQuant.
h-control augments hard-replacement guidance with block-conditional pseudo-Gibbs refinement on unobserved latent sites and adaptive 3D patch freezing to achieve superior FVD on RealEstate10K and DAVIS.
BAF reduces memorization in diffusion LoRAs by filtering spectral channels of the adaptation weights that show weak alignment with the base model's principal subspace.
The work introduces uncertainty-aware foundation models for clinical data by learning set-valued patient representations that enforce consistency across partial observations and integrate multimodal self-supervised objectives.
HiFi-Inpaint delivers state-of-the-art detail-preserving human-product images by adding Shared Enhancement Attention and Detail-Aware Loss to reference-based inpainting on a new 40K dataset.
OPAD enables reliable high-quality personalization of one-step diffusion models via multi-step teacher distillation combined with adversarial alignment losses.
ViGoRL introduces visually grounded RL that anchors reasoning steps to image coordinates and uses multi-turn zooming to outperform standard RL and supervised baselines on spatial and GUI reasoning benchmarks.
PMC-VQA dataset and MedVInT model achieve better generative performance on medical VQA benchmarks by visual instruction tuning on a newly constructed large-scale dataset.
MONET is an open 104.9M image-text pair dataset created via safety filtering, deduplication, and multi-VLM recaptioning from 2.9B raw pairs, validated by training a competitive 4B-parameter latent diffusion model.
CAST selects better multimodal coresets by fusing collapse-aware topologies across modalities and matching distributions at multiple scales in the diffusion wavelet domain.
A per-class loss reweighting scheme based on distributional robustness allows CLIP models to perform class-incremental and domain-incremental learning with minimal memory while limiting forgetting on CIFAR-100, ImageNet1K, and DomainNet.
Continuous diffusion spoken language models follow scaling laws for loss and phoneme divergence and generate emotive multi-speaker speech at 16B scale, though long-form coherence stays difficult.
Flow matching generative models preserve sample quality, diversity, and latent representations despite pruning 50% of the CelebA-HQ dataset or altering architecture and training configurations.
Random label bridge training aligns LLM parameters with vision tasks, and partial training of certain layers often suffices due to their foundational properties.
citing papers explorer
-
OP2GS: Object-Aware 3D Gaussian Splatting with Dual-Opacity Primitives
OP2GS adds instance identities and dual opacities to 3D Gaussians so that visual rendering and object-mask rendering are handled by separate opacity channels, reducing label contamination while attaching semantics at the object level.
-
SSDA: Bridging Spectral and Structural Gaps via Dual Adaptation for Vision-Based Time Series Forecasting
SSDA uses spectral magnitude alignment and structural-guided low-rank adaptation to close frequency and adjacency gaps when large vision models process time series rendered as images.
-
Adaptive Subspace Projection for Generative Personalization
A training-free adaptive subspace projection method mitigates semantic collapsing in generative personalization by isolating and adjusting drift in a low-dimensional subspace using the stable pre-trained embedding as anchor.
-
The Indra Representation Hypothesis for Multimodal Alignment
Unimodal model representations converge to a relational structure captured by the Indra representation via V-enriched Yoneda embedding, which is unique and structure-preserving and improves cross-model and cross-modal robustness when instantiated with angular distance.
-
VLAs are Confined yet Capable of Generalizing to Novel Instructions
Averaging and temporally interpolating text latents in VLAs enables 83% success on novel task combinations in the libero-ood benchmark where SOTA models achieve under 15%.
-
MAM-CLIP: Vision-Language Pretraining on Mammography Atlases for BI-RADS Classification
Contrastive pretraining on mammography atlas image-text pairs improves BI-RADS classification F1 by 1-14% especially in low-label regimes, outperforming equivalent numbers of direct labels in some settings.
-
Improved Baselines with Representation Autoencoders
RAE v2 reaches gFID 1.06 on ImageNet-256 in 80 epochs by combining multi-layer encoder sums, complementary REPA targets, and free guidance via output reparameterization.
-
DiRotQ: Rotation-Aware Quantization for 4-bit Diffusion Transformers
DiRotQ uses PCA-based rotation-aware activation quantization combined with GPTQ to achieve better FID and PSNR in 4-bit diffusion transformers than prior methods like SVDQuant.
-
$h$-control: Training-Free Camera Control via Block-Conditional Gibbs Refinement
h-control augments hard-replacement guidance with block-conditional pseudo-Gibbs refinement on unobserved latent sites and adaptive 3D patch freezing to achieve superior FVD on RealEstate10K and DAVIS.
-
Filtering Memorization from Parameter-Space in Diffusion Models
BAF reduces memorization in diffusion LoRAs by filtering spectral channels of the adaptation weights that show weak alignment with the base model's principal subspace.
-
Uncertainty-Aware Foundation Models for Clinical Data
The work introduces uncertainty-aware foundation models for clinical data by learning set-valued patient representations that enforce consistency across partial observations and integrate multimodal self-supervised objectives.
-
HiFi-Inpaint: Towards High-Fidelity Reference-Based Inpainting for Generating Detail-Preserving Human-Product Images
HiFi-Inpaint delivers state-of-the-art detail-preserving human-product images by adding Shared Enhancement Attention and Detail-Aware Loss to reference-based inpainting on a new 40K dataset.
-
Adversarial Concept Distillation for One-Step Diffusion Personalization
OPAD enables reliable high-quality personalization of one-step diffusion models via multi-step teacher distillation combined with adversarial alignment losses.
-
Grounded Reinforcement Learning for Visual Reasoning
ViGoRL introduces visually grounded RL that anchors reasoning steps to image coordinates and uses multi-turn zooming to outperform standard RL and supervised baselines on spatial and GUI reasoning benchmarks.
-
PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering
PMC-VQA dataset and MedVInT model achieve better generative performance on medical VQA benchmarks by visual instruction tuning on a newly constructed large-scale dataset.
-
MONET: A Massive, Open, Non-redundant and Enriched Text-to-image dataset
MONET is an open 104.9M image-text pair dataset created via safety filtering, deduplication, and multi-VLM recaptioning from 2.9B raw pairs, validated by training a competitive 4B-parameter latent diffusion model.
-
CAST: Collapse-Aware multi-Scale Topology Fusion for Multimodal Coreset Selection
CAST selects better multimodal coresets by fusing collapse-aware topologies across modalities and matching distributions at multiple scales in the diffusion wavelet domain.
-
Memory-Efficient Continual Learning with CLIP Models
A per-class loss reweighting scheme based on distributional robustness allows CLIP models to perform class-incremental and domain-incremental learning with minimal memory while limiting forgetting on CIFAR-100, ImageNet1K, and DomainNet.
-
Scaling Properties of Continuous Diffusion Spoken Language Models
Continuous diffusion spoken language models follow scaling laws for loss and phoneme divergence and generate emotive multi-speaker speech at 16B scale, though long-form coherence stays difficult.
-
The Amazing Stability of Flow Matching
Flow matching generative models preserve sample quality, diversity, and latent representations despite pruning 50% of the CelebA-HQ dataset or altering architecture and training configurations.
-
Language-Pretraining-Induced Bias: A Strong Foundation for General Vision Tasks
Random label bridge training aligns LLM parameters with vision tasks, and partial training of certain layers often suffices due to their foundational properties.
- Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm