RICA replaces ICA's global generative model with local Riemannian geometry, introducing a disentanglement tensor based on the Hessian of the log-likelihood and Ricci curvature to measure pointwise disentanglement, which recovers sources across manifolds in controlled tests.
hub Mixed citations
Learning transferable visual models from natural language supervision
Mixed citation behavior. Most common role is background (61%).
hub tools
citation-role summary
citation-polarity summary
co-cited works
representative citing papers
NeuroQA is a large-scale 3D brain MRI visual question answering benchmark with verified image-grounded QA pairs, multi-domain coverage, and baseline evaluations showing current models lag behind text-only performance.
WEBSHORTS dataset and SHORTS-CAST framework ground micro-video popularity prediction in structured open-web context collected at upload time and enable selective online adaptation using delayed labels.
UniTriGen uses unified diffusion in a shared latent space plus lightweight adapters and scene-balanced sampling to produce high-quality aligned VIS-IR-Label triplets from limited paired data, improving few-shot RGB-T semantic segmentation.
CreFlow combines LTL compositional rewards with credit-aware NFT and corrective reflow losses in online RL to improve embodied video diffusion models, raising downstream task success by 23.8 percentage points on eight bimanual manipulation tasks.
SPA unlocks patch-level features in CLIP for class-incremental learning via semantic-guided selection and optimal transport alignment with class descriptions, plus projectors and pseudo-feature replay to reduce forgetting.
SubPopMark embeds verifiable subpopulation biases into distilled datasets via CVM and USTM optimization stages, allowing provenance inference through comparison of model output signatures against a reference behavior bank.
VLMs show a resolution illusion on UHR Earth observation imagery where higher resolution does not improve micro-target perception; UHR-Micro benchmark and MAP-Agent address this via evidence-centered active inspection.
SAEParate disentangles sparse representations in diffusion models via contrastive clustering and nonlinear encoding to enable more precise concept unlearning with reduced side effects.
PoseBridge recovers semantic information lost during skeletonization by extracting pose-anchored cues from human pose estimation and transferring them via skeleton-conditioned bridging and semantic prototype adaptation, yielding 13.3-17.4 point gains on the Kinetics PURLS benchmark.
PhyGround is a new benchmark with curated prompts, a 13-law taxonomy, large-scale human annotations, and an open physics-specialized VLM judge for evaluating physical reasoning in generative video models.
GridProbe uses posterior probing on a KxK frame grid to adaptively select question-relevant frames, delivering up to 3.36x TFLOPs reduction with accuracy within 1.6 pp of the full-frame baseline on Video-MME-v2.
LoopVLA adds recurrent refinement and learned sufficiency estimation to VLA models, cutting parameters 45% and raising throughput 1.7x while matching baseline task success on LIBERO and VLA-Arena.
IPAD-CLIP adapts CLIP via artifact-aware text embeddings to detect multi-class local perceptual artifacts, backed by a new dataset of 3520 images with pixel-level masks.
SphereVAD performs training-free video anomaly detection by recasting anomaly discrimination as von Mises-Fisher likelihood-ratio geodesic inference on the unit hypersphere using intermediate MLLM features, with Frechet mean centering, holistic scene attention, and spherical geodesic pulling.
Excess risk decomposes into independent alignment (trace of inverse average Hessian times gradient covariance) and curvature terms, so both flatness and gradient alignment are required; SAGE achieves this and sets new SOTA on DomainBed.
PIQL integrates privileged information to accelerate convergence, lower loss, and improve generalization in tabular foundation models.
Optimal INR freeze depth matches highest weight stable rank layer; SAEs reveal SIREN atoms are localized while FFMLP atoms trace cohort contours with causal impact on PSNR.
S2M extracts structured text quadruples from change masks to provide noise-free multimodal supervision, achieving 17.80% Sek and 66.14% F_scd on the new Gaza-Change-v2 dataset and outperforming LLM-based multimodal methods.
TrajGANR learns continuous neural representations of trajectories to enable fine-grained alignment with street-view images and locations in a joint multimodal self-supervised objective, outperforming prior geospatial MSSL methods on urban mobility and road tasks.
FreeSpec uses SVD-based spectral reconstruction to fuse global low-rank and local high-rank features, reducing content drift and preserving temporal dynamics in long video generation.
ArenaPO infers Gaussian capability distributions from pairwise preferences and applies truncated-normal latent inference to derive fine-grained offline rewards for preference optimization of text-to-image diffusion models.
TrajShield is a training-free defense that reduces jailbreak success rates by 52.44% on average in text-to-video models by localizing and neutralizing risks through trajectory simulation and causal intervention.
RSRCC is a new 126k-question benchmark for fine-grained remote sensing change question-answering, constructed via a hierarchical semi-supervised pipeline with retrieval-augmented Best-of-N ranking.
citing papers explorer
-
LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?
LLaVA-UHD v4 reduces visual-encoding FLOPs by 55.8% for high-resolution images in MLLMs via slice-based encoding plus intra-ViT early compression while matching or exceeding baseline performance on document, OCR, and VQA benchmarks.
-
Decoupling Endpoint and Semantic Transition Learning for Zero-Shot Composed Image Retrieval
DeCIR improves projection-based zero-shot composed image retrieval by decoupling endpoint and semantic transition alignment with separate low-rank adapters merged by LRDM, showing gains on CIRR, CIRCO, FashionIQ, and GeneCIS.
-
Anisotropic Modality Align
Modality representations share dominant semantic geometry but have an anisotropic residual gap; AnisoAlign corrects source representations boundedly using target geometry for unpaired alignment.
-
LithoBench: Benchmarking Large Multimodal Models for Remote-Sensing Lithology Interpretation
LithoBench is a new multi-level benchmark showing that existing large multimodal models have substantial limitations in geological semantic understanding for remote sensing lithology interpretation.
-
Hierarchical Dual-Subspace Decoupling for Continual Learning in Vision-Language Models
HDSD decouples parameter subspaces in vision-language models via a Feature Modulation Module, General Fusion Module with adaptive thresholds, and Hierarchical Learning Module with SVD scaling to minimize cross-task interference and achieve state-of-the-art class-incremental learning performance.
-
ModelLens: Finding the Best for Your Task from Myriads of Models
ModelLens learns a performance-aware latent space from 1.62M leaderboard records to rank unseen models on unseen datasets without forward passes on the target.
-
Taming Outlier Tokens in Diffusion Transformers
Outlier tokens in DiTs are addressed with Dual-Stage Registers, which reduce artifacts and improve image generation on ImageNet and text-to-image tasks.
-
DART: A Vision-Language Foundation Model for Comprehensive Rope Condition Monitoring
DART is a cross-modal foundation model that delivers rope damage classification, severity regression, and few-shot recognition from a single frozen representation trained on 4270 images across 14 damage classes.
-
FaithfulFaces: Pose-Faithful Facial Identity Preservation for Text-to-Video Generation
FaithfulFaces introduces a pose-faithful identity aligner with a shared dictionary and invariance constraint to maintain facial identity in text-to-video generation under large pose changes and occlusions.
-
Intermediate Representations are Strong AI-Generated Image Detectors
Intermediate layer embedding sensitivity to perturbations distinguishes AI-generated images from real ones, yielding higher AUROC on GenImage and Forensics Small benchmarks than prior methods.
-
ViewSAM: Learning View-aware Cross-modal Semantics for Weakly Supervised Cross-view Referring Multi-Object Tracking
ViewSAM achieves state-of-the-art weakly supervised performance on cross-view referring multi-object tracking by refining SAM tracklets via affinity-guided re-prompting and modeling view-induced variations as learnable conditions on SAM2.
-
Mitigating Multimodal LLMs Hallucinations via Relevance Propagation at Inference Time
LIME reduces hallucinations in multimodal LLMs by using LRP to boost perceptual modality contributions through inference-time KV updates.
-
UniBCI: Towards a Unified Pretrained Model for Invasive Brain-Computer Interfaces
UniBCI is a unified pretrained model for invasive neural spike data that uses CST tokenization, IAA attention, and self-supervised masked reconstruction to achieve SOTA downstream performance with better generalization and efficiency.
-
FingerViP: Learning Real-World Dexterous Manipulation with Fingertip Visual Perception
FingerViP equips each finger with a miniature camera and trains a multi-view diffusion policy that achieves 80.8% success on real-world dexterous tasks previously limited by wrist-camera occlusion.
-
Stealthy and Adjustable Text-Guided Backdoor Attacks on Multimodal Pretrained Models
Introduces a text-guided backdoor attack using common textual words as triggers and visual perturbations for stealthy, adjustable control on multimodal pretrained models.
-
Rapidly deploying on-device eye tracking by distilling visual foundation models
DistillGaze reduces median gaze error by 58.62% on a 2000+ participant dataset by distilling foundation models into a 256K-parameter on-device model using synthetic labeled data and unlabeled real data.
-
Towards Continual Expansion of Data Coverage: Automatic Text-guided Edge-case Synthesis
Automated LLM-based prompt engineering for text-to-image edge-case synthesis improves object detection robustness on the FishEye8K benchmark over naive augmentation and manual prompts.
-
ORCA: An Agentic Reasoning Framework for Hallucination and Adversarial Robustness in Vision-Language Models
ORCA is an agentic reasoning framework that enhances factual accuracy and adversarial robustness of pretrained LVLMs via an Observe-Reason-Critique-Act loop with small vision models, reporting accuracy gains of up to 40% on hallucination benchmarks and 20% under adversarial perturbations.
-
ASAP: Attention Sink Anchored Pruning
ASAP prunes tokens in ViTs by anchoring on attention sinks modeled as lazy random walks, using cumulative transition matrices and radial diffusion clustering to compress redundancy while preserving accuracy.
-
PhyWorld: Physics-Faithful World Model for Video Generation
PhyWorld improves temporal consistency and physical plausibility in video world models via flow matching fine-tuning followed by DPO on physics preference pairs, with reported gains on VBench and a custom physical-faithfulness benchmark.
-
AGC: Adaptive Geodesic Correction for Adversarial Robustness on Vision-Language Models
AGC is a training-free inference-time defense for CLIP that adaptively corrects features along geodesics to robust augmentations, claiming 44.4% higher average robust accuracy and 10x lower latency than prior baselines across eight datasets and three backbones.
-
A Composite Activation Function for Learning Stable Binary Representations
HTAF is a sigmoid-tanh composite that approximates the Heaviside function to allow stable gradient training of binary activation networks, yielding ICBMs with stable discretization and competitive performance on image tasks.
-
TINS: Test-time ID-prototype-separated Negative Semantics Learning for OOD Detection
TINS improves OOD detection by learning negative semantics at test time with ID-prototype separation, cutting average FPR95 from 14.04% to 6.72% on the Four-OOD benchmark with ImageNet-1K.
-
Empty SPACE: Cross-Attention Sparsity for Concept Erasure in Diffusion Models
SPACE induces sparsity in cross-attention parameters via closed-form iterative updates to erase target concepts more effectively than dense baselines in large diffusion models.
-
ST-Gen4D: Embedding 4D Spatiotemporal Cognition into World Model for 4D Generation
ST-Gen4D uses a world model that fuses global appearance and local dynamic graphs into a 4D cognition representation to guide consistent 4D Gaussian generation.
-
ShellfishNet: A Domain-Specific Benchmark for Visual Recognition of Marine Molluscs
ShellfishNet is a new benchmark of 8,691 images across 32 mollusc taxa for evaluating vision models on real-world underwater ecological monitoring tasks including robustness to degradation.
-
Pan-FM: A Pan-Organ Foundation Model with Saliency-Guided Masking for Missing Robustness
Pan-FM learns balanced representations across seven organs by adaptively masking dominant organs during pre-training, yielding stronger disease prediction and missing-organ robustness than single-organ or naive multimodal baselines on UK Biobank.
-
Diverse Sampling in Diffusion Models with Marginal Preserving Particle Guidance
EDDY adds diversity to diffusion-model samples by using kernel-based anti-symmetric pairwise drifts that preserve marginal distributions via Fokker-Planck symmetries, with practical approximations for expensive cases.
-
OpenGaFF: Open-Vocabulary Gaussian Feature Field with Codebook Attention
OpenGaFF adds a geometry-conditioned Gaussian Feature Field and codebook-guided attention to 3D Gaussian Splatting for spatially consistent open-vocabulary 3D semantic understanding.
-
Adding Thermal Awareness to Visual Systems in Real-Time via Distilled Diffusion Models
FusionProxy is a distilled diffusion-based fusion module that adds thermal awareness to RGB vision systems in real time as an independent plug-and-play component.
-
Perceptual Flow Network for Visually Grounded Reasoning
PFlowNet decouples perception from reasoning, integrates multi-dimensional rewards with vicinal geometric shaping via variational RL, and reports new SOTA results on V* Bench (90.6%) and MME-RealWorld-lite (67.0%).
-
Diffusion Templates: A Unified Plugin Framework for Controllable Diffusion
Diffusion Templates is a unified plugin framework that allows injecting various controllable capabilities into diffusion models through a standardized interface.
-
Bridging Coarse and Fine Recognition: A Hybrid Approach for Open-Ended Multi-Granularity Object Recognition in Interactive Educational Games
HyMOR combines MLLM for coarse open-ended recognition with CLIP for fine-grained domain objects, achieving near-CLIP fine performance and 2.5% better general recognition plus 23.2% overall SBert gain on a new TBO textbook dataset.
-
UniSemAlign: Text-Prototype Alignment with a Foundation Encoder for Semi-Supervised Histopathology Segmentation
UniSemAlign aligns text and prototype representations with visual features to generate better supervision signals for semi-supervised segmentation, reporting Dice gains of up to 8.6% on CRAG with 10% labels.
-
Beyond Surface Artifacts: Capturing Shared Latent Forgery Knowledge Across Modalities
Introduces MAF framework and DeepModal-Bench to capture universal cross-modal forgery traces for better generalization in multimodal deepfake detection.
-
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective
The survey frames VLA models as pipelines that generate progressively grounded action tokens and classifies those tokens into eight types to guide future development.
-
Enhancing Text-to-Image Diffusion Transformer via Split-Text Conditioning
DiT-ST converts complete-text captions into split-text primitives via LLMs and injects them hierarchically across denoising stages to reduce semantic confusion in DiT-based text-to-image generation.
-
Vision-Language-Action Safety: Threats, Challenges, Evaluations, and Mechanisms
A literature survey that unifies fragmented work on attacks, defenses, evaluations, and deployment challenges for Vision-Language-Action models in robotics.
-
HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds
HY-World 2.0 generates and reconstructs high-fidelity navigable 3D Gaussian Splatting worlds from text, images, or videos via upgraded panorama, planning, expansion, and composition modules, with released code claiming open-source SOTA performance.
-
Shaping Schema via Language Representation as the Next Frontier for LLM Intelligence Expanding
Advanced language representations shape LLMs' schemas to improve knowledge activation and problem-solving.
- STAMBRIDGE: Spectral-Temporal Amplitude-aware Mid-Feature Bridge for EEG Visual Decoding
- MSAVBench: Towards Comprehensive and Reliable Evaluation of Multi-Shot Audio-Video Generation
- MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI
- DVD: Discrete Voxel Diffusion for 3D Generation and Editing