{"total":94,"items":[{"citing_arxiv_id":"2605.23137","ref_index":9,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"STAMBRIDGE: Spectral-Temporal Amplitude-aware Mid-Feature Bridge for EEG Visual Decoding","primary_cat":"eess.IV","submitted_at":"2026-05-22T01:21:45+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22668","ref_index":42,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"SEGA: Spectral-Energy Guided Attention for Resolution Extrapolation in Diffusion Transformers","primary_cat":"cs.CV","submitted_at":"2026-05-21T16:09:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SEGA adaptively scales RoPE attention components using spectral-energy guidance from the latent to improve structural coherence and fine details in high-resolution DiT synthesis.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22531","ref_index":70,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Disentanglement Beyond Generative Models with Riemannian ICA","primary_cat":"cs.LG","submitted_at":"2026-05-21T14:22:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"RICA replaces ICA's global generative model with local Riemannian geometry, introducing a disentanglement tensor based on the Hessian of the log-likelihood and Ricci curvature to measure pointwise disentanglement, which recovers sources across manifolds in controlled tests.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22372","ref_index":24,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"ASAP: Attention Sink Anchored Pruning","primary_cat":"cs.LG","submitted_at":"2026-05-21T12:04:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ASAP prunes tokens in ViTs by anchoring on attention sinks modeled as lazy random walks, using cumulative transition matrices and radial diffusion clustering to compress redundancy while preserving accuracy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20942","ref_index":14,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Bridging Structure and Language: Graph-Based Visual Reasoning for Autonomous Road Understanding","primary_cat":"cs.CV","submitted_at":"2026-05-20T09:28:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A graph-grounded Combined Road Substrate framework generates traceable QA pairs from road maps to improve small VLMs on compositional road reasoning tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20624","ref_index":66,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Accelerating Video Inverse Problem Solvers with Autoregressive Diffusion Models","primary_cat":"cs.CV","submitted_at":"2026-05-20T02:16:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AVIS applies autoregressive diffusion models to video inverse problems by streaming restoration with measurement-consistent initialization, reducing latency from 114s to 4s and raising throughput to 1.18 FPS (or 5.91 FPS in the Flash variant).","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20525","ref_index":33,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"NeuroQA: A Large-Scale Image-Grounded Benchmark for 3D Brain MRI Understanding","primary_cat":"cs.CV","submitted_at":"2026-05-19T21:54:12+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"NeuroQA is a large-scale 3D brain MRI visual question answering benchmark with verified image-grounded QA pairs, multi-domain coverage, and baseline evaluations showing current models lag behind text-only performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20183","ref_index":46,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"MSAVBench: Towards Comprehensive and Reliable Evaluation of Multi-Shot Audio-Video Generation","primary_cat":"cs.CV","submitted_at":"2026-05-19T17:59:33+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19242","ref_index":55,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"PhyWorld: Physics-Faithful World Model for Video Generation","primary_cat":"cs.CV","submitted_at":"2026-05-19T01:28:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PhyWorld improves temporal consistency and physical plausibility in video world models via flow matching fine-tuning followed by DPO on physics preference pairs, with reported gains on VBench and a custom physical-faithfulness benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18653","ref_index":24,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Will It Go Viral? Grounding Micro-Video Popularity Prediction on the Open Web","primary_cat":"cs.MM","submitted_at":"2026-05-18T17:00:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"WEBSHORTS dataset and SHORTS-CAST framework ground micro-video popularity prediction in structured open-web context collected at upload time and enable selective online adaptation using delayed labels.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18579","ref_index":24,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"S2Aligner: Pair-Efficient and Transferable Pre-Training for Sparse Text-Attributed Graphs","primary_cat":"cs.LG","submitted_at":"2026-05-18T15:56:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"S2Aligner decouples semantic and structural components in LLM-as-Aligner pre-training for sparse TAGs and uses structure-oriented reconstruction plus domain risk balancing to improve transferability and reduce generalization gaps.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17949","ref_index":3,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"SkyNative: A Native Multimodal Framework for Remote Sensing Visual Evidence Reasoning","primary_cat":"cs.CV","submitted_at":"2026-05-18T07:06:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SkyNative introduces an encoder-free architecture using raw patch tokens and modality-specific parameters in a unified autoregressive model to improve image-grounded reasoning in remote sensing vision-language tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17766","ref_index":32,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"LatentUMM: Dual Latent Alignment for Unified Multimodal Models","primary_cat":"cs.CV","submitted_at":"2026-05-18T02:35:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LatentUMM proposes dual latent alignment at modality and capacity levels plus latent dynamics stabilization to reduce semantic drift and improve consistency in unified multimodal models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17310","ref_index":30,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Attention Hijacking: Response Manipulation Across Queries in Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2026-05-17T08:02:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Attention Hijacking is a new attack that improves cross-query transferability in VLMs by explicitly steering internal attention to a persistent image-dominant pattern.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16949","ref_index":30,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Beyond Point-Wise Matching: Structural Representation Alignment for Accelerating Diffusion Transformers","primary_cat":"cs.CV","submitted_at":"2026-05-16T12:01:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"sREPA enforces structural consistency in relational geometry of pre-trained vision features to accelerate DiT training and improve generation quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15961","ref_index":25,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Sparse Autoencoders enable Robust and Interpretable Fine-tuning of CLIP models","primary_cat":"cs.CV","submitted_at":"2026-05-15T13:54:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SAE-FT uses a sparse autoencoder on pre-trained CLIP visual representations to regularize fine-tuning by penalizing changes to semantically meaningful features, aiming for robust performance on ImageNet and distribution shifts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15923","ref_index":1,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Invaria: Learning Scale and Density Invariance in Point Clouds via Next-Resolution Prediction","primary_cat":"cs.CV","submitted_at":"2026-05-15T13:06:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Invaria trains point cloud encoders with next-resolution prediction to learn scale and density invariant features, yielding higher mIoU on ScanNet under lower resolution and scaled objects while using a smaller model.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18868","ref_index":46,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"DarkLLM: Learning Language-Driven Adversarial Attacks with Large Language Models","primary_cat":"cs.CR","submitted_at":"2026-05-15T12:28:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DarkLLM trains an LLM to generate language-driven adversarial perturbations that unify targeted, untargeted, segmentation, and multi-model attacks on foundation models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15584","ref_index":1,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"AGC: Adaptive Geodesic Correction for Adversarial Robustness on Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2026-05-15T03:48:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"AGC is a training-free inference-time defense for CLIP that adaptively corrects features along geodesics to robust augmentations, claiming 44.4% higher average robust accuracy and 10x lower latency than prior baselines across eight datasets and three backbones.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16423","ref_index":5,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Nonlinear Bipolar Compensation: Handling Outliers in Post-Training Quantization","primary_cat":"cs.CV","submitted_at":"2026-05-14T14:55:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Nonlinear Bipolar Compensation with Bipolar Logarithmic Transformation reduces outlier effects in post-training quantization by performing compensation in a compressed transformed space.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14626","ref_index":15,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"UniTriGen: Unified Triplet Generation of Aligned Visible-Infrared-Label for Few-Shot RGB-T Semantic Segmentation","primary_cat":"cs.CV","submitted_at":"2026-05-14T09:39:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"UniTriGen uses unified diffusion in a shared latent space plus lightweight adapters and scene-balanced sampling to produce high-quality aligned VIS-IR-Label triplets from limited paired data, improving few-shot RGB-T semantic segmentation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14486","ref_index":44,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Reduce the Artifacts Bias for More Generalizable AI-Generated Image Detection","primary_cat":"cs.CV","submitted_at":"2026-05-14T07:26:36+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SEF introduces GAN upsampling for diverse artifacts and expert fusion to reduce domain interference, yielding stronger generalization on 13 benchmarks for AI-generated image detection.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14382","ref_index":47,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Delta Forcing: Trust Region Steering for Interactive Autoregressive Video Generation","primary_cat":"cs.CV","submitted_at":"2026-05-14T05:06:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Delta Forcing improves temporal coherence in interactive autoregressive video generation by estimating transition consistency from teacher-generator latent deltas and balancing it against a monotonic continuity objective.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14274","ref_index":30,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"CreFlow: Corrective Reflow for Sparse-Reward Embodied Video Diffusion RL","primary_cat":"cs.CV","submitted_at":"2026-05-14T02:18:58+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CreFlow combines LTL compositional rewards with credit-aware NFT and corrective reflow losses in online RL to improve embodied video diffusion models, raising downstream task success by 23.8 percentage points on eight bimanual manipulation tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13835","ref_index":39,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Unlocking Patch-Level Features for CLIP-Based Class-Incremental Learning","primary_cat":"cs.CV","submitted_at":"2026-05-13T17:56:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SPA unlocks patch-level features in CLIP for class-incremental learning via semantic-guided selection and optimal transport alignment with class descriptions, plus projectors and pseudo-feature replay to reduce forgetting.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13122","ref_index":27,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Early Semantic Grounding in Image Editing Models for Zero-Shot Referring Image Segmentation","primary_cat":"cs.CV","submitted_at":"2026-05-13T07:48:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Pretrained instruction-based image editing models exhibit early foreground-background separability that enables a training-free framework for zero-shot referring image segmentation using a single denoising step.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12942","ref_index":30,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"From Compression to Accountability: Harmless Copyright Protection for Dataset Distillation","primary_cat":"cs.CR","submitted_at":"2026-05-13T03:23:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SubPopMark embeds verifiable subpopulation biases into distilled datasets via CVM and USTM optimization stages, allowing provenance inference through comparison of model output signatures against a reference behavior bank.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12678","ref_index":56,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"No One Knows the State of the Art in Geospatial Foundation Models","primary_cat":"cs.CV","submitted_at":"2026-05-12T19:29:51+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"An audit of 152 papers reveals that geospatial foundation models lack standardized evaluations, training controls, and weight releases, so no one knows the state of the art.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12237","ref_index":25,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"UHR-Micro: Diagnosing and Mitigating the Resolution Illusion in Earth Observation VLMs","primary_cat":"cs.CV","submitted_at":"2026-05-12T15:07:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VLMs show a resolution illusion on UHR Earth observation imagery where higher resolution does not improve micro-target perception; UHR-Micro benchmark and MAP-Agent address this via evidence-centered active inspection.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Existing benchmarks often rely on downsampling, cropped patches, or macro-level semantics even for large scenes [32], bypassing the severe scale disparity of native UHR imagery. UHR-Micro fills this gap by evaluating micro-level perception directly in UHR scenes. Resolution Constraints and Perceptual Limitations.Fixed-resolution encoders such as CLIP [ 25] inevitably erase micro-features when resizing ultra-high-resolution imagery. Although recent models adopt dynamic strategies, including adaptive tiling in LLaV A-NeXT [13] and InternVL3.5 [37], or dynamic resolution encoding in Qwen3-VL [4], they still face fundamental perceptual bottlenecks. Minute object signals may be lost during token compression or overwhelmed by background noise,"},{"citing_arxiv_id":"2605.12179","ref_index":32,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"SyncDPO: Enhancing Temporal Synchronization in Video-Audio Joint Generation via Preference Learning","primary_cat":"cs.CV","submitted_at":"2026-05-12T14:22:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SyncDPO improves temporal synchronization in video-audio joint generation using DPO with efficient on-the-fly negative sample construction and curriculum learning.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[30] Chetwin Low, Weimin Wang, and Calder Katyal. Ovi: Twin backbone cross-modal fusion for audio-video generation.arXiv preprint arXiv:2510.01284, 2025. [31] Andrew Owens, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H Adelson, and William T Freeman. Visually indicated sounds. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2405-2413, 2016. [32] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748-8763. PmLR, 2021. [33] Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya"},{"citing_arxiv_id":"2605.12122","ref_index":30,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Disentangled Sparse Representations for Concept-Separated Diffusion Unlearning","primary_cat":"cs.LG","submitted_at":"2026-05-12T13:39:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SAEParate disentangles sparse representations in diffusion models via contrastive clustering and nonlinear encoding to enable more precise concept unlearning with reduced side effects.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Contrastive Learning.The core idea of contrastive learning is to shape the representation space based on relative similarity, drawing samples that share semantic content closer while pushing unrelated samples apart. This objective has proven highly effective for self-supervised visual representation learning [6, 16] as well as vision-language alignment [ 30]. When label supervision is available, supervised contrastive learning [21] treats same class samples as positives, producing tighter intra- class clusters and clearer inter-class separation. Our work introduces a contrastive objective in the SAE latent space to explicitly enforce concept-level separation, extending contrastive learning beyond instance-level similarity to structured concept separation for diffusion unlearning."},{"citing_arxiv_id":"2605.11558","ref_index":54,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"A Composite Activation Function for Learning Stable Binary Representations","primary_cat":"cs.LG","submitted_at":"2026-05-12T05:41:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"HTAF is a sigmoid-tanh composite that approximates the Heaviside function to allow stable gradient training of binary activation networks, yielding ICBMs with stable discretization and competitive performance on image tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"predicts a set of predefined concepts and then uses them for class prediction, enabling transparency and intervention at the concept level. However, CBMs often suffer from reduced prediction performance and limited practicality due to the requirement of concept annotations, which are costly to obtain. To address this, subsequent works have proposed label-free variants using pretrained models such as Contrastive Language-Image Pre-training (CLIP, [54]) and Large Language Models (LLMs, [5]) to automatically generate concept labels for each image ([50, 81, 37]). Despite these advances, such approaches still achieve lower prediction performance than standard image models. Moreover, concept annotations produced by multimodal models such as CLIP can be unreliable and may fail to accurately capture underlying semantic concepts of individual images"},{"citing_arxiv_id":"2605.11497","ref_index":22,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"PoseBridge: Bridging the Skeletonization Gap for Zero-Shot Skeleton-Based Action Recognition","primary_cat":"cs.CV","submitted_at":"2026-05-12T04:15:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PoseBridge recovers semantic information lost during skeletonization by extracting pose-anchored cues from human pose estimation and transferring them via skeleton-conditioned bridging and semantic prototype adaptation, yielding 13.3-17.4 point gains on the Kinetics PURLS benchmark.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"trained with the original pose estimation objective and our action-semantic alignment objective on the MS COCO dataset [17], and is then frozen to extract both estimated 2D skeleton sequences and pose-anchored semantics used in the ZSSAR stage. Following common ZSSAR evaluation protocols for fair comparison, we adopt Shift-GCN [4] as the skeleton encoder and use the pretrained CLIP text encoder [22] to obtain semantic prototypes from action descriptions. All models are trained only on seen classes and evaluated on disjoint unseen classes. Detailed implementation settings are provided in Appendix A, and hyperparameter/design choice analyses are presented in Appendix D. 4.2 Comparison with State-of-the-Art Evaluation on standard split benchmark."},{"citing_arxiv_id":"2605.11477","ref_index":23,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"LDDR: Linear-DPP-Based Dynamic-Resolution Frame Sampling for Video MLLMs","primary_cat":"cs.CV","submitted_at":"2026-05-12T03:45:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LDDR proposes a linear DPP-based dynamic-resolution frame sampler that achieves 3x speedup and up to 2.5-point gains on video MLLM benchmarks by selecting non-redundant frames and allocating tokens accordingly.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"InInternational conference on machine learning, pages 8748-8763. PmLR, 2021. [22] Kele Shao, Keda TAO, Can Qin, Haoxuan You, Yang Sui, and Huan Wang. Holitom: Holistic token merging for fast video large language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id= 6hvaQTKkpF. [23] Min Shi, Shihao Wang, Chieh-Yun Chen, Jitesh Jain, Kai Wang, Junjun Xiong, Guilin Liu, Zhiding Yu, and Humphrey Shi. Slow-fast architecture for video multi-modal large language models.arXiv preprint arXiv:2504.01328, 2025. [24] Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al."},{"citing_arxiv_id":"2605.11462","ref_index":42,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"SpatialForge: Bootstrapping 3D-Aware Spatial Reasoning from Open-World 2D Images","primary_cat":"cs.CV","submitted_at":"2026-05-12T03:20:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SpatialForge synthesizes 10 million spatial QA pairs from in-the-wild 2D images to train VLMs for better depth ordering, layout, and viewpoint-dependent reasoning.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"3 SpatialForge Pipeline 3.3.1 Image Filtering As shown in Figure 1 step 1, we first filter the raw image pool to ensure both visual quality and physical realism. At the visual level, we remove low-quality images such as those that are blurred, poorly exposed, or severely distorted, as these can degrade geometric consistency. At the semantic level, we use CLIP [ 42] to distinguish real-world scenes from synthetic or non-physical content. Specifically, we compare image embeddings with a small set of textual anchors (e.g., \"natural scene\" vs. \"GUI interface\") and discard images that are more similar to non-physical categories, such as screenshots or text-heavy documents. This filtering step ensures that the remaining data provides"},{"citing_arxiv_id":"2605.10937","ref_index":25,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping","primary_cat":"cs.CV","submitted_at":"2026-05-11T17:59:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to reduce reward hacking and improve performance over GRPO baselines.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Implementation Details:Throughout our experiments, we adopt DanceGRPO [ 7] as the baseline for comparison. Two models with different parameter scales, Stable Diffusion v1.4 (0.9B) [ 11] and FLUX.1 Dev (12B) [3], are selected as backbones to validate the effectiveness of our method. The training reward is defined as a combination of scores from CLIP [ 25] and HPS-v2.1 [ 12], consistent with the configuration used in DanceGRPO 1. Specifically, the HPS-v2.1 and CLIP reward components are weighted at a ratio of 1:1 for SD1.4, and 0.7:1.4 for FLUX.1 Dev. Regarding training data, the original DanceGRPO paper uses a carefully curated internal prompt set; in contrast, to ensure a fair comparison and to assess the robustness of our method, we employ the publicly available"},{"citing_arxiv_id":"2605.10817","ref_index":19,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"CLEF: EEG Foundation Model for Learning Clinical Semantics","primary_cat":"cs.AI","submitted_at":"2026-05-11T16:34:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CLEF, a long-context EEG foundation model using 3D multitaper spectrograms and contrastive alignment with reports and EHR, beats prior models on 229 of 234 clinical tasks and raises mean AUROC from 0.65 to 0.74.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"Long-context modeling alone cannot recover important clinical semantics. Whether a patient carries a diagnosis of Alzheimer's disease or is being treated with morphine produces signatures in the EEG, but recovering them requires grounding in the clinical modalities that encode this information. CLEF aligns the recording-level embedding with two such modalities through symmetric contrastive objectives [19]: free-text neurologist reports, preprocessed by an LLM summarizer to preserve electrographic content, and structured electronic health record (EHR) data, which summarizes the patient's broader clinical profile through demographics, active medications, and diagnoses encoded as a variable-length set of learned code embeddings. Together, reports and EHRs ensure EEG"},{"citing_arxiv_id":"2605.10806","ref_index":32,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"PhyGround: Benchmarking Physical Reasoning in Generative World Models","primary_cat":"cs.CV","submitted_at":"2026-05-11T16:30:51+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PhyGround is a new benchmark with curated prompts, a 13-law taxonomy, large-scale human annotations, and an open physics-specialized VLM judge for evaluating physical reasoning in generative video models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"annotators must distinguish videos that merely look plausible from those that actually satisfy the relevant physical constraints. A small annotator pool further amplifies this problem, as aggregate scores may be driven by individual biases rather than stable model performance. Third, automatic evaluation remains insufficiently physics-aware and auditable. Standard video metrics such as FVD[40], SSIM, PSNR[ 17], and CLIP-based similarity[ 32] are useful for measuring distributional similarity, pixel-level fidelity, or semantic alignment, but they are not designed to detect violations of physical laws. Recent VLM-as-judge approaches provide a more semantic alternative, but many rely on closed-source models [4, 18, 22]. Because their architectures, weights, training data, model versions, and API configurations evolve over time and"},{"citing_arxiv_id":"2605.10762","ref_index":8,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"GridProbe: Posterior-Probing for Adaptive Test-Time Compute in Long-Video VLMs","primary_cat":"cs.CV","submitted_at":"2026-05-11T15:57:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"GridProbe uses posterior probing on a KxK frame grid to adaptively select question-relevant frames, delivering up to 3.36x TFLOPs reduction with accuracy within 1.6 pp of the full-frame baseline on Video-MME-v2.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10756","ref_index":40,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"TINS: Test-time ID-prototype-separated Negative Semantics Learning for OOD Detection","primary_cat":"cs.CV","submitted_at":"2026-05-11T15:54:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"TINS improves OOD detection by learning negative semantics at test time with ID-prototype separation, cutting average FPR95 from 14.04% to 6.72% on the Four-OOD benchmark with ImageNet-1K.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"often forced into one of the known classes, which can lead to unreliable decisions. OOD detection addresses this issue by retaining the C-way ID classifier while introducing an OOD score function S [29-31] to separate ID and OOD inputs: Gγ(x) = \u001aID, ifS(x)≥γ; OOD, otherwise, (1) where Gγ is the OOD detector with threshold γ∈R , and a larger value of S indicates stronger evidence thatxbelongs to ID. CLIP and NegLabel.CLIP [ 40] consists of a text encoder T(·) using the Transformer [47] architec- ture and an image encoder I(·) using the ViT [11] or ResNet [16] architecture. Given a test image x, we obtain an image feature v=I(x)∈R d and text features tyi =T(E(prompt(y i)))∈R d for labels yi ∈ Y +, where prompt(·) represents the prompt template applied to an input label ( e."},{"citing_arxiv_id":"2605.10198","ref_index":66,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Empty SPACE: Cross-Attention Sparsity for Concept Erasure in Diffusion Models","primary_cat":"cs.LG","submitted_at":"2026-05-11T08:46:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SPACE induces sparsity in cross-attention parameters via closed-form iterative updates to erase target concepts more effectively than dense baselines in large diffusion models.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"2), nudity erasure (Sec. 4.3), and an analysis on the effect of sparsity on the erasure performance and on storage requirements (Sec. 4.4). We defer to Appendix C for additional experimental results, including celebrity memorization (Appendix C.5). 4.1 Implementation Details We measure how successfully concepts have been erased using CLIP Score [65] (CS), CLIP Accuracy [66] (CA), and Kernel Inception Distance [67] (KID). Lower CS and CA for the concept to erase imply better erasure. Higher CS and CA for the concepts to preserve indicate better knowledge preservation. The KID score is used to address the change in the generative distribution of the models. For non-target concepts, a smaller KID implies good preservation"},{"citing_arxiv_id":"2605.09948","ref_index":5,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models","primary_cat":"cs.AI","submitted_at":"2026-05-11T03:51:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LoopVLA adds recurrent refinement and learned sufficiency estimation to VLA models, cutting parameters 45% and raising throughput 1.7x while matching baseline task success on LIBERO and VLA-Arena.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"ulation, using large-scale multimodal pretraining to bridge perceptual understanding and physical interaction [1, 2, 3, 4]. A typical VLA system builds upon a vision-language model (VLM) to encode visual observations and language instructions into latent representations, and then employs an action head that maps these representations to continuous control outputs [ 5, 6, 7, 8, 9]. Most existing designs follow alate-outputparadigm, where only the final-layer representation of the VLM is used for action prediction. Despite its simplicity, the late-output paradigm implicitly assumes that the deepest, most abstract representation is universally suitable for all action decisions. This assumption, however, is not well"},{"citing_arxiv_id":"2605.09271","ref_index":99,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Shaping Schema via Language Representation as the Next Frontier for LLM Intelligence Expanding","primary_cat":"cs.AI","submitted_at":"2026-05-10T02:42:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"Advanced language representations shape LLMs' schemas to improve knowledge activation and problem-solving.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"language such as syntactic features and topological attributes dictate the activation efficiency and organizational stability of the model's internal schema and performance. 5 Alternative Views The scaling law [ 98] demonstrates that model performance systematically improves as both the number of parameters and the amount of training data increase [99, 100, 101, 102, 103]. Empirical studies have shown a near power-law relationship between scale and performance, suggesting that larger models yield predictable improvements in generalization and reasoning ability [104, 8, 105]. More intriguingly, recent findings [55] indicate that LLMs exhibitemergent abilities, qualitative capabilities that arise abruptly once a model surpasses a certain scale threshold [106, 5]."},{"citing_arxiv_id":"2605.09262","ref_index":26,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Reinforcing Multimodal Reasoning Against Visual Degradation","primary_cat":"cs.CV","submitted_at":"2026-05-10T02:17:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ROMA improves MLLM robustness to seen and unseen visual corruptions by +2.3-2.4% over GRPO on seven reasoning benchmarks while matching clean accuracy.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl.arXiv preprint arXiv:2503.07536, 2025. [25] Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, Zhuoma GongQue, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, et al. We-math: Does your large multimodal model achieve human-like mathematical reasoning?arXiv preprint arXiv:2407.01284, 2024. [26] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748-8763. PmLR, 2021. [27] Roberta Raileanu, Max Goldstein, Denis Yarats, Ilya Kostrikov, and Rob Fergus."},{"citing_arxiv_id":"2605.08985","ref_index":35,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?","primary_cat":"cs.CV","submitted_at":"2026-05-09T15:10:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLaVA-UHD v4 reduces visual-encoding FLOPs by 55.8% for high-resolution images in MLLMs via slice-based encoding plus intra-ViT early compression while matching or exceeding baseline performance on document, OCR, and VQA benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Omnidocbench: Benchmarking diverse pdf document parsing with comprehensive annotations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24838-24848, 2025. 11 [34] Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world.arXiv preprint arXiv:2306.14824, 2023. [35] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748-8763. PmLR, 2021. [36] Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh."},{"citing_arxiv_id":"2605.08678","ref_index":78,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI","primary_cat":"cs.LG","submitted_at":"2026-05-09T04:29:46+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"or submitted artifact restricted to the problem-relevant part under frozen evaluation. Benchmark Setting Count # Scope New Method Generalize Scalability Control Reference ML-Bench [92] ML code 169 18 repos✗ ✗△ △Pass@K MLAgentBench [34] ML experimentation 13 4 categories✗ ✗ ✗ ✗baselines MLE-bench [10] ML engineering 75 15 categories✗ ✗ ✓ ✓medals MLE-Dojo [76] ML engineering 200+ 4 domains✗ ✗ ✓△H-Rank PostTrainBench [78] LLM post-training 28 7 evals✗△✓ ✓instruct AutoResearch [44] LLM training 1 1 setup△✗△ △val_bpb MLGym [66] ML experiments 13 4 domains△✗△ △baselines PaperBench [88] Paper replication 20 1 AI✗ ✗ ✓△rubric MLR-Bench [11] Research workflow 201 9 topics✓ ✗ ✓ ✗judge DiscoveryBench [58] Data discovery 1167 6 domains✗ ✗ ✓ ✓facets ScienceAgentBench [18] Data workflow 102 4 fields✗ ✗△✓papers"},{"citing_arxiv_id":"2605.08664","ref_index":16,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"IPAD-CLIP: Teaching CLIP to Detect Image Local Perceptual Artifacts","primary_cat":"cs.CV","submitted_at":"2026-05-09T04:04:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"IPAD-CLIP adapts CLIP via artifact-aware text embeddings to detect multi-class local perceptual artifacts, backed by a new dataset of 3520 images with pixel-level masks.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"Second, these artifacts are localized, subtle, and exhibit diverse appearances. Adopting a vision-only supervised learning paradigm risks learning spurious correlations from limited visual patterns without understanding the concept of artifacts, making them prone to being missed detection. To overcome this, we introduce IPAD-CLIP, a novel framework built upon CLIP [16] that enhances artifact discrimination in both textual and visual spaces while preserving generalization capabilities. Our key insight is that these local artifacts are closely related to specific semantics. For instance, ghosting often appears at the edges of subjects, lens flare frequently occurs around light sources, and moiré commonly arises on"},{"citing_arxiv_id":"2605.08389","ref_index":33,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Decoupling Endpoint and Semantic Transition Learning for Zero-Shot Composed Image Retrieval","primary_cat":"cs.CV","submitted_at":"2026-05-08T18:55:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DeCIR improves projection-based zero-shot composed image retrieval by decoupling endpoint and semantic transition alignment with separate low-rank adapters merged by LRDM, showing gains on CIRR, CIRCO, FashionIQ, and GeneCIS.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"In this paper, we propose LRDM to merge low-rank coefficients while keeping the retrieval-oriented output basis fixed. 3 Method 3.1 Problem Formulation and Model Overview Composed image retrieval aims to retrieve a target image from a reference image Iref and a text modification t. DeCIR uses a projection-based CIR model based on the pretrained Pic2Word [33]. Pic2Word uses a CLIP [32] visual encoder EV , a CLIP text encoder ET , and a lightweight mapping network fϕ consisting of a three-layer MLP. Given Iref, the visual encoder first extracts an image feature, and the mapping network projects it into a pseudo-word token in the CLIP text embedding space, which is denoted as sref =f ϕ(EV (Iref)). Subsequently, this token is inserted into a text"},{"citing_arxiv_id":"2605.08003","ref_index":28,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"SphereVAD: Training-Free Video Anomaly Detection via Geodesic Inference on the Unit Hypersphere","primary_cat":"cs.CV","submitted_at":"2026-05-08T16:57:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SphereVAD performs training-free video anomaly detection by recasting anomaly discrimination as von Mises-Fisher likelihood-ratio geodesic inference on the unit hypersphere using intermediate MLLM features, with Frechet mean centering, holistic scene attention, and spherical geodesic pulling.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Existing approaches, whether weakly supervised [32, 35], unsupervised [34], or few-shot paradigms [22, 26], all rely on some form of training procedure and therefore face bottlenecks including high computational cost, strong dependence on target-domain data, and the necessity of retraining when transferring across scenes. Recent advances in multimodal large language models (MLLMs) [28, 18] have opened new possibili- ties for training-free V AD. Several methods attempt to leverage the language reasoning capability of MLLMs to judge anomalies directly [39, 43], yet a significant performance gap with respect to supervised counterparts persists. We find that MLLM intermediate-layer features already encode rich discriminative information about anomalies."},{"citing_arxiv_id":"2605.07971","ref_index":47,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"DVD: Discrete Voxel Diffusion for 3D Generation and Editing","primary_cat":"cs.CV","submitted_at":"2026-05-08T16:32:17+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"tasks to Direct3D-S2 [15] and TripoSR [9] as broader baselines. Metrics.To evaluate the generation ability, we compute FID [ 44] of the rendering images under the feature space of DINOv2 [ 45], noted as FIDD. We also compute the point cloud fid of the finalized mesh and voxels under the feature space of PointNet++ [46], denoted as FIDPC and FIDV, respectively. We compute the CLIP score [47] between the rendered images of the generated assets and the condition. For the image-conditioned task, we compute Chamfer Distance between the generated and GT mesh as a metric of reconstruction, denoted as CD. We also report the chamfer distance of the cibufied mesh of the GT voxel and the generated voxel as CDV. For certain experiments, negative"}],"limit":50,"offset":0}