{"total":42,"items":[{"citing_arxiv_id":"2606.19088","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ReSiReg: Towards Spatially Consistent Semantics in Language-Conditioned Robotic Tasks","primary_cat":"cs.RO","submitted_at":"2026-06-17T13:58:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ReSiReg clusters VLM intermediates into prototypes, derives language descriptors, and reconstructs patches as mixtures to improve spatial consistency in dense language-grounded retrieval for robotics.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.29691","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Unsupervised Semantic Segmentation Facilitates Model Understanding","primary_cat":"cs.CV","submitted_at":"2026-05-28T09:52:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A visualization protocol based on unsupervised semantic segmentation reveals positional biases, scaling behaviors, and boundary artifacts across self-supervised vision transformer models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23556","ref_index":173,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Is Dimensionality a Barrier for Retrieval Models?","primary_cat":"cs.LG","submitted_at":"2026-05-22T12:22:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Dimension d = O(m^{-2} log n) nearly achieves the optimal margin m^rd(+∞, A) for retrieval embeddings, with matching lower bounds showing d = O(k log(n/k)) suffices and is necessary for m = Θ(k^{-1/2}) on k-sparse query matrices.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23033","ref_index":39,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Uncovering the Latent Potential of Deep Intermediate Representations","primary_cat":"cs.LG","submitted_at":"2026-05-21T20:58:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces LOES, a constructive spectral method to select task-discriminative subspaces from intermediate layer embeddings, and GeoReg for enforcing simplicial class geometry during fine-tuning, with reported gains increasing with model depth across modalities.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23028","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RADAR: Relative Angular Divergence Across Representations","primary_cat":"cs.LG","submitted_at":"2026-05-21T20:51:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RADAR is a geometrically grounded metric that predicts cross-domain transferability by comparing layer-wise representation trajectory distributions in foundation models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21028","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DySink: Dynamic Frame Sinks for Autoregressive Long Video Generation","primary_cat":"cs.CV","submitted_at":"2026-05-20T11:01:01+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20085","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Spatially Prompted Visual Trajectory Prediction for Egocentric Manipulation","primary_cat":"cs.CV","submitted_at":"2026-05-19T16:39:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"The paper introduces SP-VTP as a new setting for egocentric manipulation, releases the EgoSPT dataset with first-frame spatial annotations, and proposes the SPOT model that outperforms non-prompted baselines on cross-scene trajectory prediction.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18324","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Improved Baselines with Representation Autoencoders","primary_cat":"cs.CV","submitted_at":"2026-05-18T12:42:34+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RAE v2 reaches gFID 1.06 on ImageNet-256 in 80 epochs by combining multi-layer encoder sums, complementary REPA targets, and free guidance via output reparameterization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17633","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SparseSAM: Structured Sparsification of Activations in Segment Anything Models","primary_cat":"cs.CV","submitted_at":"2026-05-17T19:54:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SparseSAM achieves 2x faster inference and 2.8x memory reduction in SAM with only 0.004 mIoU loss at 0.4 density via Stripe-Sort Attention and Residual-Consistency MLP.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17630","ref_index":28,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SegRAG: Training-Free Retrieval-Augmented Semantic Segmentation","primary_cat":"cs.CV","submitted_at":"2026-05-17T19:51:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SegRAG is a training-free retrieval-augmented framework that extracts class-specific point prompts from a filtered DINOv3 feature bank to boost SAM3 semantic segmentation performance on standard and agricultural benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17283","ref_index":98,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"OProver: A Unified Framework for Agentic Formal Theorem Proving","primary_cat":"cs.CL","submitted_at":"2026-05-17T06:39:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OProver-32B achieves top Pass@32 scores on MiniF2F, ProverBench, and PutnamBench by combining continued pretraining with iterative agentic proving, retrieval, SFT on repairs, and RL on unresolved cases using a 6.86M-proof dataset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13565","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Qwen-Image-VAE-2.0 Technical Report","primary_cat":"cs.CV","submitted_at":"2026-05-13T14:04:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Qwen-Image-VAE-2.0 achieves state-of-the-art high-compression image reconstruction and superior diffusability for diffusion models, with a new text-rich document benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10404","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Position: Life-Logging Video Streams Make the Privacy-Utility Trade-off Inevitable","primary_cat":"cs.CV","submitted_at":"2026-05-11T11:42:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Life-logging video streams create an inevitable privacy-utility trade-off that is a foundational challenge for always-on AI systems.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"Moreover, facial (micro-)expressions and body languages can be exploited to infer individuals'emotional states, psychological characteristics, and personality traitsfrom raw video data. Large pretrained VLMs are superior and mature in such tasks [38] with products already introduced to the market and users [71, 72]. Combining with audio 3 (a) (b) Fig. 3 Human face inversion attack based on embeddings from PE [7]. Facial features are restored. modality further aids emotional analysis [73] from video. If the marketing or human resource management team intentionally accesses relevant information, they can be more guiding or aggressive, affecting or misleading clients' real need. Last but not least, VLMs are even better than humans in analyzinggeographical locationfrom videos, inferring"},{"citing_arxiv_id":"2605.08298","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"What Cohort INRs Encode and Where to Freeze Them","primary_cat":"cs.LG","submitted_at":"2026-05-08T11:09:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Optimal INR freeze depth matches highest weight stable rank layer; SAEs reveal SIREN atoms are localized while FFMLP atoms trace cohort contours with causal impact on PSNR.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"structure and test-time fitting cannot be separated, leaving per-layer transferability unmeasured. This raises two questions:whichlayers carry transferable representations, andwhatdo those representa- tions encode? Drawing inspiration from the pretrained vision encoder literature, where frozen-feature analysis has localized transferable representations to intermediate rather than output layers [4, 17, 53], we sweep the test-time freeze boundary across all encoder layers of a cohort-trained INR. Whether this pattern holds for cohort-trained INRs has not been studied. The two dominant backbones, SIREN [60] and Fourier-feature MLPs (FFMLP) [63], may not localize transferable structure at the same depth since they encode coordinate position through fundamentally different mechanisms (sinu-"},{"citing_arxiv_id":"2605.01517","ref_index":239,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation","primary_cat":"cs.CV","submitted_at":"2026-05-02T16:10:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.25184","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Enabling High Error Tolerance in Satellite Video Transmissions by Generative Semantic Communication","primary_cat":"eess.SP","submitted_at":"2026-04-28T03:43:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A generative semantic communication method for satellite video achieves 2.5 dB higher PSNR than conventional semantic comms at 45% error rate and remains functional above 80% error by combining semantic encoding with generative reconstruction.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.21681","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Sapiens2","primary_cat":"cs.CV","submitted_at":"2026-04-23T13:45:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Sapiens2 improves pretraining, data scale, and architecture over its predecessor to set new state-of-the-art results on human pose estimation, body-part segmentation, normal estimation, and new tasks like pointmap and albedo estimation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19648","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"CoCo-SAM3: Harnessing Concept Conflict in Open-Vocabulary Semantic Segmentation","primary_cat":"cs.CV","submitted_at":"2026-04-21T16:37:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"CoCo-SAM3 improves SAM3 by aligning evidence from synonymous prompts for concept consistency and then running inter-class competition on a unified scale to reduce mask overlaps.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.16879","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Adaptive Forensic Feature Refinement via Intrinsic Importance Perception","primary_cat":"cs.CV","submitted_at":"2026-04-18T07:07:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"I2P adaptively selects the most discriminative layers from visual foundation models for synthetic image detection and constrains task updates to low-sensitivity parameter subspaces to improve specificity without harming generalization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.13313","ref_index":46,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Concrete Jungle: Towards Concreteness Paved Contrastive Negative Mining for Compositional Understanding","primary_cat":"cs.LG","submitted_at":"2026-04-14T21:28:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Using lexical concreteness to guide contrastive negative mining and a new margin-based Cement loss, the Slipform framework reaches state-of-the-art on compositional benchmarks for vision-language models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"8907 CLIP-L 0.9197 0.9279 0.9029 DINOScore 0.6514 0.6707 0.6225 CLIP-V 0.7168 0.7270 0.7000 # Unique↑11050 1103411190 Table 1 | Statistics summary of datasets by sampling bias. Detailed investigations of the ConcreteBatch are performed regarding sampling bias, hardness, and intra-modal similari- ties. CLIPScores are measured with a PE-Core-L-14-336 [46] pretrained on the MetaCLIP-5.4B dataset [47] (total 58B sam- ples seen), DINOv2 [48] is employed for DINOScores, and BERTScores [49] is calculated using the default RoBERTa- large [50] with baseline rescaling. For notational convenience, let Dℎ𝑐, D𝑙𝑐 and D𝑤𝑜 denote datasets generated by perturbing keywords with high, low, and randomly sampled concreteness,"},{"citing_arxiv_id":"2604.12551","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Cross-Attentive Multiview Fusion of Vision-Language Embeddings","primary_cat":"cs.CV","submitted_at":"2026-04-14T10:25:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CAMFusion fuses multiview 2D vision-language embeddings via cross-attention and multiview consistency self-supervision to produce better 3D semantic and instance representations, outperforming averaging and reaching SOTA on benchmarks including zero-shot out-of-domain cases.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.12145","ref_index":53,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Why Your Tokenizer Fails in Information Fusion: A Timing-Aware Pre-Quantization Fusion for Video-Enhanced Audio Tokenization","primary_cat":"eess.AS","submitted_at":"2026-04-13T23:49:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A timing-aware pre-quantization fusion approach integrates visual cues into audio tokenizers along the temporal axis, maintaining reconstruction quality while outperforming audio-only and prior multimodal baselines on downstream tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.12012","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment","primary_cat":"cs.CV","submitted_at":"2026-04-13T20:00:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TIPSv2 improves dense patch-text alignment in vision-language pretraining through distillation and iBOT++ modifications, yielding models on par with or better than recent baselines on 9 tasks across 20 datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.11751","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Grounded World Model for Semantically Generalizable Planning","primary_cat":"cs.RO","submitted_at":"2026-04-13T17:25:41+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A vision-language-aligned world model turns visuomotor MPC into a language-following planner that reaches 87% success on 288 unseen semantic tasks where standard VLAs drop to 22%.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[4] Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Rasheed, Junke Wang, Marco Monteiro, Hu Xu, Shiyu Dong, Nikhila Ravi, Daniel Li, Piotr Dollár, and Christoph Feichtenhofer. Perception encoder: The best visual embeddings are not at the output of the network, 2025. URL https: //arxiv.org/abs/2504.13181. [5] Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choro- manski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alexander Herzog, Jasmine Hsu, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Dmitry Kalash-"},{"citing_arxiv_id":"2604.11496","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Revisiting Compositionality in Dual-Encoder Vision-Language Models: The Role of Inference","primary_cat":"cs.CV","submitted_at":"2026-04-13T14:03:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Dual-encoder VLMs gain robust compositional generalization by learning localized alignments from frozen patch and token embeddings instead of using global similarity.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.08522","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"UniversalVTG: A Universal and Lightweight Foundation Model for Video Temporal Grounding","primary_cat":"cs.CV","submitted_at":"2026-04-09T17:57:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"UniversalVTG is a lightweight foundation model for video temporal grounding that achieves state-of-the-art results across five benchmarks while being over 100 times smaller than recent MLLM-based methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.06912","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models","primary_cat":"cs.CV","submitted_at":"2026-04-08T10:12:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Q-Zoom achieves up to 4.39x inference speedup in high-resolution MLLM scenarios via query-aware gating and region localization, matching or exceeding baseline accuracy on document and high-res benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Conversely, the LLaV A series [1], [14] adopts the dense, uncompressed token sequence directly from the vision encoder, projecting it straight into the LLM's feature space. This latter approach has gradually emerged as the mainstream paradigm in modern MLLMs due to its architectural simplicity PREPRINT 3 and empirical effectiveness. However, standard vision en- coders [17], [18], [39], [40] (e.g., CLIP ViT [18]) are typically pre-trained at low resolutions (e.g.,224×224or336×336). This training prior fundamentally bottlenecks their ability to directly encode high-resolution images, severely limiting the fine-grained perceptual capabilities of the resulting MLLMs. To tackle this limitation, recent literature has explored several distinct evolutionary pathways."},{"citing_arxiv_id":"2604.06332","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Telescope: Learnable Hyperbolic Foveation for Ultra-Long-Range Object Detection","primary_cat":"cs.CV","submitted_at":"2026-04-07T18:13:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Telescope uses learnable hyperbolic foveation to deliver a 76% relative mAP gain (0.185 to 0.326) for objects beyond 250 meters while keeping overhead low.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.05418","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"VideoStir: Understanding Long Videos via Spatio-Temporally Structured and Intent-Aware RAG","primary_cat":"cs.CV","submitted_at":"2026-04-07T04:26:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VideoStir introduces a spatio-temporal graph-based structure and intent-aware retrieval for long-video RAG, achieving competitive performance with SOTA methods via a new IR-600K dataset.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"hour-scale video reasoning via uniform sampling. However, this paradigm faces a major bottleneck: limited context windows necessitate sparse sam- pling across the timeline, which can yield semanti- cally redundant frames while simultaneously risk- ing the loss of fleeting, query-critical cues. In par- allel, contrastive video-language models [21] (e.g., Video-CLIP [22], X-CLIP [23], and PE [ 24]) ex- tend CLIP-style contrastive embeddings to video, enabling query-conditioned retrieval of salient clips and frames. Nevertheless, contrastive objectives primarily optimize for semantic similarity and may miss cues that are implicitly relevant to the query intent but lack an explicit semantic match. 2.2 Agentic Long-Video Understanding To alleviate the contextual intractability inherent in"},{"citing_arxiv_id":"2604.04133","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Learning Robust Visual Features in Computed Tomography Enables Efficient Transfer Learning for Clinical Tasks","primary_cat":"cs.CV","submitted_at":"2026-04-05T14:29:35+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VoxelFM learns robust 3D CT visual features via DINO self-distillation that transfer effectively to seven clinical task categories using frozen backbones and lightweight heads, outperforming prior CT foundation models even on report generation.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"2025.3564382. [14] Weiyun Wang et al.InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency. Aug. 2025.doi:10.48550/arXiv.2508.18265. [15] Shuai Bai et al.Qwen3-VL Technical Report. Nov. 2025.doi:10.48550/arXiv.2511.21631. [16][2103.00020] Learning Transferable Visual Models From Natural Language Supervision. [17] Michael Tschannen et al.SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Under- standing, Localization, and Dense Features. Feb. 2025.doi:10.48550/arXiv.2502.14786. 17 [18] Daniel Bolya et al.Perception Encoder: The best visual embeddings are not at the output of the network. Apr. 2025.doi:10.48550/arXiv.2504.13181. [19] David Fan et al."},{"citing_arxiv_id":"2604.02320","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Large-scale Codec Avatars: The Unreasonable Effectiveness of Large-scale Avatar Pretraining","primary_cat":"cs.CV","submitted_at":"2026-04-02T17:58:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Pretraining on 1M wild videos followed by post-training on curated data yields high-fidelity feedforward 3D avatars that generalize across identities, clothing, and lighting with emergent relightability and loose-garment support.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"line trains on diverse corpora [25, 50, 51] fromin-the-wild data. These models generalize more broadly in a feedfor- ward manner but often produce distortions from unobserved views, blur in body parts, and limited expressivity. Recently, large-scale pre/post-training has achieved re- markable success in resolving the aforementioned trade-off in language modeling [1, 60, 62], vision models [6, 32, 56] and video generation [4, 34, 63]. Pretraining learns broad priors for generalization from million-to-billion-scale train- ing data, and the model is post-trained with high-quality curated data to align the learned representation with a tar- get task. Inspired by the success in adjacent domains, we present Large-scale Codec Avatars (LCA), a pre/post-train"},{"citing_arxiv_id":"2603.03577","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"From Local Matches to Global Masks: Template-Guided Instance Detection and Segmentation in Open-World Scenes","primary_cat":"cs.CV","submitted_at":"2026-03-03T23:11:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"L2G-Det detects and segments novel object instances in open scenes by using local template patch matches to generate points that prompt an augmented SAM for global masks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.01738","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Simplicity Prevails: The Emergence of Generalizable AIGI Detection in Visual Foundation Models","primary_cat":"cs.CV","submitted_at":"2026-02-02T07:20:02+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Frozen features from vision foundation models enable a linear probe to outperform specialized AIGI detectors by over 30% on in-the-wild data due to emergent forgery knowledge from pre-training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.17817","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Chorus: Multi-Teacher Pretraining for Holistic 3D Gaussian Scene Encoding","primary_cat":"cs.CV","submitted_at":"2025-12-19T17:22:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Chorus pretrains a shared 3D Gaussian scene encoder via multi-teacher distillation to capture holistic features from high-level semantics to fine-grained structure, with strong transfer on segmentation and point-cloud tasks using far fewer scenes.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.13511","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Adapting MLLMs for Nuanced Video Retrieval","primary_cat":"cs.CV","submitted_at":"2025-12-15T16:38:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Text-only contrastive fine-tuning of an MLLM with hard negatives produces embeddings that handle temporal, negation, and multimodal nuances in video retrieval and achieves SOTA performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.08730","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SegEarth-OV3: Exploring SAM 3 for Open-Vocabulary Semantic Segmentation in Remote Sensing Images","primary_cat":"cs.CV","submitted_at":"2025-12-09T15:42:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SAM 3 can be applied training-free to remote sensing open-vocabulary segmentation and change detection by fusing its semantic and instance heads and filtering with presence scores.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.16719","ref_index":12,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SAM 3: Segment Anything with Concepts","primary_cat":"cs.CV","submitted_at":"2025-11-20T18:59:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SAM 3 introduces promptable concept segmentation that doubles accuracy of prior systems on images and videos while improving standard SAM segmentation performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.18457","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VFM-VAE: Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models","primary_cat":"cs.CV","submitted_at":"2025-10-21T09:30:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VFM-VAE uses a frozen VFM directly as LDM tokenizer via a custom decoder, reaching gFID 2.22 in 80 epochs and 1.62 after 640 epochs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.20899","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Concepts in Motion: Temporal Concept Bottleneck Model for Interpretable Video Classification","primary_cat":"cs.CV","submitted_at":"2025-09-25T08:35:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MoTIF adds temporal self-attention and automatic VLM-based concept discovery to concept bottleneck models for interpretable video classification, showing gains over prior global CBMs on benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.06248","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Deepfake Detection that Generalizes Across Benchmarks","primary_cat":"cs.CV","submitted_at":"2025-08-08T12:03:56+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GenD achieves state-of-the-art average cross-dataset AUROC in deepfake detection by parameter-efficient adaptation of a foundational vision encoder with hyperspherical manifold enforcement via L2 normalization and metric learning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.09985","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning","primary_cat":"cs.AI","submitted_at":"2025-06-11T17:57:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"V-JEPA 2 pre-trained on massive unlabeled video achieves strong results on motion understanding and action anticipation, SOTA video QA at 8B scale, and enables zero-shot robotic planning on Franka arms using only 62 hours of unlabeled robot video.","context_count":1,"top_context_role":"method","top_context_polarity":"unclear","context_text":"Frames per Second 4.0 4.0 Crop Size 256 [256, 384, 512] Random Resize Aspect Ratio [0.75 1.35] [0.75, 1.35] Random Resize Scale [0.3, 1.0] [0.3, 1.0] Steps Variable 12000 Warmup Steps 12000 N/A Batch Size (global) 3072 3072 Starting Learning Rate 1e-4 5.25e-4 Final Learning Rate 5.25e-4 1e-6 Weight Decay 0.04 0.04 EMA 0.99925 0.99925 Spatial Mask Scale [0.15, 0.7] [0.15, 0.7] Temporal Mask Scale [1.0, 1.0] [1.0, 1.0] Mask Aspect Ratio [0.75 1.5] [0.75, 1.5] Tubelet Size 2 2 Patch Size 16 16 Training in the first phase began with a learning rate warmup for 12,000 steps followed by a constant learning rate for the rest of the phase. We checked evaluations every 60,000 steps. The cooldown phase began with a learning rate at 5."},{"citing_arxiv_id":"2506.01844","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics","primary_cat":"cs.LG","submitted_at":"2025-06-02T16:30:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SmolVLA is a small efficient VLA model that achieves performance comparable to 10x larger models while training on one GPU and deploying on consumer hardware via community data and chunked asynchronous action prediction.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}