{"total":22,"items":[{"citing_arxiv_id":"2605.18491","ref_index":26,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Benchmarking transferability of SSL pretraining to same and different modality segmentation tasks","primary_cat":"cs.CV","submitted_at":"2026-05-18T14:41:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SMIT, which combines masked image modeling with self-distillation, delivers the highest segmentation accuracy, fastest convergence, and best few-shot performance across nine CT and MRI tasks compared to contrastive and rotation-based SSL methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17373","ref_index":18,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"FML-bench: A Controlled Study of AI Research Agent Strategies from the Perspective of Search Dynamics","primary_cat":"cs.LG","submitted_at":"2026-05-17T10:30:38+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16836","ref_index":48,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"HYVINT: Intensity-Driven Hypergraph Generation with Variational Representations","primary_cat":"stat.ML","submitted_at":"2026-05-16T06:38:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"HYVINT introduces an intensity-driven incidence mechanism and tractable variational estimator for hypergraph generation, with error bounds and empirical gains in fidelity, novelty, and diversity.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15571","ref_index":11,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"MaxSketch: Robust Distinct Counting in Streams via Random Projections","primary_cat":"stat.ML","submitted_at":"2026-05-15T03:29:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MaxSketch achieves O~(log n / ε²) memory for (1+ε)-approximate distinct counting in streams with geometric structure via max-linear random projections.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12969","ref_index":31,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Revisiting Reinforcement Learning with Verifiable Rewards from a Contrastive Perspective","primary_cat":"cs.LG","submitted_at":"2026-05-13T04:02:36+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12451","ref_index":34,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"FuTCR: Future-Targeted Contrast and Repulsion for Continual Panoptic Segmentation","primary_cat":"cs.CV","submitted_at":"2026-05-12T17:41:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FuTCR improves new-class panoptic quality by up to 28% in continual panoptic segmentation by discovering future-like regions in background areas and applying targeted contrast and repulsion to restructure representations.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"for contrastive learning of visual representations. InInternational conference on machine learning, pages 1597-1607. PmLR, 2020. [33] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729-9738, 2020. [34] Viktor Olsson, Wilhelm Tranheden, Juliano Pinto, and Lennart Svensson. Classmix: Segmentation-based data augmentation for semi-supervised learning. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 1369-1378, 2021. [35] Zihan Lin, Zilei Wang, and Yixin Zhang. Preparing the future for continual semantic segmenta-"},{"citing_arxiv_id":"2605.12122","ref_index":16,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Disentangled Sparse Representations for Concept-Separated Diffusion Unlearning","primary_cat":"cs.LG","submitted_at":"2026-05-12T13:39:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SAEParate disentangles sparse representations in diffusion models via contrastive clustering and nonlinear encoding to enable more precise concept unlearning with reduced side effects.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Scaling and evaluating sparse autoencoders, 2024. URL https://arxiv.org/abs/2406.04093. [15] Wes Gurnee, Neel Nanda, Matthew Pauly, Katherine Harvey, Dmitrii Troitskii, and Dimitris Bertsimas. Finding neurons in a haystack: Case studies with sparse probing.Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/ forum?id=JYs1R9IMJr. [16] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729-9738, 2020. [17] Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415, 2016."},{"citing_arxiv_id":"2605.08802","ref_index":8,"ref_count":2,"confidence":0.55,"is_internal_anchor":false,"paper_title":"CoLVR: Enhancing Exploratory Latent Visual Reasoning via Contrastive Optimization","primary_cat":"cs.CV","submitted_at":"2026-05-09T08:47:00+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CoLVR uses latent contrastive objectives with angle-based perturbation and RL trajectory rewards to increase exploratory visual reasoning in MLLMs, delivering 5-8% gains on VSP, Jigsaw, and MMStar benchmarks.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"with it to form a misleading negative feature: z= (cosθ neg ·∆ norm + sinθ neg ·η norm)· ||∆||, Sneg=SIhint+z=SI+∆+z. This procedure produces a set of positive and negative samples distributed around the true reasoning trajectory. During training, we repeat the above process Nneg times and optimize the latent represen- tations using the InfoNCE loss [8], which encourages closeness to positive samples and separation from the structured negative samples: Lcontras=− 1 K KX i log esim(hi,SIhint )/τ esim(hi,SIhint )/τ+PNneg j esim(hi,Sj neg)/τ ! , where τ represents the temperature. In addition, we introduce a cross-entropy loss on the answer to preserve the model's text generation capability: LcontrasSF T =λ2Lcontras+ℓCE ."},{"citing_arxiv_id":"2605.07338","ref_index":38,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"ShellfishNet: A Domain-Specific Benchmark for Visual Recognition of Marine Molluscs","primary_cat":"cs.CV","submitted_at":"2026-05-08T06:42:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ShellfishNet is a new benchmark of 8,691 images across 32 mollusc taxa for evaluating vision models on real-world underwater ecological monitoring tasks including robustness to degradation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07178","ref_index":12,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Masks Can Talk: Extracting Structured Text Information from Single-Modal Images for Remote Sensing Change Detection","primary_cat":"cs.CV","submitted_at":"2026-05-08T03:16:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"S2M extracts structured text quadruples from change masks to provide noise-free multimodal supervision, achieving 17.80% Sek and 66.14% F_scd on the new Gaza-Change-v2 dataset and outperforming LLM-based multimodal methods.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"boundary refinement, and multimodal alignment. For the segmentation task, we adopt a hybrid seg- mentation loss combiningFocal Loss[ 16],Dice Loss[ 25], andLovász Loss[ 2]. The segmentation loss is defined as: Lseg =αL focal +βL dice +γL lovász.(5) Bidirectional vision-text contrastive loss.To align the visual and textual modalities, we employ a bidirectional infoNCE loss based on [12]. Let Ri I be the global visual feature of sample i, and Ri T the corresponding text embedding. A similarity matrix Sij = Ri I ·Rj T/τ is computed with temperature τ. The vision-to-text (V→T) and text-to-vision (T→V) losses are: LV→T=− 1 B BX i=1 log exp(Sii)P j exp(Sij) ,L T→V=− 1 B BX i=1 log exp(Sii)P j exp(Sji) .(6) The bidirectional contrastive loss is their average:"},{"citing_arxiv_id":"2605.07055","ref_index":24,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Pan-FM: A Pan-Organ Foundation Model with Saliency-Guided Masking for Missing Robustness","primary_cat":"cs.CV","submitted_at":"2026-05-08T00:04:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Pan-FM learns balanced representations across seven organs by adaptively masking dominant organs during pre-training, yielding stronger disease prediction and missing-organ robustness than single-organ or naive multimodal baselines on UK Biobank.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. [23] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning. volume 33, pages 21271-21284, 2020. [24] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729-9738, 2020. [25] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners."},{"citing_arxiv_id":"2605.06990","ref_index":19,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"TRAJGANR: Trajectory-Centric Urban Multimodal Learning via Geospatially Aligned Neural Representations","primary_cat":"cs.CV","submitted_at":"2026-05-07T22:10:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TrajGANR learns continuous neural representations of trajectories to enable fine-grained alignment with street-view images and locations in a joint multimodal self-supervised objective, outperforming prior geospatial MSSL methods on urban mobility and road tasks.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"The first term aligns each street-view image embedding h(I) k with its corresponding SVI location embedding h(x) k , which is very similar to many previous location-aware geospatial MSSL frame- works [21, 31, 5, 13]. The second term in Eq. (7) aligns each localized trajectory embedding ei[x(I) k ] with the same location embedding. For these two losses, negativesNloc are drawn from a MoCo-style queue [19] of random and trajectory locations, providing a large, diverse set of negatives. The third term aligns each localized trajectory embedding with its corresponding street-view image embedding, using embeddings of other street-view images in the minibatch as negatives Nimg. The second and third loss terms make TRAJGANR distinct from all existing geospatial MSSL frameworks."},{"citing_arxiv_id":"2605.04943","ref_index":20,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"DART: A Vision-Language Foundation Model for Comprehensive Rope Condition Monitoring","primary_cat":"cs.CV","submitted_at":"2026-05-06T14:12:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DART is a cross-modal foundation model that delivers rope damage classification, severity regression, and few-shot recognition from a single frozen representation trained on 4270 images across 14 damage classes.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02589","ref_index":49,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Representation learning from OCT images","primary_cat":"cs.CV","submitted_at":"2026-05-04T13:37:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A structured survey of representation learning methods for retinal OCT image analysis, covering supervised, self-supervised, generative, multimodal, and foundation model approaches along with datasets and open problems.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"learning approaches train models end-to-end using manually annotated labels, relying on CNN-based architectures such as ResNet [45] and DenseNet [46] or encoder-decoder designs such as U-Net [47] variants to learn discriminative features directly from labeled B-scans. Self-supervised learning methods instead construct supervisory signals from the data itself, through contrastive objectives such as SimCLR [48], MoCo [49], and BYOL [50], reconstruction-based objectives such as autoencoders, variational autoencoders, and masked image modeling, or pretext tasks such as rotation prediction, patch ordering, and inpainting. Generative approaches model the underlying data distribution explicitly, using variational autoencoders, generative adversarial networks, or the more recently introduced diffusion models, to learn"},{"citing_arxiv_id":"2605.02292","ref_index":15,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Momentum-Anchored Multi-Scale Fusion Model for Long-Tailed Chest X-Ray Classification","primary_cat":"cs.CV","submitted_at":"2026-05-04T07:28:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A new neural network stabilizes features for rare chest X-ray diseases via momentum anchoring and multi-scale fusion on EfficientNet, achieving 0.8682 AUC on ChestX-ray14.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.16801","ref_index":47,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Continuous Limits of Coupled Flows in Representation Learning","primary_cat":"cs.LG","submitted_at":"2026-04-18T03:19:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Discrete decentralized learning dynamics on manifolds converge uniformly to an overdamped Langevin SDE whose stationary states produce orthogonally disentangled, linearly separable features.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Stochastic Analysis and Applications, 16(3):547-566, 1998. [45] Brian C Hall.Lie groups, Lie algebras, and representations: an elementary introduction, volume 222. Springer, 2015. [46] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770-778, 2016. [47] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729-9738, 2020. [48] Matthias Hein, Jean-Yves Audibert, and Ulrike V on Luxburg. From graphs to manifolds-weak and strong pointwise consistency of graph laplacians."},{"citing_arxiv_id":"2604.12805","ref_index":35,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Image-to-Image Translation Framework Embedded with Rotation Symmetry Priors","primary_cat":"cs.CV","submitted_at":"2026-04-14T14:35:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Rotation-equivariant convolutions and adaptive TL-Conv layers are added to I2I networks to preserve rotation symmetry and improve translation quality across domains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.11389","ref_index":40,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"ConvFormer3D-TAP: Phase/Uncertainty-Aware Front-End Fusion for Cine CMR View Classification Pipelines","primary_cat":"cs.CV","submitted_at":"2026-04-13T12:29:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ConvFormer3D-TAP classifies six cine CMR views at 96% accuracy using 3D conv tokenization, multiscale attention, and uncertainty-aware multi-clip fusion on 150k sequences.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.09537","ref_index":26,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Case-Grounded Evidence Verification: A Framework for Constructing Evidence-Sensitive Supervision","primary_cat":"cs.CL","submitted_at":"2026-04-10T17:55:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A supervision construction procedure generates explicit support and controlled non-support examples (counterfactual and topic-related negatives) without manual annotation, producing verifiers that demonstrate genuine evidence dependence in radiology tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.04496","ref_index":23,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"The Indra Representation Hypothesis for Multimodal Alignment","primary_cat":"cs.CV","submitted_at":"2026-04-06T07:46:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Unimodal model representations converge to a relational structure captured by the Indra representation via V-enriched Yoneda embedding, which is unique and structure-preserving and improves cross-model and cross-modal robustness when instantiated with angular distance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.02509","ref_index":32,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Rapidly deploying on-device eye tracking by distilling visual foundation models","primary_cat":"cs.CV","submitted_at":"2026-04-02T21:07:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DistillGaze reduces median gaze error by 58.62% on a 2000+ participant dataset by distilling foundation models into a 256K-parameter on-device model using synthetic labeled data and unlabeled real data.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Adrien Bardes, Gregoire Mialon, Yuandong Tian, et al. A cookbook of self-supervised learning.arXiv preprint arXiv:2304.12210, 2023. [31] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InInternational conference on machine learning, pages 1597-1607. PmLR, 2020. [32] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729-9738, 2020. [33] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl"},{"citing_arxiv_id":"2604.01833","ref_index":19,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Language-Pretraining-Induced Bias: A Strong Foundation for General Vision Tasks","primary_cat":"cs.CV","submitted_at":"2026-04-02T09:46:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Random label bridge training aligns LLM parameters with vision tasks, and partial training of certain layers often suffices due to their foundational properties.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Linear probing experiments use a batch size of 128 for 100 epochs on an L20 GPU. And all the sacle up experiments for Qwen3-8B, Qwen3-14B and LLaVA series are all conducted on 8×A100 GPUs. All experiments use a consistent training setup with 224×224 input resolution, a learning rate of 1e-3 (Adam optimizer, cosine annealing), weight decay of 0.05, drop path regularization [19] of 0.1, and gradient clipping with max norm 1.0. Hyperparameter selection and robustness.We clarify how hyperparameters are chosen and evaluate sensitivity to optimization choices. We acknowledge that learning rate and weight decay can affect absolute performance; thus, we perform a small learning-rate×weight-decay sweep. To avoid unintentionally favoring"}],"limit":50,"offset":0}