{"total":15,"items":[{"citing_arxiv_id":"2605.22248","ref_index":75,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"No Epoch Like the Present: Robust Climate Emulation Requires Out-of-Distribution Generalisation","primary_cat":"cs.LG","submitted_at":"2026-05-21T09:54:57+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ML climate emulators degrade under seasonal distribution shifts that proxy long-term climate change, but physically motivated compositional decompositions improve out-of-distribution performance with modest in-distribution trade-offs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21324","ref_index":25,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Stimulus symmetries can confound representational similarity analyses","primary_cat":"q-bio.NC","submitted_at":"2026-05-20T15:51:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Stimulus symmetries render many neural representations functionally equivalent yet produce qualitatively different RSMs, including drifting ones from SGD or regularization in image-encoding networks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20549","ref_index":42,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"MAPS: A Synthetic Dataset for Probing Vision Models in a Controlled 3D Scene Space","primary_cat":"cs.CV","submitted_at":"2026-05-19T22:51:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MAPS provides 2618 validated 3D meshes and a controllable rendering pipeline to attribute vision model recognition failures to specific scene parameters, finding camera distance and elevation as the dominant failure factors across 20 tested models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19359","ref_index":17,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"MAM-CLIP: Vision-Language Pretraining on Mammography Atlases for BI-RADS Classification","primary_cat":"cs.CV","submitted_at":"2026-05-19T04:42:05+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Contrastive pretraining on mammography atlas image-text pairs improves BI-RADS classification F1 by 1-14% especially in low-label regimes, outperforming equivalent numbers of direct labels in some settings.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18735","ref_index":31,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"PIXLRelight: Controllable Relighting via Intrinsic Conditioning","primary_cat":"cs.CV","submitted_at":"2026-05-18T17:55:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A transformer-based neural renderer that transfers arbitrary PBR lighting to single images via shared intrinsic conditioning extracted from both multi-illumination photos and path-traced coarse 3D renders.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11300","ref_index":37,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Can Graphs Help Vision SSMs See Better?","primary_cat":"cs.CV","submitted_at":"2026-05-11T22:40:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"GraphScan replaces geometric or coordinate-based scanning in Vision SSMs with learned local semantic graph routing, yielding SOTA results among such models on classification and segmentation tasks.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"neighborhood size, relative positional bias, and pre-SSM insertion, and visualizations confirm that GraphScan induces interpretable displacement fields over the token lattice. Together, these results suggest a new design principle: scanning should not be treated as geometric serialization but as learned semantic routing before global state-space modeling. 2 Related Work 2.1 Vision Backbones and SSMs Vision backbones span convolutional [37, 65, 74], Transformer [10, 54, 36, 17, 77], MLP-mixing [53], and graph-based [16, 66] families; we refer to App. B for a fuller treatment. State-space sequence models provide efficient long-range mixing with near-linear scaling [15, 14, 48, 11, 42], and Mamba introduced selective state spaces with input-dependent recurrence parameters [13], later extended"},{"citing_arxiv_id":"2605.11203","ref_index":27,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"FeatMap: Understanding image manipulation in the feature space and its implications for feature space geometry","primary_cat":"cs.LG","submitted_at":"2026-05-11T20:12:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Linear mappings in feature space can reconstruct a wide range of image manipulations including semantic edits, suggesting that feature representations are approximately linearly organized.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"to represent two prominent and complementary design paradigms, convolutional and transformer- based, allowing us to assess whether findings generalize across architectural designs. Both models are strong, general-purpose backbones that have demonstrated state-of-the-art performance on a wide range of vision tasks [26]. Assessing the semantic retainmentRecent discussions in the field of self-supervised learning [ 27] emphasized that features suitable for reconstruction do not necessarily coincide with features suitable for semantic tasks. We therefore also assess the semantic quality of the mapped features as compared to the original features by means of a finetuned downstream classifier. To evaluate the semantic quality of the learned mappings using a downstream classifier, we finetune both backbones on Stanford"},{"citing_arxiv_id":"2605.07338","ref_index":64,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"ShellfishNet: A Domain-Specific Benchmark for Visual Recognition of Marine Molluscs","primary_cat":"cs.CV","submitted_at":"2026-05-08T06:42:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ShellfishNet is a new benchmark of 8,691 images across 32 mollusc taxa for evaluating vision models on real-world underwater ecological monitoring tasks including robustness to degradation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08276","ref_index":30,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Beyond ViT Tokens: Masked-Diffusion Pretrained Convolutional Pathology Foundation Model for Cell-Level Dense Prediction","primary_cat":"cs.CV","submitted_at":"2026-05-08T04:34:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A masked-diffusion pretrained convolutional model outperforms ViT pathology foundation models on cell-level dense prediction tasks in histology.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"pretraining [3], while diffusion models learn visual structure through denoising objectives. Masked diffusion [29] further views diffusion as time-conditioned reconstruction, suggesting that the cor- ruption process can be designed for representation learning rather than image synthesis alone. In parallel, modern convolutional architectures such as ConvNeXt [30] and ConvNeXt V2 [31] provide strong locality bias and efficient multi-scale feature extraction, making convolutional pretraining an important alternative to token-based visual representation learning. 2 Frozen Trainable Element-wise Add Data Flow Skip Connection Loss / Supervision"},{"citing_arxiv_id":"2605.06274","ref_index":9,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"When Labels Have Structure: Improving Image Classification with Hierarchy-Aware Cross-Entropy","primary_cat":"cs.LG","submitted_at":"2026-05-07T13:49:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Hierarchy-Aware Cross-Entropy improves image classification by incorporating class hierarchies into the loss through prediction aggregation and ancestral label smoothing, achieving mean accuracy gains of 4.66% in end-to-end training and 2.18% in linear probing.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01711","ref_index":24,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Linear-Time Global Visual Modeling without Explicit Attention","primary_cat":"cs.CV","submitted_at":"2026-05-03T04:51:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Dynamic parameterization of standard layers can replace explicit attention for linear-time global visual modeling.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.15555","ref_index":26,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"CXR-LT 2026 Challenge: Multi-Center Long-Tailed and Zero Shot Chest X-ray Classification","primary_cat":"cs.CV","submitted_at":"2026-04-16T22:10:09+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CXR-LT 2026 introduces a radiologist-annotated multi-center dataset of 145k+ CXRs to benchmark robust multi-label classification on known classes and open-world generalization to unseen rare diseases.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.04552","ref_index":18,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"StableTTA: Improving Vision Model Performance by Training-free Test-Time Adaptation Methods","primary_cat":"cs.CV","submitted_at":"2026-04-06T09:21:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"StableTTA improves ImageNet-1K accuracy across 71 vision models by stabilizing logit aggregation under coherent-batch inference and enabling efficient single-forward-pass adaptation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.09138","ref_index":6,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Rotation Equivariant Mamba for Vision Tasks","primary_cat":"cs.CV","submitted_at":"2026-03-10T03:22:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"EQ-VMamba adds rotation-equivariant cross-scan and group Mamba blocks to enforce end-to-end rotation equivariance, yielding better rotation robustness, competitive accuracy, and roughly 50% fewer parameters than non-equivariant baselines across classification, segmentation, and super-resolution.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.22813","ref_index":5,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"TRUST: Test-Time Refinement using Uncertainty-Guided SSM Traverses","primary_cat":"cs.CV","submitted_at":"2025-09-26T18:19:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TRUST is a test-time adaptation method for SSM vision models that uses uncertainty-guided traversal permutations to refine Mamba parameters via pseudo-labels and weight averaging, improving robustness on distribution shifts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}