{"total":53,"items":[{"citing_arxiv_id":"2605.31108","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Remembering by Reconstructing: Domain Incremental Learning With Test-Time Training on Video Streams","primary_cat":"cs.CV","submitted_at":"2026-05-29T10:17:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Domain-incremental video learning that permits forgetting through per-domain LoRA adapters and recovers the matching adapter at inference via test-time training on a self-supervised MAE reconstruction head.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.31068","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"HQ-JEPA: Hybrid Quantum Joint-Embedding Predictive Architecture for Cross-Modal Remote Sensing Representation Learning","primary_cat":"cs.CV","submitted_at":"2026-05-29T09:37:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"HQ-JEPA combines JEPA-style predictive self-supervision with cross-modal alignment and a SWAP-test-based quantum fidelity loss for learning representations from paired remote sensing imagery, reporting competitive results on GeoBench tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28940","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Neural Scaling Laws for Jet Generation","primary_cat":"hep-ph","submitted_at":"2026-05-27T18:00:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Scaling laws hold logarithmically for model size in autoregressive jet generation, with next-token loss correlating to physical metrics via sliced Wasserstein distance, but show weaker scaling for dataset size and compute due to rapid saturation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.27893","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SIGMA: Bridging Structural and Distributional Gaps for Vision Foundation Model Adaptation","primary_cat":"cs.CV","submitted_at":"2026-05-27T03:17:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SIGMA proposes a lightweight PEFT adapter consisting of scale-adaptive fusion and semantic modulation to bridge structural and distributional gaps when adapting vision foundation models to dense tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20892","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"FruitEnsemble: MLLM-Guided Arbitration for Heterogeneous ensemble in Fine-Grained Fruit Recognition","primary_cat":"cs.CV","submitted_at":"2026-05-20T08:31:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"FruitEnsemble uses a weighted ensemble of backbones for top-3 candidates followed by MLLM arbitration on low-confidence samples to reach 70.49% accuracy on a new 306-class fruit dataset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17837","ref_index":117,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Temporal Aware Pruning for Efficient Diffusion-based Video Generation","primary_cat":"cs.CV","submitted_at":"2026-05-18T04:18:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"TAPE applies temporal-aware token pruning with smoothing, reselection, and timestep scheduling to speed up video diffusion models while preserving visual fidelity and coherence.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12430","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"AOI-SSL: Self-Supervised Framework for Efficient Segmentation of Wire-bonded Semiconductors In Optical Inspection","primary_cat":"cs.CV","submitted_at":"2026-05-12T17:27:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AOI-SSL combines small-domain self-supervised pre-training of vision transformers with in-context patch retrieval to reduce labeled data needs and enable fast adaptation for semiconductor wire-bond segmentation.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"study. References [1] John E Ayers.Digital Integrated Circuits: Analysis and Design, page 32. CRC Press, Boca Raton, FL, 2003. 7 [2] Ivana Balaˇzevi'c, David Steiner, Nikhil Parthasarathy, Relja Arandjelovi'c, and Olivier Henaff. Towards in-context scene understanding.Advances in Neural Information Processing Systems, 36:63758-63778, 2023. 2, 3, 5 [3] Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. BEiT: BERT pre-training of image transformers.arXiv preprint arXiv:2106.08254, 2021. 1 [4] Paul Bergmann, Michael Fauser, David Sattlegger, and Carsten Steger. MVTec AD-A comprehensive real-world dataset for unsupervised anomaly detection. InProceedings of the IEEE/CVF conference on computer vision and pattern"},{"citing_arxiv_id":"2605.11563","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"TCP-SSM: Efficient Vision State Space Models with Token-Conditioned Poles","primary_cat":"cs.CV","submitted_at":"2026-05-12T05:49:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TCP-SSM conditions stable poles on visual tokens to explicitly control memory decay and oscillation in SSMs, cutting computation up to 44% while matching or exceeding accuracy on classification, segmentation, and detection.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"group size, and low-rank dimension, which may require deployment-specific tuning. While the pole parameterization ensures stable local recurrence dynamics, it does not directly address other sources of inefficiency, such as scan ordering and memory access patterns. Future work can explore native two-dimensional pole-structured recurrences, adaptive scan, and hardware-aware implementations. 9 References [1] Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: Bert pre-training of image trans- formers.arXiv preprint arXiv:2106.08254, 2021. [2] Amanda M Beck, Mingjian He, Rodrigo G Gutierrez, Gladia C Hotan, and Patrick L Purdon. An iterative search algorithm to identify oscillatory dynamics in neurophysiological time series. bioRxiv, pages 2022-10, 2022."},{"citing_arxiv_id":"2605.07338","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ShellfishNet: A Domain-Specific Benchmark for Visual Recognition of Marine Molluscs","primary_cat":"cs.CV","submitted_at":"2026-05-08T06:42:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ShellfishNet is a new benchmark of 8,691 images across 32 mollusc taxa for evaluating vision models on real-world underwater ecological monitoring tasks including robustness to degradation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05646","ref_index":86,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality","primary_cat":"cs.CV","submitted_at":"2026-05-07T03:53:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MUSE decouples reconstruction and semantic learning in visual tokenization via topological orthogonality, yielding SOTA generation quality and improved semantic performance over its teacher model.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01741","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Adaptive Texture-aware Masking for Self-Supervised Learning in 3D Dental CBCT Analysis","primary_cat":"cs.CV","submitted_at":"2026-05-03T06:42:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ATMask adaptively masks high inter-slice texture variation regions in 3D CBCT volumes during self-supervised pretraining, enabling more data-efficient learning than random masking on dental tasks with a contributed 6314-scan dataset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00242","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MAEPose: Self-Supervised Spatiotemporal Learning for Human Pose Estimation on mmWave Video","primary_cat":"cs.CV","submitted_at":"2026-04-30T21:23:03+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.27277","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"BrainDINO: A Brain MRI Foundation Model for Generalizable Clinical Representation Learning","primary_cat":"cs.LG","submitted_at":"2026-04-30T00:21:36+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00915","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Rethink MAE with Linear Time-Invariant Dynamics","primary_cat":"cs.CV","submitted_at":"2026-04-29T15:07:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Token order in frozen visual representations is exploitable via SSM-based LTI probes, revealing pre-training-dependent heterogeneity that fixed pooling misses.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.21681","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Sapiens2","primary_cat":"cs.CV","submitted_at":"2026-04-23T13:45:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Sapiens2 improves pretraining, data scale, and architecture over its predecessor to set new state-of-the-art results on human pose estimation, body-part segmentation, normal estimation, and new tasks like pointmap and albedo estimation.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"heads, patch size16, final norm with [CLS]) at1024×768. Training uses a joint MAE and contrastive objective: an 8-layer MAE decoder (dim512) withℓ 2 reconstruction, and a [CLS] projection head for contrastive learning. Loss weights areMAE:1.0,CLS:0.4, KoLeo:0.04. We adopt multi-view training with2global and4local crops; global crops use random resize-crop in ratio[0.5,1.0], local crops in[0.2,0.7], with standard color/blur/solarize and horizontal flips. Inputs are normalized to ImageNet means/stds. Importantly, we do not use color augmentations on the global views - used for masked reconstruction objective. Optimization uses fused AdamW (lr1×10 −4,(β 1, β2)=(0.9,0.95), wd0.05) with zero-decay for norms, biases, positional and special tokens."},{"citing_arxiv_id":"2604.21502","ref_index":41,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VFM$^{4}$SDG: Unveiling the Power of VFMs for Single-Domain Generalized Object Detection","primary_cat":"cs.CV","submitted_at":"2026-04-23T10:04:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VFM4SDG is a dual-prior framework that distills cross-domain stable relations from VFMs into DETR encoders and injects semantic-contextual priors into decoder queries to reduce missed detections in single-domain generalized object detection.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.20329","ref_index":1,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Image Generators are Generalist Vision Learners","primary_cat":"cs.CV","submitted_at":"2026-04-22T08:23:48+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"Furthermore, the broad world knowledge acquired during pretraining endows the model with stronger priors on object sizes and distances compared to narrowly targeted models. To enable Nano Banana Pro to estimate depth in metric units, we instruct the model to output a carefully constructed false-color visualization of depth values. To visualize depth maps as RGB images, we establish a mapping between unbounded depth values in [0,∞) and bounded RGB values in[0, 1]3. Because the utility of accurate metric depth for nearby image content is generally higher than that of distant content (e.g., graspable objects matter more for robotics tasks, stereo/monodepth benchmarks usually measure accuracy terms of disparity or relative/log-depth) we \"curve\" metric depth prior to RGB encoding. Specifically, this is achieved by"},{"citing_arxiv_id":"2604.22838","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Neural Network Optimization Reimagined: Decoupled Techniques for Scratch and Fine-Tuning","primary_cat":"cs.CV","submitted_at":"2026-04-21T06:27:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"DualOpt decouples optimization by using real-time layer-wise weight decay for scratch training and weight rollback for fine-tuning to improve convergence, generalization, and reduce knowledge forgetting.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.16086","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Stylistic-STORM (ST-STORM) : Perceiving the Semantic Nature of Appearance","primary_cat":"cs.CV","submitted_at":"2026-04-17T14:15:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ST-STORM introduces a dual-branch SSL framework that disentangles semantic content from stylistic appearance using gated latent streams, JEPA for content invariance, and adversarial constraints for style capture.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.11487","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"NTIRE 2026 Challenge on Robust AI-Generated Image Detection in the Wild","primary_cat":"cs.CV","submitted_at":"2026-04-13T13:53:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The NTIRE 2026 challenge provides a dataset of over 294,000 real and AI-generated images with 36 transformations to benchmark robust detection models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.08649","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"PRAGMA: Revolut Foundation Model","primary_cat":"cs.LG","submitted_at":"2026-04-09T18:00:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PRAGMA pre-trains a Transformer on heterogeneous banking events with a tailored self-supervised masked objective, yielding embeddings that support strong downstream performance on credit scoring, fraud detection, and lifetime value prediction using linear heads or light fine-tuning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.08461","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"OVS-DINO: Open-Vocabulary Segmentation via Structure-Aligned SAM-DINO with Language Guidance","primary_cat":"cs.CV","submitted_at":"2026-04-09T16:57:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"OVS-DINO structurally aligns DINO with SAM to revitalize attenuated boundary features, achieving SOTA gains of 2.1% average and 6.3% on Cityscapes in weakly-supervised open-vocabulary segmentation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.05482","ref_index":46,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Unifying VLM-Guided Flow Matching and Spectral Anomaly Detection for Interpretable Veterinary Diagnosis","primary_cat":"cs.CV","submitted_at":"2026-04-07T06:22:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"New public dataset and VLM-guided flow matching segmentation combined with random matrix theory anomaly detection for interpretable canine pneumothorax diagnosis.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.02778","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"When Modalities Remember: Continual Learning for Multimodal Knowledge Graphs","primary_cat":"cs.CL","submitted_at":"2026-04-03T06:40:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MRCKG combines a multimodal-structural curriculum, cross-modal preservation, and contrastive replay to let multimodal knowledge graphs learn new entities and relations over time without catastrophic forgetting.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"where ⊕ denotes sequence concatenation, [ENT] is a global ag- gregation token,s 𝑒 ∈R 𝑑 is the learnable structural embedding of entity 𝑒, and ˆv𝑒,1:𝑚 and ˆw𝑒,1:𝑛 are the visual and textual token representations after projection through learnable linear layers, respectively. Structural embeddings are obtained by training KGE models. On the visual side, we use a frozen BEiT [ 2] visual tokenizer and codebook embedding layer to extract discrete visual token representations. On the textual side, we use a frozen BERT [ 9] tokenizer and word embedding layer to obtain textual token rep- resentations. These are then mapped into the same 𝑑-dimensional space through two learnable linear projection layers,W 𝑣 ∈R 𝑑𝑣 ×𝑑 andW 𝑤 ∈R 𝑑𝑤 ×𝑑."},{"citing_arxiv_id":"2604.02509","ref_index":38,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Rapidly deploying on-device eye tracking by distilling visual foundation models","primary_cat":"cs.CV","submitted_at":"2026-04-02T21:07:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DistillGaze reduces median gaze error by 58.62% on a 2000+ participant dataset by distilling foundation models into a 256K-parameter on-device model using synthetic labeled data and unlabeled real data.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Vicreg: Variance-invariance-covariance regularization for self- supervised learning.arXiv preprint arXiv:2105.04906, 2021. [37] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000-16009, 2022. [38] Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: Bert pre-training of image transformers.arXiv preprint arXiv:2106.08254, 2021. [39] Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. ibot: Image bert pre-training with online tokenizer.arXiv preprint arXiv:2111.07832, 2021. [40] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,"},{"citing_arxiv_id":"2603.23032","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Generative Event Pretraining with Foundation Model Alignment","primary_cat":"cs.CV","submitted_at":"2026-03-24T10:10:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GEP transfers semantic knowledge from image foundation models to event data via alignment and generative pretraining on mixed sequences to create transferable event-based visual models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.03190","ref_index":69,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Expectation and Acoustic Neural Network Representations Enhance Music Identification from Brain Activity","primary_cat":"cs.AI","submitted_at":"2026-03-03T17:47:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Separating acoustic and expectation ANN representations as teacher targets improves EEG music identification beyond baselines and seed ensembles.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.02667","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Unifying Contrastive and Generative Objectives for Visual Understanding and Text-to-Image Generation","primary_cat":"cs.CV","submitted_at":"2026-03-03T06:54:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DREAM introduces Masking Warmup and Semantically Aligned Decoding to let a single encoder handle both contrastive alignment and masked generation, yielding gains over CLIP and FLUID on understanding and generation benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.01041","ref_index":61,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Generalizable Deepfake Detection Based on Forgery-aware Layer Masking and Multi-artifact Subspace Decomposition","primary_cat":"cs.CV","submitted_at":"2026-01-03T02:33:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"FMSD improves cross-dataset generalization in deepfake detection by using gradient-based layer masking to select forgery-sensitive weights and SVD to split them into preserved semantic and multiple learnable artifact subspaces with orthogonality constraints.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.13684","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Recurrent Video Masked Autoencoders","primary_cat":"cs.CV","submitted_at":"2025-12-15T18:59:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RVM uses recurrent computation inside a masked autoencoder to learn video representations that match or exceed prior video and image models on classification, tracking, and dense spatial tasks with up to 30x better parameter efficiency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.03637","ref_index":40,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AaSP: Aliasing-aware Self-Supervised Pre-Training for Audio Spectrogram Transformers","primary_cat":"cs.SD","submitted_at":"2025-12-03T10:17:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AaSP learns aliasing-stable audio representations by augmenting patch tokens with adaptive subband features from alias-prone bands and using teacher-student masked modeling plus multi-mask contrastive regularization, reaching SOTA on AS-20K, ESC-50, and NSynth under fine-tuning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.09691","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PaCo-FR: Patch-Pixel Aligned End-to-End Codebook Learning for Facial Representation Pre-training","primary_cat":"cs.CV","submitted_at":"2025-08-13T10:37:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PaCo-FR introduces a structured-masking and patch-codebook framework for unsupervised facial representation pre-training that claims state-of-the-art results on multiple facial tasks after training on only 2 million unlabeled images.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.06964","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Adversarial Video Promotion Against Text-to-Video Retrieval","primary_cat":"cs.CV","submitted_at":"2025-08-09T12:20:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Pioneers ViPro, the first attack to adversarially promote videos in text-to-video retrieval, using Modal Refinement to improve black-box transferability across multiple targets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.14854","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Frabjous: Deep Learning Fast Radio Burst Morphologies","primary_cat":"astro-ph.IM","submitted_at":"2025-07-20T07:46:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Frabjous applies deep learning to classify FRB morphologies into five classes at 55% accuracy by augmenting limited real data with simulations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.15315","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Tight Clusters Make Specialized Experts","primary_cat":"cs.LG","submitted_at":"2025-02-21T09:10:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces Adaptive Clustering router for MoE models that scales features to identify tight expert clusters, yielding faster convergence, robustness to corruption, and performance gains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.12524","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"YOLOv12: Attention-Centric Real-Time Object Detectors","primary_cat":"cs.CV","submitted_at":"2025-02-18T04:20:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"YOLOv12 is a new attention-based real-time object detector that reports higher accuracy than YOLOv10, YOLOv11, and RT-DETR variants at comparable or better speed and efficiency.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"ity, but also reduces both computational cost and parameter / memory usage. 3.4. Architectural Improvements In this section, we will introduce the overall architecture and some improvements over the vanilla attention mecha- nism. Some of them are not initially proposed by us. Many attention-centric vision transformers are designed with the plain-style architectures [1, 18, 19, 21, 25, 51], while we retain the hierarchical design of the previous YOLO systems [3, 24, 28, 29, 32, 45-47, 53, 57, 58] and will demonstrate the necessity of this. We remove the de- sign of stacking three blocks in the last stage of the back- bone, which is present in recent versions [24, 28, 53, 58]. Instead, we retain only a single R-ELAN block, reducing"},{"citing_arxiv_id":"2502.03387","ref_index":155,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LIMO: Less is More for Reasoning","primary_cat":"cs.CL","submitted_at":"2025-02-05T17:23:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LIMO achieves 63.3% on AIME24 and 95.6% on MATH500 via supervised fine-tuning on roughly 1% of the data used by prior models, supporting the claim that minimal strategic examples suffice when pre-training has already encoded domain knowledge.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2501.09411","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Towards Robust and Realistic Human Pose Estimation via WiFi Signals","primary_cat":"cs.CV","submitted_at":"2025-01-16T09:38:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DT-Pose reformulates WiFi HPE as domain-consistent representation learning via temporal contrastive masked pretraining plus hybrid topology-constrained decoding to yield more accurate and realistic 2D/3D poses.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2412.14803","ref_index":79,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations","primary_cat":"cs.CV","submitted_at":"2024-12-19T12:48:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Video Prediction Policy conditions robot action learning on future-frame predictions inside fine-tuned video diffusion models, yielding 18.6% relative gains on Calvin ABC-D and 31.6% higher real-world success rates.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2412.08110","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"The ART of Composition: Attention-Regularized Training for Compositional Visual Grounding","primary_cat":"cs.CV","submitted_at":"2024-12-11T05:36:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CompART adds a composition loss on decomposed captions to regularize attention sums and improves multi-object grounding plus VQA across four VLM types and six benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2411.04996","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models","primary_cat":"cs.CL","submitted_at":"2024-11-07T18:59:06+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MoT decouples non-embedding parameters by modality in transformers to match dense multi-modal performance with roughly one-third to one-half the FLOPs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2410.21169","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction","primary_cat":"cs.MM","submitted_at":"2024-10-28T16:11:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"Survey proposing a taxonomy for document parsing into pipeline-based systems and VLM-driven unified models, reviewing components, metrics, benchmarks, and challenges.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"2 Transformer-based Methods. Recent advances in Transformer models have extended their application in DLA. BEiT (Bidirectional Encoder Representation from Image Transformers), inspired by BERT, employs self-supervised pretraining to learn robust image representations, excelling at extracting global document features such as titles, paragraphs, and tables [16]. The Document Image Transformer (DiT), with its Vision Transformer (ViT)-like architecture, splits document images into patches to enhance layout analysis. However, these models are computationally intensive and require extensive pretraining [115]. Recent work, such as [1, 15], also focuses on using transformers for classification tasks based on document visual features."},{"citing_arxiv_id":"2408.13471","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Disentangled Generative Graph Representation Learning","primary_cat":"cs.LG","submitted_at":"2024-08-24T05:13:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DiGGR introduces a self-supervised graph representation learning framework that disentangles latent factors to guide mask modeling and improve representation quality on graph tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2405.09818","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Chameleon: Mixed-Modal Early-Fusion Foundation Models","primary_cat":"cs.CL","submitted_at":"2024-05-16T05:23:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Chameleon is an early-fusion token model that handles mixed image-text sequences for understanding and generation, achieving competitive or superior performance to larger models like Llama-2, Mixtral, and Gemini-Pro on captioning, VQA, text, and image tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2404.08471","ref_index":224,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Revisiting Feature Prediction for Learning Visual Representations from Video","primary_cat":"cs.CV","submitted_at":"2024-02-15T18:59:11+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"3D convolution comprisingd filters of size2 × 16 × 16 with a temporal stride of2 and a spatial stride of16, resulting in a tensor of shape8 × 14 × 14 × d. Next we add absolute 3D sin-cos positional embeddings to the spatio-temporal feature map and flatten it, resulting in a 1D token sequence of shape1568 × d. This process is demonstrated in Figure 7. [16 x 224 x 224 x 3] 3D Conv [2 x 16 x 16 x d] [8 x 14 x 14 x d] 3D sin-cos absolute position embeddings [8 x 14 x 14 x d] [1568 x d] +16 video frames resolution 224 x 224 flatten Figure 7 V-JEPA training operates on a video clip flattened into a sequence of tokens. To convert a video clip of size 16 × 224 × 224 × 3 into a 1D token sequence, we apply a 3D convolution comprisingd filters of size2 × 16 × 16 with a temporal"},{"citing_arxiv_id":"2309.16671","ref_index":85,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Demystifying CLIP Data","primary_cat":"cs.CV","submitted_at":"2023-09-28T17:59:56+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MetaCLIP curates balanced 400M-pair subsets from CommonCrawl that outperform CLIP data, reaching 70.8% zero-shot ImageNet accuracy on ViT-B versus CLIP's 68.3%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2304.02643","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Segment Anything","primary_cat":"cs.CV","submitted_at":"2023-04-05T17:59:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A promptable model trained on 1B masks achieves competitive zero-shot segmentation performance across tasks and is released publicly with its dataset.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[2] Bogdan Alexe, Thomas Deselaers, and Vittorio Ferrari. What is an object? CVPR, 2010. 4, 10 [3] Pablo Arbel 'aez, Michael Maire, Charless Fowlkes, and Jitendra Malik. Contour detection and hierarchical image segmentation. TPAMI, 2010. 4, 10, 21, 28 [4] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv:1607.06450, 2016. 16 [5] Hangbo Bao, Li Dong, and Furu Wei. BEiT: BERT pre-training of image transformers. arXiv:2106.08254, 2021. 17 [6] Dina Bashkirova, Mohamed Abdelfattah, Ziliang Zhu, James Akl, Fadi Alladkani, Ping Hu, Vitaly Ablavsky, Berk Calli, Sarah Adel Bargal, and Kate Saenko. ZeroWaste dataset: Towards deformable object segmentation in cluttered scenes. CVPR, 2022."},{"citing_arxiv_id":"2303.15389","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"EVA-CLIP: Improved Training Techniques for CLIP at Scale","primary_cat":"cs.CV","submitted_at":"2023-03-27T17:02:21+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EVA-CLIP delivers improved CLIP training recipes that yield 82.0% zero-shot ImageNet-1K accuracy for a 5B-parameter model after only 9 billion samples.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"batch size 131k / 131k / 61k samples seen 8B / 4B / 2B drop path [26] 0.0 random resized crop (0.9, 1) numerical precision DeepSpeed fp16 [43] ZeRO optimizer [42] stage 1 Table 8: EV A-CLIP-B and EV A-CLIP-L training setting. References [1] Reaching 80 zero-shot accuracy with openclip: Vit-g/14 trained on laion-2b. https://laion.ai/blog/giant-openclip/. 4 [2] Hangbo Bao, Li Dong, and Furu Wei. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021. 5, 6 [3] Andrei Barbu, David Mayo, Julian Alverio, William Luo, Christo- pher Wang, Dan Gutfreund, Josh Tenenbaum, and Boris Katz. Ob- jectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models."},{"citing_arxiv_id":"2302.12288","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth","primary_cat":"cs.CV","submitted_at":"2023-02-23T19:13:10+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ZoeDepth combines relative depth pre-training on many datasets with metric depth fine-tuning and automatic head routing to achieve strong zero-shot generalization while preserving metric scale.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2205.01917","ref_index":22,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CoCa: Contrastive Captioners are Image-Text Foundation Models","primary_cat":"cs.CV","submitted_at":"2022-05-04T07:01:14+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CoCa unifies contrastive and generative pretraining in one image-text model to reach 86.3% zero-shot ImageNet accuracy and new state-of-the-art results on multiple downstream benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}