{"work":{"id":"b353bda2-591d-479a-9c8b-22dfcba12431","openalex_id":"https://openalex.org/W2194775991","doi":"10.1109/cvpr.2016.90","arxiv_id":null,"raw_key":null,"title":"In: IEEE Conference on Computer Vision and Pattern Recognition","authors":[{"given":"Kaiming","family":"He","sequence":"first","affiliation":[]},{"given":"Xiangyu","family":"Zhang","sequence":"additional","affiliation":[]},{"given":"Shaoqing","family":"Ren","sequence":"additional","affiliation":[]},{"given":"Jian","family":"Sun","sequence":"additional","affiliation":[]}],"authors_text":"He, K","year":2016,"venue":"2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)","abstract":null,"external_url":"https://doi.org/10.1109/cvpr.2016.90","cited_by_count":164175,"metadata_source":"doi_reference","metadata_fetched_at":"2026-07-01T06:35:28.601150+00:00","pith_arxiv_id":null,"created_at":"2026-05-08T18:23:55.272054+00:00","updated_at":"2026-07-01T06:35:28.601150+00:00","title_quality_ok":false,"display_title":"Deep residual learning for image recognition","render_title":"Deep residual learning for image recognition"},"hub":{"state":{"work_id":"b353bda2-591d-479a-9c8b-22dfcba12431","tier":"mega_hub","tier_reason":"1,000+ Pith inbound or 100,000+ external citations","pith_inbound_count":190,"external_cited_by_count":164175,"distinct_field_count":38,"first_pith_cited_at":"2019-06-20T20:30:39+00:00","last_pith_cited_at":"2026-06-30T09:08:35+00:00","author_build_status":"needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"needed","reader_status":"needed","recognition_status":"needed","updated_at":"2026-07-01T07:51:06.098375+00:00","tier_text":"mega_hub"},"tier":"mega_hub","role_counts":[{"context_role":"method","n":18},{"context_role":"background","n":14},{"context_role":"baseline","n":2},{"context_role":"dataset","n":1}],"polarity_counts":[{"context_polarity":"use_method","n":16},{"context_polarity":"background","n":14},{"context_polarity":"baseline","n":2},{"context_polarity":"unclear","n":2},{"context_polarity":"use_dataset","n":1}],"runs":{"ask_index":{"job_type":"ask_index","status":"succeeded","result":{"title":"Deep residual learning for image recognition","claims":[{"claim_text":"These channels are not independent signals but jointly represent a single complex-valued measurement, where the relationship between them encodes the local phase. Unlike magnitude-only approaches, where a single intensity channel is compressed, this coupling must be explicitly preserved. The architecture, loss function, and evaluation metrics described below are designed accordingly. The architecture is implemented as a ResNet-based [20] conditional variational autoencoder (CVAE) [21]. The encod","claim_type":"method","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"Together, these considerations make a scalable, high-speed, and robust reconstruction capable of operating at Monte Carlo scale essential for Hyper-Kamiokande. Machine-learning based reconstruction offers a promising path toward meeting these computational and topological chal- lenges. Convolutional neural networks [ 16], and in particular residual networks (ResNets) [17], are well suited to process the high-dimensional charge and time images recorded by the PMT array. At Super-Kamiokande, machi","claim_type":"method","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"Instead of binary classification, our model classifies into four states (LL,L,H,HH), and instead of training CNN feature extractors from scratch, we use pre-trained ResNet50 using transfer learning. The model architecture is shown in Figure 3. 3.6.1 Feature extraction.The first step is to extract features from each of the seven images. Here we apply transfer learning using ResNet50 [22], pre-trained on a large dataset. We extract information from the penultimate layer of ResNet50, compressing ea","claim_type":"method","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"historical video and recomputes attention upon query arrival. (2) ReKV [12] retrieves query-relevant KVCache at the token level. (3) LiveVLM [13] further combines token-level retrieval with KVCache compression to reduce memory usage. (4) StreamMem [14] also compresses KVCache, but under a TABLE II DATASET CONFIGURATIONS. Dataset Max Length Description MLVU [19] 703s multi-task long video LongVideoBench [20] 468s long-term multi-modal video VideoMME [21] 1,018s full-spectrum multi-modal video RVS","claim_type":"dataset","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"Training on such data could reinforce areas where AI systems are vulnerable [37, 796], enhancing their robustness in real-world applications. Adversarial examples can be constructed in various ways. One straightforward approach is to add small perturbations to inputs, which preserves their original labels while introducing adversarial characteristics [100, 260, 300, 504]. Another effective strategy is red teaming, which usually involves human teams systematically testing to find vulnerabilities ","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"histopathological images [2], [4], [5], [6]. CNN have been widely adopted for cancer detection due to their ability to capture local texture patterns and hierarchical spatial features. Residual learning has been introduced to alleviate the vanishing gradient problem, leading to significant improvements in deep feature representation, as exemplified by ResNet architectures [7]. Similarly, DenseNet and kernel architectures enhance feature reuse and gradient flow, while EfficientNet achieves state-","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks Deep residual learning for image recognition because it crossed a citation-hub threshold. Current citing contexts most often use it as method evidence (18 contexts).","role_counts":[{"n":18,"context_role":"method"},{"n":14,"context_role":"background"},{"n":2,"context_role":"baseline"},{"n":1,"context_role":"dataset"}]},"error":null,"updated_at":"2026-06-05T21:30:39.045798+00:00"},"author_expand":{"job_type":"author_expand","status":"succeeded","result":{"authors_linked":[{"id":"02ad2b6c-8a3d-4309-ba96-5181bc91718e","orcid":null,"display_name":"Kaiming He"},{"id":"90e0e192-6197-4ef4-b9c2-386dbfd79fad","orcid":null,"display_name":"Xiangyu Zhang"},{"id":"c49ca12d-98c9-4fc2-a95f-a30b92a41773","orcid":null,"display_name":"Shaoqing Ren"},{"id":"fa0012a3-358f-4383-8457-2763cccd76e7","orcid":null,"display_name":"Jian Sun"}]},"error":null,"updated_at":"2026-06-05T21:30:39.042991+00:00"},"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-06-05T21:30:27.082895+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"A ConvNet for the 2020s","work_id":"0a23d1b7-bd56-43cc-8a80-7c43ce994e1e","shared_citers":17},{"title":"author Dong, W","work_id":"effdb28b-742e-4840-b3ca-d89502a6cd4d","shared_citers":14},{"title":"In: Proceedings of the IEEE/CVF Conference on Computer 25 Vision and Pattern Recognition, pp","work_id":"9da51225-b7bd-4032-b7db-ca577971dafe","shared_citers":12},{"title":"Very Deep Convolutional Networks for Large-Scale Image Recognition","work_id":"1c4b4409-c14b-488b-a086-c57a5aab8a29","shared_citers":11},{"title":"Walk in the cloud: Learning curves for point clouds shape analysis, pp","work_id":"3820f598-11b0-45c3-8c99-0079181ac0a7","shared_citers":11},{"title":"Derf: Decomposed radiance fields","work_id":"7083a41e-5666-435b-ab26-c753f6490b9a","shared_citers":10},{"title":"In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV)","work_id":"b8a8bb9e-1d31-40e2-9cab-ae21e338dde6","shared_citers":10},{"title":"In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp","work_id":"b9701eca-d05e-4d2e-9045-6761df4ba175","shared_citers":10},{"title":"BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding","work_id":"3e3c8ac8-b858-4b22-af32-393d98c883e0","shared_citers":9},{"title":"Deep learning","work_id":"f959cefa-9092-49df-9fb5-a4e6654500f1","shared_citers":9},{"title":"Densely connected convolutional networks","work_id":"2199d436-33c2-4b30-9d6f-ce9b8904101e","shared_citers":9},{"title":"Dickerson","work_id":"5c2060c6-427c-4321-be22-49ccae439d80","shared_citers":9},{"title":"Emogen: Emotional image content generation with text-to-image diffusion models","work_id":"7efbc2dd-b0f2-4f71-bb1c-d2fcf110d805","shared_citers":9},{"title":"Gradient-based learning applied to document recognition","work_id":"0a3595ca-57f9-43f8-8e2f-aface7154b99","shared_citers":8},{"title":"Layer Normalization","work_id":"20a2d720-0046-4c7c-bcd6-327ec8143f69","shared_citers":8},{"title":"Masset, R","work_id":"238df2e4-a3e5-46f3-860e-3ae2b0094b97","shared_citers":8},{"title":"Long short -term memory","work_id":"c3b0bfa7-6764-45f1-a40d-45baaee9d22c","shared_citers":7},{"title":"PoseNet: A convolutional network for real-time 6-dof camera relocalization","work_id":"135418b1-cafd-49fd-803d-1ca6433d4b1b","shared_citers":7},{"title":"2016, in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 779–788, doi: 10.1109/CVPR.2016.91","work_id":"37ab4f11-9f69-480d-aab9-e7d9826c586d","shared_citers":6},{"title":"Adam: A Method for Stochastic Optimization","work_id":"1910796d-9b52-4683-bf5c-de9632c1028b","shared_citers":6},{"title":"Decoupled Weight Decay Regularization","work_id":"07ef7360-d385-4033-83f7-8384a6325204","shared_citers":6},{"title":"MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications","work_id":"3870239a-c950-4625-bf33-c4f902d14175","shared_citers":6},{"title":"Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift","work_id":"05484516-8937-4cdf-9176-7f8329ef0221","shared_citers":5},{"title":"IEEE Access8, 199523–199538 (2020) https://doi.org/10.1109/ACCESS","work_id":"7cbffc3e-26d4-4a7c-a518-eafcd09cbecb","shared_citers":5}],"time_series":[{"n":5,"year":2019},{"n":1,"year":2021},{"n":2,"year":2023},{"n":7,"year":2024},{"n":16,"year":2025},{"n":95,"year":2026}],"dependency_candidates":[{"n":1,"role":"method","polarity":"background","paper_title":"A Systematic Survey on Deep Learning Architectures for Point Cloud Classification and Segmentation","primary_cat":"cs.CV","context_text":"3D convolutional filters with a stride for extracting features from the shape. They do not use any pooling, as it was observed that pooling introduced uncertainty to shape reconstruction. They pretrain the model first and then run fine-tuning. Pre-training is run layer-wise-convolution layers and RBM layer are trained with standard contrastive divergence [35] and AM-DBN layer is trained with fast persistent contrastive divergence [99]. For fine-tuning, they use a process similar to the wake-sleep algorithm from [36]. During wake, they propagate input voxel forward through the network and update the recognition weights. During sleep, they sample persistent latent variables from the network's generative distribution and propagate them backward through the","citing_arxiv_id":"2605.17131"},{"n":1,"role":"method","polarity":"use_method","paper_title":"Multi-Narrow Transformation as a Single-Model Ensemble: Boundary Conditions, Mechanisms, and Failure Modes","primary_cat":"cs.LG","context_text":"tion under a limited budget affects predictive performance. 4. Experiments WeevaluatetheMNtransformationfromthreeperspec- tives: the data-regime dependence of its effectiveness, the mechanismunderlyingitsgainsinlow-dataregimes,andits computational implications. 4.1. Experimental Setup Unless otherwise stated, the following setup was used throughout the experiments. We employed ResNet-18 [9] as the baseline architecture and applied the MN transfor- mation defined in Sec. 3.1. Following Easy Ensemble [8], theimplementationwasbasedongroupconvolution,andthe transformation strength was controlled by 𝑟∈ {1,2,4,8,16,32}. Here,𝑟= 1corresponds to the untransformed SW baseline, and𝑟= 32correspondstoamodelcontaining1,024internal paths. WeusedCIFAR-100astheprimarydataset.","citing_arxiv_id":"2605.11530"},{"n":1,"role":"baseline","polarity":"baseline","paper_title":"Urban-ImageNet: A Large-Scale Multi-Modal Dataset and Evaluation Framework for Urban Space Perception","primary_cat":"cs.CV","context_text":"benchmarking library providing modular data loaders, fine-tuning pipelines, evaluation scripts, and cross-dataset adapters for direct comparison with Places365, MS-COCO, and Cityscapes. 4.1 Task 1: Urban Scene Semantic Classification Setup:Given an image, predict itsHUSIClabel (0-9). Fine-tuned on 80K training images; evaluated on 10K test split with five-fold cross-validation.Baselines:ResNet-{18/50/152} [14], EfficientNet- B4 [36], ViT-B/16 [8], DeiT-B [37], CLIP ViT-L/14 (zero-shot + fine-tuned) [33].Metrics:Top-1 Accuracy, Macro-F1, per-class P/R/F1. 4.2 Task 2: Cross-Modal Image-Text Retrieval Task 2 evaluates two sub-configurations reflecting the dataset's two textual modalities: T2-1 (Category-Level Retrieval). Text queries are the tenHUSICclass names, formatted as \" This is a photo of {class_name}\"","citing_arxiv_id":"2605.09936"},{"n":1,"role":"method","polarity":"use_method","paper_title":"LAMES: A Large-Scale and Artisanal Mining Environmental Segmentation Dataset","primary_cat":"cs.CV","context_text":"training, validation, and test data sets are split based on entire mining sites rather than individual patches, ensur- ing that all patches from a given site are confined to either the training or test set, see Fig 10. 5.1. Mining Sector Classification (HiRes Imagery) We selected the established U-Net architecture [37], in- corporating a ResNet-50 backbone [17] trained on Ima- geNet [11] as the network architecture. U-Net is a widely recognized semantic segmentation model, demonstrating robust performance in both computer vision and remote sensing applications. The mining sites were divided into 38 for training, 14 for validation, and 19 for testing. Each bounding box of the mining sites was divided into patches","citing_arxiv_id":"2605.07740"},{"n":1,"role":"method","polarity":"use_method","paper_title":"A Unified Framework for the Detection and Classification of Fatty Pancreas in Ultrasound Images","primary_cat":"cs.CV","context_text":"Our framework operates in three distinct stages, as follows: 1.Segmentation stage.We employ a TransUNet- based architecture [3, 19] combining a ResNet [31] encoder with transformer bottleneck layers to segment both the pancreas and the splenic vein from ultrasound images. The models are initial- ized via transfer learning from a liver segmen- tation task [32] and fine-tuned on our clinical dataset. 2.Anatomically-Guided Patch Extraction stage. Using the predicted segmentation masks, we ex- tract tissue patches from two anatomically rele- vant regions: the pancreatic parenchyma (exclud- ing the splenic vein) and the peri-venous fat re- gion immediately beneath the splenic vein con- tour. 3.Classification via Texture Comparison stage.","citing_arxiv_id":"2605.07466"},{"n":1,"role":"baseline","polarity":"baseline","paper_title":"XiYOLO: Energy-Aware Object Detection via Iterative Architecture Search and Scaling","primary_cat":"cs.CV","context_text":"The searched architectures for PascalVOC and COCO are obtained independently. We refer to the resulting detector family asXiYolo. Deployment platforms and baselines.We evaluate the searched models against YOLO baselines on the ModalAI Sentinel Development Drone, which contains a Qualcomm QRB5165 CPU, a Qualcomm Adreno 650 GPU, and a 15 TOPS NPU. We compare against YOLOv5 [17], YOLOv8 [18], YOLO11 [18], and YOLOv12 [32] at nano, small, and medium scales. All models are exported to FP16 TFLite, and we measure energy per inference, cumulative energy over time, latency, and detection accuracy. Power is monitored using the VOXL Power Module v3. We estimate inference energy by subtracting idle power from measured power to obtain active inference power, then","citing_arxiv_id":"2605.06927"},{"n":1,"role":"method","polarity":"use_method","paper_title":"Digital Image Forgery Detection Using Transfer Learning","primary_cat":"cs.CV","context_text":"tween manipulated and authentic regions. Unlike raw RGB inputs, this rep- resentation explicitly emphasizes subtle manipulation artifacts introduced dur- ing tampering, enabling CNN models to learn more discriminative features for forgery detection [25]. All images are resized to224×224pixels (and299×299for InceptionV3) to match the input requirements of pretrained models [9, 10, 11, 14]. 7 3.3 Pretrained CNN Architectures To evaluate the effectiveness of theproposed approach, multiple pretrained CNN architectures are utilized: •DenseNet121 [14] •ResNet50 [9] •VGG16 [10] •EfficientNetB0 [13] •MobileNet [15] •InceptionV3 [11] Each model is fine-tuned on the enhanced input representation combining RGB images with theFdif f features [7, 16].","citing_arxiv_id":"2605.08167"},{"n":1,"role":"method","polarity":"use_method","paper_title":"Rethinking the Need for Source Models: Source-Free Domain Adaptation from Scratch Guided by a Vision-Language Model","primary_cat":"cs.CV","context_text":"we report the results under the widely adopted closed-set protocol in Office-Home, VisDA, and DomainNet-126. Furthermore, since the VODA uses no source informa- tion, it can be regarded as open-set [40], so we also provide comparisons under the open-set protocol in Office-Home. FrameworksThe initial modelθ i is a standard convolutional network, serving as the starting point for adaptation, we use ResNet-50 [41] for Office-Home, and ResNet- 101 [41] for VisDA and DomainNet-126, keeping consistency with the competitors. We initialized the networks using a layer-wise strategy: fully connected layers with Xavier uniform initialization [42], convolutional layers with Kaiming normal initial- ization tailored for ReLU activations [41], and batch normalization layers with weights","citing_arxiv_id":"2605.02604"},{"n":1,"role":"method","polarity":"use_method","paper_title":"Empirical Insights of Test Selection Metrics under Multiple Testing Objectives and Distribution Shifts","primary_cat":"cs.SE","context_text":"the identified optimal FE method and DBSCAN for CA when ablating DR candidates. Finally, we fix the optimal FE and DR methods when ablating CA candidates. Based on the Silhouette and DBCV scores in Table 3, the optimal pipelines are(DeepDrebin, UMAP, DBSCAN)for the AndroZoo dataset and(RoBERTa, UMAP, DBSCAN)for IMDb. Note that for image datasets (MNIST and Udacity), we directly use the best pipeline(ResNet-50 [ 30], UMAP, DBSCAN)validated by [ 7], which achieves average scores of 0.69 and 0.47 for MNIST and Udacity, respectively. 5.1.2 The cluster-to-fault correspondence.To validate whether the resulting cluster from the best pipeline indeed represents a DNN fault, we conduct feature pattern inspection and cluster-specific retraining validation [1]. Figure 2 displays the heatmaps of a randomly selected cluster from the best","citing_arxiv_id":"2604.23342"},{"n":1,"role":"method","polarity":"use_method","paper_title":"H-SemiS: Hierarchical Fusion of Semi and Self-Supervised Learning for Knee Osteoarthritis Severity Grading","primary_cat":"cs.CV","context_text":"andm,(m≫n) are labeled and unlabeled samples;x i is a train- ing sample; andθ Te , θS t are model parameters. Following Mean Teacher framework (Tarvainen and Valpola, 2017), we update the teacher via EMA applied only to the student's base network, while keeping the QCN fixed to stabilize classical feature ex- traction and provide consistent inputs to the quantum module, as defined in Eq. (20). θ(t+1) Te ←µ·θ (t) Te +(1−µ)·θ (t) S t (20) wheretrepresents the training iteration andµis the EMA smooth- ing coefficient, set to 0.99 following Tarvainen and Valpola (2017), which controls the update rate of the teacher parame- ters. This consistency strategy gradually transfers knowledge from the student to the teacher, aligns their predictions, and im-","citing_arxiv_id":"2604.23335"},{"n":1,"role":"method","polarity":"background","paper_title":"Autonomous Unmanned Aircraft Systems for Enhanced Search and Rescue of Drowning Swimmers: Image-Based Localization and Mission Simulation","primary_cat":"cs.CV","context_text":"Consequently, synthetic images were only employed during the labeling process by adding 2,000 synthetic images of the class \"prob. ok\" to the 400 original images. Data augmentation can improve the general representation of certain object fea- tures, so we employ some standard data augmentation techniques, manipulating the following image attributes [ 63]: 1. Brightness: Simulates varying lighting conditions 2. Contrast: Adjusts intensity differences 3. Noise: Mimics distance variations and sensor characteristics 4. Motion blur: Simulates camera movement 5. Rotation: Introduces variation in object orientation 6. Translation: Simulates different object positions Each transformation was applied once per image, expanding the 2,400 images by","citing_arxiv_id":"2604.18088"},{"n":1,"role":"method","polarity":"use_method","paper_title":"Training-inference input alignment outweighs framework choice in longitudinal retinal image prediction","primary_cat":"cs.CV","context_text":"The aggregated history features 𝐆(𝑠) are injected into the U-Net encoder at the corresponding scales via channel-wise concatenation followed by a 1 × 1 convolution acting as a pixel-wise temporal mixer, followed by GroupNorm with SiLU activation [23], providing normalization and non-linear refinement of the fused representation. The model predicts the residual change from the most recent history frame rather than the absolute target [24]: 𝐼̂∗ = 𝐼𝑁 + 𝑓𝜃(𝐼𝑁, ℋ, 𝛥𝑡∗) where 𝑓𝜃 denotes the U-Net output (prior to the residual addition). This residual formulation concentrates capacity on the disease-relevant change signal. The output layer is initialized near zero, so the initial prediction approximates the copy-last baseline 𝐼𝑁. 3.3 Training Configuration Our primary model, TRU, is trained to predict the ground-truth target 𝐼∗ via a per-pixel masked","citing_arxiv_id":"2604.16955"},{"n":1,"role":"method","polarity":"use_method","paper_title":"Weak-to-Strong Knowledge Distillation Accelerates Visual Learning","primary_cat":"cs.CV","context_text":"A speedup ratio greater than 1.0×means ours reaches the same target earlier with fewer epochs or steps. For higher-is-better metrics (Top-1, AP50), first@τis the first epoch with metric at or aboveτ. For lower-is-better metrics (FID), first@τis the first step at or belowτ. Gate and Hyperparameter Selection.For ImageNet classification [7], we useτ= 65for ResNet-50 [14] andτ= 50for ViT-S/16 [8]. For CIFAR early-stage classification [26], we use fixed dataset-level gates:τ= 75for CIFAR-10 and τ= 60for CIFAR-100. For object detection on the COCO dataset [28], we use a fixed task-level AP50 target (τ= 20%). For diffusion generation on the CIFAR-10 dataset [18,26,32], we use a fixed task-level FID target (τ= 60), selected from the teacher reference around 30k training steps, with a conservative","citing_arxiv_id":"2604.15451"},{"n":1,"role":"method","polarity":"use_method","paper_title":"Generative Modeling of Complex-Valued Brain MRI Data","primary_cat":"eess.IV","context_text":"These channels are not independent signals but jointly represent a single complex-valued measurement, where the relationship between them encodes the local phase. Unlike magnitude-only approaches, where a single intensity channel is compressed, this coupling must be explicitly preserved. The architecture, loss function, and evaluation metrics described below are designed accordingly. The architecture is implemented as a ResNet-based [20] conditional variational autoencoder (CVAE) [21]. The encoder maps each (2×96×96) input patch to a latent representation of size (2×48×48), yielding a compression factor of 4. This factor was chosen to retain fine diagnostic detail and the coupling between channels while providing a sufficiently compact input for the flow matching model. The decoder reconstructs the original patch from this compressed encoding.","citing_arxiv_id":"2604.14800"},{"n":1,"role":"method","polarity":"use_method","paper_title":"Enhancing Event Reconstruction in Hyper-Kamiokande with Machine Learning: A ResNet Implementation","primary_cat":"hep-ex","context_text":"Together, these considerations make a scalable, high-speed, and robust reconstruction capable of operating at Monte Carlo scale essential for Hyper-Kamiokande. Machine-learning based reconstruction offers a promising path toward meeting these computational and topological chal- lenges. Convolutional neural networks [ 16], and in particular residual networks (ResNets) [17], are well suited to process the high-dimensional charge and time images recorded by the PMT array. At Super-Kamiokande, machine-learning techniques are already employed for solar-neutrino classification [18] and for neutron-capture tagging [ 19], and they show encouraging re- sults in neutrino-reconstruction studies for other experiments [20, 21, 22].","citing_arxiv_id":"2604.13503"},{"n":1,"role":"method","polarity":"use_method","paper_title":"Human Centered Non Intrusive Driver State Modeling Using Personalized Physiological Signals in Real World Automated Driving","primary_cat":"cs.HC","context_text":"Instead of binary classification, our model classifies into four states (LL,L,H,HH), and instead of training CNN feature extractors from scratch, we use pre-trained ResNet50 using transfer learning. The model architecture is shown in Figure 3. 3.6.1 Feature extraction.The first step is to extract features from each of the seven images. Here we apply transfer learning using ResNet50 [22], pre-trained on a large dataset. We extract information from the penultimate layer of ResNet50, compressing each image into a feature vector of size2048 × 1. This process is repeated for all seven input images (one per signal), resulting in seven feature vectors. These feature vectors are then concatenated into a single vector of size14336 × 1(because2048 × 7 = 14336.","citing_arxiv_id":"2604.11549"},{"n":1,"role":"dataset","polarity":"use_dataset","paper_title":"Mosaic: Cross-Modal Clustering for Efficient Video Understanding","primary_cat":"cs.PF","context_text":"historical video and recomputes attention upon query arrival. (2) ReKV [12] retrieves query-relevant KVCache at the token level. (3) LiveVLM [13] further combines token-level retrieval with KVCache compression to reduce memory usage. (4) StreamMem [14] also compresses KVCache, but under a TABLE II DATASET CONFIGURATIONS. Dataset Max Length Description MLVU [19] 703s multi-task long video LongVideoBench [20] 468s long-term multi-modal video VideoMME [21] 1,018s full-spectrum multi-modal video RVS-ego [22] 3,605s first-person ego-centric video RVS-movie [22] 1,671s third-person plot video query-agnostic memory budget. Following prior work [12], [13], we set the retrieved frames to 64 for all baselines. B. Overall Performance","citing_arxiv_id":"2604.10060"},{"n":1,"role":"method","polarity":"use_method","paper_title":"DSVTLA: Deep Swin Vision Transformer-Based Transfer Learning Architecture for Multi-Type Cancer Histopathological Cancer Image Classification","primary_cat":"eess.IV","context_text":"histopathological images [2], [4], [5], [6]. CNN have been widely adopted for cancer detection due to their ability to capture local texture patterns and hierarchical spatial features. Residual learning has been introduced to alleviate the vanishing gradient problem, leading to significant improvements in deep feature representation, as exemplified by ResNet architectures [7]. Similarly, DenseNet and kernel architectures enhance feature reuse and gradient flow, while EfficientNet achieves state-of-the-art performance through compound scaling with reduced computational cost [8], [9], [10] . More recently, transformer -based architectures have gained attention in medical imaging for their capability to model long -range","citing_arxiv_id":"2604.09468"},{"n":1,"role":"method","polarity":"use_method","paper_title":"Variational Feature Compression for Model-Specific Representations","primary_cat":"cs.CV","context_text":"and gradient-based saliency, the scores are normalized and combined into a uni- fied importance measure, and thresholding yields a binary mask that selects the dimensions retained for decoding. importance score. A threshold produces a binary mask, and the masked vector Zm is decoded intoX ′ for inference. 3.4 Variational Latent Bottleneck Our encoder uses a ResNet-18 [11] backbone with the final fully connected and softmax layers removed. The 512-dimensional output from global average pool- ing is processed through two parallel linear layers to produceµandlogσ 2; a sampleZis drawn via the reparameterization trick. Following the Variational Information Bottleneck (VIB) framework [2], minimizingI(Z;X)via KL diver- gence regularization reduces redundant information, while maximizingI(Z;Y)","citing_arxiv_id":"2604.06644"},{"n":1,"role":"method","polarity":"use_method","paper_title":"Lightweight True In-Pixel Encryption with FeFET Enabled Pixel Design for Secure Imaging","primary_cat":"cs.CV","context_text":"The PSNR values between encrypted and original images further confirm that the encrypted images diverge substantially from the originals, consistent with strong visual obfuscation. To further evaluate the robustness of SecurePix against machine-learning-based attacks, we performed an image- classification test using a ResNet-18 neural network [30]. In this experiment, the classifier is treated as an adversarial model attempting to recognize the encrypted images. For CIFAR-10 [31], the ResNet-18 network was first trained only on the unencrypted training images following standard supervised- learning procedures. After training, we encrypted 10,000 CIFAR-10 test images using SecurePix and fed these encrypted","citing_arxiv_id":"2604.05147"}]},"error":null,"updated_at":"2026-06-05T21:30:22.214786+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-06-05T21:30:14.629465+00:00"},"reader_index":{"job_type":"reader_index","status":"succeeded","result":{"note":"annotated reader requires full-text/OA fetch; shell is wired for mega hubs","status":"reader queued"},"error":null,"updated_at":"2026-06-05T21:30:38.757310+00:00"},"recognition_alignment":{"job_type":"recognition_alignment","status":"succeeded","result":{"modules":["IndisputableMonolith.Foundation.RecognitionForcing","IndisputableMonolith.Chain","IndisputableMonolith.Engineering.CorticalNeuromodulationDevice","IndisputableMonolith.Engineering.PhantomCoupledGWAntennaSensitivity","IndisputableMonolith.Foundation.InitialCondition","IndisputableMonolith.Foundation.LedgerForcing","IndisputableMonolith.Foundation.ObserverForcing","IndisputableMonolith.Information.InformationIsLedger"],"query_chars":45},"error":null,"updated_at":"2026-06-05T21:30:16.286292+00:00"},"role_polarity":{"job_type":"role_polarity","status":"succeeded","result":{"title":"Deep residual learning for image recognition","claims":[{"claim_text":"These channels are not independent signals but jointly represent a single complex-valued measurement, where the relationship between them encodes the local phase. Unlike magnitude-only approaches, where a single intensity channel is compressed, this coupling must be explicitly preserved. The architecture, loss function, and evaluation metrics described below are designed accordingly. The architecture is implemented as a ResNet-based [20] conditional variational autoencoder (CVAE) [21]. The encod","claim_type":"method","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"Together, these considerations make a scalable, high-speed, and robust reconstruction capable of operating at Monte Carlo scale essential for Hyper-Kamiokande. Machine-learning based reconstruction offers a promising path toward meeting these computational and topological chal- lenges. Convolutional neural networks [ 16], and in particular residual networks (ResNets) [17], are well suited to process the high-dimensional charge and time images recorded by the PMT array. At Super-Kamiokande, machi","claim_type":"method","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"Instead of binary classification, our model classifies into four states (LL,L,H,HH), and instead of training CNN feature extractors from scratch, we use pre-trained ResNet50 using transfer learning. The model architecture is shown in Figure 3. 3.6.1 Feature extraction.The first step is to extract features from each of the seven images. Here we apply transfer learning using ResNet50 [22], pre-trained on a large dataset. We extract information from the penultimate layer of ResNet50, compressing ea","claim_type":"method","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"historical video and recomputes attention upon query arrival. (2) ReKV [12] retrieves query-relevant KVCache at the token level. (3) LiveVLM [13] further combines token-level retrieval with KVCache compression to reduce memory usage. (4) StreamMem [14] also compresses KVCache, but under a TABLE II DATASET CONFIGURATIONS. Dataset Max Length Description MLVU [19] 703s multi-task long video LongVideoBench [20] 468s long-term multi-modal video VideoMME [21] 1,018s full-spectrum multi-modal video RVS","claim_type":"dataset","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"Training on such data could reinforce areas where AI systems are vulnerable [37, 796], enhancing their robustness in real-world applications. Adversarial examples can be constructed in various ways. One straightforward approach is to add small perturbations to inputs, which preserves their original labels while introducing adversarial characteristics [100, 260, 300, 504]. Another effective strategy is red teaming, which usually involves human teams systematically testing to find vulnerabilities ","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"histopathological images [2], [4], [5], [6]. CNN have been widely adopted for cancer detection due to their ability to capture local texture patterns and hierarchical spatial features. Residual learning has been introduced to alleviate the vanishing gradient problem, leading to significant improvements in deep feature representation, as exemplified by ResNet architectures [7]. Similarly, DenseNet and kernel architectures enhance feature reuse and gradient flow, while EfficientNet achieves state-","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks Deep residual learning for image recognition because it crossed a citation-hub threshold. Current citing contexts most often use it as method evidence (18 contexts).","role_counts":[{"n":18,"context_role":"method"},{"n":14,"context_role":"background"},{"n":2,"context_role":"baseline"},{"n":1,"context_role":"dataset"}]},"error":null,"updated_at":"2026-06-05T21:29:54.148326+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"Deep residual learning for image recognition","claims":[{"claim_text":"These channels are not independent signals but jointly represent a single complex-valued measurement, where the relationship between them encodes the local phase. Unlike magnitude-only approaches, where a single intensity channel is compressed, this coupling must be explicitly preserved. The architecture, loss function, and evaluation metrics described below are designed accordingly. The architecture is implemented as a ResNet-based [20] conditional variational autoencoder (CVAE) [21]. The encod","claim_type":"method","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"Together, these considerations make a scalable, high-speed, and robust reconstruction capable of operating at Monte Carlo scale essential for Hyper-Kamiokande. Machine-learning based reconstruction offers a promising path toward meeting these computational and topological chal- lenges. Convolutional neural networks [ 16], and in particular residual networks (ResNets) [17], are well suited to process the high-dimensional charge and time images recorded by the PMT array. At Super-Kamiokande, machi","claim_type":"method","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"Instead of binary classification, our model classifies into four states (LL,L,H,HH), and instead of training CNN feature extractors from scratch, we use pre-trained ResNet50 using transfer learning. The model architecture is shown in Figure 3. 3.6.1 Feature extraction.The first step is to extract features from each of the seven images. Here we apply transfer learning using ResNet50 [22], pre-trained on a large dataset. We extract information from the penultimate layer of ResNet50, compressing ea","claim_type":"method","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"historical video and recomputes attention upon query arrival. (2) ReKV [12] retrieves query-relevant KVCache at the token level. (3) LiveVLM [13] further combines token-level retrieval with KVCache compression to reduce memory usage. (4) StreamMem [14] also compresses KVCache, but under a TABLE II DATASET CONFIGURATIONS. Dataset Max Length Description MLVU [19] 703s multi-task long video LongVideoBench [20] 468s long-term multi-modal video VideoMME [21] 1,018s full-spectrum multi-modal video RVS","claim_type":"dataset","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"Training on such data could reinforce areas where AI systems are vulnerable [37, 796], enhancing their robustness in real-world applications. Adversarial examples can be constructed in various ways. One straightforward approach is to add small perturbations to inputs, which preserves their original labels while introducing adversarial characteristics [100, 260, 300, 504]. Another effective strategy is red teaming, which usually involves human teams systematically testing to find vulnerabilities ","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"histopathological images [2], [4], [5], [6]. CNN have been widely adopted for cancer detection due to their ability to capture local texture patterns and hierarchical spatial features. Residual learning has been introduced to alleviate the vanishing gradient problem, leading to significant improvements in deep feature representation, as exemplified by ResNet architectures [7]. Similarly, DenseNet and kernel architectures enhance feature reuse and gradient flow, while EfficientNet achieves state-","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks Deep residual learning for image recognition because it crossed a citation-hub threshold. Current citing contexts most often use it as method evidence (18 contexts).","role_counts":[{"n":18,"context_role":"method"},{"n":14,"context_role":"background"},{"n":2,"context_role":"baseline"},{"n":1,"context_role":"dataset"}]},"error":null,"updated_at":"2026-06-05T21:30:14.634001+00:00"}},"summary":{"title":"Deep residual learning for image recognition","claims":[{"claim_text":"These channels are not independent signals but jointly represent a single complex-valued measurement, where the relationship between them encodes the local phase. Unlike magnitude-only approaches, where a single intensity channel is compressed, this coupling must be explicitly preserved. The architecture, loss function, and evaluation metrics described below are designed accordingly. The architecture is implemented as a ResNet-based [20] conditional variational autoencoder (CVAE) [21]. The encod","claim_type":"method","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"Together, these considerations make a scalable, high-speed, and robust reconstruction capable of operating at Monte Carlo scale essential for Hyper-Kamiokande. Machine-learning based reconstruction offers a promising path toward meeting these computational and topological chal- lenges. Convolutional neural networks [ 16], and in particular residual networks (ResNets) [17], are well suited to process the high-dimensional charge and time images recorded by the PMT array. At Super-Kamiokande, machi","claim_type":"method","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"Instead of binary classification, our model classifies into four states (LL,L,H,HH), and instead of training CNN feature extractors from scratch, we use pre-trained ResNet50 using transfer learning. The model architecture is shown in Figure 3. 3.6.1 Feature extraction.The first step is to extract features from each of the seven images. Here we apply transfer learning using ResNet50 [22], pre-trained on a large dataset. We extract information from the penultimate layer of ResNet50, compressing ea","claim_type":"method","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"historical video and recomputes attention upon query arrival. (2) ReKV [12] retrieves query-relevant KVCache at the token level. (3) LiveVLM [13] further combines token-level retrieval with KVCache compression to reduce memory usage. (4) StreamMem [14] also compresses KVCache, but under a TABLE II DATASET CONFIGURATIONS. Dataset Max Length Description MLVU [19] 703s multi-task long video LongVideoBench [20] 468s long-term multi-modal video VideoMME [21] 1,018s full-spectrum multi-modal video RVS","claim_type":"dataset","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"Training on such data could reinforce areas where AI systems are vulnerable [37, 796], enhancing their robustness in real-world applications. Adversarial examples can be constructed in various ways. One straightforward approach is to add small perturbations to inputs, which preserves their original labels while introducing adversarial characteristics [100, 260, 300, 504]. Another effective strategy is red teaming, which usually involves human teams systematically testing to find vulnerabilities ","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"histopathological images [2], [4], [5], [6]. CNN have been widely adopted for cancer detection due to their ability to capture local texture patterns and hierarchical spatial features. Residual learning has been introduced to alleviate the vanishing gradient problem, leading to significant improvements in deep feature representation, as exemplified by ResNet architectures [7]. Similarly, DenseNet and kernel architectures enhance feature reuse and gradient flow, while EfficientNet achieves state-","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks Deep residual learning for image recognition because it crossed a citation-hub threshold. Current citing contexts most often use it as method evidence (18 contexts).","role_counts":[{"n":18,"context_role":"method"},{"n":14,"context_role":"background"},{"n":2,"context_role":"baseline"},{"n":1,"context_role":"dataset"}]},"graph":{"co_cited":[{"title":"A ConvNet for the 2020s","work_id":"0a23d1b7-bd56-43cc-8a80-7c43ce994e1e","shared_citers":17},{"title":"author Dong, W","work_id":"effdb28b-742e-4840-b3ca-d89502a6cd4d","shared_citers":14},{"title":"In: Proceedings of the IEEE/CVF Conference on Computer 25 Vision and Pattern Recognition, pp","work_id":"9da51225-b7bd-4032-b7db-ca577971dafe","shared_citers":12},{"title":"Very Deep Convolutional Networks for Large-Scale Image Recognition","work_id":"1c4b4409-c14b-488b-a086-c57a5aab8a29","shared_citers":11},{"title":"Walk in the cloud: Learning curves for point clouds shape analysis, pp","work_id":"3820f598-11b0-45c3-8c99-0079181ac0a7","shared_citers":11},{"title":"Derf: Decomposed radiance fields","work_id":"7083a41e-5666-435b-ab26-c753f6490b9a","shared_citers":10},{"title":"In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV)","work_id":"b8a8bb9e-1d31-40e2-9cab-ae21e338dde6","shared_citers":10},{"title":"In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp","work_id":"b9701eca-d05e-4d2e-9045-6761df4ba175","shared_citers":10},{"title":"BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding","work_id":"3e3c8ac8-b858-4b22-af32-393d98c883e0","shared_citers":9},{"title":"Deep learning","work_id":"f959cefa-9092-49df-9fb5-a4e6654500f1","shared_citers":9},{"title":"Densely connected convolutional networks","work_id":"2199d436-33c2-4b30-9d6f-ce9b8904101e","shared_citers":9},{"title":"Dickerson","work_id":"5c2060c6-427c-4321-be22-49ccae439d80","shared_citers":9},{"title":"Emogen: Emotional image content generation with text-to-image diffusion models","work_id":"7efbc2dd-b0f2-4f71-bb1c-d2fcf110d805","shared_citers":9},{"title":"Gradient-based learning applied to document recognition","work_id":"0a3595ca-57f9-43f8-8e2f-aface7154b99","shared_citers":8},{"title":"Layer Normalization","work_id":"20a2d720-0046-4c7c-bcd6-327ec8143f69","shared_citers":8},{"title":"Masset, R","work_id":"238df2e4-a3e5-46f3-860e-3ae2b0094b97","shared_citers":8},{"title":"Long short -term memory","work_id":"c3b0bfa7-6764-45f1-a40d-45baaee9d22c","shared_citers":7},{"title":"PoseNet: A convolutional network for real-time 6-dof camera relocalization","work_id":"135418b1-cafd-49fd-803d-1ca6433d4b1b","shared_citers":7},{"title":"2016, in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 779–788, doi: 10.1109/CVPR.2016.91","work_id":"37ab4f11-9f69-480d-aab9-e7d9826c586d","shared_citers":6},{"title":"Adam: A Method for Stochastic Optimization","work_id":"1910796d-9b52-4683-bf5c-de9632c1028b","shared_citers":6},{"title":"Decoupled Weight Decay Regularization","work_id":"07ef7360-d385-4033-83f7-8384a6325204","shared_citers":6},{"title":"MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications","work_id":"3870239a-c950-4625-bf33-c4f902d14175","shared_citers":6},{"title":"Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift","work_id":"05484516-8937-4cdf-9176-7f8329ef0221","shared_citers":5},{"title":"IEEE Access8, 199523–199538 (2020) https://doi.org/10.1109/ACCESS","work_id":"7cbffc3e-26d4-4a7c-a518-eafcd09cbecb","shared_citers":5}],"time_series":[{"n":5,"year":2019},{"n":1,"year":2021},{"n":2,"year":2023},{"n":7,"year":2024},{"n":16,"year":2025},{"n":95,"year":2026}],"dependency_candidates":[{"n":1,"role":"method","polarity":"background","paper_title":"A Systematic Survey on Deep Learning Architectures for Point Cloud Classification and Segmentation","primary_cat":"cs.CV","context_text":"3D convolutional filters with a stride for extracting features from the shape. They do not use any pooling, as it was observed that pooling introduced uncertainty to shape reconstruction. They pretrain the model first and then run fine-tuning. Pre-training is run layer-wise-convolution layers and RBM layer are trained with standard contrastive divergence [35] and AM-DBN layer is trained with fast persistent contrastive divergence [99]. For fine-tuning, they use a process similar to the wake-sleep algorithm from [36]. During wake, they propagate input voxel forward through the network and update the recognition weights. During sleep, they sample persistent latent variables from the network's generative distribution and propagate them backward through the","citing_arxiv_id":"2605.17131"},{"n":1,"role":"method","polarity":"use_method","paper_title":"Multi-Narrow Transformation as a Single-Model Ensemble: Boundary Conditions, Mechanisms, and Failure Modes","primary_cat":"cs.LG","context_text":"tion under a limited budget affects predictive performance. 4. Experiments WeevaluatetheMNtransformationfromthreeperspec- tives: the data-regime dependence of its effectiveness, the mechanismunderlyingitsgainsinlow-dataregimes,andits computational implications. 4.1. Experimental Setup Unless otherwise stated, the following setup was used throughout the experiments. We employed ResNet-18 [9] as the baseline architecture and applied the MN transfor- mation defined in Sec. 3.1. Following Easy Ensemble [8], theimplementationwasbasedongroupconvolution,andthe transformation strength was controlled by 𝑟∈ {1,2,4,8,16,32}. Here,𝑟= 1corresponds to the untransformed SW baseline, and𝑟= 32correspondstoamodelcontaining1,024internal paths. WeusedCIFAR-100astheprimarydataset.","citing_arxiv_id":"2605.11530"},{"n":1,"role":"baseline","polarity":"baseline","paper_title":"Urban-ImageNet: A Large-Scale Multi-Modal Dataset and Evaluation Framework for Urban Space Perception","primary_cat":"cs.CV","context_text":"benchmarking library providing modular data loaders, fine-tuning pipelines, evaluation scripts, and cross-dataset adapters for direct comparison with Places365, MS-COCO, and Cityscapes. 4.1 Task 1: Urban Scene Semantic Classification Setup:Given an image, predict itsHUSIClabel (0-9). Fine-tuned on 80K training images; evaluated on 10K test split with five-fold cross-validation.Baselines:ResNet-{18/50/152} [14], EfficientNet- B4 [36], ViT-B/16 [8], DeiT-B [37], CLIP ViT-L/14 (zero-shot + fine-tuned) [33].Metrics:Top-1 Accuracy, Macro-F1, per-class P/R/F1. 4.2 Task 2: Cross-Modal Image-Text Retrieval Task 2 evaluates two sub-configurations reflecting the dataset's two textual modalities: T2-1 (Category-Level Retrieval). Text queries are the tenHUSICclass names, formatted as \" This is a photo of {class_name}\"","citing_arxiv_id":"2605.09936"},{"n":1,"role":"method","polarity":"use_method","paper_title":"LAMES: A Large-Scale and Artisanal Mining Environmental Segmentation Dataset","primary_cat":"cs.CV","context_text":"training, validation, and test data sets are split based on entire mining sites rather than individual patches, ensur- ing that all patches from a given site are confined to either the training or test set, see Fig 10. 5.1. Mining Sector Classification (HiRes Imagery) We selected the established U-Net architecture [37], in- corporating a ResNet-50 backbone [17] trained on Ima- geNet [11] as the network architecture. U-Net is a widely recognized semantic segmentation model, demonstrating robust performance in both computer vision and remote sensing applications. The mining sites were divided into 38 for training, 14 for validation, and 19 for testing. Each bounding box of the mining sites was divided into patches","citing_arxiv_id":"2605.07740"},{"n":1,"role":"method","polarity":"use_method","paper_title":"A Unified Framework for the Detection and Classification of Fatty Pancreas in Ultrasound Images","primary_cat":"cs.CV","context_text":"Our framework operates in three distinct stages, as follows: 1.Segmentation stage.We employ a TransUNet- based architecture [3, 19] combining a ResNet [31] encoder with transformer bottleneck layers to segment both the pancreas and the splenic vein from ultrasound images. The models are initial- ized via transfer learning from a liver segmen- tation task [32] and fine-tuned on our clinical dataset. 2.Anatomically-Guided Patch Extraction stage. Using the predicted segmentation masks, we ex- tract tissue patches from two anatomically rele- vant regions: the pancreatic parenchyma (exclud- ing the splenic vein) and the peri-venous fat re- gion immediately beneath the splenic vein con- tour. 3.Classification via Texture Comparison stage.","citing_arxiv_id":"2605.07466"},{"n":1,"role":"baseline","polarity":"baseline","paper_title":"XiYOLO: Energy-Aware Object Detection via Iterative Architecture Search and Scaling","primary_cat":"cs.CV","context_text":"The searched architectures for PascalVOC and COCO are obtained independently. We refer to the resulting detector family asXiYolo. Deployment platforms and baselines.We evaluate the searched models against YOLO baselines on the ModalAI Sentinel Development Drone, which contains a Qualcomm QRB5165 CPU, a Qualcomm Adreno 650 GPU, and a 15 TOPS NPU. We compare against YOLOv5 [17], YOLOv8 [18], YOLO11 [18], and YOLOv12 [32] at nano, small, and medium scales. All models are exported to FP16 TFLite, and we measure energy per inference, cumulative energy over time, latency, and detection accuracy. Power is monitored using the VOXL Power Module v3. We estimate inference energy by subtracting idle power from measured power to obtain active inference power, then","citing_arxiv_id":"2605.06927"},{"n":1,"role":"method","polarity":"use_method","paper_title":"Digital Image Forgery Detection Using Transfer Learning","primary_cat":"cs.CV","context_text":"tween manipulated and authentic regions. Unlike raw RGB inputs, this rep- resentation explicitly emphasizes subtle manipulation artifacts introduced dur- ing tampering, enabling CNN models to learn more discriminative features for forgery detection [25]. All images are resized to224×224pixels (and299×299for InceptionV3) to match the input requirements of pretrained models [9, 10, 11, 14]. 7 3.3 Pretrained CNN Architectures To evaluate the effectiveness of theproposed approach, multiple pretrained CNN architectures are utilized: •DenseNet121 [14] •ResNet50 [9] •VGG16 [10] •EfficientNetB0 [13] •MobileNet [15] •InceptionV3 [11] Each model is fine-tuned on the enhanced input representation combining RGB images with theFdif f features [7, 16].","citing_arxiv_id":"2605.08167"},{"n":1,"role":"method","polarity":"use_method","paper_title":"Rethinking the Need for Source Models: Source-Free Domain Adaptation from Scratch Guided by a Vision-Language Model","primary_cat":"cs.CV","context_text":"we report the results under the widely adopted closed-set protocol in Office-Home, VisDA, and DomainNet-126. Furthermore, since the VODA uses no source informa- tion, it can be regarded as open-set [40], so we also provide comparisons under the open-set protocol in Office-Home. FrameworksThe initial modelθ i is a standard convolutional network, serving as the starting point for adaptation, we use ResNet-50 [41] for Office-Home, and ResNet- 101 [41] for VisDA and DomainNet-126, keeping consistency with the competitors. We initialized the networks using a layer-wise strategy: fully connected layers with Xavier uniform initialization [42], convolutional layers with Kaiming normal initial- ization tailored for ReLU activations [41], and batch normalization layers with weights","citing_arxiv_id":"2605.02604"},{"n":1,"role":"method","polarity":"use_method","paper_title":"Empirical Insights of Test Selection Metrics under Multiple Testing Objectives and Distribution Shifts","primary_cat":"cs.SE","context_text":"the identified optimal FE method and DBSCAN for CA when ablating DR candidates. Finally, we fix the optimal FE and DR methods when ablating CA candidates. Based on the Silhouette and DBCV scores in Table 3, the optimal pipelines are(DeepDrebin, UMAP, DBSCAN)for the AndroZoo dataset and(RoBERTa, UMAP, DBSCAN)for IMDb. Note that for image datasets (MNIST and Udacity), we directly use the best pipeline(ResNet-50 [ 30], UMAP, DBSCAN)validated by [ 7], which achieves average scores of 0.69 and 0.47 for MNIST and Udacity, respectively. 5.1.2 The cluster-to-fault correspondence.To validate whether the resulting cluster from the best pipeline indeed represents a DNN fault, we conduct feature pattern inspection and cluster-specific retraining validation [1]. Figure 2 displays the heatmaps of a randomly selected cluster from the best","citing_arxiv_id":"2604.23342"},{"n":1,"role":"method","polarity":"use_method","paper_title":"H-SemiS: Hierarchical Fusion of Semi and Self-Supervised Learning for Knee Osteoarthritis Severity Grading","primary_cat":"cs.CV","context_text":"andm,(m≫n) are labeled and unlabeled samples;x i is a train- ing sample; andθ Te , θS t are model parameters. Following Mean Teacher framework (Tarvainen and Valpola, 2017), we update the teacher via EMA applied only to the student's base network, while keeping the QCN fixed to stabilize classical feature ex- traction and provide consistent inputs to the quantum module, as defined in Eq. (20). θ(t+1) Te ←µ·θ (t) Te +(1−µ)·θ (t) S t (20) wheretrepresents the training iteration andµis the EMA smooth- ing coefficient, set to 0.99 following Tarvainen and Valpola (2017), which controls the update rate of the teacher parame- ters. This consistency strategy gradually transfers knowledge from the student to the teacher, aligns their predictions, and im-","citing_arxiv_id":"2604.23335"},{"n":1,"role":"method","polarity":"background","paper_title":"Autonomous Unmanned Aircraft Systems for Enhanced Search and Rescue of Drowning Swimmers: Image-Based Localization and Mission Simulation","primary_cat":"cs.CV","context_text":"Consequently, synthetic images were only employed during the labeling process by adding 2,000 synthetic images of the class \"prob. ok\" to the 400 original images. Data augmentation can improve the general representation of certain object fea- tures, so we employ some standard data augmentation techniques, manipulating the following image attributes [ 63]: 1. Brightness: Simulates varying lighting conditions 2. Contrast: Adjusts intensity differences 3. Noise: Mimics distance variations and sensor characteristics 4. Motion blur: Simulates camera movement 5. Rotation: Introduces variation in object orientation 6. Translation: Simulates different object positions Each transformation was applied once per image, expanding the 2,400 images by","citing_arxiv_id":"2604.18088"},{"n":1,"role":"method","polarity":"use_method","paper_title":"Training-inference input alignment outweighs framework choice in longitudinal retinal image prediction","primary_cat":"cs.CV","context_text":"The aggregated history features 𝐆(𝑠) are injected into the U-Net encoder at the corresponding scales via channel-wise concatenation followed by a 1 × 1 convolution acting as a pixel-wise temporal mixer, followed by GroupNorm with SiLU activation [23], providing normalization and non-linear refinement of the fused representation. The model predicts the residual change from the most recent history frame rather than the absolute target [24]: 𝐼̂∗ = 𝐼𝑁 + 𝑓𝜃(𝐼𝑁, ℋ, 𝛥𝑡∗) where 𝑓𝜃 denotes the U-Net output (prior to the residual addition). This residual formulation concentrates capacity on the disease-relevant change signal. The output layer is initialized near zero, so the initial prediction approximates the copy-last baseline 𝐼𝑁. 3.3 Training Configuration Our primary model, TRU, is trained to predict the ground-truth target 𝐼∗ via a per-pixel masked","citing_arxiv_id":"2604.16955"},{"n":1,"role":"method","polarity":"use_method","paper_title":"Weak-to-Strong Knowledge Distillation Accelerates Visual Learning","primary_cat":"cs.CV","context_text":"A speedup ratio greater than 1.0×means ours reaches the same target earlier with fewer epochs or steps. For higher-is-better metrics (Top-1, AP50), first@τis the first epoch with metric at or aboveτ. For lower-is-better metrics (FID), first@τis the first step at or belowτ. Gate and Hyperparameter Selection.For ImageNet classification [7], we useτ= 65for ResNet-50 [14] andτ= 50for ViT-S/16 [8]. For CIFAR early-stage classification [26], we use fixed dataset-level gates:τ= 75for CIFAR-10 and τ= 60for CIFAR-100. For object detection on the COCO dataset [28], we use a fixed task-level AP50 target (τ= 20%). For diffusion generation on the CIFAR-10 dataset [18,26,32], we use a fixed task-level FID target (τ= 60), selected from the teacher reference around 30k training steps, with a conservative","citing_arxiv_id":"2604.15451"},{"n":1,"role":"method","polarity":"use_method","paper_title":"Generative Modeling of Complex-Valued Brain MRI Data","primary_cat":"eess.IV","context_text":"These channels are not independent signals but jointly represent a single complex-valued measurement, where the relationship between them encodes the local phase. Unlike magnitude-only approaches, where a single intensity channel is compressed, this coupling must be explicitly preserved. The architecture, loss function, and evaluation metrics described below are designed accordingly. The architecture is implemented as a ResNet-based [20] conditional variational autoencoder (CVAE) [21]. The encoder maps each (2×96×96) input patch to a latent representation of size (2×48×48), yielding a compression factor of 4. This factor was chosen to retain fine diagnostic detail and the coupling between channels while providing a sufficiently compact input for the flow matching model. The decoder reconstructs the original patch from this compressed encoding.","citing_arxiv_id":"2604.14800"},{"n":1,"role":"method","polarity":"use_method","paper_title":"Enhancing Event Reconstruction in Hyper-Kamiokande with Machine Learning: A ResNet Implementation","primary_cat":"hep-ex","context_text":"Together, these considerations make a scalable, high-speed, and robust reconstruction capable of operating at Monte Carlo scale essential for Hyper-Kamiokande. Machine-learning based reconstruction offers a promising path toward meeting these computational and topological chal- lenges. Convolutional neural networks [ 16], and in particular residual networks (ResNets) [17], are well suited to process the high-dimensional charge and time images recorded by the PMT array. At Super-Kamiokande, machine-learning techniques are already employed for solar-neutrino classification [18] and for neutron-capture tagging [ 19], and they show encouraging re- sults in neutrino-reconstruction studies for other experiments [20, 21, 22].","citing_arxiv_id":"2604.13503"},{"n":1,"role":"method","polarity":"use_method","paper_title":"Human Centered Non Intrusive Driver State Modeling Using Personalized Physiological Signals in Real World Automated Driving","primary_cat":"cs.HC","context_text":"Instead of binary classification, our model classifies into four states (LL,L,H,HH), and instead of training CNN feature extractors from scratch, we use pre-trained ResNet50 using transfer learning. The model architecture is shown in Figure 3. 3.6.1 Feature extraction.The first step is to extract features from each of the seven images. Here we apply transfer learning using ResNet50 [22], pre-trained on a large dataset. We extract information from the penultimate layer of ResNet50, compressing each image into a feature vector of size2048 × 1. This process is repeated for all seven input images (one per signal), resulting in seven feature vectors. These feature vectors are then concatenated into a single vector of size14336 × 1(because2048 × 7 = 14336.","citing_arxiv_id":"2604.11549"},{"n":1,"role":"dataset","polarity":"use_dataset","paper_title":"Mosaic: Cross-Modal Clustering for Efficient Video Understanding","primary_cat":"cs.PF","context_text":"historical video and recomputes attention upon query arrival. (2) ReKV [12] retrieves query-relevant KVCache at the token level. (3) LiveVLM [13] further combines token-level retrieval with KVCache compression to reduce memory usage. (4) StreamMem [14] also compresses KVCache, but under a TABLE II DATASET CONFIGURATIONS. Dataset Max Length Description MLVU [19] 703s multi-task long video LongVideoBench [20] 468s long-term multi-modal video VideoMME [21] 1,018s full-spectrum multi-modal video RVS-ego [22] 3,605s first-person ego-centric video RVS-movie [22] 1,671s third-person plot video query-agnostic memory budget. Following prior work [12], [13], we set the retrieved frames to 64 for all baselines. B. Overall Performance","citing_arxiv_id":"2604.10060"},{"n":1,"role":"method","polarity":"use_method","paper_title":"DSVTLA: Deep Swin Vision Transformer-Based Transfer Learning Architecture for Multi-Type Cancer Histopathological Cancer Image Classification","primary_cat":"eess.IV","context_text":"histopathological images [2], [4], [5], [6]. CNN have been widely adopted for cancer detection due to their ability to capture local texture patterns and hierarchical spatial features. Residual learning has been introduced to alleviate the vanishing gradient problem, leading to significant improvements in deep feature representation, as exemplified by ResNet architectures [7]. Similarly, DenseNet and kernel architectures enhance feature reuse and gradient flow, while EfficientNet achieves state-of-the-art performance through compound scaling with reduced computational cost [8], [9], [10] . More recently, transformer -based architectures have gained attention in medical imaging for their capability to model long -range","citing_arxiv_id":"2604.09468"},{"n":1,"role":"method","polarity":"use_method","paper_title":"Variational Feature Compression for Model-Specific Representations","primary_cat":"cs.CV","context_text":"and gradient-based saliency, the scores are normalized and combined into a uni- fied importance measure, and thresholding yields a binary mask that selects the dimensions retained for decoding. importance score. A threshold produces a binary mask, and the masked vector Zm is decoded intoX ′ for inference. 3.4 Variational Latent Bottleneck Our encoder uses a ResNet-18 [11] backbone with the final fully connected and softmax layers removed. The 512-dimensional output from global average pool- ing is processed through two parallel linear layers to produceµandlogσ 2; a sampleZis drawn via the reparameterization trick. Following the Variational Information Bottleneck (VIB) framework [2], minimizingI(Z;X)via KL diver- gence regularization reduces redundant information, while maximizingI(Z;Y)","citing_arxiv_id":"2604.06644"},{"n":1,"role":"method","polarity":"use_method","paper_title":"Lightweight True In-Pixel Encryption with FeFET Enabled Pixel Design for Secure Imaging","primary_cat":"cs.CV","context_text":"The PSNR values between encrypted and original images further confirm that the encrypted images diverge substantially from the originals, consistent with strong visual obfuscation. To further evaluate the robustness of SecurePix against machine-learning-based attacks, we performed an image- classification test using a ResNet-18 neural network [30]. In this experiment, the classifier is treated as an adversarial model attempting to recognize the encrypted images. For CIFAR-10 [31], the ResNet-18 network was first trained only on the unencrypted training images following standard supervised- learning procedures. After training, we encrypted 10,000 CIFAR-10 test images using SecurePix and fed these encrypted","citing_arxiv_id":"2604.05147"}]},"authors":[{"id":"fa0012a3-358f-4383-8457-2763cccd76e7","orcid":null,"display_name":"Jian Sun","source":"manual","import_confidence":0.72},{"id":"02ad2b6c-8a3d-4309-ba96-5181bc91718e","orcid":null,"display_name":"Kaiming He","source":"manual","import_confidence":0.72},{"id":"c49ca12d-98c9-4fc2-a95f-a30b92a41773","orcid":null,"display_name":"Shaoqing Ren","source":"manual","import_confidence":0.72},{"id":"90e0e192-6197-4ef4-b9c2-386dbfd79fad","orcid":null,"display_name":"Xiangyu Zhang","source":"manual","import_confidence":0.72}]}}