{"total":22,"items":[{"citing_arxiv_id":"2605.23304","ref_index":44,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"General Hazard Detection","primary_cat":"cs.CV","submitted_at":"2026-05-22T07:24:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Introduces CompliVision dataset and active learning framework for rule-based hazard compliance assessment using vision-language models grounded in safety standards.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19301","ref_index":24,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"iGSP:Implicit Gradient Subspace Projection for Efficient Continual Learning of Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2026-05-19T03:22:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"iGSP uses implicit gradient subspace projection in two phases to enable efficient continual adaptation of vision-language models, claiming SOTA accuracy with 42.7% fewer trainable parameters and 86.9% less total parameter growth.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16026","ref_index":39,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"From Flat Language Labels to Typological Priors: Structured Language Conditioning for Multilingual Speech-to-Speech Translation","primary_cat":"cs.CL","submitted_at":"2026-05-15T15:01:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"S2ST-Omni 2 uses typology-informed hierarchical encoding, gated Dual-CTC, and typology-aware prompting to improve multilingual S2ST over flat-label baselines on CVSS-C, with gains in low-data regimes.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10345","ref_index":43,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"BGG: Bridging the Geometric Gap between Cross-View images by Vision Foundation Model Adaptation for Geo-Localization","primary_cat":"cs.CV","submitted_at":"2026-05-11T10:46:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"BGG adapts vision foundation models using multi-granularity dilated convolutions and frequency-domain patch aggregation to achieve state-of-the-art cross-view geo-localization on University-1652 and SUES-200 with low training cost.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"tions and out-of-distribution robustness. However, full fine- tuning of these massive models is not only computationally expensive but also prone to destroying the general knowledge structure acquired during pre-training. This issue is particularly evident for CVGL, where training data are often limited. To address this challenge, PETL originates from NLP (e.g., Adapters [25] and LoRA [43] ). Its core idea is to freeze the pre-trained backbone and adapt to downstream tasks by optimizing only a minimal number of additional parameters. Although methods like AdaptFormer [44] and Mv-Adapter [45] have successfully transferred this paradigm to general visual tasks, their application in the CVGL domain remains in an exploratory stage [46]."},{"citing_arxiv_id":"2605.07861","ref_index":50,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"From Synthetic to Real: Toward Identity-Consistent Makeup Transfer with Synthetic and Real Data","primary_cat":"cs.CV","submitted_at":"2026-05-08T15:21:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"The work creates identity-consistent synthetic makeup data via ConsistentBeauty and adapts models to real images using reinforcement learning in RealBeauty, achieving better identity preservation and real-world performance than prior methods.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"Following previous works, we first employ several standard automatic metrics to evaluate the performance of makeup transfer methods from different per- spectives. To assess makeup fidelity, we compute the cosine similarity between the generated and reference images within the CLIP [46] embedding space, referred to as the CLIP-I score. Following PhotoMaker [50], we utilize Face similarity metric to measure the ID consistency, which calculates the cosine similarity of face embedding [44] between source and transferred results. In addition, we use L2M to measure background preservation over non-facial regions defined by facial parsing [51]. Since makeup transfer requires both makeup fidelity and identity consistency, achieving high scores in one metric while"},{"citing_arxiv_id":"2605.07706","ref_index":1,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Bayesian Fine-tuning in Projected Subspaces","primary_cat":"cs.LG","submitted_at":"2026-05-08T13:14:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Bayesian fine-tuning of large models can be done efficiently by projecting uncertainties into low-dimensional subspaces, yielding improved calibration and generalization while keeping computational costs low.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"efficient training, calibration, uncertainty, subspace inference, Bayesian inference I. INTRODUCTION Parameter-efficient fine-tuning methods have become a practical alternative to full fine-tuning of large pretrained models, as they substantially reduce computational and mem- ory requirements while preserving downstream performance. Among these methods, LoRA (Low-Rank Adaptation) [1] decomposes weight updates into low-rank matrices, enabling efficient adaptation to new tasks with only a small number of trainable parameters. Minimizing the number of trainable parameters reduces memory and storage requirements, making large-scale model adaptation feasible. Reducing computational overhead speeds up training time and makes adaptation pos-"},{"citing_arxiv_id":"2605.04769","ref_index":56,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Lightweight Cross-Spectral Face Recognition via Contrastive Alignment and Distillation","primary_cat":"cs.CV","submitted_at":"2026-05-06T11:16:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A lightweight hybrid CNN-Transformer framework for heterogeneous face recognition achieves competitive performance on cross-spectral benchmarks and standard RGB tasks using contrastive alignment and distillation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.04593","ref_index":53,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"DiCLIP: Diffusion Model Enhances CLIP's Dense Knowledge for Weakly Supervised Semantic Segmentation","primary_cat":"cs.CV","submitted_at":"2026-05-06T07:41:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DiCLIP uses diffusion-based visual correlation enhancement and text semantic augmentation to improve CLIP-generated class activation maps for weakly supervised semantic segmentation, outperforming prior methods on PASCAL VOC and MS COCO.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.26388","ref_index":16,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"SplitFT: An Adaptive Federated Split Learning System For LLMs Fine-Tuning","primary_cat":"cs.DC","submitted_at":"2026-04-29T07:58:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SplitFT adapts cut-layer selection and reduces LoRA rank per client in federated split learning to improve efficiency and performance when fine-tuning LLMs on heterogeneous devices and data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.26340","ref_index":5,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Adaptive and Fine-grained Module-wise Expert Pruning for Efficient LoRA-MoE Fine-Tuning","primary_cat":"cs.LG","submitted_at":"2026-04-29T06:45:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DMEP prunes experts module-by-module in LoRA-MoE and removes load balancing after pruning, cutting trainable parameters 35-43% and raising throughput ~10% while matching or exceeding uniform baselines on reasoning tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19632","ref_index":28,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"CreatiParser: Generative Image Parsing of Raster Graphic Designs into Editable Layers","primary_cat":"cs.CV","submitted_at":"2026-04-21T16:20:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"CreatiParser decomposes raster graphic designs into editable text, background, and sticker layers via a hybrid VLM-diffusion model with ParserReward and GRPO optimization, reporting 23.7% average metric gains on Parser-40K and Crello datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17949","ref_index":40,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"ZSG-IAD: A Multimodal Framework for Zero-Shot Grounded Industrial Anomaly Detection","primary_cat":"cs.CV","submitted_at":"2026-04-20T08:30:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ZSG-IAD is a zero-shot multimodal system that uses language-guided two-hop grounding and rule-based reinforcement learning to produce anomaly masks and explainable reports from industrial sensor data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.12686","ref_index":21,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"BID-LoRA: A Parameter-Efficient Framework for Continual Learning and Unlearning","primary_cat":"cs.LG","submitted_at":"2026-04-14T12:57:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"BID-LoRA uses bi-directional low-rank adapters with retain/new/unlearn pathways and escape unlearning to enable continual learning and unlearning while minimizing knowledge leakage and parameter updates.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.12610","ref_index":60,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Transforming External Knowledge into Triplets for Enhanced Retrieval in RAG of LLMs","primary_cat":"cs.CL","submitted_at":"2026-04-14T11:36:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Tri-RAG turns external knowledge into Condition-Proof-Conclusion triplets and retrieves via the Condition anchor to improve efficiency and quality in LLM RAG.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"we evaluate several parameter-efficient adaptation schemes for the extractor while keeping the backbone model frozen whenever applicable: •Pretrained: using the frozen backbone directly without any task-specific adaptation. •Prefix tuning [44]: prepending a small number of train- able continuous prefix vectors to the input sequence, while leaving all backbone parameters unchanged. •LoRA [60]: inserting low-rank adaptation modules into selected layers to enable efficient fine-tuning with a limited number of trainable parameters. •Tri-RAG (Ours): our extractor configuration, evaluated end-to-end against the above baselines on the same datasets to assess both the effectiveness and parameter efficiency of structured triplet extraction. 3) Experimental Details:To ensure fairness and repro-"},{"citing_arxiv_id":"2604.12575","ref_index":40,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"StructDiff: A Structure-Preserving and Spatially Controllable Diffusion Model for Single-Image Generation","primary_cat":"cs.CV","submitted_at":"2026-04-14T10:55:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"StructDiff adds adaptive receptive fields and 3D positional encoding to a single-scale diffusion model to preserve structure and enable spatial control in single-image generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.06782","ref_index":19,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"EventFace: Event-Based Face Recognition via Structure-Driven Spatiotemporal Modeling","primary_cat":"cs.CV","submitted_at":"2026-04-08T07:52:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EventFace achieves 94.19% Rank-1 accuracy and 5.35% EER on a new small event-based face dataset by transferring facial structure priors via LoRA and fusing them with temporal motion features.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.04086","ref_index":73,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"LAA-X: Unified Localized Artifact Attention for Quality-Agnostic and Generalizable Face Forgery Detection","primary_cat":"cs.CV","submitted_at":"2026-04-05T12:08:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LAA-X uses multi-task learning with explicit localized artifact attention and blending synthesis to build a deepfake detector that generalizes to high-quality and unseen manipulations after training only on real and pseudo-fake samples.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"More specifically, ground-truth heatmaps are generated by fitting anUnnormalized Gaussian Distributionfor each pixel pk = (p k x, pk y)∈ P. The pixelp k is considered as the center of the Gaussian MaskG k. To take into account the neighborhood information ofp k, the standard deviation ofG k is adaptively computed. In particular, inspired by the work of [73], the standard deviationσ k ofp k is computed based on the width and the height of the blending boundary maskBwith respect to the pointp k. Similar to [73], a radiusr k is computed based on the size of the set of virtual objects that overlap the mask centered atp k with an Intersection over Union (IoU) greater than a thresholdt. In all our experiments, we settto0."},{"citing_arxiv_id":"2604.02828","ref_index":20,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"NavCrafter: Exploring 3D Scenes from a Single Image","primary_cat":"cs.CV","submitted_at":"2026-04-03T07:50:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"NavCrafter generates controllable novel-view videos from one image via video diffusion, geometry-aware expansion, and enhanced 3D Gaussian Splatting to achieve state-of-the-art synthesis under large viewpoint changes.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"clouds to improve multi-view consistency [11], [17], but their effectiveness depends on point cloud quality and remains limited to narrow-scoped scenes. B. Camera-Conditioned Video Diffusion Models Camera-conditioned video diffusion models have recently attracted growing attention [8], [9], [18]. Early works ex- plored training-free conditioning strategies [19] or integrated LoRA modules [20] into diffusion pipelines for limited forms of camera control. Recent efforts, such as Gen3C [10], in- corporated ControlNet-like conditioning with cross-attention mechanisms, but due to high computational costs, pose control was only applied at low-resolution stages in cas- caded generators. Methods like DimensionX [21] achieved basic control via multiple LoRA modules but struggled"},{"citing_arxiv_id":"2601.10940","ref_index":22,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"HOSL: Hybrid-Order Split Learning for Memory-Constrained Edge Training","primary_cat":"cs.LG","submitted_at":"2026-01-16T01:54:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"HOSL reduces client memory up to 3.7x versus full first-order split learning while staying within 0.20-4.23% accuracy on OPT models by pairing client zeroth-order estimation with server first-order optimization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.09448","ref_index":33,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"One Prompt, Many Sounds: Modeling Listener Variability in LLM-Based Equalization","primary_cat":"cs.SD","submitted_at":"2026-01-14T12:51:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLMs using in-context learning and fine-tuning on listener experiment data generate equalization settings that align better with population preferences than random sampling or static presets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.19883","ref_index":64,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"CoMelSinger: Discrete Token-Based Zero-Shot Singing Synthesis With Structured Melody Control and Guidance","primary_cat":"cs.SD","submitted_at":"2025-09-24T08:34:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CoMelSinger introduces a discrete token-based zero-shot SVS framework on MaskGCT with coarse-to-fine contrastive learning and an SVT module to improve melody control and reduce prosody leakage.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.12089","ref_index":37,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"RadarPLM: Adapting Pre-trained Language Models for Marine Radar Target Detection by Selective Fine-tuning","primary_cat":"eess.SP","submitted_at":"2025-09-15T16:16:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"RadarPLM adapts PLMs for marine radar target detection with lightweight adaptation and selective fine-tuning based on online learning values, reporting at least 6.35% average detection gains in low SCR conditions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}