{"total":15,"items":[{"citing_arxiv_id":"2605.20788","ref_index":53,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"BioDefect: The First Dataset for Defect Detection in Bioinformatics Software","primary_cat":"cs.SE","submitted_at":"2026-05-20T06:34:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"BioDefect is a new dataset for defect detection in bioinformatics software that improves average F1-scores by 29.61% to 38.04% over existing datasets when evaluated on nine language models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10240","ref_index":31,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"MARGIN: Margin-Aware Regularized Geometry for Imbalanced Vulnerability Detection","primary_cat":"cs.SE","submitted_at":"2026-05-11T09:14:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MARGIN reduces geometric distortions in imbalanced vulnerability embeddings by dynamically regularizing margins with von Mises-Fisher concentration estimates and hyperspherical prototypes.","context_count":1,"top_context_role":"background","top_context_polarity":"support","context_text":"evaluates representations via aggregate performance metrics, with limited analysis of embedding-space geometry and class- wise distributional properties. B. Metric Learning Metric learning enhances representations by enforcing intra- class compactness and inter-class separability. Early pairwise or triplet losses are sensitive to sampling [30], while margin- based Softmax variants, such as ArcFace [31], introduce angular margins under hyperspherical constraints to improve open-set recognition [1], [32]. These methods apply fixed, sample-level margins to enlarge class separation. Moreover, it focus on open-set recognition and do not explicitly address class imbalance, whereas our work targets the closed-set, imbalanced scenario. Unlike margin-based methods, our approach does not treat"},{"citing_arxiv_id":"2605.08299","ref_index":17,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Do not copy and paste! Rewriting strategies for code retrieval","primary_cat":"cs.SE","submitted_at":"2026-05-08T11:31:26+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Full natural-language rewriting of code and queries boosts retrieval on code benchmarks while corpus-only rewriting often hurts, with token entropy difference serving as a cheap predictor of gains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.03689","ref_index":10,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Deep Graph-Language Fusion for Structure-Aware Code Generation","primary_cat":"cs.SE","submitted_at":"2026-05-05T12:33:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CGFuse enables deep token-level fusion of graph-derived structural features into language models, yielding 10-16% BLEU and 6-11% CodeBLEU gains on code generation tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02860","ref_index":12,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Standing on the Shoulders of Giants: Stabilized Knowledge Distillation for Cross--Language Code Clone Detection","primary_cat":"cs.AI","submitted_at":"2026-05-04T17:37:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Reasoning-oriented knowledge distillation from DeepSeek-R1 plus response stabilization improves reliability and often performance of compact models for cross-language code clone detection on pairs like Python-Java and Rust-Java.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"However, this study focuses on general code understanding rather than X-CCD. 7.3 Knowledge Distillation and Response Stabilization for X-CCD Although KD has been studied for code understanding, its use for CCD and especially X-CCD remains limited. Existing X-CCD methods mainly focus on representation learning, graph encoders, contrastive learning, or LLM prompting [12, 13, 33]. These works improve clone detection accuracy, but they do not explicitly study how knowledge from a large teacher model can be transferred to an efficient student model for cross-language clone detection. Our work is also related to response stabilization, which aims to reduce output variability across prompts, decoding paths, or repeated runs."},{"citing_arxiv_id":"2604.27714","ref_index":10,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"How Code Representation Shapes False-Positive Dynamics in Cross-Language LLM Vulnerability Detection","primary_cat":"cs.CR","submitted_at":"2026-04-30T11:01:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Text fine-tuning of 8B LLMs on C/C++ vulnerability data inflates cross-language false-positive rates through surface-cue memorization, which an AST inference probe can partially reverse while direct AST fine-tuning cannot.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.25903","ref_index":17,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Carbon-Taxed Transformers: A Green Compression Pipeline for Overgrown Language Models","primary_cat":"cs.SE","submitted_at":"2026-04-28T17:48:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"CTT is a compression pipeline for LLMs that achieves up to 49x memory reduction, 10x faster inference, 81% lower CO2 emissions, and retains 68-98% accuracy on code clone detection, summarization, and generation tasks.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":", functionally similar despite possible syntactic diﬀerences) or not. In our setup, the dataset has a training set of over 45,000 examples and separate validation and test sets of 4,000 examples each. Training and Evaluation Strategy.The architectural details for this task are detailed in Table 1. The student model is trained using logits generated byﬁne-tuned encoder based teacher model: UniXCoder [17]. The model learns only by mimicking the teacher's output. We evaluate model performance on the CCD task using precision and recall. Additionally, we report inference latency, memory usage and inference CO 2 emission analysis. In this evaluation, CodeBERT [ 13], CodeT5 [62], CodeT5+-220M [61], Compressor [ 50] and Avatar [ 49] are included as comparative baselines."},{"citing_arxiv_id":"2604.20835","ref_index":37,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Parallel-SFT: Improving Zero-Shot Cross-Programming-Language Transfer for Code RL","primary_cat":"cs.CL","submitted_at":"2026-04-22T17:58:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Parallel-SFT mixes parallel programs across languages during SFT to produce more transferable RL initializations, yielding better zero-shot generalization to unseen programming languages.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19031","ref_index":21,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"SAGE: Signal-Amplified Guided Embeddings for LLM-based Vulnerability Detection","primary_cat":"cs.CR","submitted_at":"2026-04-21T03:27:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SAGE uses sparse autoencoders to boost vulnerability signals in LLMs, raising internal SNR 12.7x and delivering up to 318% MCC gains on vulnerability detection benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":", MSIVD) to better align model training with realistic, evolving vulnerability landscapes. Reinforcement Learning-based Methods.To align models with non-differentiable security objectives, researchers have integrated Reinforcement Learning (RL). Methodologies range from optimizing prompts via policy gradients [42] to continuous prefix tuning for generating secure code [21, 47]. More recently, efforts have shifted towards internalizing reasoning capabilities. Inspired by advancements in reasoning models [13], approaches like those by Simoni et al. [49] utilize Group Relative Policy Optimization (GRPO) to enhance detection logic. Frameworks such as ReVD [56] further employ curriculum learning and synthetic reasoning traces to improve the"},{"citing_arxiv_id":"2604.05100","ref_index":9,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Edit, But Verify: An Empirical Audit of Instructed Code-Editing Benchmarks","primary_cat":"cs.SE","submitted_at":"2026-04-06T18:59:42+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"The two main benchmarks for LLM instructed code editing over-represent Python, miss common real-world domains and edit types, and have test coverage issues that limit what they measure.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.03959","ref_index":3,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"LoRA-MME: Multi-Model Ensemble of LoRA-Tuned Encoders for Code Comment Classification","primary_cat":"cs.SE","submitted_at":"2026-03-04T11:36:32+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"LoRA-MME ensembles LoRA-adapted UniXcoder, CodeBERT, GraphCodeBERT, and CodeBERTa with learned weights to reach 0.7906 weighted F1 and 0.6867 macro F1 on code comment classification.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.14852","ref_index":18,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"UntrustVul: An Automated Approach for Identifying Untrustworthy Alerts in Vulnerability Detection Models","primary_cat":"cs.SE","submitted_at":"2025-03-19T03:18:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"UntrustVul identifies untrustworthy vulnerability predictions by marking lines that neither match historical vulnerability patterns nor influence vulnerable lines through dependencies, reporting AUC 70-88% and F1 82-94% on 115K predictions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2501.16044","ref_index":4,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"MultiMend: Multilingual Program Repair with Context Augmentation and Multi-Hunk Patch Generation","primary_cat":"cs.SE","submitted_at":"2025-01-27T13:37:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"MultiMend augments buggy function context via retrieval and generates multi-hunk patches, fixing 2,227 of 5,501 bugs across six benchmarks in four languages.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2308.03281","ref_index":124,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Towards General Text Embeddings with Multi-stage Contrastive Learning","primary_cat":"cs.CL","submitted_at":"2023-08-07T03:52:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"GTE_base is a compact text embedding model using multi-stage contrastive learning on diverse data that outperforms OpenAI's API and 10x larger models on massive benchmarks and works for code as text.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2306.03091","ref_index":20,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems","primary_cat":"cs.CL","submitted_at":"2023-06-05T17:59:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RepoBench is a new benchmark with retrieval, completion, and pipeline tasks to evaluate code auto-completion systems on entire repositories instead of single files.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}