{"total":15,"items":[{"citing_arxiv_id":"2606.31159","ref_index":45,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"An Empirical Study of Security Calibration in Large Language Models for Code","primary_cat":"cs.SE","submitted_at":"2026-06-30T05:37:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Empirical evaluation of three LLMs finds prevalent overconfidence in insecure code generation, with security calibration outperforming functional calibration but both degrading in repository-level settings.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.21397","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Evaluating LLMs for Real-World Web Vulnerability Detection","primary_cat":"cs.CR","submitted_at":"2026-06-19T13:02:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Frontier LLMs detect up to 63% of web vulnerabilities in WordPress plugins with scoped prompts outperforming open-ended ones, but all show low consistency across runs and miss some baseline issues.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.04739","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Revisiting Vul-RAG: Reproducibility and Replicability of RAG-based Vulnerability Detection with Open-Weight Models","primary_cat":"cs.SE","submitted_at":"2026-06-03T11:20:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"Reproducibility study of Vul-RAG confirms original findings in a fully local open-weights setting but identifies a persistent performance plateau at approximately 0.30 pairwise accuracy across diverse recent open-weight LLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30777","ref_index":75,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"What Breaks When LLMs Code? Characterizing Operational Safety Failures of Agentic Code Assistants","primary_cat":"cs.SE","submitted_at":"2026-05-29T03:09:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"An empirical study of 547 confirmed safety incidents from GitHub and literature derives a 33-type taxonomy showing constraint violations, destructive actions, and deception dominate in everyday coding-agent use.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13776","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"\"Like Taking the Path of Least Resistance\": Exploring the Impact of LLM Interaction on the Creative Process of Programming","primary_cat":"cs.HC","submitted_at":"2026-05-13T16:54:51+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LLM assistance shortens idea-generation periods and reduces creative moments during programming tasks while yielding solutions with comparable idea counts and greater functional correctness.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"In domains such as programming and design, LLMs assist users in exploring alternative solutions, automating repetitive scripting tasks, and lowering barriers to creative coding [55, 71, 80]). Impact of LLMs on Creativity. Empirical research indicates that LLMs can increase both the quantity and richness of ideas generated, particularly benefiting less experienced users, and can enhance the perceived creativity of outputs [4, 18, 30, 41] However, concerns about homogenization remain: LLMs may inadvertently encourage convergence on similar solutions, reducing collective originality and diversity [5, 18, 47]. Importantly, the impact of LLMs on creative outcomes depends on interaction modalities-for example, whether the LLM functions as a \"ghostwriter\" or a sounding board-and on the presence of scaffolds or structured interaction [14, 30, 71]."},{"citing_arxiv_id":"2604.11398","ref_index":61,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Nix: A Solution With Problems","primary_cat":"cs.SE","submitted_at":"2026-04-13T12:41:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"A literature review of Nix's functional package management solutions to software deployment problems alongside the new and unsolved issues it introduces.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.08417","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Vulnerability Detection with Interprocedural Context in Multiple Languages: Assessing Effectiveness and Cost of Modern LLMs","primary_cat":"cs.SE","submitted_at":"2026-04-09T16:17:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Adding interprocedural context from callers or callees enables LLMs to detect vulnerabilities more effectively, with Gemini 3 Flash achieving F1 scores of at least 0.978 for C at low cost and Claude Haiku 4.5 excelling at explanations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.18740","ref_index":90,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Measuring and Exploiting Contextual Bias in LLM-Assisted Security Code Review","primary_cat":"cs.SE","submitted_at":"2026-03-19T10:40:27+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LLM-based security code review is vulnerable to framing bias, with a novel iterative refinement attack achieving 100% success in reintroducing vulnerabilities across real projects.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.08367","ref_index":103,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Methodological Analysis of Empirical Studies in Quantum Software Testing","primary_cat":"quant-ph","submitted_at":"2026-01-13T09:29:00+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A systematic analysis of 59 quantum software testing empirical studies reveals highly diverse designs, inconsistent reporting, and open methodological challenges, leading to recommendations for future work.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"The test oracle problem arises in almost every empirical study on QST and can substantially affect experimental evaluations. For example, oracle-induced false positives may inflate the number of reported faults, thereby confounding effectiveness-related metrics. We note that, while CST research has extensively studied oracle absence (e.g., oracle generation [27, 84] and metamorphic testing [20, 103]), empirical QST studies face additional challenges rooted in quantum semantics and measurement. Therefore, before directly transferring CST-style solutions, it is important , Vol. 1, No. 1, Article . Publication date: January 2025. A Methodological Analysis of Empirical Studies in Quantum Software Testing 47 to first address several QC-specific issues that determine whether an oracle is well-defined and"},{"citing_arxiv_id":"2511.02434","ref_index":54,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Who's Who? LLM-assisted Software Traceability with Architecture Entity Recognition","primary_cat":"cs.SE","submitted_at":"2025-11-04T10:06:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLM approaches ExArch and ArTEMiS reach F1 scores of 0.86 and 0.81 for architecture entity recognition and traceability, matching or approaching baselines that require manual models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.22202","ref_index":51,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Library Hallucinations in LLM-Generated Code: A Risk Analysis Grounded in Developer Queries","primary_cat":"cs.SE","submitted_at":"2025-09-26T11:14:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A study of seven LLMs finds that realistic prompt variations such as one-character misspellings trigger library hallucinations in up to 26% of cases, fabricated names in up to 99%, and time-based prompts in up to 85%, and introduces LibHalluBench for evaluation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.19625","ref_index":95,"ref_count":4,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Search-Based Software Engineering and AI Foundation Models: Current Landscape and Future Roadmap","primary_cat":"cs.SE","submitted_at":"2025-05-26T07:46:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A research roadmap analyzing the current state of search-based software engineering with foundation models, outlining challenges and directions across three integration aspects.","context_count":2,"top_context_role":"background","top_context_polarity":"background","context_text":"LLM-integrated approach.Proceedings of the ACM on Programming Languages, 8(OOPSLA1):474-499, 2024. doi: 10.1145/3649828. [94] Wenwu Li, Xiangfeng Wang, Wenhao Li, and Bo Jin. A survey of automatic prompt engineering: An optimization perspective.CoRR, abs/2502.11560, 2025. doi: 10.48550/ARXIV .2502.11560. URL https://doi.org/10. 48550/arXiv.2502.11560. 28 Sartaj et al. [95] Ziyang Li, Saikat Dutta, and Mayur Naik. IRIS: LLM-assisted static analysis for detecting security vulnerabilities. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. URLhttps://openreview.net/forum?id=9LdJDU7E91. [96] Davide Li Calsi, Matias Duran, Thomas Laurent, Xiao-Yi Zhang, Paolo Arcaini, and Fuyuki Ishikawa."},{"citing_arxiv_id":"2503.17181","ref_index":66,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Study of LLMs' Preferences for Libraries and Programming Languages","primary_cat":"cs.SE","submitted_at":"2025-03-21T14:29:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Empirical study of eight LLMs finds overuse of popular libraries like NumPy in up to 45% of unnecessary cases and strong default preference for Python even when suboptimal.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2412.11194","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Direction for Detection: A Survey of Automated Vulnerability Detection and all of its Pain Points","primary_cat":"cs.SE","submitted_at":"2024-12-15T14:01:41+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ML4AVD research remains locked into binary function-level classification of C/C++ vulnerabilities because twelve pain points in the pipeline reinforce each other through feedback loops.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"sought to catalyze this change, pushing A VD from research prototypes to practical solutions ready for production. This work systematizes current approaches in A VD in re- sponse to growing complexity and fragmentation within the field. As A VD research has evolved swiftly from rule-based methods [3], [4] to sophisticated machine learning [5], [6] and language model-driven techniques [7], [8], [9], the field has become increasingly divided, with studies often focused on isolated tasks, limited language support, and inconsistent evaluation methodologies. For example, prior works have noted the variability in dataset quality [10] and poor open science practices [11]. Overall, this rapid progress, while impressive, has potentially led the field away from a unified"},{"citing_arxiv_id":"2310.11113","ref_index":77,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Revisiting Sentiment Analysis for Software Engineering in the Era of Large Language Models","primary_cat":"cs.SE","submitted_at":"2023-10-17T09:53:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"bLLMs achieve state-of-the-art results on limited and imbalanced SE sentiment datasets even in zero-shot settings, but fine-tuned sLLMs outperform when ample balanced training data is available.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}