{"total":16,"items":[{"citing_arxiv_id":"2606.21171","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"An Exploratory Case Study of LLM-Assisted Refactoring and Gameplay Feature Generation in an Endless Runner Game","primary_cat":"cs.SE","submitted_at":"2026-06-19T07:19:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"GPT-4o successfully completed all three refactoring tasks but only one of three gameplay feature generation tasks in the studied endless runner game.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.17981","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Planning to Hammer: Difficulty-Aware Decomposition for Automating Rocq Proofs","primary_cat":"cs.SE","submitted_at":"2026-06-16T14:33:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Quarry improves Rocq proof automation success rates by 7-13% under 10-minute budgets via LLM-planned decompositions ranked by a proof-state difficulty model for CoqHammer solvability.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.08588","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LLM vs. Human Unit Tests: Fault Detection on Real Python Bugs","primary_cat":"cs.SE","submitted_at":"2026-06-07T12:01:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LLM-generated unit tests with retrieval-augmented context detect faults in 69% of real Python bugs versus 17.2% for general-purpose human-written tests, with similar coverage levels.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.04704","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Extraction and Search in Rocq: Theorems, Definitions and Their dependencies","primary_cat":"cs.SE","submitted_at":"2026-06-03T10:33:17+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"TheoremExtr extracts 71,795 theorems with dependencies and 27,481 definitions from 32 Rocq projects and provides a cross-project similarity search website.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.29822","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Inferring Code Correctness from Specification","primary_cat":"cs.SE","submitted_at":"2026-05-28T12:04:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TRAILS infers code correctness by aggregating LLM judgments on input-output pairs from category-partitioned specification tests, improving MCC by up to 39% over Zero-Shot COT on LiveCodeBench and CoCoClaNeL.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.26017","ref_index":46,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Trustworthy Software Project Generation : a Case Study with an Interactive Theorem Prover","primary_cat":"cs.SE","submitted_at":"2026-05-25T16:35:36+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"An LLM agent with Rocq backend automatically builds a verified RISC-V RV32I interpreter (1859 lines Rocq, 2848 lines extracted C++) that passes 265 tests and 12-hour fuzzing, while a Dafny backend fails.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20310","ref_index":101,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Combined Program Analysis Techniques: A Systematic Mapping Study","primary_cat":"cs.SE","submitted_at":"2026-05-19T16:44:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A systematic mapping study of 248 papers introduces a taxonomy of synergistic effects, inter-analysis workflows, and mapping functions to catalog patterns in combined program analysis techniques.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Hunger, R. Wood, S. Khurshid, and M. Tiwari. ACHyb: A hybrid analysis approach to detect kernel access control vulnerabilities. pages 316-327. Association for Computing Machinery, Inc, 2021. [100] Z. Huang, S. Ravi, and C. Wang. Discovering Likely Program Invariants for Persistent Memory. pages 1795-1807. Association for Computing Machinery, Inc, 2024. [101] Z. Hui. Utilization of Dependence and Weight to Improve Fault Localization Method of Regression Test Cases.International Journal of Software Engineering and Knowledge Engineering, 27(3):423-447, 2017. Publisher: World Scientific Publishing Co. Pte Ltd. [102] W. Hummer, O. Raz, O. Shehory, P. Leitner, and S. Dustdar. Testing of data-centric and event-based dynamic service compositions."},{"citing_arxiv_id":"2605.17914","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Guiding LLM-based Loop Invariant Synthesis via Feedback on Local Reasoning Errors","primary_cat":"cs.PL","submitted_at":"2026-05-18T06:23:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LORIS detects local reasoning errors in LLM-generated proofs for loop invariants by translating natural-language steps to first-order logic implications and using invalid implications to refine the invariants, achieving 93.1% success on 460 C programs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17242","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Runnable to Shippable: Multi-Agent Test-Driven Development for Generating Full-Stack Web Applications from Requirements","primary_cat":"cs.SE","submitted_at":"2026-05-17T03:48:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TDDev automates the full TDD loop for web app generation from requirements, delivering 34-48 percentage point quality gains and zero manual intervention in user studies.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13716","ref_index":36,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SkillOps: Managing LLM Agent Skill Libraries as Self-Maintaining Software Ecosystems","primary_cat":"cs.SE","submitted_at":"2026-05-13T16:02:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SkillOps maintains LLM skill libraries via Skill Contracts and ecosystem graphs, raising ALFWorld task success to 79.5% as a standalone agent and improving retrieval baselines by up to 2.9 points with near-zero library-time LLM cost.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08694","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Learning Method for Symbolic Systems Using Large Language Models","primary_cat":"cs.SE","submitted_at":"2026-05-09T05:05:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LLM2Ltac mines symbolic tactics from 11,725 Coq theorems using LLMs and integrates them into CoqHammer, improving proof rates by 23.87% on 6,199 theorems from four large verification projects.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"demonstrates the effectiveness of generalization testing. Without generalization testing, many ineffective tactics are present, prevent- ing the retrieval of effective tactics. Answer to RQ6:The results demonstrate that the general- ization testing is effective in improving the performance of LLM2Ltac. 6 Related Work Recent work uses LLMs to extract structured knowledge such as logic rules [28], mathematical equations [36], knowledge graphs [46], and lemmas [18]. These methods target specific rule types and data sources, and none of them mine reusable tactics from formal proofs. LLM2Ltacfills this gap by mining reasoning strategies from exist- ing proofs and using the mined tactics to enhance symbolic provers. The most related work isStrat2Rocq[ 18], which also enhances"},{"citing_arxiv_id":"2605.07403","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Boosting Automatic Java-to-Cangjie Translation with Multi-Stage LLM Training and Error Repair","primary_cat":"cs.SE","submitted_at":"2026-05-08T07:58:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Multi-stage LLM training plus compiler-guided error repair boosts functional equivalence in Java-to-Cangjie translation by 6.06% over prior methods despite scarce parallel data.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"1 Challenges in Low-Resource Code Translation Translating widely used Java code into the emerging Cangjie lan- guage is important for cross-platform migration. However, Cangjie is still under development, and high-quality Java-to-Cangjie parallel corpora are scarce. This low-resource scenario limits the effective- ness of traditional code translation approaches [23]. This challenge is exacerbated by the knowledge imbalance in LLMs. Most LLMs are pre-trained mainly on high-resource programming languages such as Java and Python. As a result, they often lack sufficient knowledge of the core syntax, the usage of standard library API, and runtime behavior in low-resource programming languages, e.g., Cangjie [18]."},{"citing_arxiv_id":"2604.06755","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Babbling Suppression: Making LLMs Greener One Token at a Time","primary_cat":"cs.SE","submitted_at":"2026-04-08T07:21:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Babbling Suppression stops LLM code generation upon test passage to reduce token output and energy consumption by up to 65% across Python and Java benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"18 families of LLMs on typical software development tasks con- sidering full-precision and quantized versions. Their results show that larger models with higher energy budgets do not always show substantially improved accuracy, and quantized versions of large models can often achieve better efficiency without compromising performance. Similarly, Mehditabar et al. [25] propose the BRACE framework to systematically benchmark code language models on functional correctness and energy efficiency. By evaluating 22 state-of-the-art models, their framework provides insights into the accuracy-energy trade-offs. Our method complements these works by preserving the same level of accuracy while further reducing the energy footprint, providing a practical approach to more sustainable"},{"citing_arxiv_id":"2602.13851","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Evaluating LLM-Generated ACSL Annotations for Formal Verification","primary_cat":"cs.SE","submitted_at":"2026-02-14T19:18:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Rule-based annotation generation for ACSL outperforms LLM-based methods in achieving successful formal verification of C programs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.06428","ref_index":41,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Walking the Tightrope of LLMs for Software Development: A Practitioners' Perspective","primary_cat":"cs.SE","submitted_at":"2025-11-09T15:49:55+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Qualitative interview study with 22 practitioners identifies multi-level benefits, challenges, and mitigation strategies for using LLMs in software development.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.19625","ref_index":196,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Search-Based Software Engineering and AI Foundation Models: Current Landscape and Future Roadmap","primary_cat":"cs.SE","submitted_at":"2025-05-26T07:46:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A research roadmap analyzing the current state of search-based software engineering with foundation models, outlining challenges and directions across three integration aspects.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"For example, in the case of test scenarios generated for cyber-physical systems like ADS (e.g., using genetic algorithms) [91, 176, 103, 137, 105, 27, 56, 28], FMs (leveraging their capabilitiesFM-S3andFM-S4) can help determine whether the generated scenarios are realistic. To this end, some works already aim to assess the realism of generated test scenarios (e.g., [196]). However, these works remain preliminary and are very specific to test scenarios generated using a single testing technique; therefore, more general methods are required. FMs can also apply their strengthsFM-S1andFM-S2to improve the quality of generated test scenarios from various perspectives. For instance, they can increase the diversity of scenarios and support the automated removal of redundant ones."}],"limit":50,"offset":0}