CASCADE finds code-documentation mismatches by running LLM-generated tests from docs and confirming failure only when documentation-derived code succeeds on the same test.
hub Canonical reference
Large Language Models for Software Engineering: Survey and Open Problems
Canonical reference. 100% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 12roles
background 6polarities
background 6representative citing papers
TRACE reveals that LLMs detect documentation bugs and contradictions better than subtle implementation drift, with asymmetric sensitivity and poor confidence calibration across seven models on 22k traces.
A distribution-free abstention rule grounded in multiple hypothesis testing uses execution consistency to let code LLMs avoid hallucination-prone tasks with theoretical guarantees.
A systematic mapping study of 45 LLM-based RE papers identifies and characterizes 62 public datasets, revealing imbalances in open-science practices, elicitation support, and socio-technical diversity.
AgentReputation proposes separating AI agent task execution, reputation management, and secure record-keeping into distinct layers, with context-specific reputation cards and a risk-based policy engine to handle verification in decentralized settings.
RedShell fine-tunes LLMs on enhanced malicious PowerShell data to produce syntactically valid offensive code for pentesting, reporting over 90% validity, strong semantic match to references, and better edit-distance similarity than prior methods plus functional execution success.
Augmenting LLMs with bug references, few-shot learning, chain-of-thought, and RAG improves MPI error detection accuracy from 44% to 77% and generalizes across models.
Data-influence-score filtering using validation-set loss on downstream coding tasks improves Code-LLM performance, with the most beneficial training data varying significantly across different programming tasks.
A research roadmap analyzing the current state of search-based software engineering with foundation models, outlining challenges and directions across three integration aspects.
RedShell fine-tunes LLMs on a custom dataset of public code samples to generate syntactically valid PowerShell scripts with semantic similarity to references, reporting under 10% parse errors and over 50%/40% mean similarity on Edit Distance and METEOR.
Engineering choices for tools, safety guardrails, and human oversight determine whether an internal coding agent delivers value in practice more than the underlying model quality.
XGBoost models trained on ≤16-qubit data predict eigensolver hyperparameters and reduce error by 0.12% on 28-qubit systems.
citing papers explorer
-
CASCADE: Detecting Inconsistencies between Code and Documentation with Automatic Test Generation
CASCADE finds code-documentation mismatches by running LLM-generated tests from docs and confirming failure only when documentation-derived code succeeds on the same test.
-
Measuring LLM Trust Allocation Across Conflicting Software Artifacts
TRACE reveals that LLMs detect documentation bugs and contradictions better than subtle implementation drift, with asymmetric sensitivity and poor confidence calibration across seven models on 22k traces.
-
Task Abstention for Large Language Models in Code Generation
A distribution-free abstention rule grounded in multiple hypothesis testing uses execution consistency to let code LLMs avoid hallucination-prone tasks with theoretical guarantees.
-
Characterizing Datasets for LLM-based Requirements Engineering: A Systematic Mapping Study
A systematic mapping study of 45 LLM-based RE papers identifies and characterizes 62 public datasets, revealing imbalances in open-science practices, elicitation support, and socio-technical diversity.
-
AgentReputation: A Decentralized Agentic AI Reputation Framework
AgentReputation proposes separating AI agent task execution, reputation management, and secure record-keeping into distinct layers, with context-specific reputation cards and a risk-based policy engine to handle verification in decentralized settings.
-
Towards Automated Pentesting with Large Language Models
RedShell fine-tunes LLMs on enhanced malicious PowerShell data to produce syntactically valid offensive code for pentesting, reporting over 90% validity, strong semantic match to references, and better edit-distance similarity than prior methods plus functional execution success.
-
Improving MPI Error Detection and Repair with Large Language Models and Bug References
Augmenting LLMs with bug references, few-shot learning, chain-of-thought, and RAG improves MPI error detection accuracy from 44% to 77% and generalizes across models.
-
An Empirical Study on Influence-Based Pretraining Data Selection for Code Large Language Models
Data-influence-score filtering using validation-set loss on downstream coding tasks improves Code-LLM performance, with the most beneficial training data varying significantly across different programming tasks.
-
Search-Based Software Engineering and AI Foundation Models: Current Landscape and Future Roadmap
A research roadmap analyzing the current state of search-based software engineering with foundation models, outlining challenges and directions across three integration aspects.
-
RedShell: A Generative AI-Based Approach to Ethical Hacking
RedShell fine-tunes LLMs on a custom dataset of public code samples to generate syntactically valid PowerShell scripts with semantic similarity to references, reporting under 10% parse errors and over 50%/40% mean similarity on Edit Distance and METEOR.
-
Building an Internal Coding Agent at Zup: Lessons and Open Questions
Engineering choices for tools, safety guardrails, and human oversight determine whether an internal coding agent delivers value in practice more than the underlying model quality.
-
Accelerating Quantum Eigensolver Algorithms With Machine Learning
XGBoost models trained on ≤16-qubit data predict eigensolver hyperparameters and reduce error by 0.12% on 28-qubit systems.