TBE identifies 32.1% of 92,011 equivalent surviving quantum mutants (29,536) via OpenQASM comparison after transpilation, reporting 100% precision and 82% accuracy on 348,299 mutants.
hub Canonical reference
Nguyen and Raymond Choo
Canonical reference. 79% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
LGMT is a logic-grounded metamorphic testing framework that detects hidden reasoning defects in LLMs by checking consistency on semantically invariant inputs derived from FOL equivalences.
ClozeMaster masks bracketed structures in historical Rust bug code and uses LLMs to infill them, generating test programs that discovered 27 confirmed bugs in rustc and mrustc while outperforming existing fuzzers.
GraphQLify automates REST-to-GraphQL migration via static source code analysis, delivering 100% type-safe conversions on 834 APIs and 2-4x faster performance than REST for multi-call workflows.
A systematic analysis of 59 quantum software testing empirical studies reveals highly diverse designs, inconsistent reporting, and open methodological challenges, leading to recommendations for future work.
EyeMulator augments CodeLLM fine-tuning loss with token weights derived from human eye-tracking scan paths, producing large gains on code translation and summarization across StarCoder, Llama-3.2 and DeepSeek-Coder.
Reasoning-tuned LLMs align with human comprehension failure patterns under code obfuscation using the Block Model, unlike instruction-tuned variants.
ATTAIN is a three-module trace-driven framework that combines exploit execution, LLM-guided diff search, and evidence-based judgment to identify affected library versions for CVEs, reporting 93.24% F1 on 224 CVEs across 25,943 versions.
PeAR shows static binary instrumentation can instrument 88% of FUZZBENCH targets with 4x throughput gains and coverage matching compiler-based methods.
QUTest is a native OpenQASM testing framework that encodes Arrange/Act/Assert tests and 12 assertion types via pragma comments while remaining compatible with existing tools.
Noise from quantum hardware simulators significantly alters mutant detection distances, making equivalent mutants harder to separate from faults, with output-distribution metrics reaching 73.03% accuracy and 74.89% F1-score under device-specific thresholds.
Code language models show no transferable security understanding from code diffs alone, rely on commit messages, miss over 93% of fixes at 0.5% false positive rate, and suffer large drops under group or temporal splits.
A dual-axis quality framework ranks DL mutation operators by statistical resistance and Jaccard-based realism to real faults, enabling up to 55.6% fewer mutants on held-out validation data without dropping baseline performance.
Ethics testing is introduced as a systematic approach to generate tests that identify software harms induced by unethical behavior in generative AI outputs.
QuanForge introduces statistical mutation killing and nine post-training mutation operators for QNNs to distinguish test suites and localize vulnerable circuit regions.
Co-locating tests with implementation code yields substantially higher preservation and correctness in foundation-model-generated programs than separated test syntax.
QTyBERT matches or exceeds BERT-based log anomaly detection effectiveness while reducing embedding generation time to near static word embedding levels.
LLM agents resolve fewer than half of issues while satisfying design constraints despite passing tests, as shown by a benchmark of 495 issues and 1787 constraints from six repositories.
APIKG4Syn synthesizes API-oriented training data via knowledge graphs and Monte Carlo search to fine-tune a 7B model that reaches 25% pass@1 on HarmonyOS code generation, beating untuned GPT-4o at 17.59%.
PerfOrch is a four-agent multi-LLM system that uses offline profiling to build language-and-category rankings for routing tasks, achieving 97.19% and 95.83% pass@1 on HumanEval-X and EffiBench-X with generalization across benchmarks.
CoCoMagic applies constrained cooperative co-evolution to metamorphic and differential testing to find up to 287% more distinct behavioral divergences in an end-to-end ADS than baseline search methods.
UntrustVul identifies untrustworthy vulnerability predictions by marking lines that neither match historical vulnerability patterns nor influence vulnerable lines through dependencies, reporting AUC 70-88% and F1 82-94% on 115K predictions.
MR-Adopt deduces input transformations from hard-coded MR test cases using LLMs, data-flow refinement, and output-relation selection to enable reuse with new source inputs.
MR-Scout extracts over 11,000 metamorphic-relation-encoded test cases from 701 OSS projects, codifies 97% of them as high-quality generators, and shows they raise line coverage by 13.52% and mutation score by 9.42% on programs that already have developer tests.
citing papers explorer
-
Quantum Mutant Equivalence via Transpilation
TBE identifies 32.1% of 92,011 equivalent surviving quantum mutants (29,536) via OpenQASM comparison after transpilation, reporting 100% precision and 82% accuracy on 348,299 mutants.
-
LGMT: Logic-Grounded Metamorphic Testing for Evaluating the Reasoning Reliability of LLMs
LGMT is a logic-grounded metamorphic testing framework that detects hidden reasoning defects in LLMs by checking consistency on semantically invariant inputs derived from FOL equivalences.
-
ClozeMaster: Fuzzing Rust Compiler by Harnessing LLMs for Infilling Masked Real Programs
ClozeMaster masks bracketed structures in historical Rust bug code and uses LLMs to infill them, generating test programs that discovered 27 confirmed bugs in rustc and mrustc while outperforming existing fuzzers.
-
GraphQLify: Automated and Type Safety-Preserving GraphQL API Adoption
GraphQLify automates REST-to-GraphQL migration via static source code analysis, delivering 100% type-safe conversions on 834 APIs and 2-4x faster performance than REST for multi-call workflows.
-
A Methodological Analysis of Empirical Studies in Quantum Software Testing
A systematic analysis of 59 quantum software testing empirical studies reveals highly diverse designs, inconsistent reporting, and open methodological challenges, leading to recommendations for future work.
-
EyeMulator: Improving Code Language Models by Mimicking Human Visual Attention
EyeMulator augments CodeLLM fine-tuning loss with token weights derived from human eye-tracking scan paths, producing large gains on code translation and summarization across StarCoder, Llama-3.2 and DeepSeek-Coder.
-
Do Machines Struggle Where Humans Do? LLM and Human Comprehension of Obfuscated Code
Reasoning-tuned LLMs align with human comprehension failure patterns under code obfuscation using the Block Model, unlike instruction-tuned variants.
-
ATTAIN: Automated Exploit Failure Analysis through Trace-Driven Diff Analysis
ATTAIN is a three-module trace-driven framework that combines exploit execution, LLM-guided diff search, and evidence-based judgment to identify affected library versions for CVEs, reporting 93.24% F1 on 224 CVEs across 25,943 versions.
-
PeAR: A Static Binary Rewriting Framework for Binary-Only Fuzzing
PeAR shows static binary instrumentation can instrument 88% of FUZZBENCH targets with 4x throughput gains and coverage matching compiler-based methods.
-
QUTest: A Native Testing Framework for Quantum Programs
QUTest is a native OpenQASM testing framework that encodes Arrange/Act/Assert tests and 12 assertion types via pragma comments while remaining compatible with existing tools.
-
Robust Mutation Analysis of Quantum Programs Under Noise
Noise from quantum hardware simulators significantly alters mutant detection distances, making equivalent mutants harder to separate from faults, with output-distribution metrics reaching 73.03% accuracy and 74.89% F1-score under device-specific thresholds.
-
Code-Centric Detection of Vulnerability-Fixing Commits: A Unified Benchmark and Empirical Study
Code language models show no transferable security understanding from code diffs alone, rely on commit messages, miss over 93% of fixes at 0.5% false positive rate, and suffer large drops under group or temporal splits.
-
Quality-Driven Selective Mutation for Deep Learning
A dual-axis quality framework ranks DL mutation operators by statistical resistance and Jaccard-based realism to real faults, enabling up to 55.6% fewer mutants on held-out validation data without dropping baseline performance.
-
Ethics Testing: Proactive Identification of Generative AI System Harms
Ethics testing is introduced as a systematic approach to generate tests that identify software harms induced by unethical behavior in generative AI outputs.
-
QuanForge: A Mutation Testing Framework for Quantum Neural Networks
QuanForge introduces statistical mutation killing and nine post-training mutation operators for QNNs to distinguish test suites and localize vulnerable circuit regions.
-
Co-Located Tests, Better AI Code: How Test Syntax Structure Affects Foundation Model Code Generation
Co-locating tests with implementation code yields substantially higher preservation and correctness in foundation-model-generated programs than separated test syntax.
-
A Comparative Study of Semantic Log Representations for Software Log-based Anomaly Detection
QTyBERT matches or exceeds BERT-based log anomaly detection effectiveness while reducing embedding generation time to near static word embedding levels.
-
Does Pass Rate Tell the Whole Story? Evaluating Design Constraint Compliance in LLM-based Issue Resolution
LLM agents resolve fewer than half of issues while satisfying design constraints despite passing tests, as shown by a benchmark of 495 issues and 1787 constraints from six repositories.
-
Knowledge-Graph-Driven Data Synthesis for Low-Resource Software Development: A HarmonyOS Case Study
APIKG4Syn synthesizes API-oriented training data via knowledge graphs and Monte Carlo search to fine-tune a 7B model that reaches 25% pass@1 on HarmonyOS code generation, beating untuned GPT-4o at 17.59%.
-
Multi-LLM Orchestration for High-Quality Code Generation: Exploiting Complementary Model Strengths
PerfOrch is a four-agent multi-LLM system that uses offline profiling to build language-and-category rankings for routing tasks, achieving 97.19% and 95.83% pass@1 on HumanEval-X and EffiBench-X with generalization across benchmarks.
-
Constrained Co-evolutionary Metamorphic Differential Testing for Autonomous Systems with an Interpretability Approach
CoCoMagic applies constrained cooperative co-evolution to metamorphic and differential testing to find up to 287% more distinct behavioral divergences in an end-to-end ADS than baseline search methods.
-
UntrustVul: An Automated Approach for Identifying Untrustworthy Alerts in Vulnerability Detection Models
UntrustVul identifies untrustworthy vulnerability predictions by marking lines that neither match historical vulnerability patterns nor influence vulnerable lines through dependencies, reporting AUC 70-88% and F1 82-94% on 115K predictions.
-
MR-Adopt: Automatic Deduction of Input Transformation Function for Metamorphic Testing
MR-Adopt deduces input transformations from hard-coded MR test cases using LLMs, data-flow refinement, and output-relation selection to enable reuse with new source inputs.
-
MR-Scout: Automated Synthesis of Metamorphic Relations from Existing Test Cases
MR-Scout extracts over 11,000 metamorphic-relation-encoded test cases from 701 OSS projects, codifies 97% of them as high-quality generators, and shows they raise line coverage by 13.52% and mutation score by 9.42% on programs that already have developer tests.
-
Context-Based Adversarial Attacks on AI Code Generators: Vulnerability Analysis and Implications
Context-based adversarial attacks raise vulnerable code generation in models like GPT-4 and CodeLlama from 3.5% to 37.4%, with 60-100% transferability, and a dual-layer defense reaches 89.1% detection at low false positives.
-
Boosting Automatic Java-to-Cangjie Translation with Multi-Stage LLM Training and Error Repair
Multi-stage LLM training plus compiler-guided error repair boosts functional equivalence in Java-to-Cangjie translation by 6.06% over prior methods despite scarce parallel data.
-
Accelerating Policy Synthesis in Large-Scale MDPs via Hierarchical Adaptive Refinement
Presents hierarchical adaptive refinement to accelerate near-optimal policy synthesis in MDPs up to 1M states with up to 2x speedup over PRISM and formal error bounds.
-
Context-Aware Unit Testing for Quantum Subroutines
Proposes a context-aware unit testing framework for quantum subroutines modeled as parametrized quantum channels, using probabilistic assertions and demonstrated on GHZ preparation and Shor's algorithm subroutines.
-
DeepFWI: Identifying Bug-Sensitive Warnings with Multi-Modal Code-Warning Semantics
DeepFWI is a multi-modal LSTM model with cross-attention that identifies bug-sensitive warnings at warning granularity, reaching 67.06% F1 on a 280k-warning dataset and surfacing 25 confirmed bugs in four open-source projects.
-
HYDRA: A Hybrid Heuristic-Guided Deep Representation Architecture for Predicting Latent Zero-Day Vulnerabilities in Patched Functions
HYDRA is a hybrid model that uses heuristics plus deep embeddings and a VAE to predict latent zero-day vulnerabilities in patched functions from Chrome, Android, and ImageMagick.
-
Search-Based Software Engineering and AI Foundation Models: Current Landscape and Future Roadmap
A research roadmap analyzing the current state of search-based software engineering with foundation models, outlining challenges and directions across three integration aspects.
-
Software Engineering for Self-Adaptive Robotics: A Research Agenda
This paper proposes a research agenda for software engineering of self-adaptive robotic systems along lifecycle stages and enabling technologies, identifying challenges and a roadmap to 2030.