hub Canonical reference

Bridging research and practice in simulation-based testing of industrial robot navigation systems

· 2025 · arXiv 3991.2025

Canonical reference. 100% of citing Pith papers cite this work as background.

45 Pith papers citing it

Background 100% of classified citations

read on arXiv browse 45 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 15

citation-polarity summary

background 15

representative citing papers

Mind your key: An Empirical Study of LLM API Credential Leakage in iOS Apps

cs.SE · 2026-06-10 · unverdicted · novelty 8.0

Empirical analysis of 444 iOS apps using dynamic traffic interception found 282 leaking LLM API keys across ten providers, with only 28% remediation after three months.

AgentFlow: Building Agent Dependency Graphs for Static Analysis of Agent Programs

cs.SE · 2026-07-02 · unverdicted · novelty 7.0 · 2 refs

AgentFlow builds a framework-agnostic Agent Dependency Graph from agent program source code to support static analyses such as BOM generation and prompt-to-tool risk detection, evaluated on 5,399 real programs across five frameworks.

Falsification, Not Exposure: An Internally Preregistered Placebo-Controlled Decomposition of Self-Repair Feedback in Frozen Small Code Models

cs.SE · 2026-06-30 · unverdicted · novelty 7.0

Preregistered placebo-controlled decomposition shows external executable counterexamples drive self-repair gains in small code models more than re-exposure or self-critique.

Kops: Safely Extending the eBPF Compilation Pipeline with Native Operations

cs.OS · 2026-06-23 · unverdicted · novelty 7.0

Kops enables extension of the eBPF JIT with native operations using proof sequences checked by the existing verifier and native emits, validated by Lean 4 proofs, delivering up to 24% microbenchmark and 12% application speedups.

The Alignment Problem in Constrained Code Generation

cs.SE · 2026-06-19 · unverdicted · novelty 7.0

Incomplete constrainers in constrained decoding push LLMs into low-probability program regions, making unconstrained decoding outperform constrained decoding on functional correctness across seven models and three benchmarks.

KBSpec: LLM-driven Formal Specification Generation with Evolving Domain Knowledge Base

cs.SE · 2026-06-19 · unverdicted · novelty 7.0

KBSpec maintains an evolving knowledge base combining external docs and internal verifier feedback to improve LLM generation of verifiable JML specifications, achieving 10-25% higher verification pass rates.

Scaffold, Not Vocabulary? A Controlled, Two-Tier, Pre-Registered Study of a Popperian Code-Generation Skill

cs.SE · 2026-06-04 · conditional · novelty 7.0

Controlled ablation finds Popperian code-generation skill adds no separable correctness benefit over labels-only scaffold; gains track structure not content.

EvoRepair: Enhancing Vulnerability Repair Agents Through Experience-Based Self-Evolution

cs.SE · 2026-05-28 · unverdicted · novelty 7.0

EvoRepair is the first experience-based self-evolving agent framework for automated vulnerability repair, reporting 90.46% overall success on PATCHEVAL and SEC-bench benchmarks.

SCARA: A Semantics-Constrained Autonomous Remediation Agent for Opaque Industrial Software Vulnerabilities

cs.CR · 2026-05-19 · unverdicted · novelty 7.0

SCARA introduces a four-stage pipeline using state-aware verification and constrained synthesis to remediate vulnerabilities in source-unavailable industrial software, reporting 100% precision and 88.9% success on a 15-case benchmark.

From Runnable to Shippable: Multi-Agent Test-Driven Development for Generating Full-Stack Web Applications from Requirements

cs.SE · 2026-05-17 · unverdicted · novelty 7.0

TDDev automates the full TDD loop for web app generation from requirements, delivering 34-48 percentage point quality gains and zero manual intervention in user studies.

Stress-Testing Neural Network Verifiers with Provably Robust Instances

cs.LG · 2026-05-16 · conditional · novelty 7.0

A reusable framework generates verification instances with provably known robustness labels, revealing numeric tolerance issues and bugs in five verifiers while introducing difficulty profiles to diagnose failure modes.

Hydra: Efficient, Correct Code Generation via Checkpoint-and-Rollback Support

cs.SE · 2026-05-14 · unverdicted · novelty 7.0

Hydra enables asynchronous static error checking and targeted checkpoint-rollback repair during LLM code generation, cutting latency by up to 71% and token use by up to 70% versus post-hoc repair on C/C++ tasks.

MASPrism: Lightweight Failure Attribution for Multi-Agent Systems Using Prefill-Stage Signals

cs.SE · 2026-05-08 · unverdicted · novelty 7.0 · 2 refs

MASPrism attributes failures in multi-agent systems by ranking candidates from prefill-stage NLL and attention signals of a 0.6B SLM, beating baselines by up to 33.41% Top-1 accuracy and proprietary LLMs by up to 89.5% relative improvement while processing traces in 2.66 seconds.

Single-Language Evidence Is Insufficient for Automated Logging: A Multilingual Benchmark and Empirical Study with LLMs

cs.SE · 2026-04-19 · unverdicted · novelty 7.0

MultiLogBench shows that LLM performance on automated logging varies substantially across programming languages, demonstrating that single-language evidence is insufficient for general claims about model behavior or tool design.

Certified Program Synthesis with a Multi-Modal Verifier

cs.SE · 2026-04-17 · unverdicted · novelty 7.0

LeetProof achieves higher rates of fully certified program synthesis from natural language by using a multi-modal verifier in Lean to validate specifications via randomized testing and delegate proofs to AI tools, outperforming single-mode baselines on benchmarks while uncovering defects in prior参考.

MR-Coupler: Automated Metamorphic Test Generation via Functional Coupling Analysis

cs.SE · 2026-04-11 · conditional · novelty 7.0 · 2 refs

MR-Coupler leverages functional coupling analysis and LLMs to generate valid metamorphic test cases for over 90% of tasks while detecting 44% of real bugs, outperforming baselines by 64.90% in validity and 36.56% in false-alarm reduction.

Measuring and Exploiting Contextual Bias in LLM-Assisted Security Code Review

cs.SE · 2026-03-19 · accept · novelty 7.0

LLM-based security code review is vulnerable to framing bias, with a novel iterative refinement attack achieving 100% success in reintroducing vulnerabilities across real projects.

Benchmarking Quantum Software Testing with Scalable Quantum Programs

cs.SE · 2026-07-02 · unverdicted · novelty 6.0 · 2 refs

Qolumbina curates 40 quantum programs into a benchmark with QST-oriented criteria for functionality, output behavior, and complexity to support scalable empirical studies of quantum software testing approaches.

Underspecification does not imply Incoherence: The Risks of Semantic Collapse in Coding Models

cs.SE · 2026-07-02 · unverdicted · novelty 6.0

Coding LLMs exhibit detrimental semantic collapse on underspecified prompts by producing consistent but incorrect code rather than incoherent variations, affecting 3-32% of tasks across MBPP, HumanEval, and LiveCodeBench.

cs.SE · 2026-06-29 · unverdicted · novelty 6.0

Large-scale analysis of 200K PyPI packages identifies 1,361 replicated popular packages, 256 replicated vulnerable packages, and 7 new replicated malicious packages, showing replication as a security threat vector.

Agentic Persona Generation with Critique-Refinement: An Industrial Evaluation

cs.SE · 2026-06-08 · unverdicted · novelty 6.0

PerGent, an agentic critique-refinement system for persona generation, reaches 96.9% expert approval in an industrial evaluation at Kinaxis and reproduces more pre-LLM expert content than single-shot baselines.

Code Is More Than Text: Uncertainty Estimation for Code Generation

cs.CL · 2026-06-08 · unverdicted · novelty 6.0

Three code-specific uncertainty axes (lexical, algorithmic, functional) yield an ensemble that raises average AUROC from 0.696 to 0.776 across five code LLMs, with one single-pass signal matching multi-pass baselines at lower cost.

Are We Lost in the Woods? Detecting Silent Semantic Faults for Random Forest Classifiers with Data-informed Static Analysis

cs.SE · 2026-06-05 · unverdicted · novelty 6.0

dille detects silent semantic faults in random forest ML pipelines with 91% precision via data-informed static analysis on Kaggle notebooks, finding 12-18% of scripts affected.

Provably Secure Agent Guardrail

cs.AI · 2026-05-28 · unverdicted · novelty 6.0

Introduces ePCA framework using neural-symbolic isolation to force agents to formalize intentions as logical constraints, claiming zero attack success and false positive rates in tested scenarios.

citing papers explorer

Showing 45 of 45 citing papers.

Mind your key: An Empirical Study of LLM API Credential Leakage in iOS Apps cs.SE · 2026-06-10 · unverdicted · none · ref 25
Empirical analysis of 444 iOS apps using dynamic traffic interception found 282 leaking LLM API keys across ten providers, with only 28% remediation after three months.
AgentFlow: Building Agent Dependency Graphs for Static Analysis of Agent Programs cs.SE · 2026-07-02 · unverdicted · none · ref 48 · 2 links
AgentFlow builds a framework-agnostic Agent Dependency Graph from agent program source code to support static analyses such as BOM generation and prompt-to-tool risk detection, evaluated on 5,399 real programs across five frameworks.
Falsification, Not Exposure: An Internally Preregistered Placebo-Controlled Decomposition of Self-Repair Feedback in Frozen Small Code Models cs.SE · 2026-06-30 · unverdicted · none · ref 22
Preregistered placebo-controlled decomposition shows external executable counterexamples drive self-repair gains in small code models more than re-exposure or self-critique.
Kops: Safely Extending the eBPF Compilation Pipeline with Native Operations cs.OS · 2026-06-23 · unverdicted · partial · ref 37
Kops enables extension of the eBPF JIT with native operations using proof sequences checked by the existing verifier and native emits, validated by Lean 4 proofs, delivering up to 24% microbenchmark and 12% application speedups.
The Alignment Problem in Constrained Code Generation cs.SE · 2026-06-19 · unverdicted · none · ref 27
Incomplete constrainers in constrained decoding push LLMs into low-probability program regions, making unconstrained decoding outperform constrained decoding on functional correctness across seven models and three benchmarks.
KBSpec: LLM-driven Formal Specification Generation with Evolving Domain Knowledge Base cs.SE · 2026-06-19 · unverdicted · none · ref 16
KBSpec maintains an evolving knowledge base combining external docs and internal verifier feedback to improve LLM generation of verifiable JML specifications, achieving 10-25% higher verification pass rates.
Scaffold, Not Vocabulary? A Controlled, Two-Tier, Pre-Registered Study of a Popperian Code-Generation Skill cs.SE · 2026-06-04 · conditional · none · ref 10
Controlled ablation finds Popperian code-generation skill adds no separable correctness benefit over labels-only scaffold; gains track structure not content.
EvoRepair: Enhancing Vulnerability Repair Agents Through Experience-Based Self-Evolution cs.SE · 2026-05-28 · unverdicted · none · ref 63
EvoRepair is the first experience-based self-evolving agent framework for automated vulnerability repair, reporting 90.46% overall success on PATCHEVAL and SEC-bench benchmarks.
SCARA: A Semantics-Constrained Autonomous Remediation Agent for Opaque Industrial Software Vulnerabilities cs.CR · 2026-05-19 · unverdicted · none · ref 22
SCARA introduces a four-stage pipeline using state-aware verification and constrained synthesis to remediate vulnerabilities in source-unavailable industrial software, reporting 100% precision and 88.9% success on a 15-case benchmark.
From Runnable to Shippable: Multi-Agent Test-Driven Development for Generating Full-Stack Web Applications from Requirements cs.SE · 2026-05-17 · unverdicted · none · ref 11
TDDev automates the full TDD loop for web app generation from requirements, delivering 34-48 percentage point quality gains and zero manual intervention in user studies.
Stress-Testing Neural Network Verifiers with Provably Robust Instances cs.LG · 2026-05-16 · conditional · none · ref 16
A reusable framework generates verification instances with provably known robustness labels, revealing numeric tolerance issues and bugs in five verifiers while introducing difficulty profiles to diagnose failure modes.
Hydra: Efficient, Correct Code Generation via Checkpoint-and-Rollback Support cs.SE · 2026-05-14 · unverdicted · none · ref 39
Hydra enables asynchronous static error checking and targeted checkpoint-rollback repair during LLM code generation, cutting latency by up to 71% and token use by up to 70% versus post-hoc repair on C/C++ tasks.
MASPrism: Lightweight Failure Attribution for Multi-Agent Systems Using Prefill-Stage Signals cs.SE · 2026-05-08 · unverdicted · none · ref 5 · 2 links
MASPrism attributes failures in multi-agent systems by ranking candidates from prefill-stage NLL and attention signals of a 0.6B SLM, beating baselines by up to 33.41% Top-1 accuracy and proprietary LLMs by up to 89.5% relative improvement while processing traces in 2.66 seconds.
Single-Language Evidence Is Insufficient for Automated Logging: A Multilingual Benchmark and Empirical Study with LLMs cs.SE · 2026-04-19 · unverdicted · none · ref 36
MultiLogBench shows that LLM performance on automated logging varies substantially across programming languages, demonstrating that single-language evidence is insufficient for general claims about model behavior or tool design.
Certified Program Synthesis with a Multi-Modal Verifier cs.SE · 2026-04-17 · unverdicted · none · ref 47
LeetProof achieves higher rates of fully certified program synthesis from natural language by using a multi-modal verifier in Lean to validate specifications via randomized testing and delegate proofs to AI tools, outperforming single-mode baselines on benchmarks while uncovering defects in prior参考.
MR-Coupler: Automated Metamorphic Test Generation via Functional Coupling Analysis cs.SE · 2026-04-11 · conditional · none · ref 14 · 2 links
MR-Coupler leverages functional coupling analysis and LLMs to generate valid metamorphic test cases for over 90% of tasks while detecting 44% of real bugs, outperforming baselines by 64.90% in validity and 36.56% in false-alarm reduction.
Measuring and Exploiting Contextual Bias in LLM-Assisted Security Code Review cs.SE · 2026-03-19 · accept · none · ref 34
LLM-based security code review is vulnerable to framing bias, with a novel iterative refinement attack achieving 100% success in reintroducing vulnerabilities across real projects.
Benchmarking Quantum Software Testing with Scalable Quantum Programs cs.SE · 2026-07-02 · unverdicted · none · ref 11 · 2 links
Qolumbina curates 40 quantum programs into a benchmark with QST-oriented criteria for functionality, output behavior, and complexity to support scalable empirical studies of quantum software testing approaches.
Underspecification does not imply Incoherence: The Risks of Semantic Collapse in Coding Models cs.SE · 2026-07-02 · unverdicted · none · ref 12
Coding LLMs exhibit detrimental semantic collapse on underspecified prompts by producing consistent but incorrect code rather than incoherent variations, affecting 3-32% of tasks across MBPP, HumanEval, and LiveCodeBench.
Uncovering Similar but Different Packages in PyPI and Potential Security Threats cs.SE · 2026-06-29 · unverdicted · none · ref 5
Large-scale analysis of 200K PyPI packages identifies 1,361 replicated popular packages, 256 replicated vulnerable packages, and 7 new replicated malicious packages, showing replication as a security threat vector.
Agentic Persona Generation with Critique-Refinement: An Industrial Evaluation cs.SE · 2026-06-08 · unverdicted · none · ref 25
PerGent, an agentic critique-refinement system for persona generation, reaches 96.9% expert approval in an industrial evaluation at Kinaxis and reproduces more pre-LLM expert content than single-shot baselines.
Code Is More Than Text: Uncertainty Estimation for Code Generation cs.CL · 2026-06-08 · unverdicted · none · ref 16
Three code-specific uncertainty axes (lexical, algorithmic, functional) yield an ensemble that raises average AUROC from 0.696 to 0.776 across five code LLMs, with one single-pass signal matching multi-pass baselines at lower cost.
Are We Lost in the Woods? Detecting Silent Semantic Faults for Random Forest Classifiers with Data-informed Static Analysis cs.SE · 2026-06-05 · unverdicted · none · ref 25
dille detects silent semantic faults in random forest ML pipelines with 91% precision via data-informed static analysis on Kaggle notebooks, finding 12-18% of scripts affected.
Provably Secure Agent Guardrail cs.AI · 2026-05-28 · unverdicted · none · ref 61
Introduces ePCA framework using neural-symbolic isolation to force agents to formalize intentions as logical constraints, claiming zero attack success and false positive rates in tested scenarios.
Converted, Not Equivalent: Benchmarking Codebase Conversion via Observational Equivalence cs.SE · 2026-05-27 · unverdicted · none · ref 4
T2J-Bench shows top coding agents achieve only 26.7-28.9% pass rate on codebase conversion under a three-stage observational equivalence check, with agents overestimating success by 66.6-97.8 points.
QUTest: A Native Testing Framework for Quantum Programs quant-ph · 2026-05-19 · unverdicted · none · ref 10
QUTest is a native OpenQASM testing framework that encodes Arrange/Act/Assert tests and 12 assertion types via pragma comments while remaining compatible with existing tools.
Debugging the Debuggers: Failure-Anchored Structured Recovery for Software Engineering Agents cs.SE · 2026-05-09 · unverdicted · none · ref 6 · 2 links
PROBE turns runtime telemetry from failed software engineering agent runs into evidence-grounded diagnoses and actionable recovery guidance, achieving 65.37% diagnosis accuracy and 21.79% recovery rate on 257 cases.
Characterizing and Mitigating False-Positive Bug Reports in the Linux Kernel cs.SE · 2026-05-08 · conditional · none · ref 60 · 2 links
False-positive bug reports in the Linux kernel consume effort comparable to real bugs and can be filtered by LLMs using retrieval-augmented generation at 88% F1.
Hallucination Inspector: A Fact-Checking Judge for API Migration cs.SE · 2026-04-22 · unverdicted · none · ref 9
Hallucination Inspector verifies symbols in LLM-generated API migration code against a documentation-derived knowledge base using AST extraction, identifying scaffolding hallucinations and cutting false positives versus standard metrics in preliminary Android tests.
TypeScript Repository Indexing for Code Agent Retrieval cs.SE · 2026-04-20 · unverdicted · none · ref 6 · 2 links
abcoder-ts-parser builds reliable function-level code indexes for large TypeScript repositories significantly faster by using the compiler's native AST and semantic resolution instead of per-symbol language server calls.
SelfHeal: Empirical Fix Pattern Analysis and Bug Repair in LLM Agents cs.SE · 2026-04-20 · unverdicted · none · ref 74
SelfHeal uses two ReAct agents and empirical fix patterns to repair bugs in LLM agents, outperforming baselines on a new 37-instance benchmark.
Structured Safety Auditing for Balancing Code Correctness and Content Safety in LLM-Generated Code cs.SE · 2026-04-13 · unverdicted · none · ref 29
Dual Reasoning with explicit safety audits improves the new SUDS metric by 1.32x to 3.42x over baselines on code generation benchmarks containing injected harmful keywords.
EditFlow: Benchmarking and Optimizing Code Edit Recommendation Systems via Reconstruction of Developer Flows cs.SE · 2026-02-25 · unverdicted · none · ref 38
EditFlow reconstructs temporal developer editing flows from code changes to benchmark and optimize AI code edit recommenders so they align with natural incremental reasoning rather than static snapshots.
Bash-Commenter: Leveraging Syntax-Aware Preference Optimization to Reinforce Large Language Model for Bash Code Comment Generation cs.SE · 2026-06-29 · unverdicted · none · ref 65
Bash-Commenter applies CPT, SFT, and Syntax-Aware Preference Optimization (SAPO) via AST atomic operations to LLaMA-3.1-8B, reporting higher BLEU-4/METEOR/ROUGE-L scores than baselines on single-line and multi-line Bash comment generation tasks.
From Assistance to Agency: Rethinking Autonomy and Control in CI/CD Pipelines cs.SE · 2026-05-08 · unverdicted · none · ref 39
The central challenge in AI-augmented CI/CD is designing authority transfer from humans to agents under constraints, as current systems remain limited to bounded data-plane autonomy backed by external governance.
Search-Based Multi-Trajectory Refinement for Safe C-to-Rust Translation with Large Language Models cs.PL · 2025-05-21 · unverdicted · none · ref 2
LAC2R uses MCTS to systematically explore multiple LLM refinement trajectories for C-to-Rust translation and reports superior safety and correctness on small-scale benchmarks.
MR-SLAM: Immersive Spatial Supervision for Multi-Robot Mapping via Mixed Reality cs.RO · 2026-05-14 · unverdicted · none · ref 51
MR-SLAM combines passthrough mixed reality with multi-robot SLAM on ROS 2 to let one operator supervise mapping in situ, reporting 8.83 Hz scans, 17.9 m² coverage, and 94.7% occupancy consistency in simulated sessions.
Human-Machine Co-Boosted Bug Report Identification with Mutualistic Neural Active Learning cs.SE · 2026-04-20 · unverdicted · none · ref 98
MNAL reduces human effort in bug report labeling by up to 95.8% for readability and 196% for identifiability while improving identification performance and working with various neural models.
Leveraging LLM-Based Agentic Systems to Generate Quantum Applications for Test Optimization cs.SE · 2026-07-01 · unverdicted · none · ref 14
QPipe deploys specialized LLM agents for parsing, formulation, code generation, review, execution and verification to produce quantum applications from 20 natural-language test-optimization requirements, reporting 100% compilation and 96.7% execution success with solutions that beat a genetic-algori
Software Engineering for Self-Adaptive Robotics: A Research Agenda cs.SE · 2025-05-26 · unverdicted · none · ref 84
This paper proposes a research agenda for software engineering of self-adaptive robotic systems along lifecycle stages and enabling technologies, identifying challenges and a roadmap to 2030.
Energy-Aware Computing in the Year 2026 cs.DC · 2026-05-23 · unverdicted · none · ref 64
The paper reviews energy-aware computing literature and constructs a taxonomy organized by hardware/software aspects, measurement, optimizations, scheduling, scaling, consolidation, federated learning, and cooling.
Vibe-driven model-based engineering cs.SE · 2026-04-12 · unreviewed · ref 1
Understanding Bugs in Modern Agentic Frameworks: A Study of Symptoms, Root Causes, and Triggering Conditions cs.SE · 2026-04-10 · unreviewed · ref 41 · 2 links
How Your Credentials Are Leaked by LLM Agent Skills: An Empirical Study cs.CR · 2026-04-03 · unreviewed · ref 28 · 3 links
A Large-Scale Comprehensive Measurement of AI-Generated Code in Real-World Repositories cs.SE · 2026-03-28 · unreviewed · ref 4

Bridging research and practice in simulation-based testing of industrial robot navigation systems

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer