hub Canonical reference

Bridging research and practice in simulation-based testing of industrial robot navigation systems

doi:10 · 2025 · arXiv 3991.2025

Canonical reference. 100% of citing Pith papers cite this work as background.

39 Pith papers citing it

Background 100% of classified citations

read on arXiv browse 39 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 15

citation-polarity summary

background 15

representative citing papers

Mind your key: An Empirical Study of LLM API Credential Leakage in iOS Apps

cs.SE · 2026-06-10 · unverdicted · novelty 8.0

Empirical analysis of 444 iOS apps using dynamic traffic interception found 282 leaking LLM API keys across ten providers, with only 28% remediation after three months.

Kops: Safely Extending the eBPF Compilation Pipeline with Native Operations

cs.OS · 2026-06-23 · unverdicted · novelty 7.0

Kops enables extension of the eBPF JIT with native operations using proof sequences checked by the existing verifier and native emits, validated by Lean 4 proofs, delivering up to 24% microbenchmark and 12% application speedups.

The Alignment Problem in Constrained Code Generation

cs.SE · 2026-06-19 · unverdicted · novelty 7.0

Incomplete constrainers in constrained decoding push LLMs into low-probability program regions, making unconstrained decoding outperform constrained decoding on functional correctness across seven models and three benchmarks.

KBSpec: LLM-driven Formal Specification Generation with Evolving Domain Knowledge Base

cs.SE · 2026-06-19 · unverdicted · novelty 7.0

KBSpec maintains an evolving knowledge base combining external docs and internal verifier feedback to improve LLM generation of verifiable JML specifications, achieving 10-25% higher verification pass rates.

Scaffold, Not Vocabulary? A Controlled, Two-Tier, Pre-Registered Study of a Popperian Code-Generation Skill

cs.SE · 2026-06-04 · conditional · novelty 7.0

Controlled ablation finds Popperian code-generation skill adds no separable correctness benefit over labels-only scaffold; gains track structure not content.

EvoRepair: Enhancing Vulnerability Repair Agents Through Experience-Based Self-Evolution

cs.SE · 2026-05-28 · unverdicted · novelty 7.0

EvoRepair is the first experience-based self-evolving agent framework for automated vulnerability repair, reporting 90.46% overall success on PATCHEVAL and SEC-bench benchmarks.

SCARA: A Semantics-Constrained Autonomous Remediation Agent for Opaque Industrial Software Vulnerabilities

cs.CR · 2026-05-19 · unverdicted · novelty 7.0

SCARA introduces a four-stage pipeline using state-aware verification and constrained synthesis to remediate vulnerabilities in source-unavailable industrial software, reporting 100% precision and 88.9% success on a 15-case benchmark.

From Runnable to Shippable: Multi-Agent Test-Driven Development for Generating Full-Stack Web Applications from Requirements

cs.SE · 2026-05-17 · unverdicted · novelty 7.0

TDDev automates the full TDD loop for web app generation from requirements, delivering 34-48 percentage point quality gains and zero manual intervention in user studies.

Stress-Testing Neural Network Verifiers with Provably Robust Instances

cs.LG · 2026-05-16 · conditional · novelty 7.0

A reusable framework generates verification instances with provably known robustness labels, revealing numeric tolerance issues and bugs in five verifiers while introducing difficulty profiles to diagnose failure modes.

Hydra: Efficient, Correct Code Generation via Checkpoint-and-Rollback Support

cs.SE · 2026-05-14 · unverdicted · novelty 7.0

Hydra enables asynchronous static error checking and targeted checkpoint-rollback repair during LLM code generation, cutting latency by up to 71% and token use by up to 70% versus post-hoc repair on C/C++ tasks.

MASPrism: Lightweight Failure Attribution for Multi-Agent Systems Using Prefill-Stage Signals

cs.SE · 2026-05-08 · unverdicted · novelty 7.0 · 2 refs

MASPrism attributes failures in multi-agent systems by ranking candidates from prefill-stage NLL and attention signals of a 0.6B SLM, beating baselines by up to 33.41% Top-1 accuracy and proprietary LLMs by up to 89.5% relative improvement while processing traces in 2.66 seconds.

Single-Language Evidence Is Insufficient for Automated Logging: A Multilingual Benchmark and Empirical Study with LLMs

cs.SE · 2026-04-19 · unverdicted · novelty 7.0

MultiLogBench shows that LLM performance on automated logging varies substantially across programming languages, demonstrating that single-language evidence is insufficient for general claims about model behavior or tool design.

Certified Program Synthesis with a Multi-Modal Verifier

cs.SE · 2026-04-17 · unverdicted · novelty 7.0

LeetProof achieves higher rates of fully certified program synthesis from natural language by using a multi-modal verifier in Lean to validate specifications via randomized testing and delegate proofs to AI tools, outperforming single-mode baselines on benchmarks while uncovering defects in prior参考.

MR-Coupler: Automated Metamorphic Test Generation via Functional Coupling Analysis

cs.SE · 2026-04-11 · conditional · novelty 7.0 · 2 refs

MR-Coupler leverages functional coupling analysis and LLMs to generate valid metamorphic test cases for over 90% of tasks while detecting 44% of real bugs, outperforming baselines by 64.90% in validity and 36.56% in false-alarm reduction.

A Large-Scale Empirical Study of AI-Generated Code in Real-World Repositories

cs.SE · 2026-03-28 · unverdicted · novelty 7.0

A large-scale study of real-world repositories finds that AI-generated code differs from human-written code in complexity, structural traits, defect indicators, and commit-level activity patterns.

Measuring and Exploiting Contextual Bias in LLM-Assisted Security Code Review

cs.SE · 2026-03-19 · accept · novelty 7.0

LLM-based security code review is vulnerable to framing bias, with a novel iterative refinement attack achieving 100% success in reintroducing vulnerabilities across real projects.

cs.SE · 2026-06-29 · unverdicted · novelty 6.0

Large-scale analysis of 200K PyPI packages identifies 1,361 replicated popular packages, 256 replicated vulnerable packages, and 7 new replicated malicious packages, showing replication as a security threat vector.

Agentic Persona Generation with Critique-Refinement: An Industrial Evaluation

cs.SE · 2026-06-08 · unverdicted · novelty 6.0

PerGent, an agentic critique-refinement system for persona generation, reaches 96.9% expert approval in an industrial evaluation at Kinaxis and reproduces more pre-LLM expert content than single-shot baselines.

Code Is More Than Text: Uncertainty Estimation for Code Generation

cs.CL · 2026-06-08 · unverdicted · novelty 6.0

Three code-specific uncertainty axes (lexical, algorithmic, functional) yield an ensemble that raises average AUROC from 0.696 to 0.776 across five code LLMs, with one single-pass signal matching multi-pass baselines at lower cost.

Are We Lost in the Woods? Detecting Silent Semantic Faults for Random Forest Classifiers with Data-informed Static Analysis

cs.SE · 2026-06-05 · unverdicted · novelty 6.0

dille detects silent semantic faults in random forest ML pipelines with 91% precision via data-informed static analysis on Kaggle notebooks, finding 12-18% of scripts affected.

Provably Secure Agent Guardrail

cs.AI · 2026-05-28 · unverdicted · novelty 6.0

Introduces ePCA framework using neural-symbolic isolation to force agents to formalize intentions as logical constraints, claiming zero attack success and false positive rates in tested scenarios.

Converted, Not Equivalent: Benchmarking Codebase Conversion via Observational Equivalence

cs.SE · 2026-05-27 · unverdicted · novelty 6.0

T2J-Bench shows top coding agents achieve only 26.7-28.9% pass rate on codebase conversion under a three-stage observational equivalence check, with agents overestimating success by 66.6-97.8 points.

QUTest: A Native Testing Framework for Quantum Programs

quant-ph · 2026-05-19 · unverdicted · novelty 6.0

QUTest is a native OpenQASM testing framework that encodes Arrange/Act/Assert tests and 12 assertion types via pragma comments while remaining compatible with existing tools.

Characterizing and Mitigating False-Positive Bug Reports in the Linux Kernel

cs.SE · 2026-05-08 · conditional · novelty 6.0 · 2 refs

False-positive bug reports in the Linux kernel consume effort comparable to real bugs and can be filtered by LLMs using retrieval-augmented generation at 88% F1.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Code Is More Than Text: Uncertainty Estimation for Code Generation cs.CL · 2026-06-08 · unverdicted · none · ref 16
Three code-specific uncertainty axes (lexical, algorithmic, functional) yield an ensemble that raises average AUROC from 0.696 to 0.776 across five code LLMs, with one single-pass signal matching multi-pass baselines at lower cost.

Bridging research and practice in simulation-based testing of industrial robot navigation systems

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer