Audit of GSO, SWE-Perf and SWE-fficiency reveals that reference patches satisfy validity rules across machines for only 39/102, 11/140 and 411/498 tasks respectively, public submissions beat references on 85.3% of replay-valid tasks, and scoring rules cause ranking disagreements.
(eds.) Information Security and Cryptology – ICISC 2023, vol
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2verdicts
UNVERDICTED 2representative citing papers
Evidence-based taxonomy of security properties with first-order logic definitions and ProVerif/Tamarin executable examples derived from a 2022-2025 literature review of 53 studies.
citing papers explorer
-
Are Performance-Optimization Benchmarks Reliably Measuring Coding Agents?
Audit of GSO, SWE-Perf and SWE-fficiency reveals that reference patches satisfy validity rules across machines for only 39/102, 11/140 and 411/498 tasks respectively, public submissions beat references on 85.3% of replay-valid tasks, and scoring rules cause ranking disagreements.
-
Bridging Theory and Practice: An Executable Taxonomy of Security Properties for ProVerif and Tamarin
Evidence-based taxonomy of security properties with first-order logic definitions and ProVerif/Tamarin executable examples derived from a 2022-2025 literature review of 53 studies.