SIR-Bench supplies 794 test cases replayed from anonymized real incidents via the OUAT framework and scores agents on triage accuracy, novel evidence discovery, and tool use with an adversarial LLM judge, reporting 97.1% true positive detection and 5.67 novel findings per case as baseline.
Security Orchestration, Automation and Response (SOAR): A Comprehensive Guide.Palo Alto Networks Technical Report, 2020
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CR 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
SIR-Bench: Evaluating Investigation Depth in Security Incident Response Agents
SIR-Bench supplies 794 test cases replayed from anonymized real incidents via the OUAT framework and scores agents on triage accuracy, novel evidence discovery, and tool use with an adversarial LLM judge, reporting 97.1% true positive detection and 5.67 novel findings per case as baseline.