DiagnosticIQ benchmark shows frontier LLMs perform similarly on standard rule-to-action tasks but lose substantial accuracy under distractor expansion and condition inversion, pointing to calibration as the key deployment issue.
Time-mqa: Time series multi-task question answering with context enhancement
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.AI 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
DiagnosticIQ: A Benchmark for LLM-Based Industrial Maintenance Action Recommendation from Symbolic Rules
DiagnosticIQ benchmark shows frontier LLMs perform similarly on standard rule-to-action tasks but lose substantial accuracy under distractor expansion and condition inversion, pointing to calibration as the key deployment issue.