StepFly automates TSG execution via TSG Mentor, LLM-based DAG extraction with QPPs, and a DAG-guided parallel scheduler, reaching 94% success on GPT-4.1 with 32.9-70.4% time savings on parallelizable guides.
Nissist: An incident mitigation copi- lot based on troubleshooting guides
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
TSGuard builds domain knowledge bases offline from historical incidents and applies online multi-agent structured reasoning to diagnose AI workload failures, delivering 19.8% higher accuracy and 63.4% lower verification time than baselines on Azure production data.
OpsLLM is a domain-specific LLM for software ops QA and RCA built with human-curated data, SFT, and RL using a domain process reward model, showing accuracy gains of 0.2-5.7% on QA and 2.7-70.3% on RCA over general LLMs.
citing papers explorer
-
StepFly: Agentic Troubleshooting Guide Automation for Incident Diagnosis
StepFly automates TSG execution via TSG Mentor, LLM-based DAG extraction with QPPs, and a DAG-guided parallel scheduler, reaching 94% success on GPT-4.1 with 32.9-70.4% time savings on parallelizable guides.
-
TSGuard: Automated User-Centric Incident Diagnosis for AI Workloads in the Cloud
TSGuard builds domain knowledge bases offline from historical incidents and applies online multi-agent structured reasoning to diagnose AI workload failures, delivering 19.8% higher accuracy and 63.4% lower verification time than baselines on Azure production data.
-
An End-to-End Framework for Building Large Language Models for Software Operations
OpsLLM is a domain-specific LLM for software ops QA and RCA built with human-curated data, SFT, and RL using a domain process reward model, showing accuracy gains of 0.2-5.7% on QA and 2.7-70.3% on RCA over general LLMs.