ExCyTIn-Bench: Evaluating LLM agents on Cyber Threat Investigation
Pith reviewed 2026-05-19 04:34 UTC · model grok-4.3
The pith
A new benchmark tests LLM agents on cyber threat investigation using questions from security log graphs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ExCyTIn-Bench is built by applying expert-crafted detection logic to security logs from Microsoft Sentinel and related services to form threat investigation graphs, then using LLMs to generate questions from node pairs on those graphs. Each question anchors the start node as background context and the end node as the verifiable answer, producing automatic and explainable ground truth while supporting 7542 questions across 57 log tables in a reusable pipeline.
What carries the argument
Threat investigation graphs formed from paired nodes in extracted security logs, where expert detection logic supplies the edges and LLM generation converts node pairs into questions with explicit start-context and end-answer structure.
If this is right
- The graph-based construction makes the benchmark reusable and readily extensible to new security logs or environments.
- Current LLM agents still face substantial difficulty on these multi-hop tasks, as shown by the highest reward of 0.606.
- Automatic ground truth from explicit node pairs enables scalable, explainable evaluation without manual answer curation.
- Improved performance on the benchmark would support development of agents that can handle heterogeneous logs in threat investigations.
Where Pith is reading between the lines
- The node-pair question generation method could transfer to other domains that require chaining evidence across structured data records.
- Practical use would need additional validation against noisier, less controlled log sources that real organizations encounter.
- Strong results here could shorten the initial evidence-gathering phase for analysts and free time for higher-level decisions.
Load-bearing premise
Expert-crafted detection logic plus LLM-generated questions from graph node pairs accurately capture the difficulty and structure of real-world multi-hop security log analysis by human analysts.
What would settle it
Running the same set of questions on a panel of experienced human security analysts and finding their accuracy or reasoning patterns diverge sharply from what the benchmark assumes would indicate the questions do not reflect actual investigation difficulty.
Figures
read the original abstract
We present ExCyTIn-Bench, the first benchmark to Evaluate an LLM agent X on the task of Cyber Threat Investigation through security questions derived from investigation graphs. Real-world security analysts must sift through a large number of heterogeneous security logs, follow multi-hop chains of evidence to investigate threats. With the developments of LLMs, building LLM-based agents for automatic threat investigation is a promising direction. We construct a benchmark from a controlled Azure tenant including a SQL environment covering 57 log tables from Microsoft Sentinel and related services, and 7542 generated questions. We leverage security logs extracted with expert-crafted detection logic to build threat investigation graphs, and then generate questions with LLMs using paired nodes on the graph, taking the start node as background context and the end node as answer. Anchoring each question to these explicit nodes and edges not only provides automatic, explainable ground truth answers but also makes the pipeline reusable and readily extensible to new logs. Our comprehensive experiments on the test set with different models confirm the difficulty of the task: the best model so far can achieve a reward of 0.606, leaving much headroom for future research. The code is available at https://github.com/microsoft/SecRL
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents ExCyTIn-Bench, the first benchmark for evaluating LLM agents on cyber threat investigation. It builds investigation graphs from expert-crafted detection logic applied to 57 Azure/Sentinel log tables in a controlled tenant, then uses LLMs to generate 7542 questions by pairing start and end nodes on the graphs (start node supplies background context; end node supplies the answer). Experiments on the test set show the best model achieves a reward of 0.606, with the authors concluding substantial headroom remains for future work. The pipeline is positioned as reusable and extensible, with code released at https://github.com/microsoft/SecRL.
Significance. If the benchmark construction faithfully reproduces the difficulty of real multi-hop security log analysis, the work supplies a valuable, grounded resource for the security and AI communities. Strengths include the use of external Azure logs and expert detection rules (reducing circularity), automatic explainable ground truth via node pairs, and an open code release that supports reproducibility and extension to new logs. The reported performance gap could usefully guide development of LLM agents for practical threat investigation tasks.
major comments (2)
- [abstract and methods (graph construction and question generation)] Construction pipeline (abstract and methods section on graph-to-question generation): The central claim that node-pair questions measure genuine threat investigation capability rests on the unverified assumption that LLM-generated questions do not embed structural cues or reduced ambiguity absent from real analyst queries. No controls, human validation, or artifact analysis are described to test whether the generator supplies explicit relational structure that analysts must discover from raw logs.
- [experiments and results] Experiments and results (section reporting the 0.606 reward): The headline performance number and headroom conclusion are load-bearing for the paper's contribution, yet the manuscript provides no details on reward calculation, baseline selection, error analysis, or ablations that would rule out question-generation artifacts as an explanation for the observed scores.
minor comments (3)
- [abstract] Abstract: the phrasing 'Evaluate an LLM agent X' appears to contain a placeholder that should be clarified.
- [methods] Methods: more explicit description of the 57 log tables, the precise expert detection rules, and the LLM prompts used for question generation would improve reproducibility.
- [abstract and conclusion] The paper states code is available but does not enumerate what artifacts (graph construction scripts, prompts, evaluation harness) are included in the repository.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment below and indicate the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [abstract and methods (graph construction and question generation)] Construction pipeline (abstract and methods section on graph-to-question generation): The central claim that node-pair questions measure genuine threat investigation capability rests on the unverified assumption that LLM-generated questions do not embed structural cues or reduced ambiguity absent from real analyst queries. No controls, human validation, or artifact analysis are described to test whether the generator supplies explicit relational structure that analysts must discover from raw logs.
Authors: We agree that the manuscript would benefit from explicit validation of the generated questions. While the node-pair construction supplies automatic, explainable ground truth anchored to expert detection logic, we did not include controls or human evaluation in the original submission. In the revision we will add a dedicated subsection describing a human validation study in which security analysts rate a sample of questions for realism, ambiguity, and presence of unintended structural cues. We will also report quantitative artifact analysis comparing relational explicitness in generated questions versus a small set of publicly documented analyst queries. These additions will directly address the concern. revision: yes
-
Referee: [experiments and results] Experiments and results (section reporting the 0.606 reward): The headline performance number and headroom conclusion are load-bearing for the paper's contribution, yet the manuscript provides no details on reward calculation, baseline selection, error analysis, or ablations that would rule out question-generation artifacts as an explanation for the observed scores.
Authors: We accept that the current experimental reporting is insufficiently detailed. The manuscript summarizes model performance but omits the precise reward formulation, full baseline specifications, error categorization, and ablations. In the revised version we will expand the experiments section to include the exact reward metric definition and computation, descriptions of all baselines and their prompting setups, a qualitative error analysis of representative failures, and ablation results on question-generation parameters (e.g., prompt variations and graph depth). These additions will better substantiate the reported 0.606 score and the conclusion that substantial headroom remains. revision: yes
Circularity Check
No significant circularity in benchmark construction
full rationale
The paper constructs ExCyTIn-Bench from external Azure/Sentinel logs processed via expert-crafted detection logic into investigation graphs, then generates questions by prompting an LLM to use start-node context and end-node answers for automatic ground truth. This yields 7542 questions whose structure and difficulty are defined independently of any model performance numbers or fitted parameters. The reported best-model reward of 0.606 is a direct empirical measurement on the resulting test set and does not reduce to the construction pipeline by definition or self-reference. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked; the derivation remains self-contained against the external logs and rules.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Expert-crafted detection logic accurately captures threat patterns and multi-hop evidence chains in the 57 Microsoft Sentinel log tables.
- ad hoc to paper Questions generated by LLMs from paired start/end nodes on the graphs yield automatic, explainable ground truth that measures genuine threat investigation capability.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We leverage security logs extracted with expert-crafted detection logic to build threat investigation graphs, and then generate questions with LLMs using paired nodes on the graph, taking the start node as background context and the end node as answer.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the best model so far can achieve a reward of 0.606
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 4 Pith papers
-
Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps
A new benchmark shows frontier LLMs achieve only 3.8% average recall identifying malicious events from raw logs and fail to meet 50% recall thresholds on most tactics.
-
GenAI-Driven Threat Detection with Microsoft Security Copilot
DTDA is an LLM-powered autonomous agent that investigates Microsoft Defender incidents via planner-executor loops and generates novel alerts, achieving 80.1% precision in 120-day production use and 0.78 F1 offline.
-
GenAI-Driven Threat Detection with Microsoft Security Copilot
DTDA is an LLM agent that produces novel security alerts at 80.1% customer-validated precision and 0.78 F1 on hidden activity while running at production scale inside Microsoft Defender.
-
Pen-Strategist: A Reasoning Framework for Penetration Testing Strategy Formation and Analysis
Pen-Strategist fine-tunes Qwen-3-14B with RL on a pentesting reasoning dataset and pairs it with a CNN step classifier, reporting 87% better strategy derivation, 47.5% more subtask completions than baselines, and gain...
Reference graph
Works this paper leans on
-
[1]
Ctibench: A benchmark for evaluating llms in cy- ber threat intelligence,
Md Tanvirul Alam, Dipkamal Bhushl, Le Nguyen, and Nidhi Rastogi. Ctibench: A benchmark for evaluating llms in cyber threat intelligence. arXiv preprint arXiv:2406.07599, 2024
-
[2]
Towards an ai-enhanced cyber threat intelligence pro- cessing pipeline
Lampis Alevizos and Martijn Dekker. Towards an ai-enhanced cyber threat intelligence pro- cessing pipeline. Electronics, 13(11):2021, 2024
work page 2021
-
[3]
Magic: Generating self-correction guideline for in-context text-to-sql
Arian Askari, Christian Poelitz, and Xinye Tang. Magic: Generating self-correction guideline for in-context text-to-sql. arXiv preprint arXiv:2406.12692, 2024
-
[4]
Program Synthesis with Large Language Models
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[5]
Secure: Benchmarking generative large language models for cybersecurity advisory
Dipkamal Bhusal, Md Tanvirul Alam, Le Nguyen, Ashim Mahara, Zachary Lightcap, Rodney Frazier, Romy Fieblinger, Grace Long Torales, and Nidhi Rastogi. Secure: Benchmarking generative large language models for cybersecurity advisory. arXiv preprint arXiv:2405.20441, 2024
-
[6]
AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors
Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chen Qian, Chi-Min Chan, Yujia Qin, Yaxi Lu, Ruobing Xie, et al. Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors in agents. arXiv preprint arXiv:2308.10848, 2023. 10
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
What is cyber threat hunting?, 2023
CrowdStrike. What is cyber threat hunting?, 2023. URL https://www.crowdstrike.com/ cybersecurity-101/threat-hunting/. Accessed: 14 March 2024
work page 2023
-
[8]
2024 global threat report, 2024
CrowdStrike. 2024 global threat report, 2024. URL https://www.crowdstrike.com/ global-threat-report/
work page 2024
-
[9]
Dawei Gao, Haibin Wang, Yaliang Li, Xiuyu Sun, Yichen Qian, Bolin Ding, and Jingren Zhou
Xuemei Dong, Chao Zhang, Yuhang Ge, Yuren Mao, Yunjun Gao, Jinshu Lin, Dongfang Lou, et al. C3: Zero-shot text-to-sql with chatgpt. arXiv preprint arXiv:2307.07306, 2023
-
[10]
Dawei Gao, Haibin Wang, Yaliang Li, Xiuyu Sun, Yichen Qian, Bolin Ding, and Jingren Zhou. Text-to-sql empowered by large language models: A benchmark evaluation. arXiv preprint arXiv:2308.15363, 2023
-
[11]
Enabling efficient cyber threat hunting with cyber threat intelligence
Peng Gao, Fei Shao, Xiaoyuan Liu, Xusheng Xiao, Zheng Qin, Fengyuan Xu, Prateek Mittal, Sanjeev R Kulkarni, and Dawn Song. Enabling efficient cyber threat hunting with cyber threat intelligence. In 2021 IEEE 37th International Conference on Data Engineering (ICDE), pages 193–204. IEEE, 2021
work page 2021
-
[12]
Tactical provenance analysis for endpoint detection and response systems
Wajih Ul Hassan, Adam Bates, and Daniel Marino. Tactical provenance analysis for endpoint detection and response systems. In 2020 IEEE Symposium on Security and Privacy (SP), pages 1172–1189. IEEE, 2020
work page 2020
-
[13]
Mohammed Hassanin and Nour Moustafa. A comprehensive overview of large language models (llms) for cyber defences: Opportunities and directions. arXiv preprint arXiv:2405.14487, 2024
-
[14]
MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework
Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, et al. Metagpt: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[15]
{SLEUTH}: Real-time attack scenario reconstruction from {COTS} audit data
Md Nahid Hossain, Sadegh M Milajerdi, Junao Wang, Birhanu Eshete, Rigel Gjomemo, R Sekar, Scott Stoller, and VN Venkatakrishnan. {SLEUTH}: Real-time attack scenario reconstruction from {COTS} audit data. In 26th USENIX Security Symposium (USENIX Security 17), pages 487–504, 2017
work page 2017
-
[16]
Infiagent-dabench: Evaluating agents on data analysis tasks
Xueyu Hu, Ziyu Zhao, Shuang Wei, Ziwei Chai, Guoyin Wang, Xuwu Wang, Jing Su, Jingjing Xu, Ming Zhu, Yao Cheng, et al. Infiagent-dabench: Evaluating agents on data analysis tasks. arXiv preprint arXiv:2401.05507, 2024
-
[17]
Mlagentbench: Evaluating language agents on machine learning experimentation
Qian Huang, Jian V ora, Percy Liang, and Jure Leskovec. Mlagentbench: Evaluating language agents on machine learning experimentation. InForty-first International Conference on Machine Learning, 2024
work page 2024
-
[18]
IBM. What is threat hunting?, 2023. URL https://www.ibm.com/topics/ threat-hunting. Accessed: 1 Oct 2024
work page 2023
-
[19]
Jinhao Jiang, Kun Zhou, Zican Dong, Keming Ye, Wayne Xin Zhao, and Ji-Rong Wen. Structgpt: A general framework for large language model to reason over structured data. arXiv preprint arXiv:2305.09645, 2023
-
[20]
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[21]
Crimson: Empowering strategic reasoning in cybersecurity through large language models
Jiandong Jin, Bowen Tang, Mingxuan Ma, Xiao Liu, Yunfei Wang, Qingnan Lai, Jia Yang, and Changling Zhou. Crimson: Empowering strategic reasoning in cybersecurity through large language models. arXiv preprint arXiv:2403.00878, 2024
-
[22]
Secbench: A comprehensive multi-dimensional benchmarking dataset for llms in cybersecurity
Pengfei Jing, Mengyun Tang, Xiaorong Shi, Xing Zheng, Sen Nie, Shi Wu, Yong Yang, and Xiapu Luo. Secbench: A comprehensive multi-dimensional benchmarking dataset for llms in cybersecurity. arXiv preprint arXiv:2412.20787, 2024
- [23]
-
[24]
Seceval: A comprehensive benchmark for evaluating cybersecurity knowledge of foundation models
Guancheng Li, Yifeng Li, Wang Guannan, Haoyu Yang, and Yang Yu. Seceval: A comprehensive benchmark for evaluating cybersecurity knowledge of foundation models. https://github.com/XuanwuAI/SecEval, 2023
work page 2023
-
[25]
Camel: Communicative agents for "mind" exploration of large scale language model society, 2023
Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for "mind" exploration of large scale language model society, 2023
work page 2023
-
[26]
Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying Geng, Nan Huo, et al. Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls. Advances in Neural Information Processing Systems, 36:42330–42357, 2023
work page 2023
-
[27]
Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step, 2023. URL https://arxiv.org/abs/2305.20050
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[28]
NL2Bash: A Corpus and Semantic Parser for Natural Language Interface to the Linux Operating System
Xi Victoria Lin, Chenglong Wang, Luke Zettlemoyer, and Michael D Ernst. Nl2bash: A corpus and semantic parser for natural language interface to the linux operating system. arXiv preprint arXiv:1802.08979, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[29]
Zefang Liu. Secqa: A concise question-answering dataset for evaluating large language models in computer security. arXiv preprint arXiv:2312.15838, 2023
-
[30]
Cyberbench: A multi-task benchmark for evaluating large language models in cybersecurity, 2024
Zefang Liu, Jialei Shi, and John F Buford. Cyberbench: A multi-task benchmark for evaluating large language models in cybersecurity, 2024
work page 2024
-
[31]
Evolving techniques in cyber threat hunting: A systematic review
Arash Mahboubi, Khanh Luong, Hamed Aboutorab, Hang Thanh Bui, Geoff Jarrad, Mo- hammed Bahutair, Seyit Camtepe, Ganna Pogrebna, Ejaz Ahmed, Bazara Barry, et al. Evolving techniques in cyber threat hunting: A systematic review. Journal of Network and Computer Applications, page 104004, 2024
work page 2024
-
[32]
MITRE. Mitre att&ck, 2025. URL https://attack.mitre.org/. A knowledge base of adversary tactics and techniques
work page 2025
-
[33]
On evaluating the integration of reasoning and action in llm agents with database question answering
Linyong Nan, Ellen Zhang, Weijin Zou, Yilun Zhao, Wenfei Zhou, and Arman Cohan. On evaluating the integration of reasoning and action in llm agents with database question answering. arXiv preprint arXiv:2311.09721, 2023
-
[34]
Joshua Nordine. OSINT Framework. https://github.com/lockfale/osint-framework (commit 68c904c), 2024. Accessed 2025-05-10
work page 2024
-
[35]
Agir: Au- tomating cyber threat intelligence reporting with natural language generation
Filippo Perrina, Francesco Marchiori, Mauro Conti, and Nino Vincenzo Verde. Agir: Au- tomating cyber threat intelligence reporting with natural language generation. In 2023 IEEE International Conference on Big Data (BigData), pages 3053–3062. IEEE, 2023
work page 2023
-
[36]
Din-sql: Decomposed in-context learning of text-to-sql with self-correction
Mohammadreza Pourreza and Davood Rafiei. Din-sql: Decomposed in-context learning of text-to-sql with self-correction. Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[37]
Maria Rigaki, Ondˇrej Lukáš, Carlos A Catania, and Sebastian Garcia. Out of the cage: How stochastic parrots win in cyber security environments. arXiv preprint arXiv:2308.12086, 2023
-
[38]
Reflexion: Language Agents with Verbal Reinforcement Learning
Noah Shinn, Beck Labash, and Ashwin Gopinath. Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[39]
Time for action: Automated analysis of cyber threat intelligence in the wild
Giuseppe Siracusano, Davide Sanvito, Roberto Gonzalez, Manikantan Srinivasan, Sivakaman Kamatchi, Wataru Takahashi, Masaru Kawakita, Takahiro Kakumaru, and Roberto Bifulco. Time for action: Automated analysis of cyber threat intelligence in the wild. arXiv preprint arXiv:2307.10214, 2023
-
[40]
Towards evaluation and un- derstanding of large language models for cyber operation automation
Madeena Sultana, Adrian Taylor, Li Li, and Suryadipta Majumdar. Towards evaluation and un- derstanding of large language models for cyber operation automation. In 2023 IEEE Conference on Communications and Network Security (CNS), pages 1–6. IEEE, 2023. 12
work page 2023
-
[41]
CHESS: Contextual Harnessing for Efficient SQL Synthesis
Shayan Talaei, Mohammadreza Pourreza, Yu-Chen Chang, Azalia Mirhoseini, and Amin Saberi. Chess: Contextual harnessing for efficient sql synthesis. arXiv preprint arXiv:2405.16755, 2024
work page internal anchor Pith review arXiv 2024
-
[42]
Common Vulnerabilities and Exposures (CVE) Program
The MITRE Corporation. Common Vulnerabilities and Exposures (CVE) Program. https: //www.cve.org/, 2025. Accessed 2025-05-10
work page 2025
-
[43]
The MITRE Corporation. MITRE ATT&CK ® Knowledge Base. https://attack.mitre. org/, 2025. Version 17.1. Accessed 2025-05-10
work page 2025
-
[44]
Mac- sql: A multi-agent collaborative framework for text-to-sql,
Bing Wang, Changyu Ren, Jian Yang, Xinnian Liang, Jiaqi Bai, Linzheng Chai, Zhao Yan, Qian-Wen Zhang, Di Yin, Xing Sun, et al. Mac-sql: A multi-agent collaborative framework for text-to-sql. arXiv preprint arXiv:2312.11242, 2024
-
[45]
Voyager: An Open-Ended Embodied Agent with Large Language Models
Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[46]
A survey on large language model based autonomous agents
Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents. Frontiers of Computer Science, 18(6):186345, 2024
work page 2024
-
[47]
Avalon’s game of thoughts: Battle against deception through recursive contemplation, 2023
Shenzhi Wang, Chang Liu, Zilong Zheng, Siyuan Qi, Shuo Chen, Qisen Yang, Andrew Zhao, Chaofei Wang, Shiji Song, and Gao Huang. Avalon’s game of thoughts: Battle against deception through recursive contemplation. ArXiv, abs/2310.01320, 2023. URL https://api.semanticscholar.org/CorpusID:263605971
-
[48]
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. Self- consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[49]
AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation
Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xiaoyun Zhang, and Chi Wang. Autogen: Enabling next-gen llm applications via multi-agent conversation framework. arXiv preprint arXiv:2308.08155, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[50]
Mathchat: Converse to tackle challenging math problems with llm agents
Yiran Wu, Feiran Jia, Shaokun Zhang, Hangyu Li, Erkang Zhu, Yue Wang, Yin Tat Lee, Richard Peng, Qingyun Wu, and Chi Wang. Mathchat: Converse to tackle challenging math problems with llm agents. In ICLR 2024 Workshop on Large Language Model (LLM) Agents, 2024
work page 2024
-
[51]
Yiran Wu, Tianwei Yue, Shaokun Zhang, Chi Wang, and Qingyun Wu. Stateflow: Enhancing llm task-solving through state-driven workflows. arXiv preprint arXiv:2403.11322, 2024
-
[52]
Crab: Cross-environment agent benchmark for multimodal language model agents
Tianqi Xu, Linyao Chen, Dai-Jie Wu, Yanjun Chen, Zecheng Zhang, Xiang Yao, Zhiqiang Xie, Yongchao Chen, Shilong Liu, Bochen Qian, et al. Crab: Cross-environment agent benchmark for multimodal language model agents. arXiv preprint arXiv:2407.01511, 2024
-
[53]
Intercode: Standardizing and benchmarking interactive coding with execution feedback
John Yang, Akshara Prabhakar, Karthik Narasimhan, and Shunyu Yao. Intercode: Standardizing and benchmarking interactive coding with execution feedback.arXiv preprint arXiv:2306.14898, 2023
-
[54]
SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering
John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering. arXiv preprint arXiv:2405.15793, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[55]
ReAct: Synergizing Reasoning and Acting in Language Models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[56]
Javier Yong, Haokai Ma, Yunshan Ma, Anis Yusof, Zhenkai Liang, and Ee-Chien Chang. Attackseqbench: Benchmarking large language models’ understanding of sequential patterns in cyber attacks. arXiv preprint arXiv:2503.03170, 2025
-
[57]
Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, et al. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. arXiv preprint arXiv:1809.08887, 2018. 13
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[58]
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?, 2025. URL https://arxiv.org/abs/2504.13837
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[59]
Autodefense: Multi-agent llm defense against jailbreak attacks,
Yifan Zeng, Yiran Wu, Xiao Zhang, Huazheng Wang, and Qingyun Wu. Autodefense: Multi- agent llm defense against jailbreak attacks. arXiv preprint arXiv:2403.04783, 2024
-
[60]
Andy K Zhang, Neil Perry, Riya Dulepet, Eliot Jones, Justin W Lin, Joey Ji, Celeste Menders, Gashon Hussein, Samantha Liu, Donovan Jasper, et al. Cybench: A framework for evaluating cybersecurity capabilities and risk of language models. arXiv preprint arXiv:2408.08926, 2024
-
[61]
When llms meet cybersecurity: A systematic literature review,
Jie Zhang, Haoyu Bu, Hui Wen, Yu Chen, Lun Li, and Hongsong Zhu. When llms meet cybersecurity: A systematic literature review. arXiv preprint arXiv:2405.03644, 2024
-
[62]
When llms meet cybersecurity: A systematic literature review
Jie Zhang, Haoyu Bu, Hui Wen, Yongji Liu, Haiqiang Fei, Rongrong Xi, Lun Li, Yun Yang, Hongsong Zhu, and Dan Meng. When llms meet cybersecurity: A systematic literature review. Cybersecurity, 8(1):1–41, 2025
work page 2025
-
[63]
Reactable: Enhancing react for table question answering
Yunjia Zhang, Jordan Henkel, Avrilia Floratou, Joyce Cahoon, Shaleen Deep, and Jignesh M Pa- tel. Reactable: Enhancing react for table question answering. arXiv preprint arXiv:2310.00815, 2023
-
[64]
Expel: Llm agents are experiential learners
Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. Expel: Llm agents are experiential learners. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19632–19642, 2024
work page 2024
-
[65]
Absolute Zero: Reinforced Self-play Reasoning with Zero Data
Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, and Gao Huang. Absolute zero: Reinforced self-play reasoning with zero data. arXiv preprint arXiv:2505.03335, 2025. A Limitations and Broader Impacts Limitations. While ExCyTIn-Bench represents a significant step toward evaluating LLM agents on real...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[66]
Identification of PII Columns Each table is scanned column-by-column. For every column we draw a random sample of five values and prompt a (LLM) to decide whether the column contains PII. Columns provisionally flagged in this first pass are examined once more with three focused prompts:
-
[67]
Confirm whether the column indeed holds PII
-
[68]
Decide whether the column stores a dictionary/ JSON structure
-
[69]
If it does, enumerate which keys inside the structure contain PII. The union of both LLM passes is then reviewed by domain experts, yielding a curated list of PII-bearing columns that serves as ground truth for the remainder of the pipeline
-
[70]
Creation of PII Value Mappings For every confirmed PII column we gather its set of unique values. If the column encodes a dictionary, only the keys identified in the previous stage are considered. • Regex-based substitution. We manually go througth the tables to recognize common PII patterns, and each candidate value is matched against them (IPv4/IPv6 add...
-
[71]
Dataset -wide Replacement In the final stage we stream every table in the dataset, globally replacing each source PII value with its surrogate. This guarantees referential consistency—queries that join on an anonymised IP address still succeed—and eliminates residual PII leakage while preserving analytical utility. D Additional Question Generation Details...
-
[72]
Under these criteria and filtering after question generation, we collected a total of 589 questions as the test set (See Figure 4). We also created a strategy for sampling questions to split the training and test. Since we are building questions from the graph, and the train and test sets are all from one graph, we want the train samples to have less over...
work page 2024
-
[73]
This suggests that fine-tuning amplified the model’s bias toward the training incidents, degrading its ability to generalize. Given our small sample size, additional studies are needed to characterize the impact of fine-tuning more precisely. However, these initial results imply that naïve fine-tuning may be ill-suited to this task and motivate exploring ...
work page 2024
-
[74]
The question should be natural and relevant to the context, and it should be clear and have a deterministic answer
-
[75]
But it should not leak the answer. If the start and end alert are the same, you should be more careful since the given entities may have overlapping information
-
[76]
The question should be specific of the answer you are looking for, and the answer should match the question. - "answer": the answer to the question. You may be given one or more entities from the end alert, select the most meaningful entity and make sure it is not leaked in the context or question. - "context": the context from the start alert. you should...
work page 2024
-
[77]
Suspicious access to LSASS service
2024-06-20 07:36 UTC – CredentialAccess: “Suspicious access to LSASS service” on vnevado-win10v via mimikatz.exe (Account: tgs2z)
work page 2024
-
[78]
Possible attempt to access Primary Refresh Token (PRT)
2024-06-20 08:51 – CredentialAccess: “Possible attempt to access Primary Refresh Token (PRT)” on vnevado-win10v by get-userprttoken.ps1 (tgs2z)
work page 2024
-
[79]
Mimikatz credential theft tool
2024-06-20 08:58 – Malware: “Mimikatz credential theft tool” detected on vnevado-win10v
work page 2024
-
[80]
Malicious credential theft tool execution detected
2024-06-20 09:00 – CredentialAccess: “Malicious credential theft tool execution detected” on vnevado-win10v
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.