{"work":{"id":"a37549b4-4c94-412d-acc4-4efeb08509be","openalex_id":null,"doi":null,"arxiv_id":"2308.03688","raw_key":null,"title":"AgentBench: Evaluating LLMs as Agents","authors":null,"authors_text":"Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai","year":2023,"venue":"cs.AI","abstract":"The potential of Large Language Model (LLM) as agents has been widely acknowledged recently. Thus, there is an urgent need to quantitatively \\textit{evaluate LLMs as agents} on challenging tasks in interactive environments. We present AgentBench, a multi-dimensional benchmark that consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities. Our extensive test over \\num API-based and open-sourced (OSS) LLMs shows that, while top commercial LLMs present a strong ability of acting as agents in complex environments, there is a significant disparity in performance between them and many OSS competitors that are no larger than 70B. We identify the typical reasons of failures in environments and LLMs, showing that poor long-term reasoning, decision-making, and instruction following abilities are the main obstacles for developing usable LLM agents. Improving instruction following and training on high quality multi-round alignment data could improve agent performance. And different from existing assumptions, training on code present ambivalent impacts on different agent tasks. Datasets, environments, and an integrated evaluation package for AgentBench are released at https://github.com/THUDM/AgentBench.","external_url":"https://arxiv.org/abs/2308.03688","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-06-29T11:23:20.890512+00:00","pith_arxiv_id":"2308.03688","created_at":"2026-05-10T03:08:58.915293+00:00","updated_at":"2026-06-29T11:23:20.890512+00:00","title_quality_ok":true,"display_title":"AgentBench: Evaluating LLMs as Agents","render_title":"AgentBench: Evaluating LLMs as Agents"},"hub":{"state":{"work_id":"a37549b4-4c94-412d-acc4-4efeb08509be","tier":"super_hub","tier_reason":"100+ Pith inbound or 10,000+ external citations","pith_inbound_count":145,"external_cited_by_count":null,"distinct_field_count":17,"first_pith_cited_at":"2023-08-22T13:30:37+00:00","last_pith_cited_at":"2026-06-08T19:35:37+00:00","author_build_status":"needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-06-29T11:58:42.028495+00:00","tier_text":"super_hub"},"tier":"super_hub","role_counts":[{"context_role":"background","n":38},{"context_role":"dataset","n":5},{"context_role":"baseline","n":1}],"polarity_counts":[{"context_polarity":"background","n":38},{"context_polarity":"use_dataset","n":4},{"context_polarity":"baseline","n":1},{"context_polarity":"unclear","n":1}],"runs":{"ask_index":{"job_type":"ask_index","status":"succeeded","result":{"title":"AgentBench: Evaluating LLMs as Agents","claims":[{"claim_text":"The potential of Large Language Model (LLM) as agents has been widely acknowledged recently. Thus, there is an urgent need to quantitatively \\textit{evaluate LLMs as agents} on challenging tasks in interactive environments. We present AgentBench, a multi-dimensional benchmark that consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities. Our extensive test over \\num API-based and open-sourced (OSS) LLMs shows that, while top commercial LLMs present a strong ability of acting as agents in complex environments, there is a significant disparity in perfo","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"prove how a single pipeline is programmed, but they generally do not make dependency materialization, replay iden- tity, or partial downstream invalidation the organizing abstraction of the whole system. 2.7 Benchmarks and Reproducible Agent Environments Another relevant literature studies how to evaluate agents in realistic yet reproducible environments. Benchmarks such as AgentBench [38], WebArena [39], VisualWebArena [40], WorkArena [41], AndroidWorld [42], OSWorld [43], AppWorld [44], GAIA [","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"How language models use long contexts.Transactions of the Association for Computational Linguistics, 2024. doi: 10.1162/tacl_a_00638. [24] X. Liu, H. Yu, et al. Agentbench: Evaluating llms as agents. InInternational Conference on Learning Representations, 2023. doi: 10.48550/arxiv.2308.03688. [25] LMSYS Org. Chatbot arena leaderboard, 2026. Accessed May 2026. [26] J. Luo and Y . Shao. Cayley graph optimization for scalable multi-agent communication topologies, 2026. [27] H. B. Mann and D. R. Whi","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"Reinforcement learning in robust markov decision processes.Advances in neural information processing systems, 26, 2013. 13 [66] Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025. [67] Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench:","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"Yao, K. Pei, O. Press, and K. Narasimhan. Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770, 2023. [13] M. Kim, Y . Jung, D. Lee, and S.-w. Hwang. Plm-based world models for text-based games. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages 1324-1341, 2022. [14] X. Liu, H. Yu, H. Zhang, Y . Xu, X. Lei, H. Lai, Y . Gu, H. Ding, K. Men, K. Yang, et al. Agentbench: Evaluating llms as agents. arXiv prepri","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"oscillate, or even degrade beyond a certain tipping point [ 11, 6]. Despite substantial efforts in prompting and reasoning [ 64, 68], these failures remain stubborn, indicating that the problem is structural and systematic. These dynamics run deeper than current benchmarks capture. Existing metrics for LLM MAS: task success, final-answer accuracy, and cumulative reward [35], are largely centered on outcomes while overlooking the internal dynamics of collective reasoning. As a result, they cannot","claim_type":"dataset","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"ering both operational workflow execution and policy-based approval decisions with objective verification. We evaluate representative LLM agents on ENTCOLLABBENCHand identify key bottlenecks in delegation, parameter grounding, workflow closure, decision commitment, and coordination cost. 2 Related Works Single-Agent Enterprise Benchmarks.In recent years, agent evaluation benchmarks targeting enterprise scenarios have advanced rapidly. AgentBench [3] was the first to systematically evaluate LLMs ","claim_type":"background","confidence":0.85,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks AgentBench: Evaluating LLMs as Agents because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (13 contexts).","role_counts":[{"n":13,"context_role":"background"},{"n":3,"context_role":"dataset"},{"n":1,"context_role":"baseline"}]},"error":null,"updated_at":"2026-05-17T10:19:52.687639+00:00"},"author_expand":{"job_type":"author_expand","status":"succeeded","result":{"authors_linked":[{"id":"0bded415-e78b-4e1b-9ec5-5985843bdf1a","orcid":null,"display_name":"Xiao Liu"},{"id":"98e1668d-27ba-4c49-8301-c183afe2cb80","orcid":null,"display_name":"Hao Yu"},{"id":"a84400e0-e2ec-40d0-8f25-12d9e880956c","orcid":null,"display_name":"Hanchen Zhang"},{"id":"b5b67155-1cd4-42ed-b6cd-6c2dde72b1da","orcid":null,"display_name":"Yifan Xu"},{"id":"e29b2f87-90fb-41f8-a4d9-07f0e8b3157a","orcid":null,"display_name":"Xuanyu Lei"},{"id":"b7409f15-8632-4685-b54d-65f1b5c2b600","orcid":null,"display_name":"Hanyu Lai"}]},"error":null,"updated_at":"2026-05-17T10:19:52.639273+00:00"},"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T08:47:49.740949+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"WebArena: A Realistic Web Environment for Building Autonomous Agents","work_id":"7058ffd2-a339-4102-89eb-248eeb074652","shared_citers":28},{"title":"SWE-bench: Can Language Models Resolve Real-World GitHub Issues?","work_id":"d0effe15-a689-441a-8e3f-ea35f1c4e4b1","shared_citers":27},{"title":"ReAct: Synergizing Reasoning and Acting in Language Models","work_id":"407a2351-25f1-497d-b611-f77d0292a8e6","shared_citers":16},{"title":"$\\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains","work_id":"6a8d8dc4-0cc0-4052-8109-abbcdcd4a962","shared_citers":12},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":12},{"title":"ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs","work_id":"3c555b48-a4d9-42dd-9fdd-0f6018fbe9cb","shared_citers":12},{"title":"Voyager: An Open-Ended Embodied Agent with Large Language Models","work_id":"ffe0d207-86cf-4742-a100-e988ac8b9676","shared_citers":12},{"title":"Toolformer: Language Models Can Teach Themselves to Use Tools","work_id":"9bce40c8-cfd7-4983-80e0-c3bd4402322a","shared_citers":11},{"title":"AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation","work_id":"92b7eb9c-c3d8-4518-a376-06fa15dd895b","shared_citers":10},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":10},{"title":"Identifying the Risks of LM Agents with an LM-Emulated Sandbox","work_id":"3d4c3b66-d749-4939-b1bc-62b10b2ebbb6","shared_citers":10},{"title":"MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework","work_id":"891b9780-a800-4e3c-bba0-53597ab8dc98","shared_citers":8},{"title":"Workarena: How capable are web agents at solving common knowledge work tasks? arXiv preprint arXiv:2403.07718","work_id":"5ac27d9e-4522-46f8-985e-0e4f73130803","shared_citers":8},{"title":"Constitutional AI: Harmlessness from AI Feedback","work_id":"faaaa4e0-2676-4fac-a0b4-99aef10d2095","shared_citers":7},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":7},{"title":"Reflexion: Language Agents with Verbal Reinforcement Learning","work_id":"778f739e-5f55-4961-8a2a-e4736a2757f4","shared_citers":7},{"title":"WebGPT: Browser-assisted question-answering with human feedback","work_id":"e25ef3e1-4848-4cb9-bf28-67a420591165","shared_citers":7},{"title":"Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models","work_id":"bb63abb3-0d50-4362-b97c-b5e725b03b39","shared_citers":6},{"title":"ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate","work_id":"eac74d79-d8d1-49dd-8565-53d713a84fff","shared_citers":6},{"title":"Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena","work_id":"d0c30cd7-81e1-4159-a87f-f6adca77ff08","shared_citers":6},{"title":"OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments","work_id":"793d9419-734d-45fe-9f51-d4c5a3a57cf8","shared_citers":6},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":6},{"title":"Self-Consistency Improves Chain of Thought Reasoning in Language Models","work_id":"8c6d5a6b-b5cc-4105-9c84-9c34bb9375bb","shared_citers":6},{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":6}],"time_series":[{"n":2,"year":2023},{"n":6,"year":2024},{"n":1,"year":2025},{"n":61,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T08:47:51.745679+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T08:47:56.359158+00:00"},"role_polarity":{"job_type":"role_polarity","status":"succeeded","result":{"title":"AgentBench: Evaluating LLMs as Agents","claims":[{"claim_text":"The potential of Large Language Model (LLM) as agents has been widely acknowledged recently. Thus, there is an urgent need to quantitatively \\textit{evaluate LLMs as agents} on challenging tasks in interactive environments. We present AgentBench, a multi-dimensional benchmark that consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities. Our extensive test over \\num API-based and open-sourced (OSS) LLMs shows that, while top commercial LLMs present a strong ability of acting as agents in complex environments, there is a significant disparity in perfo","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"prove how a single pipeline is programmed, but they generally do not make dependency materialization, replay iden- tity, or partial downstream invalidation the organizing abstraction of the whole system. 2.7 Benchmarks and Reproducible Agent Environments Another relevant literature studies how to evaluate agents in realistic yet reproducible environments. Benchmarks such as AgentBench [38], WebArena [39], VisualWebArena [40], WorkArena [41], AndroidWorld [42], OSWorld [43], AppWorld [44], GAIA [","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"How language models use long contexts.Transactions of the Association for Computational Linguistics, 2024. doi: 10.1162/tacl_a_00638. [24] X. Liu, H. Yu, et al. Agentbench: Evaluating llms as agents. InInternational Conference on Learning Representations, 2023. doi: 10.48550/arxiv.2308.03688. [25] LMSYS Org. Chatbot arena leaderboard, 2026. Accessed May 2026. [26] J. Luo and Y . Shao. Cayley graph optimization for scalable multi-agent communication topologies, 2026. [27] H. B. Mann and D. R. Whi","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"Reinforcement learning in robust markov decision processes.Advances in neural information processing systems, 26, 2013. 13 [66] Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025. [67] Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench:","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"Yao, K. Pei, O. Press, and K. Narasimhan. Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770, 2023. [13] M. Kim, Y . Jung, D. Lee, and S.-w. Hwang. Plm-based world models for text-based games. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages 1324-1341, 2022. [14] X. Liu, H. Yu, H. Zhang, Y . Xu, X. Lei, H. Lai, Y . Gu, H. Ding, K. Men, K. Yang, et al. Agentbench: Evaluating llms as agents. arXiv prepri","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"oscillate, or even degrade beyond a certain tipping point [ 11, 6]. Despite substantial efforts in prompting and reasoning [ 64, 68], these failures remain stubborn, indicating that the problem is structural and systematic. These dynamics run deeper than current benchmarks capture. Existing metrics for LLM MAS: task success, final-answer accuracy, and cumulative reward [35], are largely centered on outcomes while overlooking the internal dynamics of collective reasoning. As a result, they cannot","claim_type":"dataset","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"ering both operational workflow execution and policy-based approval decisions with objective verification. We evaluate representative LLM agents on ENTCOLLABBENCHand identify key bottlenecks in delegation, parameter grounding, workflow closure, decision commitment, and coordination cost. 2 Related Works Single-Agent Enterprise Benchmarks.In recent years, agent evaluation benchmarks targeting enterprise scenarios have advanced rapidly. AgentBench [3] was the first to systematically evaluate LLMs ","claim_type":"background","confidence":0.85,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks AgentBench: Evaluating LLMs as Agents because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (13 contexts).","role_counts":[{"n":13,"context_role":"background"},{"n":3,"context_role":"dataset"},{"n":1,"context_role":"baseline"}]},"error":null,"updated_at":"2026-05-17T10:19:52.685339+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"AgentBench: Evaluating LLMs as Agents","claims":[{"claim_text":"The potential of Large Language Model (LLM) as agents has been widely acknowledged recently. Thus, there is an urgent need to quantitatively \\textit{evaluate LLMs as agents} on challenging tasks in interactive environments. We present AgentBench, a multi-dimensional benchmark that consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities. Our extensive test over \\num API-based and open-sourced (OSS) LLMs shows that, while top commercial LLMs present a strong ability of acting as agents in complex environments, there is a significant disparity in perfo","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks AgentBench: Evaluating LLMs as Agents because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T08:47:49.743516+00:00"}},"summary":{"title":"AgentBench: Evaluating LLMs as Agents","claims":[{"claim_text":"The potential of Large Language Model (LLM) as agents has been widely acknowledged recently. Thus, there is an urgent need to quantitatively \\textit{evaluate LLMs as agents} on challenging tasks in interactive environments. We present AgentBench, a multi-dimensional benchmark that consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities. Our extensive test over \\num API-based and open-sourced (OSS) LLMs shows that, while top commercial LLMs present a strong ability of acting as agents in complex environments, there is a significant disparity in perfo","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks AgentBench: Evaluating LLMs as Agents because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"WebArena: A Realistic Web Environment for Building Autonomous Agents","work_id":"7058ffd2-a339-4102-89eb-248eeb074652","shared_citers":28},{"title":"SWE-bench: Can Language Models Resolve Real-World GitHub Issues?","work_id":"d0effe15-a689-441a-8e3f-ea35f1c4e4b1","shared_citers":27},{"title":"ReAct: Synergizing Reasoning and Acting in Language Models","work_id":"407a2351-25f1-497d-b611-f77d0292a8e6","shared_citers":16},{"title":"$\\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains","work_id":"6a8d8dc4-0cc0-4052-8109-abbcdcd4a962","shared_citers":12},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":12},{"title":"ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs","work_id":"3c555b48-a4d9-42dd-9fdd-0f6018fbe9cb","shared_citers":12},{"title":"Voyager: An Open-Ended Embodied Agent with Large Language Models","work_id":"ffe0d207-86cf-4742-a100-e988ac8b9676","shared_citers":12},{"title":"Toolformer: Language Models Can Teach Themselves to Use Tools","work_id":"9bce40c8-cfd7-4983-80e0-c3bd4402322a","shared_citers":11},{"title":"AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation","work_id":"92b7eb9c-c3d8-4518-a376-06fa15dd895b","shared_citers":10},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":10},{"title":"Identifying the Risks of LM Agents with an LM-Emulated Sandbox","work_id":"3d4c3b66-d749-4939-b1bc-62b10b2ebbb6","shared_citers":10},{"title":"MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework","work_id":"891b9780-a800-4e3c-bba0-53597ab8dc98","shared_citers":8},{"title":"Workarena: How capable are web agents at solving common knowledge work tasks? arXiv preprint arXiv:2403.07718","work_id":"5ac27d9e-4522-46f8-985e-0e4f73130803","shared_citers":8},{"title":"Constitutional AI: Harmlessness from AI Feedback","work_id":"faaaa4e0-2676-4fac-a0b4-99aef10d2095","shared_citers":7},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":7},{"title":"Reflexion: Language Agents with Verbal Reinforcement Learning","work_id":"778f739e-5f55-4961-8a2a-e4736a2757f4","shared_citers":7},{"title":"WebGPT: Browser-assisted question-answering with human feedback","work_id":"e25ef3e1-4848-4cb9-bf28-67a420591165","shared_citers":7},{"title":"Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models","work_id":"bb63abb3-0d50-4362-b97c-b5e725b03b39","shared_citers":6},{"title":"ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate","work_id":"eac74d79-d8d1-49dd-8565-53d713a84fff","shared_citers":6},{"title":"Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena","work_id":"d0c30cd7-81e1-4159-a87f-f6adca77ff08","shared_citers":6},{"title":"OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments","work_id":"793d9419-734d-45fe-9f51-d4c5a3a57cf8","shared_citers":6},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":6},{"title":"Self-Consistency Improves Chain of Thought Reasoning in Language Models","work_id":"8c6d5a6b-b5cc-4105-9c84-9c34bb9375bb","shared_citers":6},{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":6}],"time_series":[{"n":2,"year":2023},{"n":6,"year":2024},{"n":1,"year":2025},{"n":61,"year":2026}],"dependency_candidates":[]},"authors":[{"id":"a84400e0-e2ec-40d0-8f25-12d9e880956c","orcid":null,"display_name":"Hanchen Zhang","source":"manual","import_confidence":0.72},{"id":"b7409f15-8632-4685-b54d-65f1b5c2b600","orcid":null,"display_name":"Hanyu Lai","source":"manual","import_confidence":0.72},{"id":"98e1668d-27ba-4c49-8301-c183afe2cb80","orcid":null,"display_name":"Hao Yu","source":"manual","import_confidence":0.72},{"id":"0bded415-e78b-4e1b-9ec5-5985843bdf1a","orcid":null,"display_name":"Xiao Liu","source":"manual","import_confidence":0.72},{"id":"e29b2f87-90fb-41f8-a4d9-07f0e8b3157a","orcid":null,"display_name":"Xuanyu Lei","source":"manual","import_confidence":0.72},{"id":"b5b67155-1cd4-42ed-b6cd-6c2dde72b1da","orcid":null,"display_name":"Yifan Xu","source":"manual","import_confidence":0.72}]}}