{"work":{"id":"6a8d8dc4-0cc0-4052-8109-abbcdcd4a962","openalex_id":null,"doi":null,"arxiv_id":"2406.12045","raw_key":null,"title":"$\\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains","authors":null,"authors_text":"Shunyu Yao, Noah Shinn, Pedram Razavi, Karthik Narasimhan","year":2024,"venue":"cs.AI","abstract":"Existing benchmarks do not test language agents on their interaction with human users or ability to follow domain-specific rules, both of which are vital for deploying them in real world applications. We propose $\\tau$-bench, a benchmark emulating dynamic conversations between a user (simulated by language models) and a language agent provided with domain-specific API tools and policy guidelines. We employ an efficient and faithful evaluation process that compares the database state at the end of a conversation with the annotated goal state. We also propose a new metric (pass^k) to evaluate the reliability of agent behavior over multiple trials. Our experiments show that even state-of-the-art function calling agents (like gpt-4o) succeed on <50% of the tasks, and are quite inconsistent (pass^8 <25% in retail). Our findings point to the need for methods that can improve the ability of agents to act consistently and follow rules reliably.","external_url":"https://arxiv.org/abs/2406.12045","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-25T06:06:43.099867+00:00","pith_arxiv_id":"2406.12045","created_at":"2026-05-09T05:55:29.709655+00:00","updated_at":"2026-05-25T06:06:43.099867+00:00","title_quality_ok":true,"display_title":"$\\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains","render_title":"$\\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains"},"hub":{"state":{"work_id":"6a8d8dc4-0cc0-4052-8109-abbcdcd4a962","tier":"super_hub","tier_reason":"100+ Pith inbound or 10,000+ external citations","pith_inbound_count":107,"external_cited_by_count":null,"distinct_field_count":12,"first_pith_cited_at":"2025-01-24T05:27:46+00:00","last_pith_cited_at":"2026-05-22T12:44:01+00:00","author_build_status":"needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-05-31T02:31:48.469339+00:00","tier_text":"super_hub"},"tier":"super_hub","role_counts":[{"context_role":"background","n":22},{"context_role":"dataset","n":11},{"context_role":"extension","n":1},{"context_role":"method","n":1}],"polarity_counts":[{"context_polarity":"background","n":21},{"context_polarity":"use_dataset","n":9},{"context_polarity":"unclear","n":3},{"context_polarity":"extend","n":1},{"context_polarity":"use_method","n":1}],"runs":{"ask_index":{"job_type":"ask_index","status":"succeeded","result":{"title":"$\\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains","claims":[{"claim_text":"Existing benchmarks do not test language agents on their interaction with human users or ability to follow domain-specific rules, both of which are vital for deploying them in real world applications. We propose $\\tau$-bench, a benchmark emulating dynamic conversations between a user (simulated by language models) and a language agent provided with domain-specific API tools and policy guidelines. We employ an efficient and faithful evaluation process that compares the database state at the end of a conversation with the annotated goal state. We also propose a new metric (pass^k) to evaluate th","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"evaluation pipelines have not internalized an adversarial mindset, and that proactive auditing could help close the security gap for the fast-paced benchmarking space. 1 Introduction The progress of AI is mostly tracked by a wide range of benchmarks. Hundreds of new benchmarks have been released in the past two years, spanning software engineering [25, 15, 13], web naviga- tion [61], desktop computing [55], general AI assistance [35], terminal operations [34], enterprise workflows [50], and tool","claim_type":"dataset","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"shift from no-user to dual-control, highlighting the challenges of guiding users. Overall, τ 2-bench provides a controlled testbed for agents that must both reason effectively and guide user actions.1 1 Introduction Existing benchmarks for conversational AI agents are designed to test their abilities to communicate effectively with a user and perform the right sequence of actions to solve tasks [26, 14, 23, 19]. These benchmarks are inherently single-control environments, where the AI agent is a","claim_type":"extension","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"MMLU-Pro [136] 81.44 77.30 82.50 84.04 80.50 85.30 84.89 MMLU-Redux [41] 87.61 88.80 91.10 89.44 90.90 93.30 91.91 C-Eval [55] 84.40 83.88 88.20 85.89 87.29 90.20 82.39 SuperGPQA [33] 49.67 51.20 58.20 59.71 56.40 63.40 61.88 Instruction Following IFEval [171] 91.13 83.20 91.50 92.39 81.70 91.90 - IFBench [164] 67.01 29.93 64.50 79.79 34.69 70.20 25.51 Agent Tau2 [153] 71.70 31.65 79.10 75.39 46.40 81.20 68.20 Claw eval [154] 58.90 21.70 65.40 58.50 22.10 36.50 60.60 Table 4 Quantitative evaluat","claim_type":"dataset","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"Generative Agents[89] Human Behavior Simulation 2023 arXiv:2304.03442 AgentVerse[90] Multi-Agent Problem Decomposition 2023 arXiv:2308.10848 MetaGPT[91] Joint Evolution (Product, QA, Engineer) 2023 arXiv:2308.00352 MindAgent[92] Multi-Agent Text Search & Defusal 2023 arXiv:2309.09971 Mobile-Agent-v2[93] Mobile GUI (Navigator & Interactor) 2024 arXiv:2404.14322 TAU-Bench (Airline/Telecom)[94] Multi-Agent Interacting Tool Use 2024 arXiv:2406.12045 ColBench (SweetRL)[95] LLM Collaborative Software ","claim_type":"dataset","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"is complementary to this line: its emphasis is not interface realism alone, but a workflow mixture derived from public demand signals and evaluated inside a reproducible release snapshot. Code and workspace agent benchmarks.Tool and code benchmarks provide the closest prece- dent for the workspace-repair side of Claw-Eval-Live. API-Bank [20], ToolBench/ToolLLM [33], Gorilla [31], MINT [40], τ-bench [48], and MCP-Bench [41] focus on API or tool manipulation. HumanEval [3], MBPP [ 2], DS-1000 [ 18","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"large capability gaps between commercial and open-source models. GAIA [11] uses 466 real-world questions requiring multi-modal reasoning and tool use, finding a 77-point gap between human performance (92%) and GPT-4 (15%). SWE-bench [2] tests resolution of real GitHub issues; We- bArena [12] and OSWorld [13] evaluate web and desktop task completion in realistic environments. τ -bench [14] highlights failures in policy compliance and behavioral consistency under dynamic user interaction. Despite ","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks $\\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (21 contexts).","role_counts":[{"n":21,"context_role":"background"},{"n":11,"context_role":"dataset"},{"n":1,"context_role":"extension"},{"n":1,"context_role":"method"}]},"error":null,"updated_at":"2026-05-22T05:43:33.128904+00:00"},"author_expand":{"job_type":"author_expand","status":"succeeded","result":{"authors_linked":[{"id":"55ffc739-fb29-4c94-90fd-a74183bd62cf","orcid":null,"display_name":"Shunyu Yao"},{"id":"2e4e9c2a-3fa8-4dbc-bfaf-de575a854051","orcid":null,"display_name":"Noah Shinn"},{"id":"47d0799e-9e4e-4efc-b5d2-bb89000879ea","orcid":null,"display_name":"Pedram Razavi"},{"id":"100cd29d-0db3-4c66-b69f-fd6ae13c609a","orcid":null,"display_name":"Karthik Narasimhan"}]},"error":null,"updated_at":"2026-05-22T05:43:33.680714+00:00"},"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T10:08:40.525214+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"SWE-bench: Can Language Models Resolve Real-World GitHub Issues?","work_id":"d0effe15-a689-441a-8e3f-ea35f1c4e4b1","shared_citers":20},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":16},{"title":"WebArena: A Realistic Web Environment for Building Autonomous Agents","work_id":"7058ffd2-a339-4102-89eb-248eeb074652","shared_citers":16},{"title":"AgentBench: Evaluating LLMs as Agents","work_id":"a37549b4-4c94-412d-acc4-4efeb08509be","shared_citers":12},{"title":"$\\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment","work_id":"3a498b1a-455f-4667-b572-c5216c99a89c","shared_citers":11},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":11},{"title":"GAIA: a benchmark for General AI Assistants","work_id":"cf222b33-f7a3-4044-a570-ecfe25edb3f8","shared_citers":11},{"title":"ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs","work_id":"3c555b48-a4d9-42dd-9fdd-0f6018fbe9cb","shared_citers":9},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":8},{"title":"Instruction-Following Evaluation for Large Language Models","work_id":"3aa06177-125a-4f5a-8f4a-8070c5986c26","shared_citers":8},{"title":"Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces","work_id":"0624be05-1d97-4fd6-8300-b04b8a3ab04b","shared_citers":8},{"title":"Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z","work_id":"8ba3cce8-4fc7-4286-9bae-513243ed4e6e","shared_citers":8},{"title":"DeepSeek-V3 Technical Report","work_id":"57d2791d-2219-4c31-a077-afc04b12a75c","shared_citers":7},{"title":"Humanity's Last Exam","work_id":"59ea00d4-16a8-45e1-aafc-290a6f91d9f4","shared_citers":7},{"title":"Identifying the Risks of LM Agents with an LM-Emulated Sandbox","work_id":"3d4c3b66-d749-4939-b1bc-62b10b2ebbb6","shared_citers":7},{"title":"Kimi K2.5: Visual Agentic Intelligence","work_id":"d690be8f-5d53-49b0-b1e7-79668eb8fcdb","shared_citers":7},{"title":"Voyager: An Open-Ended Embodied Agent with Large Language Models","work_id":"ffe0d207-86cf-4742-a100-e988ac8b9676","shared_citers":7},{"title":"Workarena: How capable are web agents at solving common knowledge work tasks? arXiv preprint arXiv:2403.07718","work_id":"5ac27d9e-4522-46f8-985e-0e4f73130803","shared_citers":7},{"title":"BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents","work_id":"25adb508-d97c-49d6-ae43-7a70c2478a34","shared_citers":6},{"title":"LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code","work_id":"ea9e51ce-1e75-4182-92d8-4d25f70d2ee4","shared_citers":6},{"title":"Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents","work_id":"57acc3ec-f4c3-49ab-bd0f-5aab91002df9","shared_citers":5},{"title":"Constitutional AI: Harmlessness from AI Feedback","work_id":"faaaa4e0-2676-4fac-a0b4-99aef10d2095","shared_citers":5},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":5},{"title":"GLM-5: from Vibe Coding to Agentic Engineering","work_id":"ad29b1a2-bf77-46b3-9ead-fb62b1d2c6fe","shared_citers":5}],"time_series":[{"n":5,"year":2025},{"n":56,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T10:08:40.553335+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T10:08:51.390339+00:00"},"role_polarity":{"job_type":"role_polarity","status":"succeeded","result":{"title":"$\\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains","claims":[{"claim_text":"Existing benchmarks do not test language agents on their interaction with human users or ability to follow domain-specific rules, both of which are vital for deploying them in real world applications. We propose $\\tau$-bench, a benchmark emulating dynamic conversations between a user (simulated by language models) and a language agent provided with domain-specific API tools and policy guidelines. We employ an efficient and faithful evaluation process that compares the database state at the end of a conversation with the annotated goal state. We also propose a new metric (pass^k) to evaluate th","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"evaluation pipelines have not internalized an adversarial mindset, and that proactive auditing could help close the security gap for the fast-paced benchmarking space. 1 Introduction The progress of AI is mostly tracked by a wide range of benchmarks. Hundreds of new benchmarks have been released in the past two years, spanning software engineering [25, 15, 13], web naviga- tion [61], desktop computing [55], general AI assistance [35], terminal operations [34], enterprise workflows [50], and tool","claim_type":"dataset","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"shift from no-user to dual-control, highlighting the challenges of guiding users. Overall, τ 2-bench provides a controlled testbed for agents that must both reason effectively and guide user actions.1 1 Introduction Existing benchmarks for conversational AI agents are designed to test their abilities to communicate effectively with a user and perform the right sequence of actions to solve tasks [26, 14, 23, 19]. These benchmarks are inherently single-control environments, where the AI agent is a","claim_type":"extension","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"MMLU-Pro [136] 81.44 77.30 82.50 84.04 80.50 85.30 84.89 MMLU-Redux [41] 87.61 88.80 91.10 89.44 90.90 93.30 91.91 C-Eval [55] 84.40 83.88 88.20 85.89 87.29 90.20 82.39 SuperGPQA [33] 49.67 51.20 58.20 59.71 56.40 63.40 61.88 Instruction Following IFEval [171] 91.13 83.20 91.50 92.39 81.70 91.90 - IFBench [164] 67.01 29.93 64.50 79.79 34.69 70.20 25.51 Agent Tau2 [153] 71.70 31.65 79.10 75.39 46.40 81.20 68.20 Claw eval [154] 58.90 21.70 65.40 58.50 22.10 36.50 60.60 Table 4 Quantitative evaluat","claim_type":"dataset","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"Generative Agents[89] Human Behavior Simulation 2023 arXiv:2304.03442 AgentVerse[90] Multi-Agent Problem Decomposition 2023 arXiv:2308.10848 MetaGPT[91] Joint Evolution (Product, QA, Engineer) 2023 arXiv:2308.00352 MindAgent[92] Multi-Agent Text Search & Defusal 2023 arXiv:2309.09971 Mobile-Agent-v2[93] Mobile GUI (Navigator & Interactor) 2024 arXiv:2404.14322 TAU-Bench (Airline/Telecom)[94] Multi-Agent Interacting Tool Use 2024 arXiv:2406.12045 ColBench (SweetRL)[95] LLM Collaborative Software ","claim_type":"dataset","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"is complementary to this line: its emphasis is not interface realism alone, but a workflow mixture derived from public demand signals and evaluated inside a reproducible release snapshot. Code and workspace agent benchmarks.Tool and code benchmarks provide the closest prece- dent for the workspace-repair side of Claw-Eval-Live. API-Bank [20], ToolBench/ToolLLM [33], Gorilla [31], MINT [40], τ-bench [48], and MCP-Bench [41] focus on API or tool manipulation. HumanEval [3], MBPP [ 2], DS-1000 [ 18","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"large capability gaps between commercial and open-source models. GAIA [11] uses 466 real-world questions requiring multi-modal reasoning and tool use, finding a 77-point gap between human performance (92%) and GPT-4 (15%). SWE-bench [2] tests resolution of real GitHub issues; We- bArena [12] and OSWorld [13] evaluate web and desktop task completion in realistic environments. τ -bench [14] highlights failures in policy compliance and behavioral consistency under dynamic user interaction. Despite ","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks $\\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (21 contexts).","role_counts":[{"n":21,"context_role":"background"},{"n":11,"context_role":"dataset"},{"n":1,"context_role":"extension"},{"n":1,"context_role":"method"}]},"error":null,"updated_at":"2026-05-22T05:43:33.116766+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"$\\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains","claims":[{"claim_text":"Existing benchmarks do not test language agents on their interaction with human users or ability to follow domain-specific rules, both of which are vital for deploying them in real world applications. We propose $\\tau$-bench, a benchmark emulating dynamic conversations between a user (simulated by language models) and a language agent provided with domain-specific API tools and policy guidelines. We employ an efficient and faithful evaluation process that compares the database state at the end of a conversation with the annotated goal state. We also propose a new metric (pass^k) to evaluate th","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks $\\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T10:08:40.529511+00:00"}},"summary":{"title":"$\\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains","claims":[{"claim_text":"Existing benchmarks do not test language agents on their interaction with human users or ability to follow domain-specific rules, both of which are vital for deploying them in real world applications. We propose $\\tau$-bench, a benchmark emulating dynamic conversations between a user (simulated by language models) and a language agent provided with domain-specific API tools and policy guidelines. We employ an efficient and faithful evaluation process that compares the database state at the end of a conversation with the annotated goal state. We also propose a new metric (pass^k) to evaluate th","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks $\\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"SWE-bench: Can Language Models Resolve Real-World GitHub Issues?","work_id":"d0effe15-a689-441a-8e3f-ea35f1c4e4b1","shared_citers":20},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":16},{"title":"WebArena: A Realistic Web Environment for Building Autonomous Agents","work_id":"7058ffd2-a339-4102-89eb-248eeb074652","shared_citers":16},{"title":"AgentBench: Evaluating LLMs as Agents","work_id":"a37549b4-4c94-412d-acc4-4efeb08509be","shared_citers":12},{"title":"$\\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment","work_id":"3a498b1a-455f-4667-b572-c5216c99a89c","shared_citers":11},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":11},{"title":"GAIA: a benchmark for General AI Assistants","work_id":"cf222b33-f7a3-4044-a570-ecfe25edb3f8","shared_citers":11},{"title":"ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs","work_id":"3c555b48-a4d9-42dd-9fdd-0f6018fbe9cb","shared_citers":9},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":8},{"title":"Instruction-Following Evaluation for Large Language Models","work_id":"3aa06177-125a-4f5a-8f4a-8070c5986c26","shared_citers":8},{"title":"Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces","work_id":"0624be05-1d97-4fd6-8300-b04b8a3ab04b","shared_citers":8},{"title":"Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z","work_id":"8ba3cce8-4fc7-4286-9bae-513243ed4e6e","shared_citers":8},{"title":"DeepSeek-V3 Technical Report","work_id":"57d2791d-2219-4c31-a077-afc04b12a75c","shared_citers":7},{"title":"Humanity's Last Exam","work_id":"59ea00d4-16a8-45e1-aafc-290a6f91d9f4","shared_citers":7},{"title":"Identifying the Risks of LM Agents with an LM-Emulated Sandbox","work_id":"3d4c3b66-d749-4939-b1bc-62b10b2ebbb6","shared_citers":7},{"title":"Kimi K2.5: Visual Agentic Intelligence","work_id":"d690be8f-5d53-49b0-b1e7-79668eb8fcdb","shared_citers":7},{"title":"Voyager: An Open-Ended Embodied Agent with Large Language Models","work_id":"ffe0d207-86cf-4742-a100-e988ac8b9676","shared_citers":7},{"title":"Workarena: How capable are web agents at solving common knowledge work tasks? arXiv preprint arXiv:2403.07718","work_id":"5ac27d9e-4522-46f8-985e-0e4f73130803","shared_citers":7},{"title":"BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents","work_id":"25adb508-d97c-49d6-ae43-7a70c2478a34","shared_citers":6},{"title":"LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code","work_id":"ea9e51ce-1e75-4182-92d8-4d25f70d2ee4","shared_citers":6},{"title":"Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents","work_id":"57acc3ec-f4c3-49ab-bd0f-5aab91002df9","shared_citers":5},{"title":"Constitutional AI: Harmlessness from AI Feedback","work_id":"faaaa4e0-2676-4fac-a0b4-99aef10d2095","shared_citers":5},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":5},{"title":"GLM-5: from Vibe Coding to Agentic Engineering","work_id":"ad29b1a2-bf77-46b3-9ead-fb62b1d2c6fe","shared_citers":5}],"time_series":[{"n":5,"year":2025},{"n":56,"year":2026}],"dependency_candidates":[]},"authors":[{"id":"100cd29d-0db3-4c66-b69f-fd6ae13c609a","orcid":null,"display_name":"Karthik Narasimhan","source":"manual","import_confidence":0.72},{"id":"2e4e9c2a-3fa8-4dbc-bfaf-de575a854051","orcid":null,"display_name":"Noah Shinn","source":"manual","import_confidence":0.72},{"id":"47d0799e-9e4e-4efc-b5d2-bb89000879ea","orcid":null,"display_name":"Pedram Razavi","source":"manual","import_confidence":0.72},{"id":"55ffc739-fb29-4c94-90fd-a74183bd62cf","orcid":null,"display_name":"Shunyu Yao","source":"manual","import_confidence":0.72}]}}