Recognition: 2 theorem links
· Lean TheoremAgentBench: Evaluating LLMs as Agents
Pith reviewed 2026-05-11 11:34 UTC · model grok-4.3
The pith
A benchmark of eight interactive environments reveals a large performance gap between top commercial LLMs and open-source models as agents.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We present AgentBench, a multi-dimensional benchmark that consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities. Our extensive test over numerous API-based and open-sourced LLMs shows that, while top commercial LLMs present a strong ability of acting as agents in complex environments, there is a significant disparity in performance between them and many OSS competitors that are no larger than 70B. We identify the typical reasons of failures in environments and LLMs, showing that poor long-term reasoning, decision-making, and instruction following abilities are the main obstacles for developing usable LLM agents. Improving instruction following,
What carries the argument
AgentBench, the collection of eight interactive environments that directly tests an LLM's ability to sustain reasoning and choose actions over multiple steps.
If this is right
- Poor long-term reasoning, decision-making, and instruction following are the primary obstacles to usable LLM agents.
- Training focused on instruction following and high-quality multi-round alignment data can raise agent performance.
- Training on code produces mixed effects that vary by the specific agent task.
- A clear performance separation exists between leading commercial LLMs and open-source models at or below 70B parameters.
Where Pith is reading between the lines
- The gap suggests that model scale up to 70B alone does not guarantee strong agent behavior without changes in training data or objectives.
- The benchmark could be used to track whether future alignment methods narrow the commercial-to-open-source difference over successive model releases.
- Extending the environments to include more physical or long-horizon real-world tasks would test whether the identified failure modes persist outside the current set.
Load-bearing premise
The eight chosen environments and their metrics sufficiently represent the core challenges of real-world agent deployment, and the measured differences reflect actual reasoning gaps rather than prompt or setup artifacts.
What would settle it
Re-testing the same models with substantially altered prompts or additional environments and finding that open-source models under 70B close the gap to commercial ones would show the reported disparity depends on the specific evaluation choices.
read the original abstract
The potential of Large Language Model (LLM) as agents has been widely acknowledged recently. Thus, there is an urgent need to quantitatively \textit{evaluate LLMs as agents} on challenging tasks in interactive environments. We present AgentBench, a multi-dimensional benchmark that consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities. Our extensive test over \num API-based and open-sourced (OSS) LLMs shows that, while top commercial LLMs present a strong ability of acting as agents in complex environments, there is a significant disparity in performance between them and many OSS competitors that are no larger than 70B. We identify the typical reasons of failures in environments and LLMs, showing that poor long-term reasoning, decision-making, and instruction following abilities are the main obstacles for developing usable LLM agents. Improving instruction following and training on high quality multi-round alignment data could improve agent performance. And different from existing assumptions, training on code present ambivalent impacts on different agent tasks. Datasets, environments, and an integrated evaluation package for AgentBench are released at https://github.com/THUDM/AgentBench.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces AgentBench, a multi-dimensional benchmark consisting of 8 distinct interactive environments to quantitatively evaluate LLMs as agents, focusing on their reasoning and decision-making abilities. Extensive testing across numerous API-based commercial and open-source LLMs (up to 70B parameters) reveals that top commercial models exhibit strong agentic performance in complex settings, while identifying a significant performance disparity with many OSS competitors. The authors analyze common failure modes, attributing them primarily to deficiencies in long-term reasoning, decision-making, and instruction following, and propose that improvements in instruction following and training on high-quality multi-round alignment data could help; they also observe ambivalent effects from code training on agent tasks. The benchmark, datasets, environments, and an integrated evaluation package are released publicly.
Significance. If the reported performance disparities prove robust, this benchmark would be a timely and useful contribution to the growing field of LLM agents by providing standardized, multi-environment evaluation that highlights gaps between commercial and open-source models. The public release of the full evaluation framework, datasets, and code is a clear strength that supports reproducibility and community follow-up work. The identification of specific failure modes (e.g., instruction following) offers actionable insights, though the overall significance is tempered by the empirical nature of the study and the need for stronger controls on evaluation artifacts.
major comments (2)
- [§4] §4 (Experiments/Evaluation Protocol): The evaluation applies a single, uniform prompting template and interaction protocol across all models without any ablation on prompt variations, model-specific few-shot examples, or relaxed output-format constraints. This is load-bearing for the central disparity claim because commercial models' documented advantage in instruction following (one of the primary failure modes identified) is likely amplified by RLHF, raising the possibility that a non-trivial portion of the gap versus OSS models ≤70B is an artifact of prompt sensitivity rather than intrinsic differences in reasoning or decision-making.
- [§3] §3 (Environments) and §4: Insufficient detail is provided on environment construction, exact metric definitions, controls for prompt sensitivity, environment stochasticity, and statistical significance testing (e.g., variance across multiple runs or seeds). Without these, the reliability of the performance numbers and the claim that the eight environments sufficiently isolate core agentic challenges cannot be fully assessed.
minor comments (2)
- [Abstract] The abstract and §5 (Analysis) could more explicitly quantify the performance gap (e.g., specific success rates or normalized scores for top commercial vs. OSS models) rather than describing it qualitatively as 'significant'.
- [Figures] Figure captions and axis labels in the results figures would benefit from greater clarity on what the plotted metrics precisely measure (e.g., success rate, partial credit, or normalized scores).
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing our strongest honest defense of the manuscript while committing to revisions that improve clarity and rigor without altering the core claims or experimental design.
read point-by-point responses
-
Referee: [§4] §4 (Experiments/Evaluation Protocol): The evaluation applies a single, uniform prompting template and interaction protocol across all models without any ablation on prompt variations, model-specific few-shot examples, or relaxed output-format constraints. This is load-bearing for the central disparity claim because commercial models' documented advantage in instruction following (one of the primary failure modes identified) is likely amplified by RLHF, raising the possibility that a non-trivial portion of the gap versus OSS models ≤70B is an artifact of prompt sensitivity rather than intrinsic differences in reasoning or decision-making.
Authors: The uniform prompting template and interaction protocol were selected to create a standardized, reproducible benchmark that enables fair head-to-head comparison of models' agentic abilities without confounding factors from per-model prompt engineering or few-shot tuning. Allowing model-specific adaptations would undermine the benchmark's purpose of measuring intrinsic capabilities under consistent conditions, as is standard in many LLM evaluation suites. The manuscript already highlights instruction following as a primary failure mode through qualitative analysis of outputs, and the observed performance gaps are consistent with this; commercial models' RLHF advantages in this area are a legitimate component of their superior agent performance rather than an artifact to be removed. We disagree that this renders the disparity claim unreliable, but we will add an explicit discussion of prompt sensitivity limitations and the rationale for standardization in the revised manuscript. revision: partial
-
Referee: [§3] §3 (Environments) and §4: Insufficient detail is provided on environment construction, exact metric definitions, controls for prompt sensitivity, environment stochasticity, and statistical significance testing (e.g., variance across multiple runs or seeds). Without these, the reliability of the performance numbers and the claim that the eight environments sufficiently isolate core agentic challenges cannot be fully assessed.
Authors: We will substantially expand Sections 3 and 4 in the revision to provide greater detail on environment construction processes, precise mathematical definitions of all metrics, any existing controls for prompt sensitivity and stochasticity, and results from multiple runs with variance reporting where environments permit. These additions will strengthen the justification that the eight environments collectively isolate core agentic challenges such as long-term reasoning and decision-making. revision: yes
Circularity Check
Empirical benchmark with no derivations or self-referential claims
full rationale
The paper is an empirical evaluation study that runs LLMs on 8 fixed environments and reports observed success rates, failure modes, and performance gaps. No equations, first-principles derivations, fitted parameters, or predictions appear in the abstract or described structure. Claims rest on direct experimental outcomes rather than any chain that reduces to its own inputs by construction. Self-citations, if present, are not load-bearing for any derivation because no derivation exists. The work is therefore self-contained against external benchmarks and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLMs can be effectively prompted to function as interactive agents in multi-step environments
Forward citations
Cited by 60 Pith papers
-
WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation
A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.
-
SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning
SimWorld Studio uses a self-evolving coding agent to generate adaptive 3D environments that improve embodied agent performance, with reported gains of 18 points over fixed environments in navigation tasks.
-
SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning
SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.
-
PhysicianBench: Evaluating LLM Agents in Real-World EHR Environments
PhysicianBench is a new benchmark of 100 physician-reviewed, execution-grounded tasks in live EHR environments where the best LLM agent reaches only 46% success and open-source models reach 19%.
-
AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents
AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.
-
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.
-
LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
LongBench is the first bilingual multi-task benchmark for long context understanding in LLMs, containing 21 datasets in 6 categories with average lengths of 6711 words (English) and 13386 characters (Chinese).
-
ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents
ClawForge supplies a generator that turns scenario templates into reproducible command-line tasks testing state conflict handling, where the strongest frontier model scores only 45.3 percent strict accuracy.
-
RS-Claw: Progressive Active Tool Exploration via Hierarchical Skill Trees for Remote Sensing Agents
RS-Claw enables remote sensing agents to actively explore tools via hierarchical skill trees, achieving up to 86% token compression and outperforming flat registration and RAG baselines on Earth-Bench.
-
AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation
10.7% of passing SWE-agent trajectories are Lucky Passes with chaotic behaviors, and a quality score based on process references changes model rankings across eight backends.
-
Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack
BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.
-
Skill Drift Is Contract Violation: Proactive Maintenance for LLM Agent Skill Libraries
SkillGuard extracts executable environment contracts from LLM skill documents to detect only relevant drifts, reporting zero false positives on 599 cases, 100% precision in known-drift tests, and raising one-round rep...
-
Beyond the All-in-One Agent: Benchmarking Role-Specialized Multi-Agent Collaboration in Enterprise Workflows
EntCollabBench shows that today's LLM agents still struggle with delegation, context transfer, parameter grounding, workflow closure, and decision commitment when tested in a simulated enterprise with 11 role-speciali...
-
AgentEscapeBench: Evaluating Out-of-Domain Tool-Grounded Reasoning in LLM Agents
AgentEscapeBench shows LLM agents' success rates drop from 90% to 60% as tool-dependency depth increases from 5 to 25 steps, while humans drop only from 98% to 80%.
-
CyBiasBench: Benchmarking Bias in LLM Agents for Cyber-Attack Scenarios
LLM agents exhibit persistent attack-selection biases as fixed traits independent of success rates, with a bias momentum effect that resists steering and yields no performance gain.
-
The Moltbook Files: A Harmless Slopocalypse or Humanity's Last Experiment
An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.
-
Partial Evidence Bench: Benchmarking Authorization-Limited Evidence in Agentic Systems
Partial Evidence Bench is a deterministic benchmark that measures agent correctness, completeness awareness, gap-report quality, and unsafe overclaiming in authorization-constrained evidence environments.
-
NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles
NeuroState-Bench is a human-calibrated benchmark with 144 tasks and 306 side-query probes showing that commitment integrity in LLM agent profiles diverges from task success, with 31 of 32 profiles changing rank under ...
-
Cutscene Agent: An LLM Agent Framework for Automated 3D Cutscene Generation
Cutscene Agent uses a multi-agent LLM system and a new toolkit for game engine control to automate end-to-end 3D cutscene generation, evaluated on the introduced CutsceneBench.
-
ClawMark: A Living-World Benchmark for Multi-Turn, Multi-Day, Multimodal Coworker Agents
ClawMark is a new benchmark for multi-turn multi-day multimodal coworker agents in stateful evolving services, with deterministic Python checkers showing frontier models achieve only 20% strict task success.
-
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
-
Trust, Lies, and Long Memories: Emergent Social Dynamics and Reputation in Multi-Round Avalon with LLM Agents
In 188 multi-round Avalon games, LLM agents with cross-game memory form reputations that boost high-reputation players' team inclusions by 46% and show more strategic deception (75% vs 36%) with higher reasoning effort.
-
Rethinking Scale: Deployment Trade-offs of Small Language Models under Agent Paradigms
Single-agent systems with tools provide the optimal performance-efficiency trade-off for small language models, outperforming base models and multi-agent setups.
-
AI scientists produce results without reasoning scientifically
LLM agents execute scientific tasks but fail to follow core scientific reasoning norms such as evidence consideration and belief revision based on refutations.
-
WhatIf: Interactive Exploration of LLM-Powered Social Simulations for Policy Reasoning
WhatIf provides an interactive platform for real-time exploration of LLM-driven social simulations, enabling policymakers to iteratively test plans, reflect on assumptions, and uncover vulnerabilities in emergency pre...
-
SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents
SkillFlow benchmark shows lifelong skill evolution yields modest gains for some models like Claude Opus 4.6 but limited or negative utility for others despite high skill usage.
-
Evaluating Tool-Using Language Agents: Judge Reliability, Propagation Cascades, and Runtime Mitigation in AgentProp-Bench
AgentProp-Bench shows substring judging agrees with humans at kappa=0.049, LLM ensemble at 0.432, bad-parameter injection propagates with ~0.62 probability, rejection and recovery are independent, and a runtime fix cu...
-
ARGOS: Who, Where, and When in Agentic Multi-Camera Person Search
ARGOS is the first benchmark reformulating multi-camera person search as an agentic interactive reasoning task grounded in a spatio-temporal topology graph, with 2691 tasks across three tracks where current LLMs achie...
-
ClawVM: Harness-Managed Virtual Memory for Stateful Tool-Using LLM Agents
ClawVM introduces a harness-managed virtual memory system for LLM agents that ensures deterministic residency and durability of state under token budgets by using typed pages and validated writeback.
-
FinTrace: Holistic Trajectory-Level Evaluation of LLM Tool Calling for Long-Horizon Financial Tasks
FinTrace supplies trajectory-level metrics for LLM financial tool calling, exposing gaps in information use and output quality, while its preference dataset enables DPO training that boosts intermediate metrics.
-
SAGE: A Service Agent Graph-guided Evaluation Benchmark
SAGE is a new multi-agent benchmark that formalizes service SOPs as dynamic dialogue graphs to measure LLM agents on logical compliance and path coverage, uncovering an execution gap and empathy resilience across 27 m...
-
FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks
FrontierFinance benchmark shows human financial experts outperform state-of-the-art LLMs by achieving higher scores and more client-ready outputs on realistic long-horizon tasks.
-
M-CARE: Standardized Clinical Case Reporting for AI Model Behavioral Disorders, with a 20-Case Atlas and Experimental Validation
M-CARE provides a medical-inspired reporting system for AI behavioral disorders, demonstrated through 20 cases and a validated experiment showing shell instructions overriding cooperative behavior across game domains.
-
$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment
τ²-bench provides a Dec-POMDP-based telecom domain with compositional task generation and a tool-constrained user simulator to measure agent performance drops in dual-control versus single-control settings.
-
$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
τ-bench shows state-of-the-art agents like GPT-4o succeed on under 50% of tool-using, rule-following tasks and are inconsistent across repeated trials.
-
WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?
WorkArena benchmark shows LLM web agents achieve partial success on enterprise tasks but have a substantial gap to full automation and perform worse with open-source models.
-
GAIA: a benchmark for General AI Assistants
GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.
-
Exploiting LLM Agent Supply Chains via Payload-less Skills
Semantic Compliance Hijacking lets attackers hijack LLM agents by disguising malicious instructions as compliance rules in skills, reaching up to 77.67% success on confidentiality breaches and 67.33% on RCE while evad...
-
SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces
SkillSafetyBench shows that localized non-user attacks via skills and artifacts can consistently induce unsafe agent behavior across domains and model backends, independent of user intent.
-
Domain Restriction via Multi SAE Layer Transitions
Multi-layer SAE transitions capture domain-specific signatures that distinguish OOD texts in Gemma-2 models.
-
On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment
FATE lets LLM agents self-evolve safer behaviors by generating and filtering repairs from their own failure trajectories using verifiers and Pareto optimization.
-
HAGE: Harnessing Agentic Memory via RL-Driven Weighted Graph Evolution
HAGE proposes a trainable weighted graph memory framework with LLM intent classification, dynamic edge modulation, and RL optimization that improves long-horizon reasoning accuracy in agentic LLMs over static baselines.
-
TIDE-Bench: Task-Aware and Diagnostic Evaluation of Tool-Integrated Reasoning
TIDE-Bench is a new benchmark for tool-integrated reasoning that combines diverse tasks, multi-aspect metrics covering answer quality, process reliability, efficiency and cost, plus filtered challenging test sets.
-
OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces
OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.
-
AgentCollabBench: Diagnosing When Good Agents Make Bad Collaborators
AgentCollabBench shows that multi-agent reliability is limited by communication topology, with converging-DAG nodes causing synthesis bottlenecks that discard constraints and explain 7-40% of information loss variance.
-
EvidenT: An Evidence-Preserving Framework for Iterative System-Level Package Repair
EvidenT repairs 53.88% of real-world RISC-V system-level package build failures by preserving repair history and build artifacts in a closed-loop validation system, outperforming baselines by a wide margin.
-
Why Does Agentic Safety Fail to Generalize Across Tasks?
Agentic safety fails to generalize across tasks because the task-to-safe-controller mapping has a higher Lipschitz constant than the task-to-controller mapping alone, as proven in linear-quadratic control and demonstr...
-
Beyond Task Success: Measuring Workflow Fidelity in LLM-Based Agentic Payment Systems
ASR, a new trajectory-fidelity metric, detects that 10 of 18 LLMs skip confirmation steps in payment agents despite perfect scores on prior metrics, and ASR-guided refinements improve task success by up to 93.8 percen...
-
Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges
LLM safety judges flip verdicts on equivalent policy rewrites up to 9.1% of the time and cannot distinguish meaningful from meaningless changes, requiring new invariance-based reliability metrics.
-
NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles
NeuroState-Bench supplies human-calibrated tasks and probes that measure commitment integrity in LLM agents and shows this measure diverges from ordinary task success.
-
Evaluating Agentic AI in the Wild: Failure Modes, Drift Patterns, and a Production Evaluation Framework
The paper presents a taxonomy of seven production-specific failure modes for agentic AI, demonstrates that existing metrics fail to detect four of them entirely, and proposes the PAEF five-dimension framework for cont...
-
AgentFloor: How Far Up the tool use Ladder Can Small Open-Weight Models Go?
Small open-weight models match GPT-5 on routine agent tool-use tasks but lag on long-horizon planning, supporting tiered routing to reduce costs in agentic systems.
-
Trace-Level Analysis of Information Contamination in Multi-Agent Systems
Agent workflows can diverge substantially from contaminated inputs yet recover correct answers, or stay similar while failing, as measured by trace divergence on GAIA tasks.
-
LATTICE: Evaluating Decision Support Utility of Crypto Agents
LATTICE is a scalable LLM-judge benchmark for crypto agent decision support that reveals performance trade-offs among real-world copilots across dimensions and tasks.
-
ProEval: Proactive Failure Discovery and Efficient Performance Estimation for Generative AI Evaluation
ProEval is a proactive framework using pre-trained GPs, Bayesian quadrature, and superlevel set sampling to estimate performance and find failures in generative AI with 8-65x fewer samples than baselines.
-
CHORUS: An Agentic Framework for Generating Realistic Deliberation Data
Chorus generates realistic deliberation discussions via LLM agents with memory and Poisson-timed participation, validated by 30 experts on realism, coherence, and utility.
-
An AI Agent Execution Environment to Safeguard User Data
GAAP guarantees confidentiality of private user data for AI agents by enforcing user-specified permissions deterministically through persistent information flow tracking, without trusting the agent or requiring attack...
-
How Adversarial Environments Mislead Agentic AI?
Adversarial compromise of tool outputs misleads agentic AI via breadth and depth attacks, revealing that epistemic and navigational robustness are distinct and often trade off against each other.
-
ClawEnvKit: Automatic Environment Generation for Claw-Like Agents
ClawEnvKit automates generation of diverse verified environments for claw-like agents from natural language, producing the Auto-ClawEval benchmark of 1,040 environments that matches human-curated quality at 13,800x lo...
-
QRAFTI: An Agentic Framework for Empirical Research in Quantitative Finance
QRAFTI is a multi-agent framework using tool-calling and reflection-based planning to emulate quant research tasks like factor replication and signal testing on financial data.
Reference graph
Works this paper leans on
-
[1]
PaLM: Scaling Language Modeling with Pathways
Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.91. URL https://aclanthology.org/2020.findings-emnlp.91. Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.Se...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2020.findings-emnlp.91 2020
-
[2]
Measuring Coding Challenge Competence With APPS
Association for Computational Linguistics. URL https://aclanthology.org/ 2023.acl-long.270. Matthew Hausknecht, Prithviraj Ammanabrolu, Marc-Alexandre Côté, and Xingdi Yuan. Interac- tive fiction games: A colossal adventure. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 7903–7910, 2020. Dan Hendrycks, Steven Basart, Saura...
work page internal anchor Pith review arXiv 2023
-
[3]
Association for Computational Linguistics. doi: 10.18653/v1/P17-1167. URL https: //aclanthology.org/P17-1167. Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Vo...
-
[4]
Llama 2: Open Foundation and Fine-Tuned Chat Models
The Association for Computational Linguistics, 2016. doi: 10.18653/v1/d16-1054. URL https://doi.org/10.18653/v1/d16-1054. Alon Talmor and Jonathan Berant. The web as a knowledge-base for answering complex questions. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technolog...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/d16-1054 2016
-
[5]
Act: bash ‘‘‘bash # put your bash code here ‘‘‘
If you think you should execute some bash code, take bash action, and you should print like this: Think: put your thought here. Act: bash ‘‘‘bash # put your bash code here ‘‘‘
-
[6]
If you think you have finished the task, take finish action, and you should print like this: Think: put your thought here. Act: finish
-
[7]
bash": <User>: The output of the OS: {{ OUTPUT }}
If you think you have got the answer to the question, take answer action, and you should print like this: Think: put your thought here. Act: answer(Your answer to the question should be put in this pair of parentheses) If the output is too long, I will truncate it. The truncated output is not complete. You have to deal with the truncating problem by yours...
work page 2024
-
[8]
get_relations(variable: var) -> list of relations A variable can be either an entity or a set of entities (i.e., the result of a previous query). This function helps to navigate all relations in the KB connected to the variable, so you can decide which relation is the most useful to find the answer to the question. A simple use case can be ‘get_relations(...
-
[9]
get_neighbors(variable: var, relation: str) -> variable Given a variable, this function returns all entities connected to the variable via the given relation. Note that, get_neighbors() can only be used after get_relations() is used to find a set of viable relations. A simple use case can be ‘get_neighbors(Barack Obama, people.person. profession)’, which ...
-
[10]
intersection(variable1: var, variable2: var) -> variable Given two variables, this function returns the intersection of the two variables. The two variables MUST be of the same type! 4https://github.com/dki-lab/Freebase-Setup 27 Published as a conference paper at ICLR 2024
work page 2024
-
[11]
Please only use it if the question seeks for a superlative accumulation (i.e., argmax or argmin)
get_attributes(variable: var) -> list of attributes This function helps to find all numerical attributes of the variable. Please only use it if the question seeks for a superlative accumulation (i.e., argmax or argmin)
-
[12]
It can only be used after get_attributes() is used to find a set of viable attributes
argmax(variable: var, attribute: str) -> variable Given a variable, this function returns the entity with the maximum value of the given attribute. It can only be used after get_attributes() is used to find a set of viable attributes. A simple use case can be ‘argmax(variable, age)’, which returns the oldest entity belonging to the variable
-
[13]
It can only be used after get_attributes() is used to find a set of viable attributes
argmin(variable: var, attribute: str) -> variable Given a variable, this function returns the entity with the minimum value of the given attribute. It can only be used after get_attributes() is used to find a set of viable attributes. A simple use case can be ‘argmin(variable, age)’, which returns the youngest entity belonging to the variable
-
[14]
Counter: Deal 30 damage to attacker when a teammate’s health is below 30%
count(variable: var) -> int Given a variable, this function returns the number of entities belonging to the variable. After a variable is produced along the process, you need to judge whether a variable is the final answer to the question. Each variable is represented as an id starting from 0. For example, #0 is the first variable, #1 is the second variab...
work page 2024
-
[15]
They want to count the steps of the abandoned building
-
[16]
A supernatural event occurred
-
[17]
They saw a claim online: counting stairs at night will result in one step less
-
[18]
The next day, when they went to the abandoned building to verify, they found no stairs
-
[19]
The number of key points varies among samples
They broke down because they were terrified. The number of key points varies among samples. As for the decision of whether the agent guess out key points, we first change relevant questions into declarative sentences, then simplify sentences into one sentence. After guessing out a key point, we delete that key point and relevant inferences to avoid repeat...
-
[20]
story". Based on the story, you need to ask questions that can be answered with
At the beginning of the game, you will receive a narrative, referred to as "story". Based on the story, you need to ask questions that can be answered with "yes", "no", or "irrelevant" to guees out the " truth"
-
[21]
By asking questions, you narrow down the range of possibilities until you eventually guess out the truth
-
[22]
Each time, you can only ask one question
-
[23]
You cannot declare the end of the game, give up on reasoning, or request a new game
Remember that your role is a player. You cannot declare the end of the game, give up on reasoning, or request a new game
-
[24]
35 Published as a conference paper at ICLR 2024
You cannot directly repeat information already provided in the story. 35 Published as a conference paper at ICLR 2024
work page 2024
-
[25]
You cannot directly ask for details about the story in the form of " why" questions; you need to make your own guesses for truth
-
[26]
You cannot directly inquire about the story; you must make your own deductions. Next, please make full use of the information provided above to engage in game reasoning. Keep in mind that your questions should be answerable with "yes", "no", or "irrelevant", and you can only ask one question at a time. Here is your story: {story} You can start guessing th...
-
[34]
During the user’s process of guessing the truth, if they come close to some truths but still have gaps in understanding the complete truth of the truth, you can provide certain entry point hints. However, you cannot directly reveal information from the truth. During the game process, please adhere to the above game rules to ensure a positive gaming experi...
work page 2024
-
[35]
You know both the "story" and the "truth". When a user wants to play Lateral Thinking Puzzle, you provide them with the "story". The user only knows the "story" and is unawared of the "truth"
-
[36]
The user asks questions that can be answered with "yes," "no," or " irrelevant". Their questions are aimed at guessing the "truth". Based on the "truth", you respond to the user’s questions using "yes," "no ," or "irrelevant" to guide them towards guessing the correct truth
-
[37]
If the user directly asks for details about the truth using the form of "why" questions, inform them that they need to make their own guesses
-
[38]
Based on the information of the truth and the user’s past questions, you answer the user’s questions
You must fully understand and accurately interpret the information from the truth. Based on the information of the truth and the user’s past questions, you answer the user’s questions. The user’s questions may not necessarily contain information from the truth, but your responses must align with the facts of the truth
-
[39]
You can only answer "irrelevant" when the truth cannot provide a direct or indirect answer. Note that this is the only condition for responding "irrelevant"; otherwise, you should answer "yes" or "no."
-
[40]
You cannot directly disclose the information from the truth to the user, even if they ask directly
-
[41]
You need to judge the user’s questions as a whole and understand their overall intent. Avoid answering based solely on a particular point; your responses must align with the facts of the truth
-
[42]
During the user’s process of guessing the truth, if they come close to some truths but still have gaps in understanding the complete truth of the truth, you can provide certain entry point hints. However, you cannot directly reveal information from the truth. USER: Alright, we can now start the game. Remember, before each response, you should review the k...
work page 2024
-
[43]
Any actions except provided available actions will be regarded as illegal
the action must be chosen from the given available actions. Any actions except provided available actions will be regarded as illegal
-
[44]
Think when necessary, try to act directly more in the process. All the tasks in the datasets are categorized into six classes. To better guide the model in accomplishing the objectives, we have selected one relatively simple example of successful completion of similar tasks for each category as 1-shot example. Here is an example: User: You are in the midd...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.