Agent Island is a new multiagent game environment that functions as a dynamic benchmark resistant to saturation and contamination, with Bayesian ranking showing OpenAI GPT-5.5 as the strongest performer among 49 models across 999 games.
Title resolution pending
4 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
Introduces MPT benchmark and PRefine method that models user preferences as evolving hypotheses to improve personalized tool calling accuracy with 1.24% of full-history token cost.
Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.
Cooperative profiles from behavioral economics games predict LLM team performance in AI-for-science workflows.
citing papers explorer
-
Agent Island: A Saturation- and Contamination-Resistant Benchmark from Multiagent Games
Agent Island is a new multiagent game environment that functions as a dynamic benchmark resistant to saturation and contamination, with Bayesian ranking showing OpenAI GPT-5.5 as the strongest performer among 49 models across 999 games.
-
Latent Preference Modeling for Cross-Session Personalized Tool Calling
Introduces MPT benchmark and PRefine method that models user preferences as evolving hypotheses to improve personalized tool calling accuracy with 1.24% of full-history token cost.
-
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning
Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.
-
Cooperative Profiles Predict Multi-Agent LLM Team Performance in AI for Science Workflows
Cooperative profiles from behavioral economics games predict LLM team performance in AI-for-science workflows.