Agent Island is a new multiagent game environment that functions as a dynamic benchmark resistant to saturation and contamination, with Bayesian ranking showing OpenAI GPT-5.5 as the strongest performer among 49 models across 999 games.
Title resolution pending
4 Pith papers cite this work. Polarity classification is still indexing.
4
Pith papers citing it
representative citing papers
Introduces MPT benchmark and PRefine method that models user preferences as evolving hypotheses to improve personalized tool calling accuracy with 1.24% of full-history token cost.
Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.
Cooperative profiles from behavioral economics games predict LLM team performance in AI-for-science workflows.
citing papers explorer
-
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning
Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.