AI model builders mostly highlight unique benchmarks that act as flexible narrative tools for market positioning rather than standardized scientific measurements.
Levels of AGI: Operationalizing progress on the path to AGI.arXiv preprint arXiv:2311.02462
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 3polarities
background 3representative citing papers
MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.
GeoMMBench reveals deficiencies in current multimodal LLMs for geoscience tasks while GeoMMAgent demonstrates that tool-integrated agents achieve significantly higher performance.
MMLU-Pro is a revised benchmark that makes language model evaluation harder and more stable by using ten options per question and emphasizing reasoning over simple knowledge recall.
PCA on AI model benchmarks reveals a general intelligence factor that rises then falls as specialized reasoning models appear, inverting the expected move toward parsimonious mechanisms.
Human-AI coexistence is best modeled as conditional mutualism under governance, formalized as a multiplex dynamical system whose simulations show stable high-coexistence equilibria only under balanced institutional oversight.
citing papers explorer
-
Unsteady Metrics and Benchmarking Cultures of AI Model Builders
AI model builders mostly highlight unique benchmarks that act as flexible narrative tools for market positioning rather than standardized scientific measurements.
-
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.
-
GeoMMBench and GeoMMAgent: Toward Expert-Level Multimodal Intelligence in Geoscience and Remote Sensing
GeoMMBench reveals deficiencies in current multimodal LLMs for geoscience tasks while GeoMMAgent demonstrates that tool-integrated agents achieve significantly higher performance.
-
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
MMLU-Pro is a revised benchmark that makes language model evaluation harder and more stable by using ten options per question and emphasizing reasoning over simple knowledge recall.
-
The Rise and Fall of $G$ in AGI
PCA on AI model benchmarks reveals a general intelligence factor that rises then falls as specialized reasoning models appear, inverting the expected move toward parsimonious mechanisms.
-
A Co-Evolutionary Theory of Human-AI Coexistence: Mutualism, Governance, and Dynamics in Complex Societies
Human-AI coexistence is best modeled as conditional mutualism under governance, formalized as a multiplex dynamical system whose simulations show stable high-coexistence equilibria only under balanced institutional oversight.