BIG-bench is a 204-task benchmark that measures scaling trends, calibration, and absolute limitations of language models across knowledge, reasoning, and social domains.
The penny drops
4 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
ML climate emulators degrade under seasonal distribution shifts that proxy long-term climate change, but physically motivated compositional decompositions improve out-of-distribution performance with modest in-distribution trade-offs.
LLMs represent semantic relations geometrically via embedding distance and direction; a linear Polar Probe decodes these structures from middle-layer activations and generalizes to new entities.
LLMs solve compositional factual recall either by computing intermediates or directly, with mechanism choice correlated to translation geometry in embedding spaces.
citing papers explorer
-
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
BIG-bench is a 204-task benchmark that measures scaling trends, calibration, and absolute limitations of language models across knowledge, reasoning, and social domains.
-
No Epoch Like the Present: Robust Climate Emulation Requires Out-of-Distribution Generalisation
ML climate emulators degrade under seasonal distribution shifts that proxy long-term climate change, but physically motivated compositional decompositions improve out-of-distribution performance with modest in-distribution trade-offs.
-
Polar probe linearly decodes semantic structures from LLMs
LLMs represent semantic relations geometrically via embedding distance and direction; a linear Polar Probe decodes these structures from middle-layer activations and generalizes to new entities.
-
How Do Language Models Compose Functions?
LLMs solve compositional factual recall either by computing intermediates or directly, with mechanism choice correlated to translation geometry in embedding spaces.