arxiv: 2206.04615 · v3 · submitted 2022-06-09 · 💻 cs.CL · cs.AI· cs.CY· cs.LG· stat.ML

Recognition: 2 theorem links

· Lean Theorem

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Aarohi Srivastava , Abhinav Rastogi , Abhishek Rao , Abu Awal Md Shoeb , Abubakar Abid , Adam Fisch , Adam R. Brown , Adam Santoro

show 442 more authors

Aditya Gupta Adri\`a Garriga-Alonso Agnieszka Kluska Aitor Lewkowycz Akshat Agarwal Alethea Power Alex Ray Alex Warstadt Alexander W. Kocurek Ali Safaya Ali Tazarv Alice Xiang Alicia Parrish Allen Nie Aman Hussain Amanda Askell Amanda Dsouza Ambrose Slone Ameet Rahane Anantharaman S. Iyer Anders Andreassen Andrea Madotto Andrea Santilli Andreas Stuhlm\"uller Andrew Dai Andrew La Andrew Lampinen Andy Zou Angela Jiang Angelica Chen Anh Vuong Animesh Gupta Anna Gottardi Antonio Norelli Anu Venkatesh Arash Gholamidavoodi Arfa Tabassum Arul Menezes Arun Kirubarajan Asher Mullokandov Ashish Sabharwal Austin Herrick Avia Efrat Aykut Erdem Ayla Karaka\c{s} B. Ryan Roberts Bao Sheng Loe Barret Zoph Bart{\l}omiej Bojanowski Batuhan \"Ozyurt Behnam Hedayatnia Behnam Neyshabur Benjamin Inden Benno Stein Berk Ekmekci Bill Yuchen Lin Blake Howald Bryan Orinion Cameron Diao Cameron Dour Catherine Stinson Cedrick Argueta C\'esar Ferri Ram\'irez Chandan Singh Charles Rathkopf Chenlin Meng Chitta Baral Chiyu Wu Chris Callison-Burch Chris Waites Christian Voigt Christopher D. Manning Christopher Potts Cindy Ramirez Clara E. Rivera Clemencia Siro Colin Raffel Courtney Ashcraft Cristina Garbacea Damien Sileo Dan Garrette Dan Hendrycks Dan Kilman Dan Roth Daniel Freeman Daniel Khashabi Daniel Levy Daniel Mosegu\'i Gonz\'alez Danielle Perszyk Danny Hernandez Danqi Chen Daphne Ippolito Dar Gilboa David Dohan David Drakard David Jurgens Debajyoti Datta Deep Ganguli Denis Emelin Denis Kleyko Deniz Yuret Derek Chen Derek Tam Dieuwke Hupkes Diganta Misra Dilyar Buzan Dimitri Coelho Mollo Diyi Yang Dong-Ho Lee Dylan Schrader Ekaterina Shutova Ekin Dogus Cubuk Elad Segal Eleanor Hagerman Elizabeth Barnes Elizabeth Donoway Ellie Pavlick Emanuele Rodola Emma Lam Eric Chu Eric Tang Erkut Erdem Ernie Chang Ethan A. Chi Ethan Dyer Ethan Jerzak Ethan Kim Eunice Engefu Manyasi Evgenii Zheltonozhskii Fanyue Xia Fatemeh Siar Fernando Mart\'inez-Plumed Francesca Happ\'e Francois Chollet Frieda Rong Gaurav Mishra Genta Indra Winata Gerard de Melo Germ\'an Kruszewski Giambattista Parascandolo Giorgio Mariani Gloria Wang Gonzalo Jaimovitch-L\'opez Gregor Betz Guy Gur-Ari Hana Galijasevic Hannah Kim Hannah Rashkin Hannaneh Hajishirzi Harsh Mehta Hayden Bogar Henry Shevlin Hinrich Sch\"utze Hiromu Yakura Hongming Zhang Hugh Mee Wong Ian Ng Isaac Noble Jaap Jumelet Jack Geissinger Jackson Kernion Jacob Hilton Jaehoon Lee Jaime Fern\'andez Fisac James B. Simon James Koppel James Zheng James Zou Jan Koco\'n Jana Thompson Janelle Wingfield Jared Kaplan Jarema Radom Jascha Sohl-Dickstein Jason Phang Jason Wei Jason Yosinski Jekaterina Novikova Jelle Bosscher Jennifer Marsh Jeremy Kim Jeroen Taal Jesse Engel Jesujoba Alabi Jiacheng Xu Jiaming Song Jillian Tang Joan Waweru John Burden John Miller John U. Balis Jonathan Batchelder Jonathan Berant J\"org Frohberg Jos Rozen Jose Hernandez-Orallo Joseph Boudeman Joseph Guerr Joseph Jones Joshua B. Tenenbaum Joshua S. Rule Joyce Chua Kamil Kanclerz Karen Livescu Karl Krauth Karthik Gopalakrishnan Katerina Ignatyeva Katja Markert Kaustubh D. Dhole Kevin Gimpel Kevin Omondi Kory Mathewson Kristen Chiafullo Ksenia Shkaruta Kumar Shridhar Kyle McDonell Kyle Richardson Laria Reynolds Leo Gao Li Zhang Liam Dugan Lianhui Qin Lidia Contreras-Ochando Louis-Philippe Morency Luca Moschella Lucas Lam Lucy Noble Ludwig Schmidt Luheng He Luis Oliveros Col\'on Luke Metz L\"utfi Kerem \c{S}enel Maarten Bosma Maarten Sap Maartje ter Hoeve Maheen Farooqi Manaal Faruqui Mantas Mazeika Marco Baturan Marco Marelli Marco Maru Maria Jose Ram\'irez Quintana Marie Tolkiehn Mario Giulianelli Martha Lewis Martin Potthast Matthew L. Leavitt Matthias Hagen M\'aty\'as Schubert Medina Orduna Baitemirova Melody Arnaud Melvin McElrath Michael A. Yee Michael Cohen Michael Gu Michael Ivanitskiy Michael Starritt Michael Strube Micha{\l} Sw\k{e}drowski Michele Bevilacqua Michihiro Yasunaga Mihir Kale Mike Cain Mimee Xu Mirac Suzgun Mitch Walker Mo Tiwari Mohit Bansal Moin Aminnaseri Mor Geva Mozhdeh Gheini Mukund Varma T Nanyun Peng Nathan A. Chi Nayeon Lee Neta Gur-Ari Krakover Nicholas Cameron Nicholas Roberts Nick Doiron Nicole Martinez Nikita Nangia Niklas Deckers Niklas Muennighoff Nitish Shirish Keskar Niveditha S. Iyer Noah Constant Noah Fiedel Nuan Wen Oliver Zhang Omar Agha Omar Elbaghdadi Omer Levy Owain Evans Pablo Antonio Moreno Casares Parth Doshi Pascale Fung Paul Pu Liang Paul Vicol Pegah Alipoormolabashi Peiyuan Liao Percy Liang Peter Chang Peter Eckersley Phu Mon Htut Pinyu Hwang Piotr Mi{\l}kowski Piyush Patil Pouya Pezeshkpour Priti Oli Qiaozhu Mei Qing Lyu Qinlang Chen Rabin Banjade Rachel Etta Rudolph Raefer Gabriel Rahel Habacker Ramon Risco Rapha\"el Milli\`ere Rhythm Garg Richard Barnes Rif A. Saurous Riku Arakawa Robbe Raymaekers Robert Frank Rohan Sikand Roman Novak Roman Sitelew Ronan LeBras Rosanne Liu Rowan Jacobs Rui Zhang Ruslan Salakhutdinov Ryan Chi Ryan Lee Ryan Stovall Ryan Teehan Rylan Yang Sahib Singh Saif M. Mohammad Sajant Anand Sam Dillavou Sam Shleifer Sam Wiseman Samuel Gruetter Samuel R. Bowman Samuel S. Schoenholz Sanghyun Han Sanjeev Kwatra Sarah A. Rous Sarik Ghazarian Sayan Ghosh Sean Casey Sebastian Bischoff Sebastian Gehrmann Sebastian Schuster Sepideh Sadeghi Shadi Hamdan Sharon Zhou Shashank Srivastava Sherry Shi Shikhar Singh Shima Asaadi Shixiang Shane Gu Shubh Pachchigar Shubham Toshniwal Shyam Upadhyay Shyamolima (Shammie) Debnath Siamak Shakeri Simon Thormeyer Simone Melzi Siva Reddy Sneha Priscilla Makini Soo-Hwan Lee Spencer Torene Sriharsha Hatwar Stanislas Dehaene Stefan Divic Stefano Ermon Stella Biderman Stephanie Lin Stephen Prasad Steven T. Piantadosi Stuart M. Shieber Summer Misherghi Svetlana Kiritchenko Swaroop Mishra Tal Linzen Tal Schuster Tao Li Tao Yu Tariq Ali Tatsu Hashimoto Te-Lin Wu Th\'eo Desbordes Theodore Rothschild Thomas Phan Tianle Wang Tiberius Nkinyili Timo Schick Timofei Kornev Titus Tunduny Tobias Gerstenberg Trenton Chang Trishala Neeraj Tushar Khot Tyler Shultz Uri Shaham Vedant Misra Vera Demberg Victoria Nyamai Vikas Raunak Vinay Ramasesh Vinay Uday Prabhu Vishakh Padmakumar Vivek Srikumar William Fedus William Saunders William Zhang Wout Vossen Xiang Ren Xiaoyu Tong Xinran Zhao Xinyi Wu Xudong Shen Yadollah Yaghoobzadeh Yair Lakretz Yangqiu Song Yasaman Bahri Yejin Choi Yichi Yang Yiding Hao Yifu Chen Yonatan Belinkov Yu Hou Yufang Hou Yuntao Bai Zachary Seid Zhuoye Zhao Zijian Wang Zijie J. Wang Zirui Wang Ziyi Wu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 23:21 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CYcs.LGstat.ML

keywords language modelsscalingbenchmarksBIG-benchemergent abilitiesmodel evaluationsocial biascalibration

0 comments

The pith

Scale brings gradual gains on knowledge tasks but sudden breakthroughs on complex ones in language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces BIG-bench, a collection of 204 tasks designed to test abilities believed to lie beyond current language models, spanning linguistics, reasoning, science, and social domains. It evaluates a range of transformer models from millions to hundreds of billions of parameters against human expert raters on every task. Performance and calibration both rise with size yet remain far below human levels across architectures. Tasks heavy on knowledge or memorization scale smoothly and predictably, while those needing multiple steps or fragile metrics show abrupt jumps at certain sizes. Social bias often grows with scale in unclear contexts but can be reduced through prompting.

Core claim

BIG-bench evaluations demonstrate that model performance and calibration improve with scale across dense and sparse transformers, yet stay poor in absolute terms relative to human raters. Tasks improve gradually and predictably when they center on knowledge or memorization; tasks show sudden breakthroughs at critical scales when they involve multiple components or brittle metrics. Performance patterns are similar across model classes with some gains from sparsity, and social bias typically rises with scale under ambiguous conditions though prompting mitigates it.

What carries the argument

BIG-bench, a suite of 204 diverse tasks contributed by 450 authors that probes capabilities beyond those of current models and tracks how performance changes across model sizes.

If this is right

Larger models will show predictable improvement on knowledge-based tasks but may suddenly gain new abilities on multi-step tasks at certain sizes.
Calibration of model outputs will continue to improve with size yet remain unreliable compared to human judgments.
Sparse model architectures will retain a modest edge over dense ones at equivalent scales.
Social biases in model outputs will tend to increase with scale unless addressed by techniques such as prompting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers may need to design new tasks focused on multi-step reasoning to better anticipate when abrupt capability jumps will occur.
The observed patterns imply that simple extrapolation from small-model trends will underestimate sudden changes in what models can do.
Maintaining human expert baselines will require ongoing updates as model performance approaches or crosses them on individual tasks.

Load-bearing premise

The 204 tasks chosen represent the capabilities that will matter for future models and human rater performance gives a stable, unbiased ceiling for comparison.

What would settle it

A follow-up evaluation on the same tasks where models exceed human raters on a majority of them or where no clear split appears between gradual and breakthrough scaling behaviors.

read the original abstract

Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 450 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BIG-bench supplies a large collaborative task suite and direct scaling measurements that map where model size helps steadily versus where it produces jumps.

read the letter

The main thing to know is that this paper assembles 204 tasks chosen to be difficult for current models and then measures performance across model sizes from millions to hundreds of billions of parameters, plus human baselines. The central observations are that accuracy and calibration rise with scale but remain low overall, that gradual gains often appear on knowledge-heavy tasks, and that sharper transitions show up on multi-step or brittle ones. Social bias trends are also tracked, with some worsening at larger sizes in ambiguous cases but improvement via prompting. These patterns hold across the GPT family, dense transformers, and sparse Switch-style models, with a modest edge for sparsity. The work is new in its breadth and in the explicit focus on scaling behavior across many domains contributed by 450 people. The human rater comparisons and the public task release are practical strengths that let others build on the data directly. The claims rest on straightforward evaluation rather than derivations or fitted parameters, so the reported trends are descriptive facts from the runs performed. The soft spots are limited but real. Task selection was contributor-driven rather than drawn from a systematic sampling frame, which leaves open how well the set represents capabilities that will matter most later. Some evaluations used closed models, which reduces full reproducibility even though the tasks themselves are released. The paper notes these points without overclaiming generality. This is the sort of resource that evaluation researchers and scaling studies will want to cite for its empirical coverage. Readers who need a broad set of hard tasks or want to compare gradual versus breakthrough scaling will find concrete data here. The work is grounded enough to deserve a full referee process rather than a desk rejection.

Referee Report

0 major / 4 minor

Summary. The manuscript introduces the Beyond the Imitation Game benchmark (BIG-bench) with 204 tasks contributed by 450 authors across 132 institutions, spanning linguistics, math, reasoning, biology, social bias and other domains. It evaluates OpenAI GPT models, Google-internal dense transformers and Switch-style sparse transformers across scales from millions to hundreds of billions of parameters, supplies human expert rater baselines on all tasks, and reports that model performance and calibration improve with scale yet remain poor in absolute terms relative to humans; tasks with gradual scaling tend to involve knowledge or memorization while breakthrough scaling appears in multi-step or brittle-metric tasks; social bias tends to increase with scale under ambiguous context but can be mitigated by prompting.

Significance. If the reported empirical patterns hold, the work supplies a valuable large-scale characterization of current language-model capabilities and limitations that can inform scaling research, capability forecasting and harm mitigation. Credit is due for the multi-institutional task collection, the provision of human baselines, the explicit separation of gradual versus breakthrough scaling behaviors, and the absence of fitted parameters or circular reductions in the analysis.

minor comments (4)

[Abstract] Abstract: the list of findings is presented as a single dense sentence; reformatting the key observations as bullets would improve immediate readability for readers scanning the paper.
[Evaluation] Evaluation protocol: the manuscript should state the precise prompting templates, number of shots, and decoding parameters used for each model family so that the reported scores can be reproduced by independent groups.
[Results] Results section: performance curves are shown without error bars or statistical tests; adding these would allow readers to assess whether observed differences between model classes or scales are reliable.
[Analysis] Task categorization: the distinction between 'gradual' and 'breakthrough' tasks is described qualitatively; a short appendix listing the specific tasks falling into each category with their scaling exponents would make the claim more concrete.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive and accurate summary of our work, the recognition of its significance for scaling research and capability forecasting, and the recommendation of minor revision. No specific major comments were provided in the report, so we have no points to address point-by-point. We are prepared to incorporate any minor suggestions or clarifications if supplied by the editor or referee.

Circularity Check

0 steps flagged

No significant circularity; purely empirical benchmark

full rationale

The paper introduces the BIG-bench dataset of 204 tasks and reports direct empirical measurements of model performance across scales, model classes, and human raters. No mathematical derivations, parameter fits, or predictions are claimed; scaling trends, gradual vs. breakthrough behaviors, and bias observations are presented as descriptive results from the evaluations themselves. The central claims rest on the contributed tasks and rater baselines without reduction to prior fits or self-citation chains. This is the expected non-finding for a large-scale benchmarking effort.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper introduces no new mathematical axioms, free parameters, or invented entities; it relies on standard transformer architectures and human evaluation protocols already established in the field.

pith-pipeline@v0.9.0 · 7756 in / 1136 out tokens · 49013 ms · 2026-05-10T23:21:35.284290+00:00 · methodology

discussion (0)

Forward citations

Cited by 52 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LiveBench: A Challenging, Contamination-Limited LLM Benchmark
cs.CL 2024-06 unverdicted novelty 8.0

LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.
Progress measures for grokking via mechanistic interpretability
cs.LG 2023-01 accept novelty 8.0

Grokking arises from gradual amplification of a Fourier-based circuit in the weights followed by removal of memorizing components.
NARRA-Gym for Evaluating Interactive Narrative Agents
cs.CL 2026-05 unverdicted novelty 7.0

NARRA-Gym is an executable benchmark that generates complete interactive narrative episodes from emotional seeds and logs full model trajectories to expose gaps in coherence, adaptation, and personalization that stati...
TRIP-Evaluate: An Open Multimodal Benchmark for Evaluating Large Models in Transportation
cs.CV 2026-04 accept novelty 7.0

TRIP-Evaluate is a new open multimodal benchmark with 837 text, image, and point-cloud items organized by a role-task-knowledge taxonomy to evaluate large models on transportation workflows.
FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks
cs.CL 2026-04 unverdicted novelty 7.0

FrontierFinance benchmark shows human financial experts outperform state-of-the-art LLMs by achieving higher scores and more client-ready outputs on realistic long-horizon tasks.
KTO: Model Alignment as Prospect Theoretic Optimization
cs.LG 2024-02 conditional novelty 7.0

KTO aligns LLMs by directly maximizing prospect-theoretic utility on binary signals and matches or exceeds preference-based methods like DPO from 1B to 30B parameters.
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
cs.LG 2024-01 conditional novelty 7.0

Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
C-Pack: Packed Resources For General Chinese Embeddings
cs.CL 2023-09 accept novelty 7.0

C-Pack releases a new Chinese embedding benchmark, large training dataset, and optimized models that outperform priors by up to 10% on C-MTEB while also delivering English SOTA results.
Large Language Models as Optimizers
cs.LG 2023-09 unverdicted novelty 7.0

Large language models can optimize by being prompted with histories of past solutions and scores to propose better ones, producing prompts that raise accuracy up to 8% on GSM8K and 50% on Big-Bench Hard over human-des...
gwBenchmarks: Stress-Testing LLM Agents on High-Precision Gravitational Wave Astronomy
gr-qc 2026-05 unverdicted novelty 6.0

LLM coding agents cannot reach the 10^{-4} relative accuracy required for gravitational wave modeling tasks and show systematic failures including metric misuse and result fabrication.
OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces
cs.AI 2026-05 unverdicted novelty 6.0

OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.
Temporal Reasoning Is Not the Bottleneck: A Probabilistic Inconsistency Framework for Neuro-Symbolic QA
cs.AI 2026-05 unverdicted novelty 6.0

Temporal reasoning is not the core bottleneck for LLMs on time-based QA; the real issue is unstructured text-to-event mapping, addressed by a neuro-symbolic system with PIS that reaches 100% accuracy on benchmarks whe...
A Meta Reinforcement Learning Approach to Goals-Based Wealth Management
cs.LG 2026-05 unverdicted novelty 6.0

MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.
Evaluating Agentic AI in the Wild: Failure Modes, Drift Patterns, and a Production Evaluation Framework
cs.AI 2026-05 unverdicted novelty 6.0

The paper presents a taxonomy of seven production-specific failure modes for agentic AI, demonstrates that existing metrics fail to detect four of them entirely, and proposes the PAEF five-dimension framework for cont...
AIT Academy: Cultivating the Complete Agent with a Confucian Three-Domain Curriculum
cs.AI 2026-04 unverdicted novelty 6.0

AIT Academy introduces a tripartite curriculum for AI agents across natural science, humanities, and social science domains, with reported gains of 15.9 points in security and 7 points in social reasoning under specif...
QuickScope: Certifying Hard Questions in Dynamic LLM Benchmarks
cs.CL 2026-04 unverdicted novelty 6.0

QuickScope uses modified COUP Bayesian optimization to find truly difficult questions in dynamic LLM benchmarks more sample-efficiently than baselines while cutting false positives.
Beyond Static Snapshots: A Grounded Evaluation Framework for Language Models at the Agentic Frontier
cs.AI 2026-04 unverdicted novelty 6.0

ISOPro replaces learned reward models with deterministic verifiers in a continuous evaluation setup for LLMs, delivering larger average capability gains than GRPO-LoRA across small models in scheduling and MBPP domain...
Parcae: Scaling Laws For Stable Looped Language Models
cs.LG 2026-04 unverdicted novelty 6.0

Parcae stabilizes looped LLMs via spectral norm constraints on injection parameters, enabling power-law scaling for training FLOPs and saturating exponential scaling at test time that improves quality over fixed-depth...
The A-R Behavioral Space: Execution-Level Profiling of Tool-Using Language Model Agents in Organizational Deployment
cs.AI 2026-04 unverdicted novelty 6.0

Execution and refusal in tool-using LLM agents form separable behavioral dimensions whose joint distribution shifts systematically with normative regimes and autonomy scaffolding.
Measuring Representation Robustness in Large Language Models for Geometry
cs.CL 2026-04 unverdicted novelty 6.0

LLMs display accuracy gaps of up to 14 percentage points on the same geometry problems solely due to representation choice, with vector forms consistently weakest and a convert-then-solve prompt helping only high-capa...
Memory in the Age of AI Agents
cs.CL 2025-12 unverdicted novelty 6.0

The paper maps agent memory research via three forms (token-level, parametric, latent), three functions (factual, experiential, working), and dynamics of formation/evolution/retrieval, plus benchmarks and future directions.
Towards an AI co-scientist
cs.AI 2025-02 unverdicted novelty 6.0

A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
cs.CL 2024-06 conditional novelty 6.0

MMLU-Pro is a revised benchmark that makes language model evaluation harder and more stable by using ten options per question and emphasizing reasoning over simple knowledge recall.
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
cs.CL 2024-04 accept novelty 6.0

Phi-3-mini (3.8B params, 3.3T tokens) reaches 69% MMLU and 8.38 MT-bench, matching larger models, with scaled-up 7B/14B variants and phi-3.5 extensions for multilingual, MoE, and vision capabilities.
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
cs.SE 2024-03 unverdicted novelty 6.0

LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
Reinforced Self-Training (ReST) for Language Modeling
cs.CL 2023-08 unverdicted novelty 6.0

ReST improves LLM translation quality on benchmarks via offline RL on self-generated data, achieving gains in a compute-efficient way compared to typical RLHF.
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
cs.CL 2023-06 accept novelty 6.0

GPT-4 as an LLM judge achieves over 80% agreement with human preferences on MT-Bench and Chatbot Arena, matching human agreement levels and providing a scalable evaluation method.
Improving Factuality and Reasoning in Language Models through Multiagent Debate
cs.CL 2023-05 unverdicted novelty 6.0

Multiagent debate among LLMs improves mathematical reasoning, strategic reasoning, and factual accuracy while reducing hallucinations.
BloombergGPT: A Large Language Model for Finance
cs.LG 2023-03 conditional novelty 6.0

BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
cs.CL 2022-11 unverdicted novelty 6.0

BLOOM is a 176B-parameter open-access multilingual language model trained on the ROOTS corpus that achieves competitive performance on benchmarks, with improved results after multitask prompted finetuning.
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
cs.CL 2022-10 accept novelty 6.0

Chain-of-thought prompting enables large language models to surpass average human performance on 17 of 23 challenging BIG-Bench tasks.
Language Models (Mostly) Know What They Know
cs.CL 2022-07 unverdicted novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
Emergent Abilities of Large Language Models
cs.CL 2022-06 unverdicted novelty 6.0

Emergent abilities are capabilities present in large language models but absent in smaller ones and cannot be predicted by extrapolating smaller model performance.
The Efficiency Gap in Byte Modeling
cs.LG 2026-05 unverdicted novelty 5.0

Byte modeling incurs greater scaling overhead for masked diffusion than autoregressive models because the diffusion objective destroys local byte contiguity needed to resolve semantics.
Task-Aware Answer Preservation under Audio Compression for Large Audio Language Models
eess.AS 2026-05 unverdicted novelty 5.0

A statistical sign-off protocol for audio compressors ensures worst-case answer preservation across query families in LALMs.
Complexity Horizons of Compressed Models in Analog Circuit Analysis
cs.AI 2026-05 unverdicted novelty 5.0

Prerequisite graphs map compressed LLM performance boundaries in analog circuit analysis to allow selecting the smallest viable model for a given task complexity.
Calibrating Model-Based Evaluation Metrics for Summarization
cs.CL 2026-04 unverdicted novelty 5.0

A reference-free proxy scoring framework combined with GIRB calibration produces better-aligned evaluation metrics for summarization and outperforms baselines across seven datasets.
Consistency Analysis of Sentiment Predictions using Syntactic & Semantic Context Assessment Summarization (SSAS)
cs.CL 2026-04 unverdicted novelty 5.0

SSAS improves LLM sentiment prediction consistency and data quality by up to 30% on three review datasets via syntactic and semantic context assessment summarization.
Empirical Evidence of Complexity-Induced Limits in Large Language Models on Finite Discrete State-Space Problems with Explicit Validity Constraints
cs.CL 2026-04 unverdicted novelty 5.0

Large reasoning models exhibit reasoning collapse, with accuracy dropping sharply beyond task-specific complexity thresholds in controlled versions of nine classical reasoning tasks using strict validity validators.
When Models Know More Than They Say: Probing Analogical Reasoning in LLMs
cs.CL 2026-04 unverdicted novelty 5.0

Probing shows LLMs hold more analogical knowledge internally than prompting reveals, with a task-dependent asymmetry between rhetorical and narrative cases.
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
cs.CL 2025-03 unverdicted novelty 5.0

Phi-4-Mini achieves strong math and coding performance with only 3.8B parameters via high-quality synthetic data, while Phi-4-Multimodal uses Mixture-of-LoRAs to integrate modalities and top speech recognition leaderboards.
Humanity's Last Exam
cs.LG 2025-01 unverdicted novelty 5.0

Humanity's Last Exam is a new 2,500-question benchmark at the frontier of human knowledge where state-of-the-art LLMs show low accuracy.
The Platonic Representation Hypothesis
cs.LG 2024-05 unverdicted novelty 5.0

Representations learned by large AI models are converging toward a shared statistical model of reality.
PaLM 2 Technical Report
cs.CL 2023-05 unverdicted novelty 5.0

PaLM 2 reports state-of-the-art results on language, reasoning, and multilingual tasks with improved efficiency over PaLM.
Galactica: A Large Language Model for Science
cs.CL 2022-11 unverdicted novelty 5.0

Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.
BenCSSmark: Making the Social Sciences Count in LLM Research
cs.CL 2026-05 unverdicted novelty 4.0

BenCSSmark is a proposed benchmark that adds social science datasets to LLM evaluation to improve model robustness and relevance across disciplines like sociology and economics.
Leveraging Weighted Syntactic and Semantic Context Assessment Summary (wSSAS) Towards Text Categorization Using LLMs
cs.CL 2026-04 unverdicted novelty 4.0

wSSAS is a two-phase deterministic framework that uses hierarchical text organization and SNR-based feature prioritization to improve clustering integrity, categorization accuracy, and reproducibility when applying LL...
From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review
cs.AI 2025-04 accept novelty 4.0

A survey consolidating benchmarks, agent frameworks, real-world applications, and protocols for LLM-based autonomous agents into a proposed taxonomy with recommendations for future research.
TinyLlama: An Open-Source Small Language Model
cs.CL 2024-01 accept novelty 4.0

TinyLlama is a 1.1B-parameter open-source language model pretrained on 1 trillion tokens that outperforms other open-source models of similar size on downstream tasks.
ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools
cs.CL 2024-06 unverdicted novelty 3.0

GLM-4 models rival or exceed GPT-4 on MMLU, GSM8K, MATH, BBH, GPQA, HumanEval, IFEval, long-context tasks, and Chinese alignment while adding autonomous tool use for web, code, and image generation.
A Survey of Large Language Models
cs.CL 2023-03 accept novelty 3.0

This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.
A Survey on In-context Learning
cs.CL 2022-12 unverdicted novelty 3.0

The paper surveys definitions, techniques, applications, and challenges in in-context learning for large language models.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · cited by 52 Pith papers · 1 internal anchor

[1]

Amini, S

URL https://arxiv.org/abs/1808.01400. (cited on p. 30) Uri Alon, Roy Sadaka, Omer Levy, and Eran Yahav. Structural language models of code. In Hal Daumé III and Aarti Singh (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pp. 245–256. PMLR, 13–18 July 2020. URLhttps://proc...

work page doi:10.18653/v1/n19-1245 2020
[2]

(cited on p

URL https://arxiv.org/abs/1606.06565. (cited on p. 40) Brandon Amos and J. Zico Kolter. Optnet: Differentiable optimization as a layer in neural networks, 2017. URLhttps: //arxiv.org/abs/1703.00443. (cited on p. 38) Philip W. Anderson. More is different.Science, 177(4047):393–396, 1972. doi: 10.1126/science.177.4047.393. URLhttps: //www.science.org/doi/ab...

work page doi:10.1126/science.177.4047.393 2017
[3]

(cited on p

URL https://arxiv.org/abs/2001.08435. (cited on p. 39) Nihat Bayat and Gökhan Çetinkaya. The relationship between inference skills and reading comprehension.TED EĞİTİM VE BİLİM (Education and Science), 45(203):177–190, 2020. doi: 10.15390/EB.2020.8782. URLhttp://egitimvebilim.ted.org. tr/index.php/EB/article/view/8782. (cited on p. 34) Mayur J. Bency, Ahm...

work page doi:10.15390/eb.2020.8782 2001
[4]

On the Opportunities and Risks of Foundation Models

URL https://arxiv.org/abs/2108.07258. (cited on p. 4) Shikha Bordia and Samuel R. Bowman. Identifying and reducing gender bias in word-level language models, 2019. URL https://arxiv.org/abs/1904.03035. (cited on p. 33) Antoine Bosselut, Hannah Rashkin, Maarten Sap, Chaitanya Malaviya, Asli Celikyilmaz, and Yejin Choi. COMET: Commonsense transformers for a...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/tvcg.2011.185 2019
[5]

doi: 10.18653/v1/W18-6433

Association for Computational Linguistics. doi: 10.18653/v1/W18-6433. URLhttps://aclanthology.org/W18-6433. (cited on p. 39) Corrado Böhm. On a family of Turing machines and the related programming language.ICC Bulletin, 3:187–194, 1964. (cited on p. 38) Kate Cain and Jane V. Oakhill. Inference making ability and its relation to comprehension failure.Read...

work page doi:10.18653/v1/w18-6433 1964
[6]

Simplicity: a unifying principle in cognitive science? , volume =

doi: 10.1016/S1364-6613(02)00005-0. URL https://doi.org/10.1016/S1364-6613(02)00005-0. (cited on p. 38) Antonio Chella, Arianna Pipitone, Alain Morin, and Famira Racy. Developing self-awareness in robots via inner speech.Frontiers in Robotics and AI, 7, 2020. doi: 10.3389/frobt.2020.00016. URLhttps://www.frontiersin.org/article/10.3389/frobt. 2020.00016. ...

work page doi:10.1016/s1364-6613(02)00005-0 2020
[7]

doi: 10.18653/v1/W19-3824

Association for Computational Linguistics. doi: 10.18653/v1/W19-3824. URLhttps://aclanthology.org/W19-3824. (cited on p. 31) Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, Percy Liang, and Luke Zettlemoyer. QuAC: Question answering in context, 2018. URLhttps://arxiv.org/abs/1808.07036. (cited on p. 40) François Chollet. On the mea...

work page doi:10.18653/v1/w19-3824 2018
[8]

doi: 10.1007/978-3-319-40566-7_4

Springer. doi: 10.1007/978-3-319-40566-7_4. URL https://doi.org/10.1007/978-3-319-40566-7_4. (cited on p. 36) Andrew Cropper, Rolf Morel, and Stephen Muggleton. Learning higher-order logic programs.Machine Learning, 109:1289–1322,

work page doi:10.1007/978-3-319-40566-7_4
[9]

overinformative

doi: 10.1007/s10994-019-05862-7. URL https://doi.org/10.1007/s10994-019-05862-7. (cited on p. 34) Joe Cruse. Emoji usage in TV conversation.Twitter blog, 18 Nov 2015. URLhttps://blog.twitter.com/en_us/a/2015/emoji- usage-in-tv-conversation. (cited on p. 31) Marc-Alexandre Côté, Ákos Kádár, Xingdi Yuan, Ben Kybartas, Tavian Barnes, Emery Fine, James Moore,...

work page doi:10.1007/s10994-019-05862-7 2015
[10]

(cited on p

URL https://arxiv.org/abs/1707.03904. (cited on p. 33) Kaustubh Dhole, Gurdeep Singh, Priyadarshini P. Pai, and Sukanta Mondal. Sequence-based prediction of protein–protein interaction sites with l1-logreg classifier.Journal of Theoretical Biology, 348:47–54, 2014. doi: 10.1016/j.jtbi.2014.01.028. URL https://pubmed.ncbi.nlm.nih.gov/24486250/. (cited on p...

work page doi:10.1016/j.jtbi.2014.01.028 2014
[11]

(cited on p

URL https://arxiv.org/abs/1910.02227. (cited on p. 32) Matan Eyal, Tal Baumel, and Michael Elhadad. Question answering as an automatic evaluation metric for news article summarization. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short ...

work page arXiv 1910
[12]

doi: 10.18653/v1/N19-1395

Association for Computational Linguistics. doi: 10.18653/v1/N19-1395. URLhttps://aclanthology.org/N19-1395. (cited on p. 32) Felix Faltings, Michel Galley, Gerold Hintz, Chris Brockett, Chris Quirk, Jianfeng Gao, and Bill Dolan. Text editing by command,

work page doi:10.18653/v1/n19-1395
[13]

Hierarchical neural story generation

URL https://arxiv.org/abs/2010.12826. (cited on p. 39) Angela Fan, Mike Lewis, and Yann Dauphin. Hierarchical neural story generation. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 889–898, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P...

work page doi:10.18653/v1/p18-1082 2010
[14]

Fodor and Zenon W

doi: https://doi.org/10.1016/0010-0277(88)90031-5. URL https://www.sciencedirect.com/science/article/pii/ 0010027788900315. (cited on p. 30) 63 Mark Forsyth.The Elements of Eloquence: Secrets of the Perfect Turn of Phrase. Berkley, New York, 2014. (cited on p. 33) Lea Frermann, Shay B. Cohen, and Mirella Lapata. Whodunnit? Crime drama as a case for natura...

work page doi:10.1016/0010-0277(88)90031-5 2014
[15]

doi: 10.5555/1625275.1625535

Morgan Kaufmann. doi: 10.5555/1625275.1625535. URLhttps://dl.acm.org/doi/10.5555/1625275.1625535. (cited on p. 36) Deep Ganguli, Danny Hernandez, Liane Lovitt, Nova DasSarma, Tom Henighan, Andy Jones, Nicholas Joseph, Jackson Kernion, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Nelson Elhage, Sheer El Showk, Stanislav Fort, Za...

work page doi:10.5555/1625275.1625535 2022
[16]

(cited on p

URL https://arxiv.org/abs/2109.06838. (cited on p. 28) Edward Gibson, Richard Futrell, Julian Jara-Ettinger, Kyle Mahowald, Leon Bergen, Sivalogeswaran Ratnasingam, Mitchell Gibson, Steven T. Piantadosi, and Bevil R. Conway. Color naming across languages reflects color use.Proceedings of the National Academy of Sciences, 114(40):10785–10790, 2017. doi: 10...

work page doi:10.1073/pnas.1619666114 2017
[17]

doi: 10.18653/v1/N19-1061

Association for Computational Linguistics. doi: 10.18653/v1/N19-1061. URLhttps://aclanthology.org/N19-1061. (cited on p. 33) Roberto González-Ibáñez, Smaranda Muresan, and Nina Wacholder. Identifying sarcasm in Twitter: A closer look. InProceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp...

work page doi:10.18653/v1/n19-1061 2011
[18]

URL https://doi.org/10.35111/0z6y-q265

doi: 10.35111/0z6y-q265. URL https://doi.org/10.35111/0z6y-q265. (cited on p. 5) Alex Graves, Greg Wayne, and Ivo Danihelka. Neural Turing machines, 2014. URLhttps://arxiv.org/abs/1410.5401. (cited on pp. 34 and 38) Alex Graves, Greg Wayne, Malcom Reynolds, Tim Harley, Ivo Danihelka, Agnieszka Grabska-Barwińska, Sergio Gómez Col- menarejo, Edward Grefenst...

work page doi:10.35111/0z6y-q265 2014
[19]

URL https://doi.org/10.1145/1925844.1926423

doi: 10.1145/1925844.1926423. URL https://doi.org/10.1145/1925844.1926423. (cited on p. 36) Sumit Gulwani, William R. Harris, and Rishabh Singh. Spreadsheet data manipulation using examples.Commun. ACM, 55(8): 97–105, Aug. 2012. doi: 10.1145/2240236.2240260. URLhttps://doi.org/10.1145/2240236.2240260. (cited on p. 36) Sumit Gulwani, José Hernández-Orallo,...

work page doi:10.1145/1925844.1926423 2012
[20]

(cited on p

URL https://link.springer.com/article/10.1007/BF02172093. (cited on p. 39) F. Maxwell Harper and Joseph A. Konstan. The MovieLens datasets: History and context.ACM Trans. Interact. Intell. Syst., 5 (4), Dec. 2015. doi: 10.1145/2827872. URLhttps://doi.org/10.1145/2827872. (cited on p. 36) Behnam Hedayatnia, Karthik Gopalakrishnan, Seokhwan Kim, Yang Liu, M...

work page doi:10.1007/bf02172093 2015
[21]

Heine, and Ara Norenzayan

Springer. URL https://www.ecva.net/papers/eccv_2018/papers_ECCV/papers/Lisa_Anne_Hendricks_Women_also_ Snowboard_ECCV_2018_paper.pdf. (cited on p. 37) Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks. In International Conference on Learning Representations, 2017. URL https://openrev...

work page doi:10.1017/s0140525x0999152x 2017
[22]

29) China Household Management Research Center, Ministry of Public Security

(cited on p. 29) China Household Management Research Center, Ministry of Public Security. National name report 2018. 2019. http: //news.cpd.com.cn/n18151/201901/t20190130_830962.html (Accessed 3 March 2021). (cited on p. 33) China Household Management Research Center, Ministry of Public Security. National name report 2019. 2020. https: //www.mps.gov.cn/n2...

work page doi:10.2478/9783110410167/html 2018
[23]

(cited on p

URL https://instagram-engineering.com/emojineering-part-1-machine-learning-for-emoji-trendsmachine- learning-for-emoji-trends-7f5f9cb979ad. (cited on p. 31) Daphne Ippolito, Daniel Duckworth, Chris Callison-Burch, and Douglas Eck. Automatic detection of generated text is easiest when humans are fooled. InProceedings of the 58th Annual Meeting of the Assoc...

work page doi:10.18653/v1/2020.acl-main.164 2020
[24]

(cited on p

URL https://arxiv.org/abs/2007.01282. (cited on p. 38) 69 Kushal Jain, Adwait Deshpande, Kumar Shridhar, Felix Laumann, and Ayushman Dash. Indic-transformers: An analysis of transformer language models for indian languages, 2020. URLhttps://arxiv.org/abs/2011.02323. (cited on p. 33) Mario Jarmasz. Roget’s Thesaurus as a lexical resource for natural langua...

work page doi:10.18653/v1/2020.acl-main.232 2007
[25]

(cited on p

URL https://arxiv.org/abs/2005.01229. (cited on p. 41) Aditya Joshi, Vinita Sharma, and Pushpak Bhattacharyya. Harnessing context incongruity for sarcasm detection. InProceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pp...

work page doi:10.3115/v1/p15-2124 2005
[26]

Jizhan Fang, Xinle Deng, Haoming Xu, Ziyan Jiang, Yuqi Tang, Ziwen Xu, Shumin Deng, Yunzhi Yao, Mengru Wang, Shuofei Qiao, Huajun Chen, and Ningyu Zhang

doi: 10.1162/tacl_a_00023. URL https://aclanthology.org/Q18-1023. (cited on p. 35) Jan Kocoń, Piotr Miłkowski, and Kamil Kanclerz. MultiEmo: Multilingual, multilevel, multidomain sentiment analysis corpus of consumer reviews. In Maciej Paszynski, Dieter Kranzlmüller, Valeria V. Krzhizhanovskaya, Jack J. Dongarra, and Peter M. A. Sloot (eds.),Computational...

work page doi:10.1162/tacl_a_00023 2021
[27]

URL https://doi.org/10.1007/s10992-020-09581-6

doi: 10.1007/s10992-020-09581-6. URL https://doi.org/10.1007/s10992-020-09581-6. (cited on p. 29) Alexander W. Kocurek, Ethan Jerzak, and Rachel Etta Rudolph. Against conventional wisdom.Philosophers’ Imprint, 20(22): 1–27, 2020. URLhttp://hdl.handle.net/2027/spo.3521354.0020.022. (cited on p. 29) Moshe Koppel and Jonathan Schler. Authorship verification ...

work page doi:10.1007/s10992-020-09581-6 2020
[28]

(cited on p

URL https://arxiv.org/abs/2101.00379. (cited on p. 30) Patrick Lewis, Barlas Oguz, Ruty Rinott, Sebastian Riedel, and Holger Schwenk. MLQA: Evaluating cross-lingual extractive question answering. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7315–7330, Online, July 2020a. Association for Computational Ling...

work page doi:10.18653/v1/2020.acl-main.653 2020
[29]

doi: 10.18653/v1/W19-3005

Association for Computational Linguistics. doi: 10.18653/v1/W19-3005. URLhttps://aclanthology.org/W19-3005. (cited on p. 39) Chandler May, Alex Wang, Shikha Bordia, Samuel R. Bowman, and Rachel Rudinger. On measuring social biases in sentence encoders. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational...

work page doi:10.18653/v1/w19-3005 2019
[30]

and Rudinger, Rachel

Association for Computational Linguistics. doi: 10.18653/v1/N19-1063. URLhttps://aclanthology.org/N19-1063. (cited on p. 31) Andrew Mayne. OpenAI API alchemy: Emoji storytelling.Andrew Mayne blog, 24 June 2020. URLhttps://andrewmayneblog. wordpress.com/2020/06/24/open-ai-alchemy-emoji-storytelling/. (cited on p. 31) Joshua Maynez, Shashi Narayan, Bernd Bo...

work page doi:10.18653/v1/n19-1063 2020
[31]

Andere zeiten, andere lehren

URL https://arxiv.org/abs/2005.00661. (cited on pp. 30 and 40) Eric Mays, Fred J. Damerau, and Robert L. Mercer. Context based spelling correction.Information Processing & Management, 27(5):517–522, 1991. doi: https://doi.org/10.1016/0306-4573(91)90066-U. URLhttps://www.sciencedirect.com/science/ article/pii/030645739190066U. (cited on p. 41) Momoh Karmah...

work page doi:10.1016/0306-4573(91)90066-u 2005
[32]

31) David Milne and Ian H

(cited on p. 31) David Milne and Ian H. Witten. An effective, low-cost measure of semantic relatedness obtained from Wikipedia links. In Proceeding of AAAI Workshop on Wikipedia and Artificial Intelligence: An Evolving Synergy, pp. 25–30, Menlo Park,

work page
[33]

URLhttps://www.aaai.org/Papers/Workshops/2008/WS- 08-15/WS08-15-005.pdf

Association for the Advancement of Artificial Intelligence. URLhttps://www.aaai.org/Papers/Workshops/2008/WS- 08-15/WS08-15-005.pdf. (cited on p. 36) Republic of China Ministry of the Interior. National name statistical analysis, 2018.https://www.ris.gov.tw/documents/data/ 5/2/107namestat.pdf (Accessed 3 March 2021). (cited on p. 33) Swaroop Mishra, Danie...

work page doi:10.18653/v1/w19-3004 2008
[34]

Nakkiran, B

(cited on p. 14) Preetum Nakkiran, Behnam Neyshabur, and Hanie Sedghi. The deep bootstrap framework: Good online learners are good offline generalizers.arXiv preprint arXiv:2010.08127, 2020. (cited on p. 14) Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep double descent: Where bigger models and more data hur...

work page arXiv 2010
[35]

and Lapata, Mirella , year =

(cited on p. 34) Shashi Narayan, Shay B. Cohen, and Mirella Lapata. Don’t give me the details, just the summary! Topic-aware convolutional neural networks for extreme summarization. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 1797–1807, Brussels, Belgium, October-November 2018. Association for Computationa...

work page doi:10.18653/v1/d18-1206 2018
[36]

doi: 10.18653/v1/P19-1442

Association for Computational Linguistics. doi: 10.18653/v1/P19-1442. URLhttps://aclanthology.org/P19-1442. (cited on p. 31) Marilyn Nippold, Melissa Allen, and Dixon Kirsch. Proverb comprehension as a function of reading proficiency in preadolescents. Language Speech and Hearing Services in Schools, 32:90, 04 2001. doi: 10.1044/0161-1461(2001/ 009). URL ...

work page doi:10.18653/v1/p19-1442 2001
[37]

URL https://doi.org/10.1080/02724980443000566

doi: 10.1080/02724980443000566. URL https://doi.org/10.1080/02724980443000566. (cited on p. 35) The Working Committee on the Revision of the National Standard Occupational Classification. Standard Occupational Classification of the People’s Republic of China. China Labour and Social Security Publishing House, 2015.http://www. jiangmen.gov.cn/bmpd/jmsrlzyh...

work page doi:10.1080/02724980443000566 2015
[38]

32) Judea Pearl.Causality: Models, Reasoning, and Inference

(cited on p. 32) Judea Pearl.Causality: Models, Reasoning, and Inference. Cambridge University Press, Cambridge, 2000. (cited on p. 30) Devin Pelser and Hugh Murrell. Deep and dense sarcasm detection, 2019. URLhttps://arxiv.org/abs/1911.07474. (cited on p. 39) Ethan Perez, Douwe Kiela, and Kyunghyun Cho. True few-shot learning with language models, 2021. ...

work page doi:10.18653/v1/2020.coling-main.518 2000
[39]

29) Tony A

(cited on p. 29) Tony A. Plate.Holographic Reduced Representations: Distributed Representation for Cognitive Structures. CSLI, Stanford, CA,

work page
[40]

29) Robert Plutchik

(cited on p. 29) Robert Plutchik. A general psychoevolutionary theory of emotion. In Robert Plutchik and Henry Kellerman (eds.),Theories of Emotion, pp. 3–33. Academic Press, 1980. doi: https://doi.org/10.1016/B978-0-12-558701-3.50007-7. URL https: //www.sciencedirect.com/science/article/pii/B9780125587013500077. (cited on p. 32) Nadia Polikarpova, Ivan K...

work page doi:10.1016/b978-0-12-558701-3.50007-7 1980
[41]

URLhttps://aclanthology.org/2020.lrec-1.125

European Language Resources Association. URLhttps://aclanthology.org/2020.lrec-1.125. (cited on p. 31) Damien Sileo, Wout Vossen, and Robbe Raymaekers. Zero-shot recommendation as language modeling. In Matthias Hagen, Suzan Verberne, Craig Macdonald, Christin Seifert, Krisztian Balog, Kjetil Nørvåg, and Vinay Setty (eds.),Advances in Information Retrieval...

work page doi:10.1007/978-3-030-99739-7_26 2020
[42]

(cited on p

John Benjamins, Amsterdam, 2010. (cited on p. 35) Bernd Steinbach and Roman Kohut. Neural networks – a model of boolean functions.5th International Workshop on Boolean Problems, Freiburg, Sept. 2002., 2002. URL https://www.researchgate.net/publication/246931125_Neural_Networks_- _A_Model_of_Boolean_Functions. (cited on p. 29) Nisan Stiennon, Long Ouyang, ...

work page doi:10.5555/1597348.1597414 2010
[43]

38) Zijian Wang and David Jurgens

(cited on p. 38) Zijian Wang and David Jurgens. It’s going to be okay: Measuring access to support in online communities. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 33–45, Brussels, Belgium, October-November

work page 2018
[44]

assessing BERT’s syntactic abilities

Association for Computational Linguistics. doi: 10.18653/v1/D18-1004. URLhttps://aclanthology.org/D18-1004. (cited on p. 39) Zijian Wang and Christopher Potts. TalkDown: A corpus for condescension detection in context. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Nat...

work page doi:10.18653/v1/d18-1004 2019
[45]

(cited on p

URL https://huggingface.co/bert-syntax/extending-bert-syntax.pdf. (cited on p. 39) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, M...

work page doi:10.1145/3412841.3441982 2019
[46]

(cited on p

URL https://arxiv.org/abs/1705.10272. (cited on p. 38) Diyi Yang, Alon Lavie, Chris Dyer, and Eduard Hovy. Humor recognition and humor anchor extraction. InProceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 2367–2376, Lisbon, Portugal, September 2015. Association for Computational Linguistics. doi: 10.18653/v1/D15-...

work page doi:10.18653/v1/d15-1284 2015
[47]

(cited on pp

URL https://arxiv.org/abs/2002.04326. (cited on pp. 29 and 35) Xiang Yu, Ngoc Thang Vu, and Jonas Kuhn. Learning the Dyck language with attention-based Seq2Seq models. InProceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 138–146, Florence, Italy, August 2019c. Association for Computational Linguistics...

work page doi:10.18653/v1/w19-4815 2002
[48]

31) Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao

(cited on p. 31) Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao. ActivityNet-QA: A dataset for understanding complex web videos via question answering, 2019d. URLhttps://arxiv.org/abs/1906.02467. (cited on p. 32) Eliezer Yudkowsky. Artificial intelligence as a positive and negative factor in global risk. In Nick Bostrom an...

work page doi:10.1109/tpami.2019.2914054 1906
[49]

Gender bias in coreference resolution: Evaluation and debiasing methods

Association for Computational Linguistics. doi: 10.18653/v1/N18-2003. URLhttps://aclanthology.org/N18-2003. (cited on pp. 31 and 41) Tony Z. Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. Calibrate before use: Improving few-shot performance of language models, 2021. URLhttps://arxiv.org/abs/2102.09690. (cited on p. 41) Ben Zhou, Daniel Khashab...

work page doi:10.18653/v1/n18-2003 2003