Recognition: 2 theorem links
· Lean TheoremBeyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
Pith reviewed 2026-05-10 23:21 UTC · model grok-4.3
The pith
Scale brings gradual gains on knowledge tasks but sudden breakthroughs on complex ones in language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
BIG-bench evaluations demonstrate that model performance and calibration improve with scale across dense and sparse transformers, yet stay poor in absolute terms relative to human raters. Tasks improve gradually and predictably when they center on knowledge or memorization; tasks show sudden breakthroughs at critical scales when they involve multiple components or brittle metrics. Performance patterns are similar across model classes with some gains from sparsity, and social bias typically rises with scale under ambiguous conditions though prompting mitigates it.
What carries the argument
BIG-bench, a suite of 204 diverse tasks contributed by 450 authors that probes capabilities beyond those of current models and tracks how performance changes across model sizes.
If this is right
- Larger models will show predictable improvement on knowledge-based tasks but may suddenly gain new abilities on multi-step tasks at certain sizes.
- Calibration of model outputs will continue to improve with size yet remain unreliable compared to human judgments.
- Sparse model architectures will retain a modest edge over dense ones at equivalent scales.
- Social biases in model outputs will tend to increase with scale unless addressed by techniques such as prompting.
Where Pith is reading between the lines
- Developers may need to design new tasks focused on multi-step reasoning to better anticipate when abrupt capability jumps will occur.
- The observed patterns imply that simple extrapolation from small-model trends will underestimate sudden changes in what models can do.
- Maintaining human expert baselines will require ongoing updates as model performance approaches or crosses them on individual tasks.
Load-bearing premise
The 204 tasks chosen represent the capabilities that will matter for future models and human rater performance gives a stable, unbiased ceiling for comparison.
What would settle it
A follow-up evaluation on the same tasks where models exceed human raters on a majority of them or where no clear split appears between gradual and breakthrough scaling behaviors.
read the original abstract
Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 450 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the Beyond the Imitation Game benchmark (BIG-bench) with 204 tasks contributed by 450 authors across 132 institutions, spanning linguistics, math, reasoning, biology, social bias and other domains. It evaluates OpenAI GPT models, Google-internal dense transformers and Switch-style sparse transformers across scales from millions to hundreds of billions of parameters, supplies human expert rater baselines on all tasks, and reports that model performance and calibration improve with scale yet remain poor in absolute terms relative to humans; tasks with gradual scaling tend to involve knowledge or memorization while breakthrough scaling appears in multi-step or brittle-metric tasks; social bias tends to increase with scale under ambiguous context but can be mitigated by prompting.
Significance. If the reported empirical patterns hold, the work supplies a valuable large-scale characterization of current language-model capabilities and limitations that can inform scaling research, capability forecasting and harm mitigation. Credit is due for the multi-institutional task collection, the provision of human baselines, the explicit separation of gradual versus breakthrough scaling behaviors, and the absence of fitted parameters or circular reductions in the analysis.
minor comments (4)
- [Abstract] Abstract: the list of findings is presented as a single dense sentence; reformatting the key observations as bullets would improve immediate readability for readers scanning the paper.
- [Evaluation] Evaluation protocol: the manuscript should state the precise prompting templates, number of shots, and decoding parameters used for each model family so that the reported scores can be reproduced by independent groups.
- [Results] Results section: performance curves are shown without error bars or statistical tests; adding these would allow readers to assess whether observed differences between model classes or scales are reliable.
- [Analysis] Task categorization: the distinction between 'gradual' and 'breakthrough' tasks is described qualitatively; a short appendix listing the specific tasks falling into each category with their scaling exponents would make the claim more concrete.
Simulated Author's Rebuttal
We thank the referee for the positive and accurate summary of our work, the recognition of its significance for scaling research and capability forecasting, and the recommendation of minor revision. No specific major comments were provided in the report, so we have no points to address point-by-point. We are prepared to incorporate any minor suggestions or clarifications if supplied by the editor or referee.
Circularity Check
No significant circularity; purely empirical benchmark
full rationale
The paper introduces the BIG-bench dataset of 204 tasks and reports direct empirical measurements of model performance across scales, model classes, and human raters. No mathematical derivations, parameter fits, or predictions are claimed; scaling trends, gradual vs. breakthrough behaviors, and bias observations are presented as descriptive results from the evaluations themselves. The central claims rest on the contributed tasks and rater baselines without reduction to prior fits or self-citation chains. This is the expected non-finding for a large-scale benchmarking effort.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 52 Pith papers
-
LiveBench: A Challenging, Contamination-Limited LLM Benchmark
LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.
-
Progress measures for grokking via mechanistic interpretability
Grokking arises from gradual amplification of a Fourier-based circuit in the weights followed by removal of memorizing components.
-
NARRA-Gym for Evaluating Interactive Narrative Agents
NARRA-Gym is an executable benchmark that generates complete interactive narrative episodes from emotional seeds and logs full model trajectories to expose gaps in coherence, adaptation, and personalization that stati...
-
TRIP-Evaluate: An Open Multimodal Benchmark for Evaluating Large Models in Transportation
TRIP-Evaluate is a new open multimodal benchmark with 837 text, image, and point-cloud items organized by a role-task-knowledge taxonomy to evaluate large models on transportation workflows.
-
FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks
FrontierFinance benchmark shows human financial experts outperform state-of-the-art LLMs by achieving higher scores and more client-ready outputs on realistic long-horizon tasks.
-
KTO: Model Alignment as Prospect Theoretic Optimization
KTO aligns LLMs by directly maximizing prospect-theoretic utility on binary signals and matches or exceeds preference-based methods like DPO from 1B to 30B parameters.
-
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
-
C-Pack: Packed Resources For General Chinese Embeddings
C-Pack releases a new Chinese embedding benchmark, large training dataset, and optimized models that outperform priors by up to 10% on C-MTEB while also delivering English SOTA results.
-
Large Language Models as Optimizers
Large language models can optimize by being prompted with histories of past solutions and scores to propose better ones, producing prompts that raise accuracy up to 8% on GSM8K and 50% on Big-Bench Hard over human-des...
-
gwBenchmarks: Stress-Testing LLM Agents on High-Precision Gravitational Wave Astronomy
LLM coding agents cannot reach the 10^{-4} relative accuracy required for gravitational wave modeling tasks and show systematic failures including metric misuse and result fabrication.
-
OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces
OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.
-
Temporal Reasoning Is Not the Bottleneck: A Probabilistic Inconsistency Framework for Neuro-Symbolic QA
Temporal reasoning is not the core bottleneck for LLMs on time-based QA; the real issue is unstructured text-to-event mapping, addressed by a neuro-symbolic system with PIS that reaches 100% accuracy on benchmarks whe...
-
A Meta Reinforcement Learning Approach to Goals-Based Wealth Management
MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.
-
Evaluating Agentic AI in the Wild: Failure Modes, Drift Patterns, and a Production Evaluation Framework
The paper presents a taxonomy of seven production-specific failure modes for agentic AI, demonstrates that existing metrics fail to detect four of them entirely, and proposes the PAEF five-dimension framework for cont...
-
AIT Academy: Cultivating the Complete Agent with a Confucian Three-Domain Curriculum
AIT Academy introduces a tripartite curriculum for AI agents across natural science, humanities, and social science domains, with reported gains of 15.9 points in security and 7 points in social reasoning under specif...
-
QuickScope: Certifying Hard Questions in Dynamic LLM Benchmarks
QuickScope uses modified COUP Bayesian optimization to find truly difficult questions in dynamic LLM benchmarks more sample-efficiently than baselines while cutting false positives.
-
Beyond Static Snapshots: A Grounded Evaluation Framework for Language Models at the Agentic Frontier
ISOPro replaces learned reward models with deterministic verifiers in a continuous evaluation setup for LLMs, delivering larger average capability gains than GRPO-LoRA across small models in scheduling and MBPP domain...
-
Parcae: Scaling Laws For Stable Looped Language Models
Parcae stabilizes looped LLMs via spectral norm constraints on injection parameters, enabling power-law scaling for training FLOPs and saturating exponential scaling at test time that improves quality over fixed-depth...
-
The A-R Behavioral Space: Execution-Level Profiling of Tool-Using Language Model Agents in Organizational Deployment
Execution and refusal in tool-using LLM agents form separable behavioral dimensions whose joint distribution shifts systematically with normative regimes and autonomy scaffolding.
-
Measuring Representation Robustness in Large Language Models for Geometry
LLMs display accuracy gaps of up to 14 percentage points on the same geometry problems solely due to representation choice, with vector forms consistently weakest and a convert-then-solve prompt helping only high-capa...
-
Memory in the Age of AI Agents
The paper maps agent memory research via three forms (token-level, parametric, latent), three functions (factual, experiential, working), and dynamics of formation/evolution/retrieval, plus benchmarks and future directions.
-
Towards an AI co-scientist
A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.
-
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
MMLU-Pro is a revised benchmark that makes language model evaluation harder and more stable by using ten options per question and emphasizing reasoning over simple knowledge recall.
-
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Phi-3-mini (3.8B params, 3.3T tokens) reaches 69% MMLU and 8.38 MT-bench, matching larger models, with scaled-up 7B/14B variants and phi-3.5 extensions for multilingual, MoE, and vision capabilities.
-
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
-
Reinforced Self-Training (ReST) for Language Modeling
ReST improves LLM translation quality on benchmarks via offline RL on self-generated data, achieving gains in a compute-efficient way compared to typical RLHF.
-
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
GPT-4 as an LLM judge achieves over 80% agreement with human preferences on MT-Bench and Chatbot Arena, matching human agreement levels and providing a scalable evaluation method.
-
Improving Factuality and Reasoning in Language Models through Multiagent Debate
Multiagent debate among LLMs improves mathematical reasoning, strategic reasoning, and factual accuracy while reducing hallucinations.
-
BloombergGPT: A Large Language Model for Finance
BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.
-
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
BLOOM is a 176B-parameter open-access multilingual language model trained on the ROOTS corpus that achieves competitive performance on benchmarks, with improved results after multitask prompted finetuning.
-
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
Chain-of-thought prompting enables large language models to surpass average human performance on 17 of 23 challenging BIG-Bench tasks.
-
Language Models (Mostly) Know What They Know
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
-
Emergent Abilities of Large Language Models
Emergent abilities are capabilities present in large language models but absent in smaller ones and cannot be predicted by extrapolating smaller model performance.
-
The Efficiency Gap in Byte Modeling
Byte modeling incurs greater scaling overhead for masked diffusion than autoregressive models because the diffusion objective destroys local byte contiguity needed to resolve semantics.
-
Task-Aware Answer Preservation under Audio Compression for Large Audio Language Models
A statistical sign-off protocol for audio compressors ensures worst-case answer preservation across query families in LALMs.
-
Complexity Horizons of Compressed Models in Analog Circuit Analysis
Prerequisite graphs map compressed LLM performance boundaries in analog circuit analysis to allow selecting the smallest viable model for a given task complexity.
-
Calibrating Model-Based Evaluation Metrics for Summarization
A reference-free proxy scoring framework combined with GIRB calibration produces better-aligned evaluation metrics for summarization and outperforms baselines across seven datasets.
-
Consistency Analysis of Sentiment Predictions using Syntactic & Semantic Context Assessment Summarization (SSAS)
SSAS improves LLM sentiment prediction consistency and data quality by up to 30% on three review datasets via syntactic and semantic context assessment summarization.
-
Empirical Evidence of Complexity-Induced Limits in Large Language Models on Finite Discrete State-Space Problems with Explicit Validity Constraints
Large reasoning models exhibit reasoning collapse, with accuracy dropping sharply beyond task-specific complexity thresholds in controlled versions of nine classical reasoning tasks using strict validity validators.
-
When Models Know More Than They Say: Probing Analogical Reasoning in LLMs
Probing shows LLMs hold more analogical knowledge internally than prompting reveals, with a task-dependent asymmetry between rhetorical and narrative cases.
-
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
Phi-4-Mini achieves strong math and coding performance with only 3.8B parameters via high-quality synthetic data, while Phi-4-Multimodal uses Mixture-of-LoRAs to integrate modalities and top speech recognition leaderboards.
-
Humanity's Last Exam
Humanity's Last Exam is a new 2,500-question benchmark at the frontier of human knowledge where state-of-the-art LLMs show low accuracy.
-
The Platonic Representation Hypothesis
Representations learned by large AI models are converging toward a shared statistical model of reality.
-
PaLM 2 Technical Report
PaLM 2 reports state-of-the-art results on language, reasoning, and multilingual tasks with improved efficiency over PaLM.
-
Galactica: A Large Language Model for Science
Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.
-
BenCSSmark: Making the Social Sciences Count in LLM Research
BenCSSmark is a proposed benchmark that adds social science datasets to LLM evaluation to improve model robustness and relevance across disciplines like sociology and economics.
-
Leveraging Weighted Syntactic and Semantic Context Assessment Summary (wSSAS) Towards Text Categorization Using LLMs
wSSAS is a two-phase deterministic framework that uses hierarchical text organization and SNR-based feature prioritization to improve clustering integrity, categorization accuracy, and reproducibility when applying LL...
-
From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review
A survey consolidating benchmarks, agent frameworks, real-world applications, and protocols for LLM-based autonomous agents into a proposed taxonomy with recommendations for future research.
-
TinyLlama: An Open-Source Small Language Model
TinyLlama is a 1.1B-parameter open-source language model pretrained on 1 trillion tokens that outperforms other open-source models of similar size on downstream tasks.
-
ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools
GLM-4 models rival or exceed GPT-4 on MMLU, GSM8K, MATH, BBH, GPQA, HumanEval, IFEval, long-context tasks, and Chinese alignment while adding autonomous tool use for web, code, and image generation.
-
A Survey of Large Language Models
This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.
-
A Survey on In-context Learning
The paper surveys definitions, techniques, applications, and challenges in in-context learning for large language models.
Reference graph
Works this paper leans on
-
[1]
URL https://arxiv.org/abs/1808.01400. (cited on p. 30) Uri Alon, Roy Sadaka, Omer Levy, and Eran Yahav. Structural language models of code. In Hal Daumé III and Aarti Singh (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pp. 245–256. PMLR, 13–18 July 2020. URLhttps://proc...
-
[2]
URL https://arxiv.org/abs/1606.06565. (cited on p. 40) Brandon Amos and J. Zico Kolter. Optnet: Differentiable optimization as a layer in neural networks, 2017. URLhttps: //arxiv.org/abs/1703.00443. (cited on p. 38) Philip W. Anderson. More is different.Science, 177(4047):393–396, 1972. doi: 10.1126/science.177.4047.393. URLhttps: //www.science.org/doi/ab...
-
[3]
URL https://arxiv.org/abs/2001.08435. (cited on p. 39) Nihat Bayat and Gökhan Çetinkaya. The relationship between inference skills and reading comprehension.TED EĞİTİM VE BİLİM (Education and Science), 45(203):177–190, 2020. doi: 10.15390/EB.2020.8782. URLhttp://egitimvebilim.ted.org. tr/index.php/EB/article/view/8782. (cited on p. 34) Mayur J. Bency, Ahm...
-
[4]
On the Opportunities and Risks of Foundation Models
URL https://arxiv.org/abs/2108.07258. (cited on p. 4) Shikha Bordia and Samuel R. Bowman. Identifying and reducing gender bias in word-level language models, 2019. URL https://arxiv.org/abs/1904.03035. (cited on p. 33) Antoine Bosselut, Hannah Rashkin, Maarten Sap, Chaitanya Malaviya, Asli Celikyilmaz, and Yejin Choi. COMET: Commonsense transformers for a...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/tvcg.2011.185 2019
-
[5]
Association for Computational Linguistics. doi: 10.18653/v1/W18-6433. URLhttps://aclanthology.org/W18-6433. (cited on p. 39) Corrado Böhm. On a family of Turing machines and the related programming language.ICC Bulletin, 3:187–194, 1964. (cited on p. 38) Kate Cain and Jane V. Oakhill. Inference making ability and its relation to comprehension failure.Read...
-
[6]
Simplicity: a unifying principle in cognitive science? , volume =
doi: 10.1016/S1364-6613(02)00005-0. URL https://doi.org/10.1016/S1364-6613(02)00005-0. (cited on p. 38) Antonio Chella, Arianna Pipitone, Alain Morin, and Famira Racy. Developing self-awareness in robots via inner speech.Frontiers in Robotics and AI, 7, 2020. doi: 10.3389/frobt.2020.00016. URLhttps://www.frontiersin.org/article/10.3389/frobt. 2020.00016. ...
-
[7]
Association for Computational Linguistics. doi: 10.18653/v1/W19-3824. URLhttps://aclanthology.org/W19-3824. (cited on p. 31) Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, Percy Liang, and Luke Zettlemoyer. QuAC: Question answering in context, 2018. URLhttps://arxiv.org/abs/1808.07036. (cited on p. 40) François Chollet. On the mea...
-
[8]
doi: 10.1007/978-3-319-40566-7_4
Springer. doi: 10.1007/978-3-319-40566-7_4. URL https://doi.org/10.1007/978-3-319-40566-7_4. (cited on p. 36) Andrew Cropper, Rolf Morel, and Stephen Muggleton. Learning higher-order logic programs.Machine Learning, 109:1289–1322,
-
[9]
doi: 10.1007/s10994-019-05862-7. URL https://doi.org/10.1007/s10994-019-05862-7. (cited on p. 34) Joe Cruse. Emoji usage in TV conversation.Twitter blog, 18 Nov 2015. URLhttps://blog.twitter.com/en_us/a/2015/emoji- usage-in-tv-conversation. (cited on p. 31) Marc-Alexandre Côté, Ákos Kádár, Xingdi Yuan, Ben Kybartas, Tavian Barnes, Emery Fine, James Moore,...
-
[10]
URL https://arxiv.org/abs/1707.03904. (cited on p. 33) Kaustubh Dhole, Gurdeep Singh, Priyadarshini P. Pai, and Sukanta Mondal. Sequence-based prediction of protein–protein interaction sites with l1-logreg classifier.Journal of Theoretical Biology, 348:47–54, 2014. doi: 10.1016/j.jtbi.2014.01.028. URL https://pubmed.ncbi.nlm.nih.gov/24486250/. (cited on p...
-
[11]
URL https://arxiv.org/abs/1910.02227. (cited on p. 32) Matan Eyal, Tal Baumel, and Michael Elhadad. Question answering as an automatic evaluation metric for news article summarization. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short ...
-
[12]
Association for Computational Linguistics. doi: 10.18653/v1/N19-1395. URLhttps://aclanthology.org/N19-1395. (cited on p. 32) Felix Faltings, Michel Galley, Gerold Hintz, Chris Brockett, Chris Quirk, Jianfeng Gao, and Bill Dolan. Text editing by command,
-
[13]
Hierarchical neural story generation
URL https://arxiv.org/abs/2010.12826. (cited on p. 39) Angela Fan, Mike Lewis, and Yann Dauphin. Hierarchical neural story generation. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 889–898, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P...
-
[14]
doi: https://doi.org/10.1016/0010-0277(88)90031-5. URL https://www.sciencedirect.com/science/article/pii/ 0010027788900315. (cited on p. 30) 63 Mark Forsyth.The Elements of Eloquence: Secrets of the Perfect Turn of Phrase. Berkley, New York, 2014. (cited on p. 33) Lea Frermann, Shay B. Cohen, and Mirella Lapata. Whodunnit? Crime drama as a case for natura...
-
[15]
Morgan Kaufmann. doi: 10.5555/1625275.1625535. URLhttps://dl.acm.org/doi/10.5555/1625275.1625535. (cited on p. 36) Deep Ganguli, Danny Hernandez, Liane Lovitt, Nova DasSarma, Tom Henighan, Andy Jones, Nicholas Joseph, Jackson Kernion, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Nelson Elhage, Sheer El Showk, Stanislav Fort, Za...
-
[16]
URL https://arxiv.org/abs/2109.06838. (cited on p. 28) Edward Gibson, Richard Futrell, Julian Jara-Ettinger, Kyle Mahowald, Leon Bergen, Sivalogeswaran Ratnasingam, Mitchell Gibson, Steven T. Piantadosi, and Bevil R. Conway. Color naming across languages reflects color use.Proceedings of the National Academy of Sciences, 114(40):10785–10790, 2017. doi: 10...
-
[17]
Association for Computational Linguistics. doi: 10.18653/v1/N19-1061. URLhttps://aclanthology.org/N19-1061. (cited on p. 33) Roberto González-Ibáñez, Smaranda Muresan, and Nina Wacholder. Identifying sarcasm in Twitter: A closer look. InProceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp...
-
[18]
URL https://doi.org/10.35111/0z6y-q265
doi: 10.35111/0z6y-q265. URL https://doi.org/10.35111/0z6y-q265. (cited on p. 5) Alex Graves, Greg Wayne, and Ivo Danihelka. Neural Turing machines, 2014. URLhttps://arxiv.org/abs/1410.5401. (cited on pp. 34 and 38) Alex Graves, Greg Wayne, Malcom Reynolds, Tim Harley, Ivo Danihelka, Agnieszka Grabska-Barwińska, Sergio Gómez Col- menarejo, Edward Grefenst...
-
[19]
URL https://doi.org/10.1145/1925844.1926423
doi: 10.1145/1925844.1926423. URL https://doi.org/10.1145/1925844.1926423. (cited on p. 36) Sumit Gulwani, William R. Harris, and Rishabh Singh. Spreadsheet data manipulation using examples.Commun. ACM, 55(8): 97–105, Aug. 2012. doi: 10.1145/2240236.2240260. URLhttps://doi.org/10.1145/2240236.2240260. (cited on p. 36) Sumit Gulwani, José Hernández-Orallo,...
-
[20]
URL https://link.springer.com/article/10.1007/BF02172093. (cited on p. 39) F. Maxwell Harper and Joseph A. Konstan. The MovieLens datasets: History and context.ACM Trans. Interact. Intell. Syst., 5 (4), Dec. 2015. doi: 10.1145/2827872. URLhttps://doi.org/10.1145/2827872. (cited on p. 36) Behnam Hedayatnia, Karthik Gopalakrishnan, Seokhwan Kim, Yang Liu, M...
-
[21]
Springer. URL https://www.ecva.net/papers/eccv_2018/papers_ECCV/papers/Lisa_Anne_Hendricks_Women_also_ Snowboard_ECCV_2018_paper.pdf. (cited on p. 37) Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks. In International Conference on Learning Representations, 2017. URL https://openrev...
-
[22]
29) China Household Management Research Center, Ministry of Public Security
(cited on p. 29) China Household Management Research Center, Ministry of Public Security. National name report 2018. 2019. http: //news.cpd.com.cn/n18151/201901/t20190130_830962.html (Accessed 3 March 2021). (cited on p. 33) China Household Management Research Center, Ministry of Public Security. National name report 2019. 2020. https: //www.mps.gov.cn/n2...
-
[23]
URL https://instagram-engineering.com/emojineering-part-1-machine-learning-for-emoji-trendsmachine- learning-for-emoji-trends-7f5f9cb979ad. (cited on p. 31) Daphne Ippolito, Daniel Duckworth, Chris Callison-Burch, and Douglas Eck. Automatic detection of generated text is easiest when humans are fooled. InProceedings of the 58th Annual Meeting of the Assoc...
-
[24]
URL https://arxiv.org/abs/2007.01282. (cited on p. 38) 69 Kushal Jain, Adwait Deshpande, Kumar Shridhar, Felix Laumann, and Ayushman Dash. Indic-transformers: An analysis of transformer language models for indian languages, 2020. URLhttps://arxiv.org/abs/2011.02323. (cited on p. 33) Mario Jarmasz. Roget’s Thesaurus as a lexical resource for natural langua...
-
[25]
URL https://arxiv.org/abs/2005.01229. (cited on p. 41) Aditya Joshi, Vinita Sharma, and Pushpak Bhattacharyya. Harnessing context incongruity for sarcasm detection. InProceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pp...
-
[26]
doi: 10.1162/tacl_a_00023. URL https://aclanthology.org/Q18-1023. (cited on p. 35) Jan Kocoń, Piotr Miłkowski, and Kamil Kanclerz. MultiEmo: Multilingual, multilevel, multidomain sentiment analysis corpus of consumer reviews. In Maciej Paszynski, Dieter Kranzlmüller, Valeria V. Krzhizhanovskaya, Jack J. Dongarra, and Peter M. A. Sloot (eds.),Computational...
-
[27]
URL https://doi.org/10.1007/s10992-020-09581-6
doi: 10.1007/s10992-020-09581-6. URL https://doi.org/10.1007/s10992-020-09581-6. (cited on p. 29) Alexander W. Kocurek, Ethan Jerzak, and Rachel Etta Rudolph. Against conventional wisdom.Philosophers’ Imprint, 20(22): 1–27, 2020. URLhttp://hdl.handle.net/2027/spo.3521354.0020.022. (cited on p. 29) Moshe Koppel and Jonathan Schler. Authorship verification ...
-
[28]
URL https://arxiv.org/abs/2101.00379. (cited on p. 30) Patrick Lewis, Barlas Oguz, Ruty Rinott, Sebastian Riedel, and Holger Schwenk. MLQA: Evaluating cross-lingual extractive question answering. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7315–7330, Online, July 2020a. Association for Computational Ling...
-
[29]
Association for Computational Linguistics. doi: 10.18653/v1/W19-3005. URLhttps://aclanthology.org/W19-3005. (cited on p. 39) Chandler May, Alex Wang, Shikha Bordia, Samuel R. Bowman, and Rachel Rudinger. On measuring social biases in sentence encoders. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational...
-
[30]
Association for Computational Linguistics. doi: 10.18653/v1/N19-1063. URLhttps://aclanthology.org/N19-1063. (cited on p. 31) Andrew Mayne. OpenAI API alchemy: Emoji storytelling.Andrew Mayne blog, 24 June 2020. URLhttps://andrewmayneblog. wordpress.com/2020/06/24/open-ai-alchemy-emoji-storytelling/. (cited on p. 31) Joshua Maynez, Shashi Narayan, Bernd Bo...
-
[31]
URL https://arxiv.org/abs/2005.00661. (cited on pp. 30 and 40) Eric Mays, Fred J. Damerau, and Robert L. Mercer. Context based spelling correction.Information Processing & Management, 27(5):517–522, 1991. doi: https://doi.org/10.1016/0306-4573(91)90066-U. URLhttps://www.sciencedirect.com/science/ article/pii/030645739190066U. (cited on p. 41) Momoh Karmah...
-
[32]
(cited on p. 31) David Milne and Ian H. Witten. An effective, low-cost measure of semantic relatedness obtained from Wikipedia links. In Proceeding of AAAI Workshop on Wikipedia and Artificial Intelligence: An Evolving Synergy, pp. 25–30, Menlo Park,
-
[33]
URLhttps://www.aaai.org/Papers/Workshops/2008/WS- 08-15/WS08-15-005.pdf
Association for the Advancement of Artificial Intelligence. URLhttps://www.aaai.org/Papers/Workshops/2008/WS- 08-15/WS08-15-005.pdf. (cited on p. 36) Republic of China Ministry of the Interior. National name statistical analysis, 2018.https://www.ris.gov.tw/documents/data/ 5/2/107namestat.pdf (Accessed 3 March 2021). (cited on p. 33) Swaroop Mishra, Danie...
-
[34]
(cited on p. 14) Preetum Nakkiran, Behnam Neyshabur, and Hanie Sedghi. The deep bootstrap framework: Good online learners are good offline generalizers.arXiv preprint arXiv:2010.08127, 2020. (cited on p. 14) Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep double descent: Where bigger models and more data hur...
-
[35]
(cited on p. 34) Shashi Narayan, Shay B. Cohen, and Mirella Lapata. Don’t give me the details, just the summary! Topic-aware convolutional neural networks for extreme summarization. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 1797–1807, Brussels, Belgium, October-November 2018. Association for Computationa...
-
[36]
Association for Computational Linguistics. doi: 10.18653/v1/P19-1442. URLhttps://aclanthology.org/P19-1442. (cited on p. 31) Marilyn Nippold, Melissa Allen, and Dixon Kirsch. Proverb comprehension as a function of reading proficiency in preadolescents. Language Speech and Hearing Services in Schools, 32:90, 04 2001. doi: 10.1044/0161-1461(2001/ 009). URL ...
-
[37]
URL https://doi.org/10.1080/02724980443000566
doi: 10.1080/02724980443000566. URL https://doi.org/10.1080/02724980443000566. (cited on p. 35) The Working Committee on the Revision of the National Standard Occupational Classification. Standard Occupational Classification of the People’s Republic of China. China Labour and Social Security Publishing House, 2015.http://www. jiangmen.gov.cn/bmpd/jmsrlzyh...
-
[38]
32) Judea Pearl.Causality: Models, Reasoning, and Inference
(cited on p. 32) Judea Pearl.Causality: Models, Reasoning, and Inference. Cambridge University Press, Cambridge, 2000. (cited on p. 30) Devin Pelser and Hugh Murrell. Deep and dense sarcasm detection, 2019. URLhttps://arxiv.org/abs/1911.07474. (cited on p. 39) Ethan Perez, Douwe Kiela, and Kyunghyun Cho. True few-shot learning with language models, 2021. ...
-
[39]
(cited on p. 29) Tony A. Plate.Holographic Reduced Representations: Distributed Representation for Cognitive Structures. CSLI, Stanford, CA,
-
[40]
(cited on p. 29) Robert Plutchik. A general psychoevolutionary theory of emotion. In Robert Plutchik and Henry Kellerman (eds.),Theories of Emotion, pp. 3–33. Academic Press, 1980. doi: https://doi.org/10.1016/B978-0-12-558701-3.50007-7. URL https: //www.sciencedirect.com/science/article/pii/B9780125587013500077. (cited on p. 32) Nadia Polikarpova, Ivan K...
-
[41]
URLhttps://aclanthology.org/2020.lrec-1.125
European Language Resources Association. URLhttps://aclanthology.org/2020.lrec-1.125. (cited on p. 31) Damien Sileo, Wout Vossen, and Robbe Raymaekers. Zero-shot recommendation as language modeling. In Matthias Hagen, Suzan Verberne, Craig Macdonald, Christin Seifert, Krisztian Balog, Kjetil Nørvåg, and Vinay Setty (eds.),Advances in Information Retrieval...
-
[42]
John Benjamins, Amsterdam, 2010. (cited on p. 35) Bernd Steinbach and Roman Kohut. Neural networks – a model of boolean functions.5th International Workshop on Boolean Problems, Freiburg, Sept. 2002., 2002. URL https://www.researchgate.net/publication/246931125_Neural_Networks_- _A_Model_of_Boolean_Functions. (cited on p. 29) Nisan Stiennon, Long Ouyang, ...
-
[43]
38) Zijian Wang and David Jurgens
(cited on p. 38) Zijian Wang and David Jurgens. It’s going to be okay: Measuring access to support in online communities. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 33–45, Brussels, Belgium, October-November
work page 2018
-
[44]
assessing BERT’s syntactic abilities
Association for Computational Linguistics. doi: 10.18653/v1/D18-1004. URLhttps://aclanthology.org/D18-1004. (cited on p. 39) Zijian Wang and Christopher Potts. TalkDown: A corpus for condescension detection in context. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Nat...
-
[45]
URL https://huggingface.co/bert-syntax/extending-bert-syntax.pdf. (cited on p. 39) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, M...
-
[46]
URL https://arxiv.org/abs/1705.10272. (cited on p. 38) Diyi Yang, Alon Lavie, Chris Dyer, and Eduard Hovy. Humor recognition and humor anchor extraction. InProceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 2367–2376, Lisbon, Portugal, September 2015. Association for Computational Linguistics. doi: 10.18653/v1/D15-...
-
[47]
URL https://arxiv.org/abs/2002.04326. (cited on pp. 29 and 35) Xiang Yu, Ngoc Thang Vu, and Jonas Kuhn. Learning the Dyck language with attention-based Seq2Seq models. InProceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 138–146, Florence, Italy, August 2019c. Association for Computational Linguistics...
-
[48]
31) Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao
(cited on p. 31) Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao. ActivityNet-QA: A dataset for understanding complex web videos via question answering, 2019d. URLhttps://arxiv.org/abs/1906.02467. (cited on p. 32) Eliezer Yudkowsky. Artificial intelligence as a positive and negative factor in global risk. In Nick Bostrom an...
-
[49]
Gender bias in coreference resolution: Evaluation and debiasing methods
Association for Computational Linguistics. doi: 10.18653/v1/N18-2003. URLhttps://aclanthology.org/N18-2003. (cited on pp. 31 and 41) Tony Z. Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. Calibrate before use: Improving few-shot performance of language models, 2021. URLhttps://arxiv.org/abs/2102.09690. (cited on p. 41) Ben Zhou, Daniel Khashab...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.