arxiv: 2304.06364 · v2 · submitted 2023-04-13 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models

Wanjun Zhong , Ruixiang Cui , Yiduo Guo , Yaobo Liang , Shuai Lu , Yanlin Wang , Amin Saied , Weizhu Chen

show 1 more author

Nan Duan

Authors on Pith no claims yet

Pith reviewed 2026-05-16 09:59 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords AGIEvalfoundation modelsbenchmarkGPT-4standardized examsSATLSATreasoning

0 comments

The pith

AGIEval benchmark shows GPT-4 surpassing average humans on SAT math at 95 percent and LSAT.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces AGIEval, a benchmark that draws questions directly from standardized human exams such as college entrance tests, law school admissions, math competitions, and lawyer qualification exams to measure foundation model abilities on human-level tasks. Evaluations of GPT-4, ChatGPT, and Text-Davinci-003 find that GPT-4 exceeds average human scores on SAT, LSAT, and math competitions, reaching 95 percent accuracy on SAT math and 92.5 percent on the English section of the Chinese national college entrance exam. The models show weaker results on problems that require complex reasoning chains or narrow domain knowledge. The authors break performance into categories of understanding, knowledge, reasoning, and calculation to expose specific strengths and gaps. The approach replaces artificial datasets with tasks tied to real human cognition and decision-making.

Core claim

AGIEval evaluates foundation models on collections of real standardized exams including SAT, LSAT, math competitions, and lawyer qualification tests. GPT-4 surpasses average human performance on SAT, LSAT, and math competitions, attaining 95 percent accuracy on the SAT Math test and 92.5 percent accuracy on the English test of the Chinese national college entrance exam, while remaining less proficient on tasks that demand complex reasoning or specific domain knowledge.

What carries the argument

The AGIEval benchmark, assembled from standardized human exams to test foundation models on understanding, knowledge, reasoning, and calculation in human-relevant contexts.

If this is right

Foundation models can now solve many exam-style questions at or above average human levels across multiple subjects.
Performance gaps appear most clearly in complex reasoning and domain-specific knowledge, guiding targeted improvements.
Capability breakdowns by category supply concrete directions for strengthening general abilities.
Human-exam benchmarks connect model results more directly to real-world cognitive demands than synthetic tests do.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Sustained high scores could support deployment of models as automated tutors or graders for these exact exams.
Gaps in complex reasoning may require architectural additions rather than further scaling alone.
Extending the benchmark with harder or culturally varied exam variants could track whether gains generalize.

Load-bearing premise

Standardized human exams serve as valid and unbiased proxies for general cognitive capabilities without favoring current model training methods or test formats.

What would settle it

A controlled comparison in which models achieve high AGIEval scores yet fail on equivalent non-exam problems that test the same underlying skills in open-ended or novel settings.

read the original abstract

Evaluating the general abilities of foundation models to tackle human-level tasks is a vital aspect of their development and application in the pursuit of Artificial General Intelligence (AGI). Traditional benchmarks, which rely on artificial datasets, may not accurately represent human-level capabilities. In this paper, we introduce AGIEval, a novel benchmark specifically designed to assess foundation model in the context of human-centric standardized exams, such as college entrance exams, law school admission tests, math competitions, and lawyer qualification tests. We evaluate several state-of-the-art foundation models, including GPT-4, ChatGPT, and Text-Davinci-003, using this benchmark. Impressively, GPT-4 surpasses average human performance on SAT, LSAT, and math competitions, attaining a 95% accuracy rate on the SAT Math test and a 92.5% accuracy on the English test of the Chinese national college entrance exam. This demonstrates the extraordinary performance of contemporary foundation models. In contrast, we also find that GPT-4 is less proficient in tasks that require complex reasoning or specific domain knowledge. Our comprehensive analyses of model capabilities (understanding, knowledge, reasoning, and calculation) reveal these models' strengths and limitations, providing valuable insights into future directions for enhancing their general capabilities. By concentrating on tasks pertinent to human cognition and decision-making, our benchmark delivers a more meaningful and robust evaluation of foundation models' performance in real-world scenarios. The data, code, and all model outputs are released in https://github.com/ruixiangcui/AGIEval.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AGIEval assembles a useful benchmark from real standardized exams and releases the data, but GPT-4's reported scores need checks for training-data overlap before the generalization claims hold.

read the letter

AGIEval takes existing public exams like the SAT, LSAT, Gaokao, and math contests and turns them into a single evaluation set for foundation models. That is the main new piece: a human-centric yardstick instead of another synthetic task suite. The paper runs GPT-4, ChatGPT, and Text-Davinci-003 across these tests, reports breakdowns by capability (understanding, knowledge, reasoning, calculation), and releases the full questions, code, and every model output on GitHub. That release is the part that actually helps the field; anyone can rerun or extend the numbers without starting from scratch. The headline results show GPT-4 clearing average human scores on several exams, which lines up with the abstract's claim. The capability analysis is straightforward and points to where the models still lag on complex reasoning or narrow domain knowledge. The main soft spot is exactly the one the stress-test note flags. These exams and their solutions have circulated online for years, so web-scale training data almost certainly contains exact or near-exact matches. The paper gives no membership checks, no decontamination steps, and no comparison to paraphrased versions. Without those, the 95% SAT Math and 92.5% Gaokao English numbers cannot be read as clean evidence of reasoning gains. Prompting details and significance testing are also light in the summary. This paper is for groups that build or evaluate large models and want a ready-made set of real-world tasks. A reader who needs a practical benchmark to track progress will get immediate use from the released artifacts. The work is coherent on its own terms and shows honest engagement with the evaluation problem, even if the central performance claims require extra verification. I would send it to peer review; the benchmark itself is worth referee attention and the leakage issue is fixable with targeted experiments.

Referee Report

2 major / 2 minor

Summary. The paper introduces AGIEval, a benchmark assembled from publicly available human standardized exams (SAT, LSAT, math competitions, Gaokao, lawyer qualification tests). It evaluates GPT-4, ChatGPT, and Text-Davinci-003, reporting that GPT-4 exceeds average human performance on several tests (95% on SAT Math, 92.5% on Gaokao English) while showing weaker results on complex reasoning and domain-knowledge tasks. Capability breakdowns (understanding/knowledge/reasoning/calculation) and full data/code/output release are provided.

Significance. If the headline numbers survive decontamination checks, the work supplies a more ecologically valid signal of foundation-model progress than synthetic benchmarks and supplies concrete capability diagnostics plus reproducible artifacts. The public release of all model outputs strengthens the contribution.

major comments (2)

[Abstract] Abstract and evaluation section: the 95% SAT-Math and 92.5% Gaokao-English figures are presented without the number of items per test, sampling protocol, exact prompt templates, or any statistical testing; these omissions leave the central claim that GPT-4 surpasses humans only moderately supported.
[Evaluation] Evaluation methodology: no membership-inference, decontamination, or paraphrased-variant experiments are reported for the publicly circulated exam questions, even though the central claim (surpassing humans via reasoning) requires that performance not be explained by training-data overlap.

minor comments (2)

[Figures] Figure captions and axis labels could more explicitly state the human baseline source and sample size for each exam.
[Appendix] A short table summarizing prompt templates per task type would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our AGIEval benchmark paper. The comments highlight valuable opportunities to strengthen the presentation of results and the evaluation methodology. We have revised the manuscript to incorporate additional details and experiments where feasible, and we respond to each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract and evaluation section: the 95% SAT-Math and 92.5% Gaokao-English figures are presented without the number of items per test, sampling protocol, exact prompt templates, or any statistical testing; these omissions leave the central claim that GPT-4 surpasses humans only moderately supported.

Authors: We agree that these supporting details are necessary to substantiate the central claims. In the revised manuscript, we have expanded the evaluation section to report the exact number of items per test (SAT Math: 58 questions; Gaokao English: 40 questions), clarified that evaluations used the full publicly available test sets with no subsampling, included the precise prompt templates in a new appendix, and added statistical testing via binomial proportion tests to confirm that GPT-4's accuracies significantly exceed the reported human averages. These changes provide stronger empirical grounding for the headline figures. revision: yes
Referee: [Evaluation] Evaluation methodology: no membership-inference, decontamination, or paraphrased-variant experiments are reported for the publicly circulated exam questions, even though the central claim (surpassing humans via reasoning) requires that performance not be explained by training-data overlap.

Authors: We acknowledge the importance of ruling out data contamination to support interpretations of reasoning ability. While full membership-inference or decontamination experiments are not feasible without access to the proprietary training data of the evaluated models, we have added paraphrased-variant experiments on subsets of the SAT and Gaokao questions in the revision; these maintain high performance, indicating robustness beyond exact memorization. We have also expanded the limitations and discussion sections to address contamination risks explicitly, noting the public nature of the exams and known training cutoffs, and we release all model outputs to support community-led analyses. revision: partial

Circularity Check

0 steps flagged

No circularity: benchmark is direct measurement on newly assembled external exam items

full rationale

The paper constructs AGIEval by collecting questions from public standardized exams (SAT, LSAT, Gaokao, math contests) and reports model accuracies as direct empirical measurements against published human averages. No equations, fitted parameters, or predictions are derived; the central claims (e.g., GPT-4 at 95% SAT Math) are simple accuracy counts on the collected items. No self-citations, uniqueness theorems, or ansatzes are invoked to justify results. The derivation chain is therefore self-contained as straightforward benchmarking.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The paper introduces a new evaluation benchmark without introducing fitted parameters, unstated mathematical axioms, or new physical entities; the only added construct is the benchmark collection itself.

invented entities (1)

AGIEval benchmark no independent evidence
purpose: To provide human-centric standardized exam questions for evaluating foundation models
The benchmark is assembled and released in this work.

pith-pipeline@v0.9.0 · 5602 in / 1137 out tokens · 42859 ms · 2026-05-16T09:59:54.385025+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce AGIEval, a novel benchmark specifically designed to assess foundation model in the context of human-centric standardized exams

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
cs.CL 2023-11 unverdicted novelty 8.0

MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.
Human-Grounded Multimodal Benchmark with 900K-Scale Aggregated Student Response Distributions from Japan's National Assessment of Academic Ability
cs.CL 2026-05 unverdicted novelty 7.0

A new benchmark dataset drawn from Japan's National Assessment of Academic Ability supplies real exam layouts, diagrams, Japanese text, and nationwide student response distributions for evaluating multimodal LLMs.
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
cs.CL 2024-05 unverdicted novelty 7.0

DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts
cs.LG 2026-04 unverdicted novelty 6.0

BAR trains independent domain experts via separate mid-training, SFT, and RL pipelines then composes them with a MoE router to match monolithic retraining performance at lower cost and without catastrophic forgetting.
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence
cs.AI 2026-04 unverdicted novelty 6.0

Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.
The Master Key Hypothesis: Unlocking Cross-Model Capability Transfer via Linear Subspace Alignment
cs.LG 2026-04 unverdicted novelty 6.0

The Master Key Hypothesis states that capabilities are low-dimensional directions transferable across models through linear subspace alignment, with UNLOCK demonstrating gains such as 12.1% accuracy improvement on MAT...
Pixtral 12B
cs.CV 2024-10 unverdicted novelty 6.0

Pixtral-12B is a 12B multimodal LLM with a custom vision encoder that ingests images at native resolution and aspect ratio, achieving leading benchmark results among open models while preserving text capabilities.
DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence
cs.SE 2024-06 unverdicted novelty 6.0

An open-source MoE code model matches GPT-4 Turbo on coding and math benchmarks while expanding to 338 languages and 128K context length.
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
cs.CL 2024-06 conditional novelty 6.0

MMLU-Pro is a revised benchmark that makes language model evaluation harder and more stable by using ten options per question and emphasizing reasoning over simple knowledge recall.
GPT-4V(ision) is a Generalist Web Agent, if Grounded
cs.IR 2024-01 conditional novelty 6.0

GPT-4V achieves 51.1% success on live web tasks as a generalist agent when plans are manually grounded, outperforming text-only models, but automatic grounding lags far behind oracle performance.
The Falcon Series of Open Language Models
cs.CL 2023-11 conditional novelty 6.0

Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
cs.CL 2023-06 accept novelty 6.0

GPT-4 as an LLM judge achieves over 80% agreement with human preferences on MT-Bench and Chatbot Arena, matching human agreement levels and providing a scalable evaluation method.
Kimi K2: Open Agentic Intelligence
cs.LG 2025-07 unverdicted novelty 5.0

Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.
Humanity's Last Exam
cs.LG 2025-01 unverdicted novelty 5.0

Humanity's Last Exam is a new 2,500-question benchmark at the frontier of human knowledge where state-of-the-art LLMs show low accuracy.
"I Don't Know" -- Towards Appropriate Trust with Certainty-Aware Retrieval Augmented Generation
cs.IR 2026-05 unverdicted novelty 4.0

CERTA adds relevance-based certainty estimation to RAG so LLMs can better signal uncertainty on non-objective questions, reducing overconfidence.
Ministral 3
cs.CL 2026-01 unverdicted novelty 4.0

Ministral 3 releases 3B/8B/14B parameter-efficient language models with base, instruction, and reasoning variants derived via iterative pruning and distillation, including image understanding capabilities.
DeepSeek-VL: Towards Real-World Vision-Language Understanding
cs.AI 2024-03 unverdicted novelty 4.0

DeepSeek-VL develops open-source 1.3B and 7B vision-language models that achieve competitive or state-of-the-art results on real-world visual-language benchmarks through diverse data curation, a hybrid vision encoder,...
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
cs.CL 2024-01 unverdicted novelty 4.0

DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.

Reference graph

Works this paper leans on

287 extracted references · 287 canonical work pages · cited by 18 Pith papers · 12 internal anchors

[1]

Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence,

Reasoning over Hybrid Chain for Table-and-Text Open Domain Question Answering , author =. Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence,. 2022 , month =. doi:10.24963/ijcai.2022/629 , url =

work page doi:10.24963/ijcai.2022/629 2022
[2]

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages=

Reasoning Over Semantic-Level Graph for Fact Checking , author=. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages=

work page
[3]

2023 , publisher =

Beeching, Edward and Han, Sheon and Lambert, Nathan and Rajani, Nazneen and Sanseviero, Omar and Tunstall, Lewis and Wolf, Thomas , title =. 2023 , publisher =

work page 2023
[4]

Communications of the ACM , volume=

Datasheets for datasets , author=. Communications of the ACM , volume=. 2021 , publisher=

work page 2021
[5]

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

GLM: General Language Model Pretraining with Autoregressive Blank Infilling , author=. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[6]

Proceedings of the 2021 ACM conference on fairness, accountability, and transparency , pages=

Bold: Dataset and metrics for measuring biases in open-ended language generation , author=. Proceedings of the 2021 ACM conference on fairness, accountability, and transparency , pages=

work page 2021
[9]

and Stoica, Ion and Xing, Eric P

Chiang, Wei-Lin and Li, Zhuohan and Lin, Zi and Sheng, Ying and Wu, Zhanghao and Zhang, Hao and Zheng, Lianmin and Zhuang, Siyuan and Zhuang, Yonghao and Gonzalez, Joseph E. and Stoica, Ion and Xing, Eric P. , month =. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90\ url =

work page
[10]

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages=

LogicalFactChecker: Leveraging Logical Operations for Fact Checking with Graph Module Network , author=. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages=

work page
[11]

Syntax-Enhanced Pre-trained Model , author=. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages=

work page
[12]

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

Neural Deepfake Detection with Factual Structure of Text , author=. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

work page 2020
[13]

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages=

ProQA: Structural Prompt-based Pre-training for Unified Question Answering , author=. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages=

work page 2022
[14]

Findings of the Association for Computational Linguistics: NAACL 2022 , pages=

Analytical Reasoning of Text , author=. Findings of the Association for Computational Linguistics: NAACL 2022 , pages=

work page 2022
[15]

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 , pages=

UserAdapter: Few-shot user learning in sentiment analysis , author=. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 , pages=

work page 2021
[16]

Natural Language Processing and Chinese Computing: 8th CCF International Conference, NLPCC 2019, Dunhuang, China, October 9--14, 2019, Proceedings, Part I , pages=

Improving Question Answering by Commonsense-Based Pre-training , author=. Natural Language Processing and Chinese Computing: 8th CCF International Conference, NLPCC 2019, Dunhuang, China, October 9--14, 2019, Proceedings, Part I , pages=

work page 2019
[17]

IEEE/ACM Transactions on Audio, Speech, and Language Processing , volume=

From lsat: The progress and challenges of complex reasoning , author=. IEEE/ACM Transactions on Audio, Speech, and Language Processing , volume=. 2022 , publisher=

work page 2022
[18]

Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence , pages=

LogiQA: a challenge dataset for machine reading comprehension with logical reasoning , author=. Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence , pages=

work page
[19]

Sort , volume=

Measuring Mathematical Problem Solving With the MATH Dataset , author=. Sort , volume=

work page
[20]

Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems , author=. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[21]

Proceedings of AAAI , year=

JEC-QA: A Legal-Domain Question Answering Dataset , author=. Proceedings of AAAI , year=

work page
[24]

2023 , eprint=

GPT-4 Technical Report , author=. 2023 , eprint=

work page 2023
[25]

2019 , publisher=

Rebooting AI: Building artificial intelligence we can trust , author=. 2019 , publisher=

work page 2019
[26]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

work page
[29]

Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing , pages=

SQuAD: 100,000+ Questions for Machine Comprehension of Text , author=. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2016
[30]

Proceedings of the 44

Ehrmann, Maud and Romanello, Matteo and Clematide, Simon and Doucet, Antoine , year =. Proceedings of the 44

work page
[31]

Advances in neural information processing systems , volume=

Superglue: A stickier benchmark for general-purpose language understanding systems , author=. Advances in neural information processing systems , volume=

work page
[36]

Advances in Neural Information Processing Systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , volume=

work page
[37]

Conference on Empirical Methods in Natural Language Processing , year=

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , author=. Conference on Empirical Methods in Natural Language Processing , year=

work page
[39]

Available at SSRN , year=

Chatgpt goes to law school , author=. Available at SSRN , year=

work page
[40]

2022 , eprint=

Solving Quantitative Reasoning Problems with Language Models , author=. 2022 , eprint=

work page 2022
[43]

Open llm leaderboard

Edward Beeching, Sheon Han, Nathan Lambert, Nazneen Rajani, Omar Sanseviero, Lewis Tunstall, and Thomas Wolf. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, 2023

work page 2023
[44]

Bowman, Gabor Angeli, Christopher Potts, and Christopher D

Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp.\ 632--642, Lisbon, Portugal, September 2015. Association for Computational Linguistics. doi:10.18653/v1/D15-1075. URL...

work page doi:10.18653/v1/d15-1075 2015
[45]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020

work page 1901
[46]

Sparks of Artificial General Intelligence: Early experiments with GPT-4

S \'e bastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[47]

Gonzalez, Ion Stoica, and Eric P

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90\ quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/

work page 2023
[48]

Chatgpt goes to law school

Jonathan H Choi, Kristin E Hickman, Amy Monahan, and Daniel Schwarcz. Chatgpt goes to law school. Available at SSRN, 2023

work page 2023
[49]

Scaling Instruction-Finetuned Language Models

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[50]

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905
[51]

S ent E val: An evaluation toolkit for universal sentence representations

Alexis Conneau and Douwe Kiela. S ent E val: An evaluation toolkit for universal sentence representations. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation ( LREC 2018) , Miyazaki, Japan, May 2018. European Language Resources Association (ELRA). URL https://aclanthology.org/L18-1269

work page 2018
[52]

BERT: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT : Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pp.\ 4171--4186, Minneapo...

work page doi:10.18653/v1/n19-1423 2019
[53]

Bold: Dataset and metrics for measuring biases in open-ended language generation

Jwala Dhamala, Tony Sun, Varun Kumar, Satyapriya Krishna, Yada Pruksachatkun, Kai-Wei Chang, and Rahul Gupta. Bold: Dataset and metrics for measuring biases in open-ended language generation. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pp.\ 862--872, 2021

work page 2021
[54]

Glm: General language model pretraining with autoregressive blank infilling

Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 320--335, 2022

work page 2022
[55]

Introducing the HIPE 2022 Shared Task:Named Entity Recognition and Linking in Multilingual Historical Documents

Maud Ehrmann, Matteo Romanello, Simon Clematide, and Antoine Doucet. Introducing the HIPE 2022 Shared Task:Named Entity Recognition and Linking in Multilingual Historical Documents . In Proceedings of the 44 d European Conference on IR Research ( ECIR 2022) , Stavanger, Norway , 2022. Lecture Notes in Computer Science, Springer . URL https://link.springer...

work page doi:10.1007/978-3-030-99739-7_44 2022
[56]

Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection

Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar. Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. arXiv preprint arXiv:2203.09509, 2022

work page arXiv 2022
[57]

Measuring mathematical problem solving with the math dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. Sort, 2 0 (4): 0 0--6

work page
[58]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009
[59]

Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answering research. Transac...

work page doi:10.1162/tacl_a_00276 2019
[60]

Solving quantitative reasoning problems with language models, 2022

Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantitative reasoning problems with language models, 2022

work page 2022
[61]

Holistic Evaluation of Language Models

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[62]

Program induction by rationale generation: Learning to solve and explain algebraic word problems

Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program induction by rationale generation: Learning to solve and explain algebraic word problems. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 158--167, 2017

work page 2017
[63]

Logiqa: a challenge dataset for machine reading comprehension with logical reasoning

Jian Liu, Leyang Cui, Hanmeng Liu, Dandan Huang, Yile Wang, and Yue Zhang. Logiqa: a challenge dataset for machine reading comprehension with logical reasoning. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pp.\ 3622--3628, 2021

work page 2021
[64]

Rebooting AI: Building artificial intelligence we can trust

Gary Marcus and Ernest Davis. Rebooting AI: Building artificial intelligence we can trust. Vintage, 2019

work page 2019
[65]

The Natural Language Decathlon: Multitask Learning as Question Answering

Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. The natural language decathlon: Multitask learning as question answering. arXiv preprint arXiv:1806.08730, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[66]

Can a suit of armor conduct electricity? a new dataset for open book question answering

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Conference on Empirical Methods in Natural Language Processing, 2018

work page 2018
[67]

Gpt-4 technical report, 2023

OpenAI. Gpt-4 technical report, 2023

work page 2023
[68]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35: 0 27730--27744, 2022

work page 2022
[69]

The LAMBADA dataset: Word prediction requiring a broad discourse context

Denis Paperno, Germ \'a n Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fern \'a ndez. The lambada dataset: Word prediction requiring a broad discourse context. arXiv preprint arXiv:1606.06031, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[70]

Squad: 100,000+ questions for machine comprehension of text

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp.\ 2383--2392, 2016

work page 2016
[71]

Internlm, 2023

SenseTime . Internlm, 2023. https://github.com/InternLM/InternLM-techreport/

work page 2023
[72]

FEVER: a large-scale dataset for Fact Extraction and VERification

James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. FEVER : a large-scale dataset for fact extraction and VER ification. In Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) , pp.\ 809--819, New Orleans, Louisiana...

work page internal anchor Pith review doi:10.18653/v1/n18-1074 2018
[73]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth \'e e Lacroix, Baptiste Rozi \`e re, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[74]

GLUE : A multi-task benchmark and analysis platform for natural language understanding

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. GLUE : A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop B lackbox NLP : Analyzing and Interpreting Neural Networks for NLP , pp.\ 353--355, Brussels, Belgium, November 2018. Association for Computa...

work page doi:10.18653/v1/w18-5446 2018
[75]

Superglue: A stickier benchmark for general-purpose language understanding systems

Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32, 2019

work page 2019
[76]

From lsat: The progress and challenges of complex reasoning

Siyuan Wang, Zhongkun Liu, Wanjun Zhong, Ming Zhou, Zhongyu Wei, Zhumin Chen, and Nan Duan. From lsat: The progress and challenges of complex reasoning. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30: 0 2201--2216, 2022

work page 2022
[77]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[78]

OPT: Open Pre-trained Transformer Language Models

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022 a

work page internal anchor Pith review Pith/arXiv arXiv 2022
[79]

Automatic Chain of Thought Prompting in Large Language Models

Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493, 2022 b

work page internal anchor Pith review arXiv 2022
[80]

Jec-qa: A legal-domain question answering dataset

Haoxi Zhong, Chaojun Xiao, Cunchao Tu, Tianyang Zhang, Zhiyuan Liu, and Maosong Sun. Jec-qa: A legal-domain question answering dataset. In Proceedings of AAAI, 2020

work page 2020
[81]

Analytical reasoning of text

Wanjun Zhong, Siyuan Wang, Duyu Tang, Zenan Xu, Daya Guo, Yining Chen, Jiahai Wang, Jian Yin, Ming Zhou, and Nan Duan. Analytical reasoning of text. In Findings of the Association for Computational Linguistics: NAACL 2022, pp.\ 2306--2319, Seattle, United States, July 2022. Association for Computational Linguistics. doi:10.18653/v1/2022.findings-naacl.177...

work page doi:10.18653/v1/2022.findings-naacl.177 2022
[82]

Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021). 2021

work page 2021
[83]

Exploiting Auxiliary Data for Offensive Language Detection with Bidirectional Transformers

Singh, Sumer and Li, Sheng. Exploiting Auxiliary Data for Offensive Language Detection with Bidirectional Transformers. Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021). 2021. doi:10.18653/v1/2021.woah-1.1

work page doi:10.18653/v1/2021.woah-1.1 2021
[84]

Modeling Profanity and Hate Speech in Social Media with Semantic Subspaces

Hahn, Vanessa and Ruiter, Dana and Kleinbauer, Thomas and Klakow, Dietrich. Modeling Profanity and Hate Speech in Social Media with Semantic Subspaces. Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021). 2021. doi:10.18653/v1/2021.woah-1.2

work page doi:10.18653/v1/2021.woah-1.2 2021
[85]

H ate BERT : Retraining BERT for Abusive Language Detection in E nglish

Caselli, Tommaso and Basile, Valerio and Mitrovi \'c , Jelena and Granitzer, Michael. H ate BERT : Retraining BERT for Abusive Language Detection in E nglish. Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021). 2021. doi:10.18653/v1/2021.woah-1.3

work page doi:10.18653/v1/2021.woah-1.3 2021
[86]

Memes in the Wild: Assessing the Generalizability of the Hateful Memes Challenge Dataset

Kirk, Hannah and Jun, Yennie and Rauba, Paulius and Wachtel, Gal and Li, Ruining and Bai, Xingjian and Broestl, Noah and Doff-Sotta, Martin and Shtedritski, Aleksandar and Asano, Yuki M. Memes in the Wild: Assessing the Generalizability of the Hateful Memes Challenge Dataset. Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021). 2021. doi...

work page doi:10.18653/v1/2021.woah-1.4 2021
[87]

Measuring and Improving Model-Moderator Collaboration using Uncertainty Estimation

Kivlichan, Ian and Lin, Zi and Liu, Jeremiah and Vasserman, Lucy. Measuring and Improving Model-Moderator Collaboration using Uncertainty Estimation. Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021). 2021. doi:10.18653/v1/2021.woah-1.5

work page doi:10.18653/v1/2021.woah-1.5 2021
[88]

DALC : the D utch Abusive Language Corpus

Caselli, Tommaso and Schelhaas, Arjan and Weultjes, Marieke and Leistra, Folkert and van der Veen, Hylke and Timmerman, Gerben and Nissim, Malvina. DALC : the D utch Abusive Language Corpus. Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021). 2021. doi:10.18653/v1/2021.woah-1.6

work page doi:10.18653/v1/2021.woah-1.6 2021
[89]

and Dulal, Saurab and Koirala, Diwa

Niraula, Nobal B. and Dulal, Saurab and Koirala, Diwa. Offensive Language Detection in N epali Social Media. Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021). 2021. doi:10.18653/v1/2021.woah-1.7

work page doi:10.18653/v1/2021.woah-1.7 2021
[90]

MIN \_ PT : An E uropean P ortuguese Lexicon for Minorities Related Terms

Fortuna, Paula and Cortez, Vanessa and Sozinho Ramalho, Miguel and P \'e rez-Mayos, Laura. MIN \_ PT : An E uropean P ortuguese Lexicon for Minorities Related Terms. Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021). 2021. doi:10.18653/v1/2021.woah-1.8

work page doi:10.18653/v1/2021.woah-1.8 2021
[91]

Fine-Grained Fairness Analysis of Abusive Language Detection Systems with C heck L ist

Manerba, Marta Marchiori and Tonelli, Sara. Fine-Grained Fairness Analysis of Abusive Language Detection Systems with C heck L ist. Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021). 2021. doi:10.18653/v1/2021.woah-1.9

work page doi:10.18653/v1/2021.woah-1.9 2021
[92]

Improving Counterfactual Generation for Fair Hate Speech Detection

Mostafazadeh Davani, Aida and Omrani, Ali and Kennedy, Brendan and Atari, Mohammad and Ren, Xiang and Dehghani, Morteza. Improving Counterfactual Generation for Fair Hate Speech Detection. Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021). 2021. doi:10.18653/v1/2021.woah-1.10

work page doi:10.18653/v1/2021.woah-1.10 2021
[93]

Hell Hath No Fury? Correcting Bias in the NRC Emotion Lexicon

Zad, Samira and Jimenez, Joshuan and Finlayson, Mark. Hell Hath No Fury? Correcting Bias in the NRC Emotion Lexicon. Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021). 2021. doi:10.18653/v1/2021.woah-1.11

work page doi:10.18653/v1/2021.woah-1.11 2021

Showing first 80 references.