pith. machine review for the scientific record. sign in

arxiv: 2304.06364 · v2 · submitted 2023-04-13 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models

Authors on Pith no claims yet

Pith reviewed 2026-05-16 09:59 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords AGIEvalfoundation modelsbenchmarkGPT-4standardized examsSATLSATreasoning
0
0 comments X

The pith

AGIEval benchmark shows GPT-4 surpassing average humans on SAT math at 95 percent and LSAT.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces AGIEval, a benchmark that draws questions directly from standardized human exams such as college entrance tests, law school admissions, math competitions, and lawyer qualification exams to measure foundation model abilities on human-level tasks. Evaluations of GPT-4, ChatGPT, and Text-Davinci-003 find that GPT-4 exceeds average human scores on SAT, LSAT, and math competitions, reaching 95 percent accuracy on SAT math and 92.5 percent on the English section of the Chinese national college entrance exam. The models show weaker results on problems that require complex reasoning chains or narrow domain knowledge. The authors break performance into categories of understanding, knowledge, reasoning, and calculation to expose specific strengths and gaps. The approach replaces artificial datasets with tasks tied to real human cognition and decision-making.

Core claim

AGIEval evaluates foundation models on collections of real standardized exams including SAT, LSAT, math competitions, and lawyer qualification tests. GPT-4 surpasses average human performance on SAT, LSAT, and math competitions, attaining 95 percent accuracy on the SAT Math test and 92.5 percent accuracy on the English test of the Chinese national college entrance exam, while remaining less proficient on tasks that demand complex reasoning or specific domain knowledge.

What carries the argument

The AGIEval benchmark, assembled from standardized human exams to test foundation models on understanding, knowledge, reasoning, and calculation in human-relevant contexts.

If this is right

  • Foundation models can now solve many exam-style questions at or above average human levels across multiple subjects.
  • Performance gaps appear most clearly in complex reasoning and domain-specific knowledge, guiding targeted improvements.
  • Capability breakdowns by category supply concrete directions for strengthening general abilities.
  • Human-exam benchmarks connect model results more directly to real-world cognitive demands than synthetic tests do.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Sustained high scores could support deployment of models as automated tutors or graders for these exact exams.
  • Gaps in complex reasoning may require architectural additions rather than further scaling alone.
  • Extending the benchmark with harder or culturally varied exam variants could track whether gains generalize.

Load-bearing premise

Standardized human exams serve as valid and unbiased proxies for general cognitive capabilities without favoring current model training methods or test formats.

What would settle it

A controlled comparison in which models achieve high AGIEval scores yet fail on equivalent non-exam problems that test the same underlying skills in open-ended or novel settings.

read the original abstract

Evaluating the general abilities of foundation models to tackle human-level tasks is a vital aspect of their development and application in the pursuit of Artificial General Intelligence (AGI). Traditional benchmarks, which rely on artificial datasets, may not accurately represent human-level capabilities. In this paper, we introduce AGIEval, a novel benchmark specifically designed to assess foundation model in the context of human-centric standardized exams, such as college entrance exams, law school admission tests, math competitions, and lawyer qualification tests. We evaluate several state-of-the-art foundation models, including GPT-4, ChatGPT, and Text-Davinci-003, using this benchmark. Impressively, GPT-4 surpasses average human performance on SAT, LSAT, and math competitions, attaining a 95% accuracy rate on the SAT Math test and a 92.5% accuracy on the English test of the Chinese national college entrance exam. This demonstrates the extraordinary performance of contemporary foundation models. In contrast, we also find that GPT-4 is less proficient in tasks that require complex reasoning or specific domain knowledge. Our comprehensive analyses of model capabilities (understanding, knowledge, reasoning, and calculation) reveal these models' strengths and limitations, providing valuable insights into future directions for enhancing their general capabilities. By concentrating on tasks pertinent to human cognition and decision-making, our benchmark delivers a more meaningful and robust evaluation of foundation models' performance in real-world scenarios. The data, code, and all model outputs are released in https://github.com/ruixiangcui/AGIEval.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces AGIEval, a benchmark assembled from publicly available human standardized exams (SAT, LSAT, math competitions, Gaokao, lawyer qualification tests). It evaluates GPT-4, ChatGPT, and Text-Davinci-003, reporting that GPT-4 exceeds average human performance on several tests (95% on SAT Math, 92.5% on Gaokao English) while showing weaker results on complex reasoning and domain-knowledge tasks. Capability breakdowns (understanding/knowledge/reasoning/calculation) and full data/code/output release are provided.

Significance. If the headline numbers survive decontamination checks, the work supplies a more ecologically valid signal of foundation-model progress than synthetic benchmarks and supplies concrete capability diagnostics plus reproducible artifacts. The public release of all model outputs strengthens the contribution.

major comments (2)
  1. [Abstract] Abstract and evaluation section: the 95% SAT-Math and 92.5% Gaokao-English figures are presented without the number of items per test, sampling protocol, exact prompt templates, or any statistical testing; these omissions leave the central claim that GPT-4 surpasses humans only moderately supported.
  2. [Evaluation] Evaluation methodology: no membership-inference, decontamination, or paraphrased-variant experiments are reported for the publicly circulated exam questions, even though the central claim (surpassing humans via reasoning) requires that performance not be explained by training-data overlap.
minor comments (2)
  1. [Figures] Figure captions and axis labels could more explicitly state the human baseline source and sample size for each exam.
  2. [Appendix] A short table summarizing prompt templates per task type would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our AGIEval benchmark paper. The comments highlight valuable opportunities to strengthen the presentation of results and the evaluation methodology. We have revised the manuscript to incorporate additional details and experiments where feasible, and we respond to each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract and evaluation section: the 95% SAT-Math and 92.5% Gaokao-English figures are presented without the number of items per test, sampling protocol, exact prompt templates, or any statistical testing; these omissions leave the central claim that GPT-4 surpasses humans only moderately supported.

    Authors: We agree that these supporting details are necessary to substantiate the central claims. In the revised manuscript, we have expanded the evaluation section to report the exact number of items per test (SAT Math: 58 questions; Gaokao English: 40 questions), clarified that evaluations used the full publicly available test sets with no subsampling, included the precise prompt templates in a new appendix, and added statistical testing via binomial proportion tests to confirm that GPT-4's accuracies significantly exceed the reported human averages. These changes provide stronger empirical grounding for the headline figures. revision: yes

  2. Referee: [Evaluation] Evaluation methodology: no membership-inference, decontamination, or paraphrased-variant experiments are reported for the publicly circulated exam questions, even though the central claim (surpassing humans via reasoning) requires that performance not be explained by training-data overlap.

    Authors: We acknowledge the importance of ruling out data contamination to support interpretations of reasoning ability. While full membership-inference or decontamination experiments are not feasible without access to the proprietary training data of the evaluated models, we have added paraphrased-variant experiments on subsets of the SAT and Gaokao questions in the revision; these maintain high performance, indicating robustness beyond exact memorization. We have also expanded the limitations and discussion sections to address contamination risks explicitly, noting the public nature of the exams and known training cutoffs, and we release all model outputs to support community-led analyses. revision: partial

Circularity Check

0 steps flagged

No circularity: benchmark is direct measurement on newly assembled external exam items

full rationale

The paper constructs AGIEval by collecting questions from public standardized exams (SAT, LSAT, Gaokao, math contests) and reports model accuracies as direct empirical measurements against published human averages. No equations, fitted parameters, or predictions are derived; the central claims (e.g., GPT-4 at 95% SAT Math) are simple accuracy counts on the collected items. No self-citations, uniqueness theorems, or ansatzes are invoked to justify results. The derivation chain is therefore self-contained as straightforward benchmarking.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The paper introduces a new evaluation benchmark without introducing fitted parameters, unstated mathematical axioms, or new physical entities; the only added construct is the benchmark collection itself.

invented entities (1)
  • AGIEval benchmark no independent evidence
    purpose: To provide human-centric standardized exam questions for evaluating foundation models
    The benchmark is assembled and released in this work.

pith-pipeline@v0.9.0 · 5602 in / 1137 out tokens · 42859 ms · 2026-05-16T09:59:54.385025+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

    cs.CL 2023-11 unverdicted novelty 8.0

    MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.

  2. Human-Grounded Multimodal Benchmark with 900K-Scale Aggregated Student Response Distributions from Japan's National Assessment of Academic Ability

    cs.CL 2026-05 unverdicted novelty 7.0

    A new benchmark dataset drawn from Japan's National Assessment of Academic Ability supplies real exam layouts, diagrams, Japanese text, and nationwide student response distributions for evaluating multimodal LLMs.

  3. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    cs.CL 2024-05 unverdicted novelty 7.0

    DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.

  4. Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts

    cs.LG 2026-04 unverdicted novelty 6.0

    BAR trains independent domain experts via separate mid-training, SFT, and RL pipelines then composes them with a MoE router to match monolithic retraining performance at lower cost and without catastrophic forgetting.

  5. Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence

    cs.AI 2026-04 unverdicted novelty 6.0

    Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.

  6. The Master Key Hypothesis: Unlocking Cross-Model Capability Transfer via Linear Subspace Alignment

    cs.LG 2026-04 unverdicted novelty 6.0

    The Master Key Hypothesis states that capabilities are low-dimensional directions transferable across models through linear subspace alignment, with UNLOCK demonstrating gains such as 12.1% accuracy improvement on MAT...

  7. Pixtral 12B

    cs.CV 2024-10 unverdicted novelty 6.0

    Pixtral-12B is a 12B multimodal LLM with a custom vision encoder that ingests images at native resolution and aspect ratio, achieving leading benchmark results among open models while preserving text capabilities.

  8. DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

    cs.SE 2024-06 unverdicted novelty 6.0

    An open-source MoE code model matches GPT-4 Turbo on coding and math benchmarks while expanding to 338 languages and 128K context length.

  9. MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

    cs.CL 2024-06 conditional novelty 6.0

    MMLU-Pro is a revised benchmark that makes language model evaluation harder and more stable by using ten options per question and emphasizing reasoning over simple knowledge recall.

  10. GPT-4V(ision) is a Generalist Web Agent, if Grounded

    cs.IR 2024-01 conditional novelty 6.0

    GPT-4V achieves 51.1% success on live web tasks as a generalist agent when plans are manually grounded, outperforming text-only models, but automatic grounding lags far behind oracle performance.

  11. The Falcon Series of Open Language Models

    cs.CL 2023-11 conditional novelty 6.0

    Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.

  12. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    cs.CL 2023-06 accept novelty 6.0

    GPT-4 as an LLM judge achieves over 80% agreement with human preferences on MT-Bench and Chatbot Arena, matching human agreement levels and providing a scalable evaluation method.

  13. Kimi K2: Open Agentic Intelligence

    cs.LG 2025-07 unverdicted novelty 5.0

    Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.

  14. Humanity's Last Exam

    cs.LG 2025-01 unverdicted novelty 5.0

    Humanity's Last Exam is a new 2,500-question benchmark at the frontier of human knowledge where state-of-the-art LLMs show low accuracy.

  15. "I Don't Know" -- Towards Appropriate Trust with Certainty-Aware Retrieval Augmented Generation

    cs.IR 2026-05 unverdicted novelty 4.0

    CERTA adds relevance-based certainty estimation to RAG so LLMs can better signal uncertainty on non-objective questions, reducing overconfidence.

  16. Ministral 3

    cs.CL 2026-01 unverdicted novelty 4.0

    Ministral 3 releases 3B/8B/14B parameter-efficient language models with base, instruction, and reasoning variants derived via iterative pruning and distillation, including image understanding capabilities.

  17. DeepSeek-VL: Towards Real-World Vision-Language Understanding

    cs.AI 2024-03 unverdicted novelty 4.0

    DeepSeek-VL develops open-source 1.3B and 7B vision-language models that achieve competitive or state-of-the-art results on real-world visual-language benchmarks through diverse data curation, a hybrid vision encoder,...

  18. DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

    cs.CL 2024-01 unverdicted novelty 4.0

    DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.

Reference graph

Works this paper leans on

287 extracted references · 287 canonical work pages · cited by 18 Pith papers · 12 internal anchors

  1. [1]

    Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence,

    Reasoning over Hybrid Chain for Table-and-Text Open Domain Question Answering , author =. Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence,. 2022 , month =. doi:10.24963/ijcai.2022/629 , url =

  2. [2]

    Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages=

    Reasoning Over Semantic-Level Graph for Fact Checking , author=. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages=

  3. [3]

    2023 , publisher =

    Beeching, Edward and Han, Sheon and Lambert, Nathan and Rajani, Nazneen and Sanseviero, Omar and Tunstall, Lewis and Wolf, Thomas , title =. 2023 , publisher =

  4. [4]

    Communications of the ACM , volume=

    Datasheets for datasets , author=. Communications of the ACM , volume=. 2021 , publisher=

  5. [5]

    Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    GLM: General Language Model Pretraining with Autoregressive Blank Infilling , author=. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  6. [6]

    Proceedings of the 2021 ACM conference on fairness, accountability, and transparency , pages=

    Bold: Dataset and metrics for measuring biases in open-ended language generation , author=. Proceedings of the 2021 ACM conference on fairness, accountability, and transparency , pages=

  7. [9]

    and Stoica, Ion and Xing, Eric P

    Chiang, Wei-Lin and Li, Zhuohan and Lin, Zi and Sheng, Ying and Wu, Zhanghao and Zhang, Hao and Zheng, Lianmin and Zhuang, Siyuan and Zhuang, Yonghao and Gonzalez, Joseph E. and Stoica, Ion and Xing, Eric P. , month =. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90\ url =

  8. [10]

    Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages=

    LogicalFactChecker: Leveraging Logical Operations for Fact Checking with Graph Module Network , author=. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages=

  9. [11]

    Syntax-Enhanced Pre-trained Model , author=. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages=

  10. [12]

    Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

    Neural Deepfake Detection with Factual Structure of Text , author=. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

  11. [13]

    Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages=

    ProQA: Structural Prompt-based Pre-training for Unified Question Answering , author=. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages=

  12. [14]

    Findings of the Association for Computational Linguistics: NAACL 2022 , pages=

    Analytical Reasoning of Text , author=. Findings of the Association for Computational Linguistics: NAACL 2022 , pages=

  13. [15]

    Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 , pages=

    UserAdapter: Few-shot user learning in sentiment analysis , author=. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 , pages=

  14. [16]

    Natural Language Processing and Chinese Computing: 8th CCF International Conference, NLPCC 2019, Dunhuang, China, October 9--14, 2019, Proceedings, Part I , pages=

    Improving Question Answering by Commonsense-Based Pre-training , author=. Natural Language Processing and Chinese Computing: 8th CCF International Conference, NLPCC 2019, Dunhuang, China, October 9--14, 2019, Proceedings, Part I , pages=

  15. [17]

    IEEE/ACM Transactions on Audio, Speech, and Language Processing , volume=

    From lsat: The progress and challenges of complex reasoning , author=. IEEE/ACM Transactions on Audio, Speech, and Language Processing , volume=. 2022 , publisher=

  16. [18]

    Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence , pages=

    LogiQA: a challenge dataset for machine reading comprehension with logical reasoning , author=. Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence , pages=

  17. [19]

    Sort , volume=

    Measuring Mathematical Problem Solving With the MATH Dataset , author=. Sort , volume=

  18. [20]

    Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems , author=. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  19. [21]

    Proceedings of AAAI , year=

    JEC-QA: A Legal-Domain Question Answering Dataset , author=. Proceedings of AAAI , year=

  20. [24]

    2023 , eprint=

    GPT-4 Technical Report , author=. 2023 , eprint=

  21. [25]

    2019 , publisher=

    Rebooting AI: Building artificial intelligence we can trust , author=. 2019 , publisher=

  22. [26]

    Advances in neural information processing systems , volume=

    Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

  23. [29]

    Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing , pages=

    SQuAD: 100,000+ Questions for Machine Comprehension of Text , author=. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing , pages=

  24. [30]

    Proceedings of the 44

    Ehrmann, Maud and Romanello, Matteo and Clematide, Simon and Doucet, Antoine , year =. Proceedings of the 44

  25. [31]

    Advances in neural information processing systems , volume=

    Superglue: A stickier benchmark for general-purpose language understanding systems , author=. Advances in neural information processing systems , volume=

  26. [36]

    Advances in Neural Information Processing Systems , volume=

    Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , volume=

  27. [37]

    Conference on Empirical Methods in Natural Language Processing , year=

    Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , author=. Conference on Empirical Methods in Natural Language Processing , year=

  28. [39]

    Available at SSRN , year=

    Chatgpt goes to law school , author=. Available at SSRN , year=

  29. [40]

    2022 , eprint=

    Solving Quantitative Reasoning Problems with Language Models , author=. 2022 , eprint=

  30. [43]

    Open llm leaderboard

    Edward Beeching, Sheon Han, Nathan Lambert, Nazneen Rajani, Omar Sanseviero, Lewis Tunstall, and Thomas Wolf. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, 2023

  31. [44]

    Bowman, Gabor Angeli, Christopher Potts, and Christopher D

    Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp.\ 632--642, Lisbon, Portugal, September 2015. Association for Computational Linguistics. doi:10.18653/v1/D15-1075. URL...

  32. [45]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020

  33. [46]

    Sparks of Artificial General Intelligence: Early experiments with GPT-4

    S \'e bastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023

  34. [47]

    Gonzalez, Ion Stoica, and Eric P

    Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90\ quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/

  35. [48]

    Chatgpt goes to law school

    Jonathan H Choi, Kristin E Hickman, Amy Monahan, and Daniel Schwarcz. Chatgpt goes to law school. Available at SSRN, 2023

  36. [49]

    Scaling Instruction-Finetuned Language Models

    Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022

  37. [50]

    BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

    Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019

  38. [51]

    S ent E val: An evaluation toolkit for universal sentence representations

    Alexis Conneau and Douwe Kiela. S ent E val: An evaluation toolkit for universal sentence representations. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation ( LREC 2018) , Miyazaki, Japan, May 2018. European Language Resources Association (ELRA). URL https://aclanthology.org/L18-1269

  39. [52]

    BERT: Pre-training of deep bidirectional transformers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT : Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pp.\ 4171--4186, Minneapo...

  40. [53]

    Bold: Dataset and metrics for measuring biases in open-ended language generation

    Jwala Dhamala, Tony Sun, Varun Kumar, Satyapriya Krishna, Yada Pruksachatkun, Kai-Wei Chang, and Rahul Gupta. Bold: Dataset and metrics for measuring biases in open-ended language generation. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pp.\ 862--872, 2021

  41. [54]

    Glm: General language model pretraining with autoregressive blank infilling

    Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 320--335, 2022

  42. [55]

    Introducing the HIPE 2022 Shared Task:Named Entity Recognition and Linking in Multilingual Historical Documents

    Maud Ehrmann, Matteo Romanello, Simon Clematide, and Antoine Doucet. Introducing the HIPE 2022 Shared Task:Named Entity Recognition and Linking in Multilingual Historical Documents . In Proceedings of the 44 d European Conference on IR Research ( ECIR 2022) , Stavanger, Norway , 2022. Lecture Notes in Computer Science, Springer . URL https://link.springer...

  43. [56]

    Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection

    Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar. Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. arXiv preprint arXiv:2203.09509, 2022

  44. [57]

    Measuring mathematical problem solving with the math dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. Sort, 2 0 (4): 0 0--6

  45. [58]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020

  46. [59]

    Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

    Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answering research. Transac...

  47. [60]

    Solving quantitative reasoning problems with language models, 2022

    Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantitative reasoning problems with language models, 2022

  48. [61]

    Holistic Evaluation of Language Models

    Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022

  49. [62]

    Program induction by rationale generation: Learning to solve and explain algebraic word problems

    Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program induction by rationale generation: Learning to solve and explain algebraic word problems. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 158--167, 2017

  50. [63]

    Logiqa: a challenge dataset for machine reading comprehension with logical reasoning

    Jian Liu, Leyang Cui, Hanmeng Liu, Dandan Huang, Yile Wang, and Yue Zhang. Logiqa: a challenge dataset for machine reading comprehension with logical reasoning. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pp.\ 3622--3628, 2021

  51. [64]

    Rebooting AI: Building artificial intelligence we can trust

    Gary Marcus and Ernest Davis. Rebooting AI: Building artificial intelligence we can trust. Vintage, 2019

  52. [65]

    The Natural Language Decathlon: Multitask Learning as Question Answering

    Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. The natural language decathlon: Multitask learning as question answering. arXiv preprint arXiv:1806.08730, 2018

  53. [66]

    Can a suit of armor conduct electricity? a new dataset for open book question answering

    Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Conference on Empirical Methods in Natural Language Processing, 2018

  54. [67]

    Gpt-4 technical report, 2023

    OpenAI. Gpt-4 technical report, 2023

  55. [68]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35: 0 27730--27744, 2022

  56. [69]

    The LAMBADA dataset: Word prediction requiring a broad discourse context

    Denis Paperno, Germ \'a n Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fern \'a ndez. The lambada dataset: Word prediction requiring a broad discourse context. arXiv preprint arXiv:1606.06031, 2016

  57. [70]

    Squad: 100,000+ questions for machine comprehension of text

    Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp.\ 2383--2392, 2016

  58. [71]

    Internlm, 2023

    SenseTime . Internlm, 2023. https://github.com/InternLM/InternLM-techreport/

  59. [72]

    FEVER: a large-scale dataset for Fact Extraction and VERification

    James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. FEVER : a large-scale dataset for fact extraction and VER ification. In Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) , pp.\ 809--819, New Orleans, Louisiana...

  60. [73]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth \'e e Lacroix, Baptiste Rozi \`e re, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

  61. [74]

    GLUE : A multi-task benchmark and analysis platform for natural language understanding

    Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. GLUE : A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop B lackbox NLP : Analyzing and Interpreting Neural Networks for NLP , pp.\ 353--355, Brussels, Belgium, November 2018. Association for Computa...

  62. [75]

    Superglue: A stickier benchmark for general-purpose language understanding systems

    Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32, 2019

  63. [76]

    From lsat: The progress and challenges of complex reasoning

    Siyuan Wang, Zhongkun Liu, Wanjun Zhong, Ming Zhou, Zhongyu Wei, Zhumin Chen, and Nan Duan. From lsat: The progress and challenges of complex reasoning. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30: 0 2201--2216, 2022

  64. [77]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022

  65. [78]

    OPT: Open Pre-trained Transformer Language Models

    Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022 a

  66. [79]

    Automatic Chain of Thought Prompting in Large Language Models

    Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493, 2022 b

  67. [80]

    Jec-qa: A legal-domain question answering dataset

    Haoxi Zhong, Chaojun Xiao, Cunchao Tu, Tianyang Zhang, Zhiyuan Liu, and Maosong Sun. Jec-qa: A legal-domain question answering dataset. In Proceedings of AAAI, 2020

  68. [81]

    Analytical reasoning of text

    Wanjun Zhong, Siyuan Wang, Duyu Tang, Zenan Xu, Daya Guo, Yining Chen, Jiahai Wang, Jian Yin, Ming Zhou, and Nan Duan. Analytical reasoning of text. In Findings of the Association for Computational Linguistics: NAACL 2022, pp.\ 2306--2319, Seattle, United States, July 2022. Association for Computational Linguistics. doi:10.18653/v1/2022.findings-naacl.177...

  69. [82]

    Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021). 2021

  70. [83]

    Exploiting Auxiliary Data for Offensive Language Detection with Bidirectional Transformers

    Singh, Sumer and Li, Sheng. Exploiting Auxiliary Data for Offensive Language Detection with Bidirectional Transformers. Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021). 2021. doi:10.18653/v1/2021.woah-1.1

  71. [84]

    Modeling Profanity and Hate Speech in Social Media with Semantic Subspaces

    Hahn, Vanessa and Ruiter, Dana and Kleinbauer, Thomas and Klakow, Dietrich. Modeling Profanity and Hate Speech in Social Media with Semantic Subspaces. Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021). 2021. doi:10.18653/v1/2021.woah-1.2

  72. [85]

    H ate BERT : Retraining BERT for Abusive Language Detection in E nglish

    Caselli, Tommaso and Basile, Valerio and Mitrovi \'c , Jelena and Granitzer, Michael. H ate BERT : Retraining BERT for Abusive Language Detection in E nglish. Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021). 2021. doi:10.18653/v1/2021.woah-1.3

  73. [86]

    Memes in the Wild: Assessing the Generalizability of the Hateful Memes Challenge Dataset

    Kirk, Hannah and Jun, Yennie and Rauba, Paulius and Wachtel, Gal and Li, Ruining and Bai, Xingjian and Broestl, Noah and Doff-Sotta, Martin and Shtedritski, Aleksandar and Asano, Yuki M. Memes in the Wild: Assessing the Generalizability of the Hateful Memes Challenge Dataset. Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021). 2021. doi...

  74. [87]

    Measuring and Improving Model-Moderator Collaboration using Uncertainty Estimation

    Kivlichan, Ian and Lin, Zi and Liu, Jeremiah and Vasserman, Lucy. Measuring and Improving Model-Moderator Collaboration using Uncertainty Estimation. Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021). 2021. doi:10.18653/v1/2021.woah-1.5

  75. [88]

    DALC : the D utch Abusive Language Corpus

    Caselli, Tommaso and Schelhaas, Arjan and Weultjes, Marieke and Leistra, Folkert and van der Veen, Hylke and Timmerman, Gerben and Nissim, Malvina. DALC : the D utch Abusive Language Corpus. Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021). 2021. doi:10.18653/v1/2021.woah-1.6

  76. [89]

    and Dulal, Saurab and Koirala, Diwa

    Niraula, Nobal B. and Dulal, Saurab and Koirala, Diwa. Offensive Language Detection in N epali Social Media. Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021). 2021. doi:10.18653/v1/2021.woah-1.7

  77. [90]

    MIN \_ PT : An E uropean P ortuguese Lexicon for Minorities Related Terms

    Fortuna, Paula and Cortez, Vanessa and Sozinho Ramalho, Miguel and P \'e rez-Mayos, Laura. MIN \_ PT : An E uropean P ortuguese Lexicon for Minorities Related Terms. Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021). 2021. doi:10.18653/v1/2021.woah-1.8

  78. [91]

    Fine-Grained Fairness Analysis of Abusive Language Detection Systems with C heck L ist

    Manerba, Marta Marchiori and Tonelli, Sara. Fine-Grained Fairness Analysis of Abusive Language Detection Systems with C heck L ist. Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021). 2021. doi:10.18653/v1/2021.woah-1.9

  79. [92]

    Improving Counterfactual Generation for Fair Hate Speech Detection

    Mostafazadeh Davani, Aida and Omrani, Ali and Kennedy, Brendan and Atari, Mohammad and Ren, Xiang and Dehghani, Morteza. Improving Counterfactual Generation for Fair Hate Speech Detection. Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021). 2021. doi:10.18653/v1/2021.woah-1.10

  80. [93]

    Hell Hath No Fury? Correcting Bias in the NRC Emotion Lexicon

    Zad, Samira and Jimenez, Joshuan and Finlayson, Mark. Hell Hath No Fury? Correcting Bias in the NRC Emotion Lexicon. Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021). 2021. doi:10.18653/v1/2021.woah-1.11

Showing first 80 references.