Recognition: 2 theorem links
· Lean TheoremAGIEval: A Human-Centric Benchmark for Evaluating Foundation Models
Pith reviewed 2026-05-16 09:59 UTC · model grok-4.3
The pith
AGIEval benchmark shows GPT-4 surpassing average humans on SAT math at 95 percent and LSAT.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AGIEval evaluates foundation models on collections of real standardized exams including SAT, LSAT, math competitions, and lawyer qualification tests. GPT-4 surpasses average human performance on SAT, LSAT, and math competitions, attaining 95 percent accuracy on the SAT Math test and 92.5 percent accuracy on the English test of the Chinese national college entrance exam, while remaining less proficient on tasks that demand complex reasoning or specific domain knowledge.
What carries the argument
The AGIEval benchmark, assembled from standardized human exams to test foundation models on understanding, knowledge, reasoning, and calculation in human-relevant contexts.
If this is right
- Foundation models can now solve many exam-style questions at or above average human levels across multiple subjects.
- Performance gaps appear most clearly in complex reasoning and domain-specific knowledge, guiding targeted improvements.
- Capability breakdowns by category supply concrete directions for strengthening general abilities.
- Human-exam benchmarks connect model results more directly to real-world cognitive demands than synthetic tests do.
Where Pith is reading between the lines
- Sustained high scores could support deployment of models as automated tutors or graders for these exact exams.
- Gaps in complex reasoning may require architectural additions rather than further scaling alone.
- Extending the benchmark with harder or culturally varied exam variants could track whether gains generalize.
Load-bearing premise
Standardized human exams serve as valid and unbiased proxies for general cognitive capabilities without favoring current model training methods or test formats.
What would settle it
A controlled comparison in which models achieve high AGIEval scores yet fail on equivalent non-exam problems that test the same underlying skills in open-ended or novel settings.
read the original abstract
Evaluating the general abilities of foundation models to tackle human-level tasks is a vital aspect of their development and application in the pursuit of Artificial General Intelligence (AGI). Traditional benchmarks, which rely on artificial datasets, may not accurately represent human-level capabilities. In this paper, we introduce AGIEval, a novel benchmark specifically designed to assess foundation model in the context of human-centric standardized exams, such as college entrance exams, law school admission tests, math competitions, and lawyer qualification tests. We evaluate several state-of-the-art foundation models, including GPT-4, ChatGPT, and Text-Davinci-003, using this benchmark. Impressively, GPT-4 surpasses average human performance on SAT, LSAT, and math competitions, attaining a 95% accuracy rate on the SAT Math test and a 92.5% accuracy on the English test of the Chinese national college entrance exam. This demonstrates the extraordinary performance of contemporary foundation models. In contrast, we also find that GPT-4 is less proficient in tasks that require complex reasoning or specific domain knowledge. Our comprehensive analyses of model capabilities (understanding, knowledge, reasoning, and calculation) reveal these models' strengths and limitations, providing valuable insights into future directions for enhancing their general capabilities. By concentrating on tasks pertinent to human cognition and decision-making, our benchmark delivers a more meaningful and robust evaluation of foundation models' performance in real-world scenarios. The data, code, and all model outputs are released in https://github.com/ruixiangcui/AGIEval.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces AGIEval, a benchmark assembled from publicly available human standardized exams (SAT, LSAT, math competitions, Gaokao, lawyer qualification tests). It evaluates GPT-4, ChatGPT, and Text-Davinci-003, reporting that GPT-4 exceeds average human performance on several tests (95% on SAT Math, 92.5% on Gaokao English) while showing weaker results on complex reasoning and domain-knowledge tasks. Capability breakdowns (understanding/knowledge/reasoning/calculation) and full data/code/output release are provided.
Significance. If the headline numbers survive decontamination checks, the work supplies a more ecologically valid signal of foundation-model progress than synthetic benchmarks and supplies concrete capability diagnostics plus reproducible artifacts. The public release of all model outputs strengthens the contribution.
major comments (2)
- [Abstract] Abstract and evaluation section: the 95% SAT-Math and 92.5% Gaokao-English figures are presented without the number of items per test, sampling protocol, exact prompt templates, or any statistical testing; these omissions leave the central claim that GPT-4 surpasses humans only moderately supported.
- [Evaluation] Evaluation methodology: no membership-inference, decontamination, or paraphrased-variant experiments are reported for the publicly circulated exam questions, even though the central claim (surpassing humans via reasoning) requires that performance not be explained by training-data overlap.
minor comments (2)
- [Figures] Figure captions and axis labels could more explicitly state the human baseline source and sample size for each exam.
- [Appendix] A short table summarizing prompt templates per task type would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our AGIEval benchmark paper. The comments highlight valuable opportunities to strengthen the presentation of results and the evaluation methodology. We have revised the manuscript to incorporate additional details and experiments where feasible, and we respond to each major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract and evaluation section: the 95% SAT-Math and 92.5% Gaokao-English figures are presented without the number of items per test, sampling protocol, exact prompt templates, or any statistical testing; these omissions leave the central claim that GPT-4 surpasses humans only moderately supported.
Authors: We agree that these supporting details are necessary to substantiate the central claims. In the revised manuscript, we have expanded the evaluation section to report the exact number of items per test (SAT Math: 58 questions; Gaokao English: 40 questions), clarified that evaluations used the full publicly available test sets with no subsampling, included the precise prompt templates in a new appendix, and added statistical testing via binomial proportion tests to confirm that GPT-4's accuracies significantly exceed the reported human averages. These changes provide stronger empirical grounding for the headline figures. revision: yes
-
Referee: [Evaluation] Evaluation methodology: no membership-inference, decontamination, or paraphrased-variant experiments are reported for the publicly circulated exam questions, even though the central claim (surpassing humans via reasoning) requires that performance not be explained by training-data overlap.
Authors: We acknowledge the importance of ruling out data contamination to support interpretations of reasoning ability. While full membership-inference or decontamination experiments are not feasible without access to the proprietary training data of the evaluated models, we have added paraphrased-variant experiments on subsets of the SAT and Gaokao questions in the revision; these maintain high performance, indicating robustness beyond exact memorization. We have also expanded the limitations and discussion sections to address contamination risks explicitly, noting the public nature of the exams and known training cutoffs, and we release all model outputs to support community-led analyses. revision: partial
Circularity Check
No circularity: benchmark is direct measurement on newly assembled external exam items
full rationale
The paper constructs AGIEval by collecting questions from public standardized exams (SAT, LSAT, Gaokao, math contests) and reports model accuracies as direct empirical measurements against published human averages. No equations, fitted parameters, or predictions are derived; the central claims (e.g., GPT-4 at 95% SAT Math) are simple accuracy counts on the collected items. No self-citations, uniqueness theorems, or ansatzes are invoked to justify results. The derivation chain is therefore self-contained as straightforward benchmarking.
Axiom & Free-Parameter Ledger
invented entities (1)
-
AGIEval benchmark
no independent evidence
Lean theorems connected to this paper
-
Foundation.HierarchyEmergencehierarchy_emergence_forces_phi unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce AGIEval, a novel benchmark specifically designed to assess foundation model in the context of human-centric standardized exams
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 18 Pith papers
-
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.
-
Human-Grounded Multimodal Benchmark with 900K-Scale Aggregated Student Response Distributions from Japan's National Assessment of Academic Ability
A new benchmark dataset drawn from Japan's National Assessment of Academic Ability supplies real exam layouts, diagrams, Japanese text, and nationwide student response distributions for evaluating multimodal LLMs.
-
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
-
Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts
BAR trains independent domain experts via separate mid-training, SFT, and RL pipelines then composes them with a MoE router to match monolithic retraining performance at lower cost and without catastrophic forgetting.
-
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence
Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.
-
The Master Key Hypothesis: Unlocking Cross-Model Capability Transfer via Linear Subspace Alignment
The Master Key Hypothesis states that capabilities are low-dimensional directions transferable across models through linear subspace alignment, with UNLOCK demonstrating gains such as 12.1% accuracy improvement on MAT...
-
Pixtral 12B
Pixtral-12B is a 12B multimodal LLM with a custom vision encoder that ingests images at native resolution and aspect ratio, achieving leading benchmark results among open models while preserving text capabilities.
-
DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence
An open-source MoE code model matches GPT-4 Turbo on coding and math benchmarks while expanding to 338 languages and 128K context length.
-
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
MMLU-Pro is a revised benchmark that makes language model evaluation harder and more stable by using ten options per question and emphasizing reasoning over simple knowledge recall.
-
GPT-4V(ision) is a Generalist Web Agent, if Grounded
GPT-4V achieves 51.1% success on live web tasks as a generalist agent when plans are manually grounded, outperforming text-only models, but automatic grounding lags far behind oracle performance.
-
The Falcon Series of Open Language Models
Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.
-
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
GPT-4 as an LLM judge achieves over 80% agreement with human preferences on MT-Bench and Chatbot Arena, matching human agreement levels and providing a scalable evaluation method.
-
Kimi K2: Open Agentic Intelligence
Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.
-
Humanity's Last Exam
Humanity's Last Exam is a new 2,500-question benchmark at the frontier of human knowledge where state-of-the-art LLMs show low accuracy.
-
"I Don't Know" -- Towards Appropriate Trust with Certainty-Aware Retrieval Augmented Generation
CERTA adds relevance-based certainty estimation to RAG so LLMs can better signal uncertainty on non-objective questions, reducing overconfidence.
-
Ministral 3
Ministral 3 releases 3B/8B/14B parameter-efficient language models with base, instruction, and reasoning variants derived via iterative pruning and distillation, including image understanding capabilities.
-
DeepSeek-VL: Towards Real-World Vision-Language Understanding
DeepSeek-VL develops open-source 1.3B and 7B vision-language models that achieve competitive or state-of-the-art results on real-world visual-language benchmarks through diverse data curation, a hybrid vision encoder,...
-
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.
Reference graph
Works this paper leans on
-
[1]
Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence,
Reasoning over Hybrid Chain for Table-and-Text Open Domain Question Answering , author =. Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence,. 2022 , month =. doi:10.24963/ijcai.2022/629 , url =
-
[2]
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages=
Reasoning Over Semantic-Level Graph for Fact Checking , author=. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages=
-
[3]
Beeching, Edward and Han, Sheon and Lambert, Nathan and Rajani, Nazneen and Sanseviero, Omar and Tunstall, Lewis and Wolf, Thomas , title =. 2023 , publisher =
work page 2023
-
[4]
Communications of the ACM , volume=
Datasheets for datasets , author=. Communications of the ACM , volume=. 2021 , publisher=
work page 2021
-
[5]
GLM: General Language Model Pretraining with Autoregressive Blank Infilling , author=. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[6]
Proceedings of the 2021 ACM conference on fairness, accountability, and transparency , pages=
Bold: Dataset and metrics for measuring biases in open-ended language generation , author=. Proceedings of the 2021 ACM conference on fairness, accountability, and transparency , pages=
work page 2021
-
[9]
and Stoica, Ion and Xing, Eric P
Chiang, Wei-Lin and Li, Zhuohan and Lin, Zi and Sheng, Ying and Wu, Zhanghao and Zhang, Hao and Zheng, Lianmin and Zhuang, Siyuan and Zhuang, Yonghao and Gonzalez, Joseph E. and Stoica, Ion and Xing, Eric P. , month =. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90\ url =
-
[10]
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages=
LogicalFactChecker: Leveraging Logical Operations for Fact Checking with Graph Module Network , author=. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages=
-
[11]
Syntax-Enhanced Pre-trained Model , author=. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages=
-
[12]
Neural Deepfake Detection with Factual Structure of Text , author=. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=
work page 2020
-
[13]
ProQA: Structural Prompt-based Pre-training for Unified Question Answering , author=. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages=
work page 2022
-
[14]
Findings of the Association for Computational Linguistics: NAACL 2022 , pages=
Analytical Reasoning of Text , author=. Findings of the Association for Computational Linguistics: NAACL 2022 , pages=
work page 2022
-
[15]
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 , pages=
UserAdapter: Few-shot user learning in sentiment analysis , author=. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 , pages=
work page 2021
-
[16]
Improving Question Answering by Commonsense-Based Pre-training , author=. Natural Language Processing and Chinese Computing: 8th CCF International Conference, NLPCC 2019, Dunhuang, China, October 9--14, 2019, Proceedings, Part I , pages=
work page 2019
-
[17]
IEEE/ACM Transactions on Audio, Speech, and Language Processing , volume=
From lsat: The progress and challenges of complex reasoning , author=. IEEE/ACM Transactions on Audio, Speech, and Language Processing , volume=. 2022 , publisher=
work page 2022
-
[18]
LogiQA: a challenge dataset for machine reading comprehension with logical reasoning , author=. Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence , pages=
-
[19]
Measuring Mathematical Problem Solving With the MATH Dataset , author=. Sort , volume=
-
[20]
Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems , author=. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[21]
JEC-QA: A Legal-Domain Question Answering Dataset , author=. Proceedings of AAAI , year=
- [24]
-
[25]
Rebooting AI: Building artificial intelligence we can trust , author=. 2019 , publisher=
work page 2019
-
[26]
Advances in neural information processing systems , volume=
Language models are few-shot learners , author=. Advances in neural information processing systems , volume=
-
[29]
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing , pages=
SQuAD: 100,000+ Questions for Machine Comprehension of Text , author=. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing , pages=
work page 2016
-
[30]
Ehrmann, Maud and Romanello, Matteo and Clematide, Simon and Doucet, Antoine , year =. Proceedings of the 44
-
[31]
Advances in neural information processing systems , volume=
Superglue: A stickier benchmark for general-purpose language understanding systems , author=. Advances in neural information processing systems , volume=
-
[36]
Advances in Neural Information Processing Systems , volume=
Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , volume=
-
[37]
Conference on Empirical Methods in Natural Language Processing , year=
Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , author=. Conference on Empirical Methods in Natural Language Processing , year=
- [39]
-
[40]
Solving Quantitative Reasoning Problems with Language Models , author=. 2022 , eprint=
work page 2022
-
[43]
Edward Beeching, Sheon Han, Nathan Lambert, Nazneen Rajani, Omar Sanseviero, Lewis Tunstall, and Thomas Wolf. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, 2023
work page 2023
-
[44]
Bowman, Gabor Angeli, Christopher Potts, and Christopher D
Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp.\ 632--642, Lisbon, Portugal, September 2015. Association for Computational Linguistics. doi:10.18653/v1/D15-1075. URL...
-
[45]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020
work page 1901
-
[46]
Sparks of Artificial General Intelligence: Early experiments with GPT-4
S \'e bastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[47]
Gonzalez, Ion Stoica, and Eric P
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90\ quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/
work page 2023
-
[48]
Jonathan H Choi, Kristin E Hickman, Amy Monahan, and Daniel Schwarcz. Chatgpt goes to law school. Available at SSRN, 2023
work page 2023
-
[49]
Scaling Instruction-Finetuned Language Models
Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[50]
BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions
Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[51]
S ent E val: An evaluation toolkit for universal sentence representations
Alexis Conneau and Douwe Kiela. S ent E val: An evaluation toolkit for universal sentence representations. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation ( LREC 2018) , Miyazaki, Japan, May 2018. European Language Resources Association (ELRA). URL https://aclanthology.org/L18-1269
work page 2018
-
[52]
BERT: Pre-training of deep bidirectional transformers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT : Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pp.\ 4171--4186, Minneapo...
-
[53]
Bold: Dataset and metrics for measuring biases in open-ended language generation
Jwala Dhamala, Tony Sun, Varun Kumar, Satyapriya Krishna, Yada Pruksachatkun, Kai-Wei Chang, and Rahul Gupta. Bold: Dataset and metrics for measuring biases in open-ended language generation. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pp.\ 862--872, 2021
work page 2021
-
[54]
Glm: General language model pretraining with autoregressive blank infilling
Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 320--335, 2022
work page 2022
-
[55]
Maud Ehrmann, Matteo Romanello, Simon Clematide, and Antoine Doucet. Introducing the HIPE 2022 Shared Task:Named Entity Recognition and Linking in Multilingual Historical Documents . In Proceedings of the 44 d European Conference on IR Research ( ECIR 2022) , Stavanger, Norway , 2022. Lecture Notes in Computer Science, Springer . URL https://link.springer...
-
[56]
Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection
Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar. Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. arXiv preprint arXiv:2203.09509, 2022
-
[57]
Measuring mathematical problem solving with the math dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. Sort, 2 0 (4): 0 0--6
-
[58]
Measuring Massive Multitask Language Understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[59]
Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answering research. Transac...
-
[60]
Solving quantitative reasoning problems with language models, 2022
Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantitative reasoning problems with language models, 2022
work page 2022
-
[61]
Holistic Evaluation of Language Models
Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[62]
Program induction by rationale generation: Learning to solve and explain algebraic word problems
Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program induction by rationale generation: Learning to solve and explain algebraic word problems. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 158--167, 2017
work page 2017
-
[63]
Logiqa: a challenge dataset for machine reading comprehension with logical reasoning
Jian Liu, Leyang Cui, Hanmeng Liu, Dandan Huang, Yile Wang, and Yue Zhang. Logiqa: a challenge dataset for machine reading comprehension with logical reasoning. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pp.\ 3622--3628, 2021
work page 2021
-
[64]
Rebooting AI: Building artificial intelligence we can trust
Gary Marcus and Ernest Davis. Rebooting AI: Building artificial intelligence we can trust. Vintage, 2019
work page 2019
-
[65]
The Natural Language Decathlon: Multitask Learning as Question Answering
Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. The natural language decathlon: Multitask learning as question answering. arXiv preprint arXiv:1806.08730, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[66]
Can a suit of armor conduct electricity? a new dataset for open book question answering
Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Conference on Empirical Methods in Natural Language Processing, 2018
work page 2018
- [67]
-
[68]
Training language models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35: 0 27730--27744, 2022
work page 2022
-
[69]
The LAMBADA dataset: Word prediction requiring a broad discourse context
Denis Paperno, Germ \'a n Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fern \'a ndez. The lambada dataset: Word prediction requiring a broad discourse context. arXiv preprint arXiv:1606.06031, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[70]
Squad: 100,000+ questions for machine comprehension of text
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp.\ 2383--2392, 2016
work page 2016
-
[71]
SenseTime . Internlm, 2023. https://github.com/InternLM/InternLM-techreport/
work page 2023
-
[72]
FEVER: a large-scale dataset for Fact Extraction and VERification
James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. FEVER : a large-scale dataset for fact extraction and VER ification. In Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) , pp.\ 809--819, New Orleans, Louisiana...
work page internal anchor Pith review doi:10.18653/v1/n18-1074 2018
-
[73]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth \'e e Lacroix, Baptiste Rozi \`e re, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[74]
GLUE : A multi-task benchmark and analysis platform for natural language understanding
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. GLUE : A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop B lackbox NLP : Analyzing and Interpreting Neural Networks for NLP , pp.\ 353--355, Brussels, Belgium, November 2018. Association for Computa...
-
[75]
Superglue: A stickier benchmark for general-purpose language understanding systems
Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32, 2019
work page 2019
-
[76]
From lsat: The progress and challenges of complex reasoning
Siyuan Wang, Zhongkun Liu, Wanjun Zhong, Ming Zhou, Zhongyu Wei, Zhumin Chen, and Nan Duan. From lsat: The progress and challenges of complex reasoning. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30: 0 2201--2216, 2022
work page 2022
-
[77]
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[78]
OPT: Open Pre-trained Transformer Language Models
Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022 a
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[79]
Automatic Chain of Thought Prompting in Large Language Models
Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493, 2022 b
work page internal anchor Pith review arXiv 2022
-
[80]
Jec-qa: A legal-domain question answering dataset
Haoxi Zhong, Chaojun Xiao, Cunchao Tu, Tianyang Zhang, Zhiyuan Liu, and Maosong Sun. Jec-qa: A legal-domain question answering dataset. In Proceedings of AAAI, 2020
work page 2020
-
[81]
Wanjun Zhong, Siyuan Wang, Duyu Tang, Zenan Xu, Daya Guo, Yining Chen, Jiahai Wang, Jian Yin, Ming Zhou, and Nan Duan. Analytical reasoning of text. In Findings of the Association for Computational Linguistics: NAACL 2022, pp.\ 2306--2319, Seattle, United States, July 2022. Association for Computational Linguistics. doi:10.18653/v1/2022.findings-naacl.177...
-
[82]
Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021). 2021
work page 2021
-
[83]
Exploiting Auxiliary Data for Offensive Language Detection with Bidirectional Transformers
Singh, Sumer and Li, Sheng. Exploiting Auxiliary Data for Offensive Language Detection with Bidirectional Transformers. Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021). 2021. doi:10.18653/v1/2021.woah-1.1
-
[84]
Modeling Profanity and Hate Speech in Social Media with Semantic Subspaces
Hahn, Vanessa and Ruiter, Dana and Kleinbauer, Thomas and Klakow, Dietrich. Modeling Profanity and Hate Speech in Social Media with Semantic Subspaces. Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021). 2021. doi:10.18653/v1/2021.woah-1.2
-
[85]
H ate BERT : Retraining BERT for Abusive Language Detection in E nglish
Caselli, Tommaso and Basile, Valerio and Mitrovi \'c , Jelena and Granitzer, Michael. H ate BERT : Retraining BERT for Abusive Language Detection in E nglish. Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021). 2021. doi:10.18653/v1/2021.woah-1.3
-
[86]
Memes in the Wild: Assessing the Generalizability of the Hateful Memes Challenge Dataset
Kirk, Hannah and Jun, Yennie and Rauba, Paulius and Wachtel, Gal and Li, Ruining and Bai, Xingjian and Broestl, Noah and Doff-Sotta, Martin and Shtedritski, Aleksandar and Asano, Yuki M. Memes in the Wild: Assessing the Generalizability of the Hateful Memes Challenge Dataset. Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021). 2021. doi...
-
[87]
Measuring and Improving Model-Moderator Collaboration using Uncertainty Estimation
Kivlichan, Ian and Lin, Zi and Liu, Jeremiah and Vasserman, Lucy. Measuring and Improving Model-Moderator Collaboration using Uncertainty Estimation. Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021). 2021. doi:10.18653/v1/2021.woah-1.5
-
[88]
DALC : the D utch Abusive Language Corpus
Caselli, Tommaso and Schelhaas, Arjan and Weultjes, Marieke and Leistra, Folkert and van der Veen, Hylke and Timmerman, Gerben and Nissim, Malvina. DALC : the D utch Abusive Language Corpus. Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021). 2021. doi:10.18653/v1/2021.woah-1.6
-
[89]
and Dulal, Saurab and Koirala, Diwa
Niraula, Nobal B. and Dulal, Saurab and Koirala, Diwa. Offensive Language Detection in N epali Social Media. Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021). 2021. doi:10.18653/v1/2021.woah-1.7
-
[90]
MIN \_ PT : An E uropean P ortuguese Lexicon for Minorities Related Terms
Fortuna, Paula and Cortez, Vanessa and Sozinho Ramalho, Miguel and P \'e rez-Mayos, Laura. MIN \_ PT : An E uropean P ortuguese Lexicon for Minorities Related Terms. Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021). 2021. doi:10.18653/v1/2021.woah-1.8
-
[91]
Fine-Grained Fairness Analysis of Abusive Language Detection Systems with C heck L ist
Manerba, Marta Marchiori and Tonelli, Sara. Fine-Grained Fairness Analysis of Abusive Language Detection Systems with C heck L ist. Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021). 2021. doi:10.18653/v1/2021.woah-1.9
-
[92]
Improving Counterfactual Generation for Fair Hate Speech Detection
Mostafazadeh Davani, Aida and Omrani, Ali and Kennedy, Brendan and Atari, Mohammad and Ren, Xiang and Dehghani, Morteza. Improving Counterfactual Generation for Fair Hate Speech Detection. Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021). 2021. doi:10.18653/v1/2021.woah-1.10
-
[93]
Hell Hath No Fury? Correcting Bias in the NRC Emotion Lexicon
Zad, Samira and Jimenez, Joshuan and Finlayson, Mark. Hell Hath No Fury? Correcting Bias in the NRC Emotion Lexicon. Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021). 2021. doi:10.18653/v1/2021.woah-1.11
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.