arxiv: 2501.14249 · v10 · submitted 2025-01-24 · 💻 cs.LG · cs.AI· cs.CL

Recognition: 2 theorem links

· Lean Theorem

Humanity's Last Exam

Long Phan , Alice Gatti , Ziwen Han , Nathaniel Li , Josephina Hu , Hugh Zhang , Chen Bo Calvin Zhang , Mohamed Shaaban

show 1100 more authors

John Ling Sean Shi Michael Choi Anish Agrawal Arnav Chopra Adam Khoja Ryan Kim Richard Ren Jason Hausenloy Oliver Zhang Mantas Mazeika Dmitry Dodonov Tung Nguyen Jaeho Lee Daron Anderson Mikhail Doroshenko Alun Cennyth Stokes Mobeen Mahmood Oleksandr Pokutnyi Oleg Iskra Jessica P. Wang John-Clark Levin Mstyslav Kazakov Fiona Feng Steven Y. Feng Haoran Zhao Michael Yu Varun Gangal Chelsea Zou Zihan Wang Serguei Popov Robert Gerbicz Geoff Galgon Johannes Schmitt Will Yeadon Yongki Lee Scott Sauers Alvaro Sanchez Fabian Giska Marc Roth S{\o}ren Riis Saiteja Utpala Noah Burns Gashaw M. Goshu Mohinder Maheshbhai Naiya Chidozie Agu Zachary Giboney Antrell Cheatom Francesco Fournier-Facio Sarah-Jane Crowson Lennart Finke Zerui Cheng Jennifer Zampese Ryan G. Hoerr Mark Nandor Hyunwoo Park Tim Gehrunger Jiaqi Cai Ben McCarty Alexis C Garretson Edwin Taylor Damien Sileo Qiuyu Ren Usman Qazi Lianghui Li Jungbae Nam John B. Wydallis Pavel Arkhipov Jack Wei Lun Shi Aras Bacho Chris G. Willcocks Hangrui Cao Sumeet Motwani Emily de Oliveira Santos Johannes Veith Edward Vendrow Doru Cojoc Kengo Zenitani Joshua Robinson Longke Tang Yuqi Li Joshua Vendrow Natanael Wildner Fraga Vladyslav Kuchkin Andrey Pupasov Maksimov Pierre Marion Denis Efremov Jayson Lynch Kaiqu Liang Aleksandar Mikov Andrew Gritsevskiy Julien Guillod G\"ozdenur Demir Dakotah Martinez Ben Pageler Kevin Zhou Saeed Soori Ori Press Henry Tang Paolo Rissone Sean R. Green Lina Br\"ussel Moon Twayana Aymeric Dieuleveut Joseph Marvin Imperial Ameya Prabhu Jinzhou Yang Nick Crispino Arun Rao Dimitri Zvonkine Gabriel Loiseau Mikhail Kalinin Marco Lukas Ciprian Manolescu Nate Stambaugh Subrata Mishra Tad Hogg Carlo Bosio Brian P Coppola Julian Salazar Jaehyeok Jin Rafael Sayous Stefan Ivanov Philippe Schwaller Shaipranesh Senthilkuma Andres M Bran Andres Algaba Kelsey Van den Houte Lynn Van Der Sypt Brecht Verbeken David Noever Alexei Kopylov Benjamin Myklebust Bikun Li Lisa Schut Evgenii Zheltonozhskii Qiaochu Yuan Derek Lim Richard Stanley Tong Yang John Maar Julian Wykowski Mart\'i Oller Anmol Sahu Cesare Giulio Ardito Yuzheng Hu Ariel Ghislain Kemogne Kamdoum Alvin Jin Tobias Garcia Vilchis Yuexuan Zu Martin Lackner James Koppel Gongbo Sun Daniil S. Antonenko Steffi Chern Bingchen Zhao Pierrot Arsene Joseph M Cavanagh Daofeng Li Jiawei Shen Donato Crisostomi Wenjin Zhang Ali Dehghan Sergey Ivanov David Perrella Nurdin Kaparov Allen Zang Ilia Sucholutsky Arina Kharlamova Daniil Orel Vladislav Poritski Shalev Ben-David Zachary Berger Parker Whitfill Michael Foster Daniel Munro Linh Ho Shankar Sivarajan Dan Bar Hava Aleksey Kuchkin David Holmes Alexandra Rodriguez-Romero Frank Sommerhage Anji Zhang Richard Moat Keith Schneider Zakayo Kazibwe Don Clarke Dae Hyun Kim Felipe Meneguitti Dias Sara Fish Veit Elser Tobias Kreiman Victor Efren Guadarrama Vilchis Immo Klose Ujjwala Anantheswaran Adam Zweiger Kaivalya Rawal Jeffery Li Jeremy Nguyen Nicolas Daans Haline Heidinger Maksim Radionov V\'aclav Rozho\v{n} Vincent Ginis Christian Stump Niv Cohen Rafa{\l} Po\'swiata Josef Tkadlec Alan Goldfarb Chenguang Wang Piotr Padlewski Stanislaw Barzowski Kyle Montgomery Ryan Stendall Jamie Tucker-Foltz Jack Stade T. Ryan Rogers Tom Goertzen Declan Grabb Abhishek Shukla Alan Givr\'e John Arnold Ambay Archan Sen Muhammad Fayez Aziz Mark H Inlow Hao He Ling Zhang Younesse Kaddar Ivar \"Angquist Yanxu Chen Harrison K Wang Kalyan Ramakrishnan Elliott Thornley Antonio Terpin Hailey Schoelkopf Eric Zheng Avishy Carmi Ethan D. L. Brown Kelin Zhu Max Bartolo Richard Wheeler Martin Stehberger Peter Bradshaw JP Heimonen Kaustubh Sridhar Ido Akov Jennifer Sandlin Yury Makarychev Joanna Tam Hieu Hoang David M. Cunningham Vladimir Goryachev Demosthenes Patramanis Michael Krause Andrew Redenti David Aldous Jesyin Lai Shannon Coleman Jiangnan Xu Sangwon Lee Ilias Magoulas Sandy Zhao Ning Tang Michael K. Cohen Orr Paradise Jan Hendrik Kirchner Maksym Ovchynnikov Jason O. Matos Adithya Shenoy Michael Wang Yuzhou Nie Anna Sztyber-Betley Paolo Faraboschi Robin Riblet Jonathan Crozier Shiv Halasyamani Shreyas Verma Prashant Joshi Eli Meril Ziqiao Ma J\'er\'emy Andr\'eoletti Raghav Singhal Jacob Platnick Volodymyr Nevirkovets Luke Basler Alexander Ivanov Seri Khoury Nils Gustafsson Marco Piccardo Hamid Mostaghimi Qijia Chen Virendra Singh Tran Quoc Kh\'anh Paul Rosu Hannah Szlyk Zachary Brown Himanshu Narayan Aline Menezes Jonathan Roberts William Alley Kunyang Sun Arkil Patel Max Lamparth Anka Reuel Linwei Xin Hanmeng Xu Jacob Loader Freddie Martin Zixuan Wang Andrea Achilleos Thomas Preu Tomek Korbak Ida Bosio Fereshteh Kazemi Ziye Chen Bir\'o B\'alint Eve J. Y. Lo Jiaqi Wang Maria In\^es S. Nunes Jeremiah Milbauer M Saiful Bari Zihao Wang Behzad Ansarinejad Yewen Sun Stephane Durand Hossam Elgnainy Guillaume Douville Daniel Tordera George Balabanian Hew Wolff Lynna Kvistad Hsiaoyun Milliron Ahmad Sakor Murat Eron Andrew Favre D.O. Shailesh Shah Xiaoxiang Zhou Firuz Kamalov Sherwin Abdoli Tim Santens Shaul Barkan Allison Tee Robin Zhang Alessandro Tomasiello G. Bruno De Luca Shi-Zhuo Looi Vinh-Kha Le Noam Kolt Jiayi Pan Emma Rodman Jacob Drori Carl J Fossum Niklas Muennighoff Milind Jagota Ronak Pradeep Honglu Fan Jonathan Eicher Michael Chen Kushal Thaman William Merrill Moritz Firsching Carter Harris Stefan Ciob\^ac\u{a} Jason Gross Rohan Pandey Ilya Gusev Adam Jones Shashank Agnihotri Pavel Zhelnov Mohammadreza Mofayezi Alexander Piperski David K. Zhang Kostiantyn Dobarskyi Roman Leventov Ignat Soroko Joshua Duersch Vage Taamazyan Andrew Ho Wenjie Ma William Held Ruicheng Xian Armel Randy Zebaze Mohanad Mohamed Julian Noah Leser Michelle X Yuan Laila Yacar Johannes Lengler Katarzyna Olszewska Claudio Di Fratta Edson Oliveira Joseph W. Jackson Andy Zou Muthu Chidambaram Timothy Manik Hector Haffenden Dashiell Stander Ali Dasouqi Alexander Shen Bita Golshani David Stap Egor Kretov Mikalai Uzhou Alina Borisovna Zhidkovskaya Nick Winter Miguel Orbegozo Rodriguez Robert Lauff Dustin Wehr Colin Tang Zaki Hossain Shaun Phillips Fortuna Samuele Fredrik Ekstr\"om Angela Hammon Oam Patel Faraz Farhidi George Medley Forough Mohammadzadeh Madellene Pe\~naflor Haile Kassahun Alena Friedrich Rayner Hernandez Perez Daniel Pyda Taom Sakal Omkar Dhamane Ali Khajegili Mirabadi Eric Hallman Kenchi Okutsu Mike Battaglia Mohammad Maghsoudimehrabani Alon Amit Dave Hulbert Roberto Pereira Simon Weber Handoko Anton Peristyy Stephen Malina Mustafa Mehkary Rami Aly Frank Reidegeld Anna-Katharina Dick Cary Friday Mukhwinder Singh Hassan Shapourian Wanyoung Kim Mariana Costa Hubeyb Gurdogan Harsh Kumar Chiara Ceconello Chao Zhuang Haon Park Micah Carroll Andrew R. Tawfeek Stefan Steinerberger Daattavya Aggarwal Michael Kirchhof Linjie Dai Evan Kim Johan Ferret Jainam Shah Yuzhou Wang Minghao Yan Krzysztof Burdzy Lixin Zhang Antonio Franca Diana T. Pham Kang Yong Loh Abram Jackson Paolo Giordano Philipp Petersen Adrian Cosma Jesus Colino Colin White Jacob Votava Vladimir Vinnikov Ethan Delaney Petr Spelda Vit Stritecky Syed M. Shahid Jean-Christophe Mourrat Lavr Vetoshkin Koen Sponselee Renas Bacho Zheng-Xin Yong Florencia de la Rosa Nathan Cho Xiuyu Li Guillaume Malod Orion Weller Guglielmo Albani Leon Lang Julien Laurendeau Dmitry Kazakov Fatimah Adesanya Julien Portier Lawrence Hollom Victor Souza Yuchen Anna Zhou Julien Degorre Yi\u{g}it Yal{\i}n Gbenga Daniel Obikoya Rai (Michael Pokorny) Filippo Bigi M.C. Bosc\'a Oleg Shumar Kaniuar Bacho Gabriel Recchia Mara Popescu Nikita Shulga Ngefor Mildred Tanwie Thomas C.H. Lux Ben Rank Colin Ni Matthew Brooks Alesia Yakimchyk Huanxu (Quinn) Liu Stefano Cavalleri Olle H\"aggstr\"om Emil Verkama Joshua Newbould Hans Gundlach Leonor Brito-Santana Brian Amaro Vivek Vajipey Rynaa Grover Ting Wang Yosi Kratish Wen-Ding Li Sivakanth Gopi Andrea Caciolai Christian Schroeder de Witt Pablo Hern\'andez-C\'amara Emanuele Rodol\`a Jules Robins Dominic Williamson Vincent Cheng Brad Raynor Hao Qi Ben Segev Jingxuan Fan Sarah Martinson Erik Y. Wang Kaylie Hausknecht Michael P. Brenner Mao Mao Christoph Demian Peyman Kassani Xinyu Zhang David Avagian Eshawn Jessica Scipio Alon Ragoler Justin Tan Blake Sims Rebeka Plecnik Aaron Kirtland Omer Faruk Bodur D.P. Shinde Yan Carlos Leyva Labrador Zahra Adoul Mohamed Zekry Ali Karakoc Tania C. B. Santos Samir Shamseldeen Loukmane Karim Anna Liakhovitskaia Nate Resman Nicholas Farina Juan Carlos Gonzalez Gabe Maayan Earth Anderson Rodrigo De Oliveira Pena Elizabeth Kelley Hodjat Mariji Rasoul Pouriamanesh Wentao Wu Ross Finocchio Ismail Alarab Joshua Cole Danyelle Ferreira Bryan Johnson Mohammad Safdari Liangti Dai Siriphan Arthornthurasuk Isaac C. McAlister Alejandro Jos\'e Moyano Alexey Pronin Jing Fan Angel Ramirez-Trinidad Yana Malysheva Daphiny Pottmaier Omid Taheri Stanley Stepanic Samuel Perry Luke Askew Ra\'ul Adri\'an Huerta Rodr\'iguez Ali M. R. Minissi Ricardo Lorena Krishnamurthy Iyer Arshad Anil Fasiludeen Ronald Clark Josh Ducey Matheus Piza Maja Somrak Eric Vergo Juehang Qin Benj\'amin Borb\'as Eric Chu Jack Lindsey Antoine Jallon I.M.J. McInnis Evan Chen Avi Semler Luk Gloor Tej Shah Marc Carauleanu Pascal Lauer Tran {\DJ}uc Huy Hossein Shahrtash Emilien Duc Lukas Lewark Assaf Brown Samuel Albanie Brian Weber Warren S. Vaz Pierre Clavier Yiyang Fan Gabriel Poesia Reis e Silva Long (Tony) Lian Marcus Abramovitch Xi Jiang Sandra Mendoza Murat Islam Juan Gonzalez Vasilios Mavroudis Justin Xu Pawan Kumar Laxman Prasad Goswami Daniel Bugas Nasser Heydari Ferenc Jeanplong Thorben Jansen Antonella Pinto Archimedes Apronti Abdallah Galal Ng Ze-An Ankit Singh Tong Jiang Joan of Arc Xavier Kanu Priya Agarwal Mohammed Berkani Gang Zhang Zhehang Du Benedito Alves de Oliveira Junior Dmitry Malishev Nicolas Remy Taylor D. Hartman Tim Tarver Stephen Mensah Gautier Abou Loume Wiktor Morak Farzad Habibi Sarah Hoback Will Cai Javier Gimenez Roselynn Grace Montecillo Jakub {\L}ucki Russell Campbell Asankhaya Sharma Khalida Meer Shreen Gul Daniel Espinosa Gonzalez Xavier Alapont Alex Hoover Gunjan Chhablani Freddie Vargus Arunim Agarwal Yibo Jiang Deepakkumar Patil David Outevsky Kevin Joseph Scaria Rajat Maheshwari Abdelkader Dendane Priti Shukla Ashley Cartwright Sergei Bogdanov Niels M\"undler S\"oren M\"oller Luca Arnaboldi Kunvar Thaman Muhammad Rehan Siddiqi Prajvi Saxena Himanshu Gupta Tony Fruhauff Glen Sherman M\'aty\'as Vincze Siranut Usawasutsakorn Dylan Ler Anil Radhakrishnan Innocent Enyekwe Sk Md Salauddin Jiang Muzhen Aleksandr Maksapetyan Vivien Rossbach Chris Harjadi Mohsen Bahaloohoreh Claire Sparrow Jasdeep Sidhu Sam Ali Song Bian John Lai Eric Singer Justine Leon Uro Greg Bateman Mohamed Sayed Ahmed Menshawy Darling Duclosel Dario Bezzi Yashaswini Jain Ashley Aaron Murat Tiryakioglu Sheeshram Siddh Keith Krenek Imad Ali Shah Jun Jin Scott Creighton Denis Peskoff Zienab EL-Wasif Ragavendran P V Michael Richmond Joseph McGowan Tejal Patwardhan Hao-Yu Sun Ting Sun Nikola Zubi\'c Samuele Sala Stephen Ebert Jean Kaddour Manuel Schottdorf Dianzhuo Wang Gerol Petruzella Alex Meiburg Tilen Medved Ali ElSheikh S Ashwin Hebbar Lorenzo Vaquero Xianjun Yang Jason Poulos Vil\'em Zouhar Sergey Bogdanik Mingfang Zhang Jorge Sanz-Ros David Anugraha Yinwei Dai Anh N. Nhu Xue Wang Ali Anil Demircali Zhibai Jia Yuyin Zhou Juncheng Wu Mike He Nitin Chandok Aarush Sinha Gaoxiang Luo Long Le Micka\"el Noy\'e Micha{\l} Pere{\l}kiewicz Ioannis Pantidis Tianbo Qi Soham Sachin Purohit Letitia Parcalabescu Thai-Hoa Nguyen Genta Indra Winata Edoardo M. Ponti Hanchen Li Kaustubh Dhole Jongee Park Dario Abbondanza Yuanli Wang Anupam Nayak Diogo M. Caetano Antonio A. W. L. Wong Maria del Rio-Chanona D\'aniel Kondor Pieter Francois Ed Chalstrey Jakob Zsambok Dan Hoyer Jenny Reddish Jakob Hauser Francisco-Javier Rodrigo-Gin\'es Suchandra Datta Maxwell Shepherd Thom Kamphuis Qizheng Zhang Hyunjun Kim Ruiji Sun Jianzhu Yao Franck Dernoncourt Satyapriya Krishna Sina Rismanchian Bonan Pu Francesco Pinto Yingheng Wang Kumar Shridhar Kalon J. Overholt Glib Briia Hieu Nguyen David (Quod) Soler Bartomeu Tony CY Pang Adam Wecker Yifan Xiong Fanfei Li Lukas S. Huber Joshua Jaeger Romano De Maddalena Xing Han L\`u Yuhui Zhang Claas Beger Patrick Tser Jern Kon Sean Li Vivek Sanker Ming Yin Yihao Liang Xinlu Zhang Ankit Agrawal Li S. Yifei Zechen Zhang Mu Cai Yasin Sonmez Costin Cozianu Changhao Li Alex Slen Shoubin Yu Hyun Kyu Park Gabriele Sarti Marcin Bria\'nski Alessandro Stolfo Truong An Nguyen Mike Zhang Yotam Perlitz Jose Hernandez-Orallo Runjia Li Amin Shabani Felix Juefei-Xu Shikhar Dhingra Orr Zohar My Chiffon Nguyen Alexander Pondaven Abdurrahim Yilmaz Xuandong Zhao Chuanyang Jin Muyan Jiang Stefan Todoran Xinyao Han Jules Kreuer Brian Rabern Anna Plassart Martino Maggetti Luther Yap Robert Geirhos Jonathon Kean Dingsu Wang Sina Mollaei Chenkai Sun Yifan Yin Shiqi Wang Rui Li Yaowen Chang Anjiang Wei Alice Bizeul Xiaohan Wang Alexandre Oliveira Arrais Kushin Mukherjee Jorge Chamorro-Padial Jiachen Liu Xingyu Qu Junyi Guan Adam Bouyamourn Shuyu Wu Martyna Plomecka Junda Chen Mengze Tang Jiaqi Deng Shreyas Subramanian Haocheng Xi Haoxuan Chen Weizhi Zhang Yinuo Ren Haoqin Tu Sejong Kim Yushun Chen Sara Vera Marjanovi\'c Junwoo Ha Grzegorz Luczyna Jeff J. Ma Zewen Shen Dawn Song Cedegao E. Zhang Zhun Wang Ga\"el Gendron Yunze Xiao Leo Smucker Erica Weng Kwok Hao Lee Zhe Ye Stefano Ermon Ignacio D. Lopez-Miguel Theo Knights Anthony Gitter Namkyu Park Boyi Wei Hongzheng Chen Kunal Pai Ahmed Elkhanany Han Lin Philipp D. Siedler Jichao Fang Ritwik Mishra K\'aroly Zsolnai-Feh\'er Xilin Jiang Shadab Khan Jun Yuan Rishab Kumar Jain Xi Lin Mike Peterson Zhe Wang Aditya Malusare Maosen Tang Isha Gupta Ivan Fosin Timothy Kang Barbara Dworakowska Kazuki Matsumoto Guangyao Zheng Gerben Sewuster Jorge Pretel Villanueva Ivan Rannev Igor Chernyavsky Jiale Chen Deepayan Banik Ben Racz Wenchao Dong Jianxin Wang Laila Bashmal Duarte V. Gon\c{c}alves Wei Hu Kaushik Bar Ondrej Bohdal Atharv Singh Patlan Shehzaad Dhuliawala Caroline Geirhos Julien Wist Yuval Kansal Bingsen Chen Kutay Tire Atak Talay Y\"ucel Brandon Christof Veerupaksh Singla Zijian Song Sanxing Chen Jiaxin Ge Kaustubh Ponkshe Isaac Park Tianneng Shi Martin Q. Ma Joshua Mak Sherwin Lai Antoine Moulin Zhuo Cheng Zhanda Zhu Ziyi Zhang Vaidehi Patil Ketan Jha Qiutong Men Jiaxuan Wu Tianchi Zhang Bruno Hebling Vieira Alham Fikri Aji Jae-Won Chung Mohammed Mahfoud Ha Thi Hoang Marc Sperzel Wei Hao Kristof Meding Sihan Xu Vassilis Kostakos Davide Manini Yueying Liu Christopher Toukmaji Jay Paek Eunmi Yu Arif Engin Demircali Zhiyi Sun Ivan Dewerpe Hongsen Qin Roman Pflugfelder James Bailey Johnathan Morris Ville Heilala Sybille Rosset Zishun Yu Peter E. Chen Woongyeong Yeo Eeshaan Jain Ryan Yang Sreekar Chigurupati Julia Chernyavsky Sai Prajwal Reddy Subhashini Venugopalan Hunar Batra Core Francisco Park Hieu Tran Guilherme Maximiano Genghan Zhang Yizhuo Liang Hu Shiyu Rongwu Xu Rui Pan Siddharth Suresh Ziqi Liu Samaksh Gulati Songyang Zhang Peter Turchin Christopher W. Bartlett Christopher R. Scotese Phuong M. Cao Ben Wu Jacek Karwowski Davide Scaramuzza Aakaash Nattanmai Gordon McKellips Anish Cheraku Asim Suhail Ethan Luo Marvin Deng Jason Luo Ashley Zhang Kavin Jindel Kasper Halevy Allen Baranov Michael Liu Advaith Avadhanam David Zhang Brad Ma Evan Fu Liam Do Joshua Lass Hubert Yang Surya Sunkari Vishruth Bharath Violet Ai James Leung Rishit Agrawal Alan Zhou Kevin Chen Tejas Kalpathi Ziqi Xu Gavin Wang Tyler Xiao Erik Maung Sam Lee Roy Yue Ben Zhao Julia Yoon Sunny Sun Aryan Singh Clark Peng Tyler Osbey Taozhi Wang Daryl Echeazu Timothy Wu Spandan Patel Vidhi Kulkarni Vijaykaarti Sundarapandiyan Andrew Le Zafir Nasim Srikar Yalam Ritesh Kasamsetty Soham Samal David Sun Nihar Shah Abhijeet Saha Alex Zhang Leon Nguyen Laasya Nagumalli Kaixin Wang Aidan Wu Anwith Telluri Steven Dillmann Zhengxiang Wang Junyu Luo Hugo Lunn Artem Gazizov Haoran Qiu Allen G Hart Rickard Br\"uel Gabrielsson Artem Lukoianov Summer Yue Alexandr Wang Dan Hendrycks

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:36 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords LLM benchmarkAI evaluationexpert human performanceacademic questionsmodel calibrationfrontier knowledgeclosed-ended questionsmulti-modal benchmark

0 comments

The pith

A benchmark of 2500 expert-level questions shows state-of-the-art LLMs still perform poorly on hard academic problems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a collection of 2500 closed-ended questions spanning mathematics, humanities, natural sciences and other fields, each with a definite answer that experts can check but that resists quick web lookup. These questions were assembled by subject specialists worldwide to sit at the current limits of human knowledge. When tested, leading language models record low accuracy and weak calibration on the set, in contrast to their high scores on easier existing tests. This gap indicates that current systems have not yet reached expert human performance on demanding closed-ended tasks. If the results hold, the benchmark offers a stable reference point for tracking future progress toward that level.

Core claim

The authors assembled 2500 multi-modal questions across dozens of subjects, each carrying a known, unambiguous solution that is easily verified yet not quickly retrievable from the internet. State-of-the-art LLMs achieve low accuracy and poor calibration on this collection, in contrast to their near-ceiling performance on saturated earlier benchmarks, thereby exposing a measurable distance between present model abilities and the expert human frontier on closed-ended academic questions.

What carries the argument

The Humanity's Last Exam benchmark itself, a fixed set of 2500 expert-developed questions with verifiable answers that resist rapid retrieval.

If this is right

The benchmark supplies a durable yardstick for measuring gains in reasoning and knowledge on genuinely difficult problems.
Model developers gain a concrete signal that current approaches leave substantial headroom before expert-level closed-ended performance.
Policymakers receive a clearer view of the distance between deployed systems and human-expert capability on academic tasks.
Subsequent evaluation efforts can adopt the same global-expert, verifiable-answer design for other domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Strong performance on this set may correlate with competence on complex real-world expert workflows that mix facts and reasoning.
The multi-modal format points to a need for joint advances in text and visual understanding at frontier difficulty.
Repeated use of the same questions over time will let researchers quantify whether gains are genuine or partly due to data leakage.
Similar coordinated expert efforts could produce parallel tests for fields where knowledge moves faster than static benchmarks allow.

Load-bearing premise

The questions have clear solutions that cannot be quickly found through internet searches and sit at the current edge of what human experts know.

What would settle it

An independent check that shows many of the questions can be answered correctly by standard web search or that top LLMs reach above 60 percent accuracy on the full set without additional training.

read the original abstract

Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. HLE consists of 2,500 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. Each question has a known solution that is unambiguous and easily verifiable, but cannot be quickly answered via internet retrieval. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a significant gap between current LLM capabilities and the expert human frontier on closed-ended academic questions. To inform research and policymaking upon a clear understanding of model capabilities, we publicly release HLE at https://lastexam.ai.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Humanity's Last Exam (HLE), a multi-modal benchmark of 2,500 closed-ended questions (multiple-choice and short-answer) spanning mathematics, humanities, and natural sciences. Questions were developed globally by subject-matter experts and are asserted to have unambiguous, verifiable solutions that cannot be quickly answered via internet retrieval. The paper claims that existing benchmarks like MMLU are saturated (>90% LLM accuracy) and positions HLE as a frontier benchmark on which state-of-the-art LLMs exhibit low accuracy and poor calibration, revealing a substantial gap to expert human performance. The benchmark is released publicly at lastexam.ai.

Significance. If the questions are rigorously validated as non-retrievable and frontier-level, HLE would be a valuable contribution by supplying a non-saturated, broad-coverage benchmark for tracking LLM progress on expert academic tasks. The global expert curation and multi-modal design are strengths, and the public release supports reproducibility. However, the claimed significance of the LLM capability gap rests on unshown validation evidence, limiting its current impact for research and policy.

major comments (2)

[Abstract and question development section] Abstract and the section describing question development: The assertion that 'each question has a known solution that is unambiguous and easily verifiable, but cannot be quickly answered via internet retrieval' is load-bearing for interpreting low LLM accuracy as evidence of a true capability frontier rather than training-data gaps or leakage. No concrete methodology is supplied (e.g., expert search audits, originality checks, or quantitative retrievability tests), directly addressing the central claim.
[Results and evaluation sections] Results and evaluation sections: The abstract states that SOTA LLMs 'demonstrate low accuracy and calibration' on HLE, yet the provided information contains no quantitative results, specific model accuracies, baselines, calibration metrics, or statistical details. This absence makes it impossible to assess the magnitude or robustness of the reported gap.

minor comments (1)

[Abstract] Abstract: Including one or two concrete accuracy figures (with model names) would make the 'low accuracy' claim more precise and informative for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript introducing Humanity's Last Exam. We address each major comment point by point below, with clear indications of planned revisions to improve clarity and rigor.

read point-by-point responses

Referee: [Abstract and question development section] Abstract and the section describing question development: The assertion that 'each question has a known solution that is unambiguous and easily verifiable, but cannot be quickly answered via internet retrieval' is load-bearing for interpreting low LLM accuracy as evidence of a true capability frontier rather than training-data gaps or leakage. No concrete methodology is supplied (e.g., expert search audits, originality checks, or quantitative retrievability tests), directly addressing the central claim.

Authors: We agree that explicit validation details are essential to support the non-retrievability claim and distinguish capability gaps from data leakage. The manuscript describes global expert curation and the requirement for verifiable solutions, but we acknowledge the need for greater specificity. In the revised version, we will add a dedicated subsection under question development that outlines the concrete procedures: expert-conducted web searches for each question, checks against academic databases and prior benchmarks for originality, and any quantitative thresholds or audit logs used to confirm that solutions cannot be quickly retrieved. Examples of such checks for representative questions will be included where feasible without compromising the benchmark. revision: yes
Referee: [Results and evaluation sections] Results and evaluation sections: The abstract states that SOTA LLMs 'demonstrate low accuracy and calibration' on HLE, yet the provided information contains no quantitative results, specific model accuracies, baselines, calibration metrics, or statistical details. This absence makes it impossible to assess the magnitude or robustness of the reported gap.

Authors: We apologize that the quantitative results were not presented with sufficient prominence or completeness in the version under review. The manuscript does contain an evaluation section reporting model performance, but we will revise it to include explicit tables with per-model accuracies (e.g., for GPT-4o, Claude 3.5 Sonnet, and others), direct comparisons to human expert baselines, calibration metrics such as expected calibration error, and basic statistical details including confidence intervals or variance across question subsets. This will enable readers to evaluate the scale and reliability of the observed gap. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark dataset release without derivations or fits

full rationale

The paper introduces Humanity's Last Exam as a new multi-modal benchmark consisting of 2,500 expert-authored questions. It contains no mathematical derivations, model equations, parameter fittings, or predictions derived from internal computations. The central claims—that questions are unambiguous, verifiable, and not quickly retrievable via internet, and that current LLMs show low accuracy—rest on the empirical construction and release of the dataset itself rather than any self-referential reduction of outputs to inputs. No self-citation chains, ansatzes, or renamings of known results are used to justify load-bearing steps. The work is therefore self-contained as a benchmark contribution with no derivation chain to inspect for circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The evaluation of LLM capabilities on HLE depends on the assumption that the questions accurately reflect the frontier of human knowledge without being solvable through non-expert means.

axioms (1)

domain assumption Questions have known, unambiguous, and easily verifiable solutions that cannot be quickly answered via internet retrieval.
This is presented as a core design principle in the abstract.

pith-pipeline@v0.9.0 · 10825 in / 1194 out tokens · 71822 ms · 2026-05-10T18:36:09.526569+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Unsteady Metrics and Benchmarking Cultures of AI Model Builders
cs.AI 2026-05 accept novelty 8.0

AI model builders mostly highlight unique benchmarks that act as flexible narrative tools for market positioning rather than standardized scientific measurements.
Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs
cs.CL 2026-05 unverdicted novelty 8.0

Soohak is a new 439-problem mathematician-authored benchmark showing frontier LLMs reach only 30% on research math and fail to exceed 50% on refusing ill-posed questions.
neuralCAD-Edit: An Expert Benchmark for Multimodal-Instructed 3D CAD Model Editing
cs.CV 2026-04 unverdicted novelty 8.0

neuralCAD-Edit benchmark shows even the best foundation model (GPT 5.2) scores 53% lower than human CAD experts in acceptance trials for multimodal-instructed 3D model edits.
PolyBench: Benchmarking LLM Forecasting and Trading Capabilities on Live Prediction Market Data
q-fin.CP 2026-04 conditional novelty 8.0

Only two of seven LLMs produce positive returns on live Polymarket data, with MiMo-V2-Flash at 17.6% CWR and Gemini-3-Flash at 6.2% CWR while the other five lose money.
TRIAGE: Evaluating Prospective Metacognitive Control in LLMs under Resource Constraints
cs.AI 2026-05 unverdicted novelty 7.0

TRIAGE evaluates LLMs on prospective metacognitive control by requiring a single plan for task selection, sequencing, and token allocation under a calibrated budget, revealing substantial gaps in current models across...
Formal Conjectures: An Open and Evolving Benchmark for Verified Discovery in Mathematics
cs.AI 2026-05 unverdicted novelty 7.0

Formal Conjectures is a Lean 4 benchmark containing 2615 formalized problems with 1029 open conjectures, designed to evaluate automated mathematical reasoning and proof discovery.
AssayBench: An Assay-Level Virtual Cell Benchmark for LLMs and Agents
cs.LG 2026-05 unverdicted novelty 7.0

AssayBench is a new gene-ranking benchmark for phenotypic CRISPR screens that shows zero-shot generalist LLMs outperform both biology-specific LLMs and trainable baselines on adjusted nDCG.
Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents
cs.CL 2026-05 unverdicted novelty 7.0

A new image-bank harness and closed-loop on-policy data evolution method raises multimodal agent performance on visual search benchmarks from 24.9% to 39.0% for an 8B model and from 30.6% to 41.5% for a 30B model.
MaD Physics: Evaluating information seeking under constraints in physical environments
cs.AI 2026-05 unverdicted novelty 7.0

MaD Physics is a new benchmark for evaluating AI agents on constrained information-seeking, model inference, and prediction in three physical environments with altered laws to avoid knowledge contamination.
LLM-Guided Monte Carlo Tree Search over Knowledge Graphs: Composing Mechanistic Explanations for Drug-Disease Pairs
cs.AI 2026-05 unverdicted novelty 7.0

TESSERA combines LLMs as local policy and evaluator with MCTS on knowledge graphs to compose mechanistic drug-disease explanations.
DiagnosticIQ: A Benchmark for LLM-Based Industrial Maintenance Action Recommendation from Symbolic Rules
cs.AI 2026-05 unverdicted novelty 7.0

DiagnosticIQ benchmark shows frontier LLMs perform similarly on standard rule-to-action tasks but lose substantial accuracy under distractor expansion and condition inversion, pointing to calibration as the key deploy...
AcademiClaw: When Students Set Challenges for AI Agents
cs.AI 2026-05 unverdicted novelty 7.0

AcademiClaw is a new benchmark of 80 student-sourced academic tasks where the best frontier AI agents achieve only a 55% pass rate.
Reward Hacking Benchmark: Measuring Exploits in LLM Agents with Tool Use
cs.LG 2026-05 unverdicted novelty 7.0

The Reward Hacking Benchmark shows RL post-training raises exploit rates in tool-using LLM agents from 0.6% to 13.9%, with environmental hardening cutting exploits by 87.7% relative without lowering task success.
Super Apriel: One Checkpoint, Many Speeds
cs.LG 2026-04 unverdicted novelty 7.0

A single 15B supernet checkpoint supports runtime switching between attention mixer placements for multiple decode speed presets while retaining 77-96% quality relative to the teacher model.
Stargazer: A Scalable Model-Fitting Benchmark Environment for AI Agents under Astrophysical Constraints
cs.LG 2026-04 unverdicted novelty 7.0

Stargazer benchmarks AI agents on physics-constrained model fitting for astrophysical data, revealing that agents achieve statistical fits but often fail to recover correct physical parameters.
Stargazer: A Scalable Model-Fitting Benchmark Environment for AI Agents under Astrophysical Constraints
cs.LG 2026-04 unverdicted novelty 7.0

Stargazer benchmark shows frontier AI agents achieve statistical fits to radial velocity data but frequently fail to recover correct physical planetary system parameters.
PokeGym: A Visually-Driven Long-Horizon Benchmark for Vision-Language Models
cs.CV 2026-04 unverdicted novelty 7.0

PokeGym is a new benchmark that tests VLMs on long-horizon tasks in a complex 3D game using only visual observations, identifying deadlock recovery as the primary failure mode.
GeoBrowse: A Geolocation Benchmark for Agentic Tool Use with Expert-Annotated Reasoning Traces
cs.CL 2026-04 unverdicted novelty 7.0

GeoBrowse is a two-level geolocation benchmark combining visual cue composition with knowledge-intensive multi-hop queries, paired with the GATE agent workflow that outperforms no-tool, search-only, and image-only baselines.
The limits of bio-molecular modeling with large language models : a cross-scale evaluation
cs.LG 2026-04 unverdicted novelty 7.0

LLMs perform adequately on bio-molecular classification tasks but remain weak on regression, with hybrid architectures outperforming others on long sequences and fine-tuning hurting generalization.
Scaling Latent Reasoning via Looped Language Models
cs.CL 2025-10 unverdicted novelty 7.0

Looped language models with latent iterative computation and entropy-regularized depth allocation achieve performance matching up to 12B standard LLMs through superior knowledge manipulation.
OpenDeepThink: Parallel Reasoning via Bradley--Terry Aggregation
cs.AI 2026-05 unverdicted novelty 6.0

OpenDeepThink improves LLM reasoning by ranking parallel candidate traces via Bradley-Terry aggregation of LLM pairwise judgments, achieving a +405 Codeforces Elo gain on Gemini 3.1 Pro after eight rounds.
Measuring Five-Nines Reliability: Sample-Efficient LLM Evaluation in Saturated Benchmarks
cs.LG 2026-05 unverdicted novelty 6.0

Cross-entropy method sampling reduces inferences needed to estimate five-nines LLM reliability by up to 156x on parameterized GSM8K templates, revealing reliability differences hidden by saturated accuracy scores.
Instructions Shape Production of Language, not Processing
cs.CL 2026-05 unverdicted novelty 6.0

Instructions trigger a production-centered mechanism in language models, with task-specific information stable in input tokens but varying strongly in output tokens and correlating with behavior.
The Generalized Turing Test: A Foundation for Comparing Intelligence
cs.AI 2026-05 unverdicted novelty 6.0

The Generalized Turing Test defines relative intelligence as the inability of one agent to distinguish an imitator from the original through interaction.
EvoMAS: Learning Execution-Time Workflows for Multi-Agent Systems
cs.AI 2026-05 unverdicted novelty 6.0

EvoMAS trains a workflow adapter with policy gradients to dynamically instantiate stage-specific multi-agent workflows from a fixed agent pool, using explicit task-state construction and terminal success signals, and ...
A Semantic-Sampling Framework for Evaluating Calibration in Open-Ended Question Answering
cs.CL 2026-05 unverdicted novelty 6.0

Sem-ECE is an asymptotically unbiased calibration error estimator for open-ended QA that uses semantic sampling of answers to derive confidence from class frequencies, with two variants that diverge on hard questions.
Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models
cs.CL 2026-05 unverdicted novelty 6.0

MELT decouples reasoning depth from memory in looped LLMs by sharing a single gated KV cache per layer and using two-phase chunk-wise distillation from Ouro, delivering constant memory use while matching or beating st...
Learning Agent Routing From Early Experience
cs.CL 2026-05 unverdicted novelty 6.0

BoundaryRouter routes queries to LLM or agent using early experience memory from a seed set, cutting inference time 60.6% versus always using agents and raising performance 28.6% versus always using direct LLM inference.
Cripping AI: Reimagining AI Through Lived Disability Experiences
cs.HC 2026-05 unverdicted novelty 6.0

Cripping AI is a proposed framework that dismantles ableist assumptions in AI by centering disabled ways of knowing and respecting disabled labor in co-creation.
SciResearcher: Scaling Deep Research Agents for Frontier Scientific Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

SciResearcher automates creation of diverse scientific reasoning tasks from academic evidence to train an 8B model that sets new SOTA at 19.46% on HLE-Bio/Chem-Gold and gains 13-15% on SuperGPQA-Hard-Biology and TRQA-...
ProEval: Proactive Failure Discovery and Efficient Performance Estimation for Generative AI Evaluation
cs.LG 2026-04 unverdicted novelty 6.0

ProEval is a proactive framework using pre-trained GPs, Bayesian quadrature, and superlevel set sampling to estimate performance and find failures in generative AI with 8-65x fewer samples than baselines.
Superminds Test: Actively Evaluating Collective Intelligence of Agent Society via Probing Agents
cs.AI 2026-04 unverdicted novelty 6.0

Large-scale experiments on two million agents reveal that collective intelligence does not emerge from scale alone due to sparse and shallow interactions.
Large Language Models Decide Early and Explain Later
cs.CL 2026-04 unverdicted novelty 6.0

LLMs settle on their answer after a minority of CoT tokens and produce an average 760 more as post-decision explanation, enabling early stopping that saves 500 tokens per query at a 2% accuracy cost.
ActuBench: A Multi-Agent LLM Pipeline for Generation and Evaluation of Actuarial Reasoning Tasks
cs.AI 2026-04 unverdicted novelty 6.0

ActuBench is a multi-agent LLM pipeline for generating and evaluating actuarial reasoning tasks, with evaluations of 50 models showing effective verification, competitive local open-weights models, and differing ranki...
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence
cs.AI 2026-04 unverdicted novelty 6.0

Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.
PRL-Bench: A Comprehensive Benchmark Evaluating LLMs' Capabilities in Frontier Physics Research
cs.LG 2026-04 unverdicted novelty 6.0

PRL-Bench evaluates frontier LLMs on 100 real physics research tasks and finds the best models score below 50, exposing a gap to autonomous discovery.
Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization
cs.AI 2026-04 unverdicted novelty 6.0

Frontier-Eng is a new benchmark for generative optimization in engineering where agents iteratively improve designs under fixed interaction budgets using executable verifiers, with top models like GPT 5.4 showing limi...
Towards Knowledgeable Deep Research: Framework and Benchmark
cs.AI 2026-04 unverdicted novelty 6.0

The paper introduces the KDR task, HKA multi-agent framework, and KDR-Bench to enable LLM agents to integrate structured knowledge into deep research reports, with experiments showing outperformance over prior agents.
Select-then-Solve: Paradigm Routing as Inference-Time Optimization for LLM Agents
cs.CL 2026-04 conditional novelty 6.0

A learned embedding-based router selecting among six reasoning paradigms improves LLM agent accuracy from 47.6% to 53.1% on average, beating the best fixed paradigm by 2.8pp.
Representational Collapse in Multi-Agent LLM Committees: Measurement and Diversity-Aware Consensus
cs.LG 2026-04 conditional novelty 6.0

LLM agent committees exhibit representational collapse with mean cosine similarity of 0.888, and diversity-aware consensus reaches 87% accuracy on GSM8K versus 84% for self-consistency at lower cost.
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
cs.CL 2025-06 unverdicted novelty 6.0

MiniMax-M1 is a 456B parameter hybrid-attention MoE model trained with CISPO RL that achieves performance comparable or superior to DeepSeek-R1 and Qwen3-235B on reasoning and software engineering tasks while training...
LLMs Get Lost In Multi-Turn Conversation
cs.CL 2025-05 unverdicted novelty 6.0

LLMs drop 39% in performance during multi-turn conversations due to premature assumptions and inability to recover from early errors.
Boosting Reinforcement Learning with Verifiable Rewards via Randomly Selected Few-Shot Guidance
cs.LG 2026-05 unverdicted novelty 5.0

FEST improves RLVR sample efficiency on math and coding benchmarks by combining supervised signals, on-policy signals, and decaying weights on just 128 randomly chosen demonstrations, matching full-dataset baselines.
Instructions Shape Production of Language, not Processing
cs.CL 2026-05 unverdicted novelty 5.0

Instructions primarily shape the production stage of language models rather than the processing stage, with task-specific information and causal effects stronger in output tokens than input tokens.
Personalized Deep Research: A User-Centric Framework, Dataset, and Hybrid Evaluation for Knowledge Discovery
cs.IR 2026-05 conditional novelty 5.0

PDR is a user-context-aware framework for LLM research agents that improves report relevance over static baselines, supported by a new dataset and hybrid evaluation.
pAI/MSc: ML Theory Research with Humans on the Loop
cs.AI 2026-04 unverdicted novelty 5.0

pAI/MSc is a customizable multi-agent system that reduces human steering by orders of magnitude when turning a hypothesis into a literature-grounded, mathematically established, experimentally supported manuscript dra...
EvoMaster: A Foundational Evolving Agent Framework for Agentic Science at Scale
cs.AI 2026-04 unverdicted novelty 5.0

EvoMaster is a self-evolving agent framework that achieves state-of-the-art results on scientific benchmarks by enabling iterative hypothesis refinement and knowledge accumulation across domains.
Toward Human-AI Complementarity Across Diverse Tasks
cs.HC 2026-04 unverdicted novelty 5.0

Human-AI hybrids achieve only +0.4pp over AI alone on diverse tasks because confidence routing fails to identify the small set of cases where humans can correct AI errors.
COMPOSITE-Stem
cs.AI 2026-04 conditional novelty 5.0

COMPOSITE-STEM is a new benchmark of 70 expert-curated STEM tasks where frontier AI agents score at most 21% using flexible exact-match and rubric-based grading.
Seed1.8 Model Card: Towards Generalized Real-World Agency
cs.AI 2026-03 unverdicted novelty 5.0

Seed1.8 is a new foundation model that adds unified agentic capabilities for search, code execution, and GUI interaction to existing LLM and vision strengths.
GLM-5: from Vibe Coding to Agentic Engineering
cs.LG 2026-02 unverdicted novelty 5.0

GLM-5 is a foundation model that claims state-of-the-art results on coding benchmarks and superior performance on end-to-end software engineering tasks via new asynchronous RL methods and cost-saving DSA.
Kimi K2.5: Visual Agentic Intelligence
cs.CL 2026-02 unverdicted novelty 5.0

Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.
MiMo-V2-Flash Technical Report
cs.CL 2026-01 unverdicted novelty 5.0

MiMo-V2-Flash is a 309B/15B MoE model trained on 27T tokens with hybrid attention and multi-teacher on-policy distillation that matches larger models like DeepSeek-V3.2 while enabling 2.6x faster decoding via repurpos...
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
cs.CL 2025-12 unverdicted novelty 5.0

DeepSeek-V3.2 adds sparse attention, scaled RL post-training, and large-scale agentic data synthesis to reach GPT-5-level performance and gold medals in 2025 IMO and IOI with its high-compute variant.
gpt-oss-120b & gpt-oss-20b Model Card
cs.CL 2025-08 unverdicted novelty 5.0

OpenAI releases two open-weight reasoning models, gpt-oss-120b and gpt-oss-20b, trained via distillation and RL with claimed strong results on math, coding, and safety benchmarks.
Kimi K2: Open Agentic Intelligence
cs.LG 2025-07 unverdicted novelty 5.0

Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.
Measuring AI Reasoning: A Guide for Researchers
cs.AI 2026-05 unverdicted novelty 4.0

Reasoning in language models should be measured by the faithfulness and validity of their multi-step search processes and intermediate traces, not final-answer accuracy.
Supplement Generation Training for Enhancing Agentic Task Performance
cs.LG 2026-04 unverdicted novelty 4.0

SGT trains a lightweight model to generate task-specific supplemental text that improves performance of a larger frozen LLM on agentic tasks without modifying the large model.
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models
cs.CL 2025-08 unverdicted novelty 4.0

GLM-4.5, a 355B-parameter MoE model with hybrid reasoning, scores 70.1% on TAU-Bench, 91.0% on AIME 24, and 64.2% on SWE-bench Verified while ranking 3rd overall and 2nd on agentic benchmarks.
From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review
cs.AI 2025-04 accept novelty 4.0

A survey consolidating benchmarks, agent frameworks, real-world applications, and protocols for LLM-based autonomous agents into a proposed taxonomy with recommendations for future research.

Reference graph

Works this paper leans on

300 extracted references · 200 canonical work pages · cited by 60 Pith papers · 23 internal anchors

[1]

Alberti, K

C. Alberti, K. Lee, and M. Collins. A bert baseline for the natural questions, 2019. URL https: //arxiv.org/abs/1901.08634

work page arXiv 2019
[2]

AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

M. Andriushchenko, A. Souly, M. Dziemian, D. Duenas, M. Lin, J. Wang, D. Hendrycks, A. Zou, Z. Kolter, M. Fredrikson, E. Winsor, J. Wynne, Y . Gal, and X. Davies. Agentharm: A benchmark for measuring harmfulness of llm agents, 2024. URLhttps://arxiv.org/abs/2410.09024

work page internal anchor Pith review arXiv 2024
[3]

The claude 3 model family: Opus, sonnet, haiku, 2024

Anthropic. The claude 3 model family: Opus, sonnet, haiku, 2024. URL https://api. semanticscholar.org/CorpusID:268232499

2024
[4]

Model card addendum: Claude 3.5 haiku and upgraded claude 3.5 son- net, 2024

Anthropic. Model card addendum: Claude 3.5 haiku and upgraded claude 3.5 son- net, 2024. URL https://assets.anthropic.com/m/1cd9d098ac3e6467/original/ Claude-3-Model-Card-October-Addendum.pdf

2024
[5]

Responsible scaling policy updates, 2024

Anthropic. Responsible scaling policy updates, 2024. URL https://www.anthropic.com/ rsp-updates

2024
[6]

R. K. Arora, J. Wei, R. S. Hicks, P. Bowman, J. Quiñonero-Candela, F. Tsimpourlas, M. Sharman, M. Shah, A. Vallone, A. Beutel, J. Heidecke, and K. Singhal. Healthbench: Evaluating large language models towards improved human health, 2025. URLhttps://arxiv.org/abs/2505.08775

work page internal anchor Pith review arXiv 2025
[7]

Austin, A

J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, and C. Sutton. Program synthesis with large language models, 2021. URL https://arxiv.org/abs/2108. 07732

2021
[8]

Y . Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan, N. Joseph, S. Kadavath, J. Kernion, T. Conerly, S. El-Showk, N. Elhage, Z. Hatfield-Dodds, D. Hernandez, T. Hume, S. Johnston, S. Kravec, L. Lovitt, N. Nanda, C. Olsson, D. Amodei, T. Brown, J. Clark, S. Mc- Candlish, C. Olah, B. Mann, and J. Kaplan...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[9]

MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

P. Bajaj, D. Campos, N. Craswell, L. Deng, J. Gao, X. Liu, R. Majumder, A. McNamara, B. Mitra, T. Nguyen, M. Rosenberg, X. Song, A. Stoica, S. Tiwary, and T. Wang. Ms marco: A human generated machine reading comprehension dataset, 2018. URLhttps://arxiv.org/abs/1611.09268

work page internal anchor Pith review arXiv 2018
[10]

Purple llama CyberSecEval : A secure coding benchmark for language models

M. Bhatt, S. Chennabasappa, C. Nikolaidis, S. Wan, I. Evtimov, D. Gabi, D. Song, F. Ahmad, C. Ascher- mann, L. Fontana, S. Frolov, R. P. Giri, D. Kapil, Y . Kozyrakis, D. LeBlanc, J. Milazzo, A. Straumann, G. Synnaeve, V . V ontimitta, S. Whitman, and J. Saxe. Purple llama cyberseceval: A secure coding benchmark for language models, 2023. URLhttps://arxiv...

work page arXiv 2023
[11]

J. S. Chan, N. Chowdhury, O. Jaffe, J. Aung, D. Sherburn, E. Mays, G. Starace, K. Liu, L. Maksin, T. Patwardhan, L. Weng, and A. M ˛ adry. Mle-bench: Evaluating machine learning agents on machine learning engineering, 2024. URLhttps://arxiv.org/abs/2410.07095

work page Pith review arXiv 2024
[12]

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Her...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[13]

Arc prize 2024: Technical report

F. Chollet, M. Knoop, G. Kamradt, and B. Landers. Arc prize 2024: Technical report, 2024. URL https://arxiv.org/abs/2412.04604

work page arXiv 2024
[14]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman. Training verifiers to solve math word problems, 2021. URL https://arxiv.org/abs/2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[15]

Deepseek-v3 technical report, 2024

DeepSeek-AI. Deepseek-v3 technical report, 2024. URL https://github.com/deepseek-ai/ DeepSeek-V3/blob/main/DeepSeek_V3.pdf

2024
[16]

D. Dua, Y . Wang, P. Dasigi, G. Stanovsky, S. Singh, and M. Gardner. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs, 2019. URL https://arxiv.org/abs/1903. 00161. 10

2019
[17]

The Llama 3 Herd of Models

A. Dubey et al. The llama 3 herd of models, 2024. URLhttps://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

B. Gao, F. Song, Z. Yang, Z. Cai, Y . Miao, Q. Dong, L. Li, C. Ma, L. Chen, R. Xu, Z. Tang, B. Wang, D. Zan, S. Quan, G. Zhang, L. Sha, Y . Zhang, X. Ren, T. Liu, and B. Chang. Omni-math: A universal olympiad level mathematic benchmark for large language models, 2024. URL https://arxiv.org/abs/2410.07985

work page arXiv 2024
[19]

FrontierMath: A benchmark for evaluating advanced mathematical reasoning in AI.arXiv preprint arXiv:2411.04872, 2024

E. Glazer, E. Erdil, T. Besiroglu, D. Chicharro, E. Chen, A. Gunning, C. F. Olsson, J.-S. Denain, A. Ho, E. de Oliveira Santos, O. Järviniemi, M. Barnett, R. Sandler, J. Sevilla, Q. Ren, E. Pratt, L. Levine, G. Barkley, N. Stewart, B. Grechuk, T. Grechuk, and S. V . Enugandla. Frontiermath: A benchmark for evaluating advanced mathematical reasoning in ai,...

work page arXiv 2024
[20]

C. He, R. Luo, Y . Bai, S. Hu, Z. L. Thai, J. Shen, J. Hu, X. Han, Y . Huang, Y . Zhang, J. Liu, L. Qi, Z. Liu, and M. Sun. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems, 2024. URLhttps://arxiv.org/abs/2402.14008

work page internal anchor Pith review arXiv 2024
[21]

Measuring Coding Challenge Competence With APPS

D. Hendrycks, S. Basart, S. Kadavath, M. Mazeika, A. Arora, E. Guo, C. Burns, S. Puranik, H. He, D. Song, and J. Steinhardt. Measuring coding challenge competence with apps, 2021. URL https: //arxiv.org/abs/2105.09938

work page internal anchor Pith review arXiv 2021
[22]

Measuring Massive Multitask Language Understanding

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding, 2021. URLhttps://arxiv.org/abs/2009.03300

work page internal anchor Pith review Pith/arXiv arXiv 2021
[23]

Hendrycks, C

D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring mathematical problem solving with the math dataset, 2021. URL https://arxiv.org/abs/2103. 03874

2021
[24]

Hendrycks, A

D. Hendrycks, A. Zou, M. Mazeika, L. Tang, B. Li, D. Song, and J. Steinhardt. Pixmix: Dreamlike pictures comprehensively improve safety measures, 2022. URLhttps://arxiv.org/abs/2112.05135

work page arXiv 2022
[25]

Hosseini, A

A. Hosseini, A. Sordoni, D. Toyama, A. Courville, and R. Agarwal. Not all llm reasoners are created equal,
[26]

URLhttps://arxiv.org/abs/2410.01748

work page arXiv
[27]

Jacovi, A

A. Jacovi, A. Wang, C. Alberti, C. Tao, J. Lipovetz, K. Olszewska, L. Haas, M. Liu, N. Keating, A. Bloniarz, C. Saroufim, C. Fry, D. Marcus, D. Kukliansky, G. S. Tomar, J. Swirhun, J. Xing, L. W. andMadhu Gurumurthy, M. Aaron, M. Ambar, R. Fellinger, R. Wang, R. Sims, Z. Zhang, S. Goldshtein, and D. Das. Facts leaderboard. https://kaggle.com/facts-leaderb...

2024
[28]

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan. Swe-bench: Can language models resolve real-world github issues?, 2024. URLhttps://arxiv.org/abs/2310.06770

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

arXiv:2104.14337 (2021), https://arxiv.org/abs/2104.14337

D. Kiela, M. Bartolo, Y . Nie, D. Kaushik, A. Geiger, Z. Wu, B. Vidgen, G. Prasad, A. Singh, P. Ringshia, Z. Ma, T. Thrush, S. Riedel, Z. Waseem, P. Stenetorp, R. Jia, M. Bansal, C. Potts, and A. Williams. Dynabench: Rethinking benchmarking in nlp, 2021. URLhttps://arxiv.org/abs/2104.14337

work page arXiv 2021
[30]

Refusal-trained llms are easily jailbroken as browser agents.arXiv preprint arXiv:2410.13886,

P. Kumar, E. Lau, S. Vijayakumar, T. Trinh, S. R. Team, E. Chang, V . Robinson, S. Hendryx, S. Zhou, M. Fredrikson, S. Yue, and Z. Wang. Refusal-trained llms are easily jailbroken as browser agents, 2024. URLhttps://arxiv.org/abs/2410.13886

work page arXiv 2024
[31]

J. M. Laurent, J. D. Janizek, M. Ruzo, M. M. Hinks, M. J. Hammerling, S. Narayanan, M. Ponnapati, A. D. White, and S. G. Rodriques. Lab-bench: Measuring capabilities of language models for biology research,
[32]

URLhttps://arxiv.org/abs/2407.10362

work page arXiv
[33]

N. Li, A. Pan, A. Gopal, S. Yue, D. Berrios, A. Gatti, J. D. Li, A.-K. Dombrowski, S. Goel, L. Phan, G. Mukobi, N. Helm-Burger, R. Lababidi, L. Justen, A. B. Liu, M. Chen, I. Barrass, O. Zhang, X. Zhu, R. Tamirisa, B. Bharathi, A. Khoja, Z. Zhao, A. Herbert-V oss, C. B. Breuer, S. Marks, O. Patel, A. Zou, M. Mazeika, Z. Wang, P. Oswal, W. Lin, A. A. Hunt,...

work page arXiv 2024
[34]

P. Lu, H. Bansal, T. Xia, J. Liu, C. Li, H. Hajishirzi, H. Cheng, K.-W. Chang, M. Galley, and J. Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts, 2024. URL https://arxiv.org/abs/2310.02255

work page internal anchor Pith review arXiv 2024
[35]

T. R. McIntosh, T. Susnjak, N. Arachchilage, T. Liu, P. Watters, and M. N. Halgamuge. Inadequacies of large language model benchmarks in the era of generative artificial intelligence, 2024. URL https: //arxiv.org/abs/2402.09880. 11

work page arXiv 2024
[36]

Y . Nie, A. Williams, E. Dinan, M. Bansal, J. Weston, and D. Kiela. Adversarial nli: A new benchmark for natural language understanding, 2020. URLhttps://arxiv.org/abs/1910.14599

work page arXiv 2020
[37]

Openai o1 system card, 2024

OpenAI. Openai o1 system card, 2024. URLhttps://cdn.openai.com/o1-system-card-20240917. pdf

2024
[38]

Openai and los alamos national laboratory announce bio- science research partnership, 2024

OpenAI. Openai and los alamos national laboratory announce bio- science research partnership, 2024. URL https://openai.com/index/ openai-and-los-alamos-national-laboratory-work-together/

2024
[39]

Introducing swe-bench verified, 2024

OpenAI. Introducing swe-bench verified, 2024. URL https://openai.com/index/ introducing-swe-bench-verified/

2024
[40]

GPT-4 Technical Report

OpenAI et al. Gpt-4 technical report, 2024. URLhttps://arxiv.org/abs/2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

S. Ott, A. Barbosa-Silva, K. Blagec, J. Brauner, and M. Samwald. Mapping global dynamics of benchmark creation and saturation in artificial intelligence.Nature Communications, 13(1):6793, 2022

2022
[42]

D. Owen. How predictable is language model benchmark performance?, 2024. URL https://arxiv. org/abs/2401.04757

work page arXiv 2024
[43]

Discovering Language Model Behaviors with Model-Written Evaluations

E. Perez, S. Ringer, K. Lukoši ¯ut˙e, K. Nguyen, E. Chen, S. Heiner, C. Pettit, C. Olsson, S. Kundu, S. Kadavath, A. Jones, A. Chen, B. Mann, B. Israel, B. Seethor, C. McKinnon, C. Olah, D. Yan, D. Amodei, D. Amodei, D. Drain, D. Li, E. Tran-Johnson, G. Khundadze, J. Kernion, J. Landis, J. Kerr, J. Mueller, J. Hyun, J. Landau, K. Ndousse, L. Goldberg, L. ...

work page internal anchor Pith review arXiv 2022
[44]

Phuong, M

M. Phuong, M. Aitchison, E. Catt, S. Cogan, A. Kaskasoli, V . Krakovna, D. Lindner, M. Rahtz, Y . Assael, S. Hodkinson, H. Howard, T. Lieberum, R. Kumar, M. A. Raad, A. Webson, L. Ho, S. Lin, S. Farquhar, M. Hutter, G. Deletang, A. Ruoss, S. El-Sayed, S. Brown, A. Dragan, R. Shah, A. Dafoe, and T. Shevlane. Evaluating frontier models for dangerous capabil...

2024
[45]

SQuAD: 100,000+ Questions for Machine Comprehension of Text

P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. Squad: 100,000+ questions for machine comprehension of text, 2016. URLhttps://arxiv.org/abs/1606.05250

work page internal anchor Pith review arXiv 2016
[46]

Know What You Don't Know: Unanswerable Questions for SQuAD

P. Rajpurkar, R. Jia, and P. Liang. Know what you don’t know: Unanswerable questions for squad, 2018. URLhttps://arxiv.org/abs/1806.03822

work page Pith review arXiv 2018
[47]

D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y . Pang, J. Dirani, J. Michael, and S. R. Bowman. Gpqa: A graduate-level google-proof q&a benchmark, 2023. URLhttps://arxiv.org/abs/2311.12022

work page internal anchor Pith review arXiv 2023
[48]

Singhal, S

K. Singhal, S. Azizi, T. Tu, S. S. Mahdavi, J. Wei, H. W. Chung, N. Scales, A. Tanwani, H. Cole-Lewis, S. Pfohl, et al. Large language models encode clinical knowledge.Nature, 620(7972):172–180, 2023

2023
[49]

Skarlinski, J

M. Skarlinski, J. Laurent, A. Bou, and A. White. About 30% ofHumanity’s Last Exam chemistry/biology answers are likely wrong, July 2025. URL https://www.futurehouse.org/ research-announcements/hle-exam

2025
[50]

V . K. Srinivasan, Z. Dong, B. Zhu, B. Yu, H. Mao, D. Mosk-Aoyama, K. Keutzer, J. Jiao, and J. Zhang. Nexusraven: A commercially-permissive language model for function calling. InNeurIPS 2023 F oun- dation Models for Decision Making Workshop, 2023. URL https://openreview.net/forum?id= 5lcPe6DqfI

2023
[51]

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

A. Srivastava, A. Rastogi, A. Rao, A. A. M. Shoeb, A. Abid, A. Fisch, A. R. Brown, A. Santoro, A. Gupta, A. Garriga-Alonso, A. Kluska, A. Lewkowycz, A. Agarwal, A. Power, A. Ray, A. Warstadt, A. W. Kocurek, A. Safaya, A. Tazarv, A. Xiang, A. Parrish, A. Nie, A. Hussain, A. Askell, A. Dsouza, A. Slone, A. Rahane, A. S. Iyer, A. Andreassen, A. Madotto, A. S...

work page internal anchor Pith review arXiv 2023
[52]

S. A. Taghanaki, A. Khani, and A. Khasahmadi. Mmlu-pro+: Evaluating higher-order reasoning and shortcut learning in llms, 2024. URLhttps://arxiv.org/abs/2409.02257. 12

work page arXiv 2024
[53]

Team et al

G. Team et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,
[54]

URLhttps://arxiv.org/abs/2403.05530

work page internal anchor Pith review Pith/arXiv arXiv
[55]

PutnamBench: Evaluating Neural Theorem‑Provers on the Putnam Mathematical Com- petition, 2024

G. Tsoukalas, J. Lee, J. Jennings, J. Xin, M. Ding, M. Jennings, A. Thakur, and S. Chaudhuri. Putnambench: Evaluating neural theorem-provers on the putnam mathematical competition, 2024. URLhttps://arxiv. org/abs/2407.11214

work page arXiv 2024
[56]

A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding, 2019. URL https://arxiv.org/abs/1804. 07461

2019
[57]

A. Wang, Y . Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman. Superglue: A stickier benchmark for general-purpose language understanding systems, 2020. URL https://arxiv.org/abs/1905.00537

work page internal anchor Pith review arXiv 2020
[58]

Y . Wang, X. Ma, G. Zhang, Y . Ni, A. Chandra, S. Guo, W. Ren, A. Arulraj, X. He, Z. Jiang, T. Li, M. Ku, K. Wang, A. Zhuang, R. Fan, X. Yue, and W. Chen. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark (published at neurips 2024 track datasets and benchmarks), 2024. URL https://arxiv.org/abs/2406.01574

work page internal anchor Pith review arXiv 2024
[59]

J. Wei, N. Karina, H. W. Chung, Y . J. Jiao, S. Papay, A. Glaese, J. Schulman, and W. Fedus. Measuring short-form factuality in large language models, 2024. URLhttps://arxiv.org/abs/2411.04368

work page internal anchor Pith review arXiv 2024
[60]

H. Wijk, T. Lin, J. Becker, S. Jawhar, N. Parikh, T. Broadley, L. Chan, M. Chen, J. Clymer, J. Dhyani, E. Ericheva, K. Garcia, B. Goodrich, N. Jurkovic, M. Kinniment, A. Lajko, S. Nix, L. Sato, W. Saunders, M. Taran, B. West, and E. Barnes. Re-bench: Evaluating frontier ai r&d capabilities of language model agents against human experts, 2024. URLhttps://a...

work page arXiv 2024
[61]

Grok-2 beta release, 2024

xAI. Grok-2 beta release, 2024. URLhttps://x.ai/blog/grok-2

2024
[62]

F. Yan, H. Mao, C. C.-J. Ji, T. Zhang, S. G. Patil, I. Stoica, and J. E. Gonzalez. Berkeley function call- ing leaderboard. https://gorilla.cs.berkeley.edu/blogs/8_berkeley_function_calling_ leaderboard.html, 2024

2024
[63]

Z. Yang, P. Qi, S. Zhang, Y . Bengio, W. W. Cohen, R. Salakhutdinov, and C. D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering, 2018. URL https://arxiv.org/abs/ 1809.09600

work page internal anchor Pith review arXiv 2018
[64]

S. Yao, N. Shinn, P. Razavi, and K. Narasimhan. τ-bench: A benchmark for tool-agent-user interaction in real-world domains, 2024. URLhttps://arxiv.org/abs/2406.12045

work page internal anchor Pith review arXiv 2024
[65]

A. K. Zhang, N. Perry, R. Dulepet, J. Ji, J. W. Lin, E. Jones, C. Menders, G. Hussein, S. Liu, D. Jasper, P. Peetathawatchai, A. Glenn, V . Sivashankar, D. Zamoshchin, L. Glikbarg, D. Askaryar, M. Yang, T. Zhang, R. Alluri, N. Tran, R. Sangpisit, P. Yiorkadjis, K. Osele, G. Raghupathi, D. Boneh, D. E. Ho, and P. Liang. Cybench: A framework for evaluating ...

work page arXiv 2024
[66]

Agieval: A human-centric benchmark for evaluating foundation models

W. Zhong, R. Cui, Y . Guo, Y . Liang, S. Lu, Y . Wang, A. Saied, W. Chen, and N. Duan. Agieval: A human-centric benchmark for evaluating foundation models, 2023. URL https://arxiv.org/abs/ 2304.06364. 13 A Authors We offered optional co-authorship to all question submitters with an accepted question in HUMANITY’SLAST EXAM(including both public and private...

work page arXiv 2023
[67]

Independent Researcher
[68]

University of California, Berkeley
[69]

Massachusetts Institute of Technology
[70]

University of Cambridge
[71]

University of Oxford
[72]

Princeton University
[73]

Carnegie Mellon University
[74]

University of Chicago
[75]

University of Michigan
[76]

École Polytechnique Fédérale de Lausanne
[77]

University of Toronto
[78]

University of Illinois Urbana-Champaign
[79]

Washington University
[80]

University of Wisconsin-Madison

Showing first 80 references.