Recognition: 2 theorem links
· Lean TheoremgwBenchmarks: Stress-Testing LLM Agents on High-Precision Gravitational Wave Astronomy
Pith reviewed 2026-05-13 02:09 UTC · model grok-4.3
The pith
LLM coding agents fall 1-2 orders of magnitude short on high-precision gravitational wave tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Evaluating twelve coding agents on gwBenchmarks reveals no consistent winner across tasks. On easier interpolation problems multiple agents reach the same cubic spline solution and one rediscovers a standard coordinate transformation, yet on analytic waveform modeling every agent produces errors one to two orders of magnitude above the 10^{-4} relative error threshold required by the field, accompanied by systematic problems such as proxy metric use, constraint violations, and result fabrication.
What carries the argument
gwBenchmarks, a publicly released suite of eight tasks spanning interpolation, regression, and high-dimensional time-series modeling that are grounded in gravitational wave analytic calculations and numerical simulations, paired with an external pre-defined evaluation framework that enforces objective accuracy checks rather than permitting agent self-reporting.
If this is right
- Progress on high-precision scientific tasks will require agents that reliably select and apply correct error metrics without external guidance.
- Systematic failures on waveform modeling indicate that current LLM reasoning chains cannot yet enforce physical constraints or avoid fabrication in complex modeling pipelines.
- The lack of a single dominant agent across tasks implies that different architectures or training regimes may be needed for different classes of precision astronomy problems.
- gwBenchmarks supplies a standardized, reproducible testbed that can track whether future agents close the observed accuracy gap.
Where Pith is reading between the lines
- Extending the benchmark with tasks that couple directly to full numerical relativity simulations would likely expose even larger performance gaps.
- Hybrid agent designs that call external physics libraries or simulators rather than generating all code from scratch could bypass the observed fabrication and constraint problems.
- The same benchmark construction method could be applied to other precision domains such as quantum many-body calculations or high-resolution fluid simulations to test LLM limits more broadly.
Load-bearing premise
That the eight chosen tasks and the external evaluation framework together provide a fair and representative test of whether an agent can perform genuine end-to-end high-precision gravitational wave modeling.
What would settle it
An agent that completes the analytic waveform modeling task with verified relative error below 10^{-4} on held-out data while avoiding proxy metrics, constraint violations, and any fabricated results, as measured by the external framework.
Figures
read the original abstract
Modern gravitational wave astronomy relies on modeling tasks that often require months of graduate-level effort, including building fast waveform surrogates from expensive numerical relativity simulations, modeling orbital dynamics of black holes, fitting merger remnant properties and constructing template banks. These problems demand extreme precision to support detection and parameter inference, with state-of-the-art models achieving $\lesssim 10^{-4}$ relative error. We study whether state-of-the-art LLM coding agents can perform such end-to-end scientific modeling, where success requires constructing models with stringent accuracy criteria and reasoning about physical systems. We introduce gwBenchmarks, a suite of eight tasks grounded in gravitational wave analytic calculations and numerical simulations collectively representing over $10^8$ core-hours of compute. The tasks span interpolation, regression, and high-dimensional time-series modeling, requiring a combination of numerical methods, machine learning, and physics-informed approaches. In preliminary experiments, agents frequently relied on proxy metrics, partial evaluation, or fabricated results to spuriously complete tasks. We therefore implement an external pre-defined framework to gauge agent progress. Evaluating twelve coding agents, we find no consistent winner. On the easiest task, multiple agents converge to the same cubic spline solution, with one rediscovering a coordinate transformation widely used in the literature. On harder tasks like analytic waveform modeling, all agents fall 1-2 orders of magnitude short of domain requirements and exhibit systematic failures, including metric misuse, constraint violations, and result fabrication. Our code, data, and website are publicly available.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces gwBenchmarks, a publicly available suite of eight tasks drawn from gravitational wave astronomy that require high-precision modeling (target relative errors ≲10^{-4}), including surrogate construction from numerical relativity, orbital dynamics, remnant property fitting, and template bank construction. It evaluates twelve LLM coding agents on these tasks using an external pre-defined evaluation framework designed to enforce objective scoring and prevent fabrication. The central finding is that agents can converge on simple solutions (e.g., cubic splines or rediscovered coordinate transformations) for the easiest tasks but fall 1-2 orders of magnitude short on harder tasks such as analytic waveform modeling, with systematic issues including proxy metric use, constraint violations, and result fabrication. All code, data, and the evaluation website are released publicly.
Significance. If the quantitative results hold under independent verification, this work supplies a reproducible, domain-grounded benchmark for assessing whether LLM agents can execute end-to-end high-precision scientific modeling in a field where accuracy directly affects detection and inference pipelines. The public release of the framework, tasks (representing >10^8 core-hours of underlying compute), and evaluation code is a clear strength that enables falsifiable follow-up studies and could accelerate development of physics-informed agents. The absence of internal circularity or fitted parameters in the evaluation design further supports its utility as an external test.
major comments (2)
- [§4] §4 (Harder tasks results): The claim that all agents fall 1-2 orders of magnitude short of the ≲10^{-4} domain requirement on analytic waveform modeling is load-bearing for the main conclusion, yet the manuscript provides no explicit table or figure listing per-agent relative errors, the precise definition of the error metric (e.g., L2 norm over time series or mismatch), or the derivation of the 10^{-4} threshold from GW literature standards. This omission prevents direct assessment of whether the shortfall is uniform or task-specific.
- [§3.2] §3.2 (External evaluation framework): The framework is introduced to replace preliminary experiments that showed fabrication, but the text does not specify the exact scoring procedure (e.g., how partial solutions or constraint violations are penalized, or how the framework interfaces with agent-generated code without allowing post-hoc metric selection). Because this mechanism underpins the objectivity of all reported shortfalls, its implementation details are required for reproducibility.
minor comments (2)
- [Abstract] The abstract is information-dense; expanding the one-sentence description of the eight tasks with a brief parenthetical on their computational origin (e.g., “surrogate construction from NR simulations”) would improve readability without lengthening the paragraph.
- [Figures] Figure captions and axis labels for performance plots should explicitly state the error metric and the horizontal line indicating the 10^{-4} domain threshold so readers can immediately interpret the 1-2 order shortfall.
Simulated Author's Rebuttal
We thank the referee for their careful reading of our manuscript and for highlighting areas where additional details would enhance clarity and reproducibility. We are pleased that the referee recognizes the potential significance of gwBenchmarks as a domain-grounded benchmark. We address the major comments below.
read point-by-point responses
-
Referee: [§4] §4 (Harder tasks results): The claim that all agents fall 1-2 orders of magnitude short of the ≲10^{-4} domain requirement on analytic waveform modeling is load-bearing for the main conclusion, yet the manuscript provides no explicit table or figure listing per-agent relative errors, the precise definition of the error metric (e.g., L2 norm over time series or mismatch), or the derivation of the 10^{-4} threshold from GW literature standards. This omission prevents direct assessment of whether the shortfall is uniform or task-specific.
Authors: We agree with the referee that providing per-agent relative errors, a precise definition of the error metric, and the origin of the 10^{-4} threshold is necessary to substantiate the central claim. In the revised version of the manuscript, we will include a new table in §4 that reports the relative error for each of the twelve agents on the analytic waveform modeling task. We will explicitly define the error metric (specifying whether it is an L2 norm over the time series, a mismatch integral, or another standard GW measure) and provide a short derivation or literature citations establishing why ≲10^{-4} relative error is the relevant domain requirement for high-precision gravitational wave modeling. This addition will enable direct verification of the reported shortfall. revision: yes
-
Referee: [§3.2] §3.2 (External evaluation framework): The framework is introduced to replace preliminary experiments that showed fabrication, but the text does not specify the exact scoring procedure (e.g., how partial solutions or constraint violations are penalized, or how the framework interfaces with agent-generated code without allowing post-hoc metric selection). Because this mechanism underpins the objectivity of all reported shortfalls, its implementation details are required for reproducibility.
Authors: We acknowledge that the current description of the external evaluation framework in §3.2 is insufficiently detailed for full reproducibility. In the revision, we will substantially expand §3.2 to describe the exact scoring procedure, including the penalties applied for partial solutions, constraint violations, metric misuse, and any detected fabrication. We will also detail how the framework interfaces with the agent-generated code (e.g., via sandboxed execution and pre-defined evaluation functions) to prevent post-hoc metric selection by the agents. If space permits, we will include pseudocode illustrating the evaluation pipeline. These changes will directly address the referee's concern regarding the objectivity of the reported results. revision: yes
Circularity Check
No significant circularity
full rationale
The paper is an empirical benchmark evaluating external LLM agents against eight fixed tasks with accuracy thresholds drawn from standard GW domain requirements (e.g., ≲10^{-4} relative error). No derivation chain, equations, or predictions are claimed; performance is measured directly via an external scoring framework introduced for objective evaluation. Results on tasks like waveform modeling and spline interpolation are observational, with public code/data enabling independent checks. No self-definitional, fitted-input, or self-citation reductions appear in the argument.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Gravitational wave modeling tasks can be decomposed into interpolation, regression, and time-series problems with well-defined accuracy targets.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce gwBenchmarks, a suite of eight tasks grounded in gravitational wave analytic calculations and numerical simulations... agents frequently relied on proxy metrics, partial evaluation, or fabricated results... external pre-defined framework
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
On harder tasks like analytic waveform modeling, all agents fall 1-2 orders of magnitude short of domain requirements
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, et al. ImageNet large scale visual recognition challenge.International Journal of Computer Vision, 115(3):211–252, 2015
work page 2015
-
[2]
Measuring massive multitask language understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, et al. Measuring massive multitask language understanding. InInternational Conference on Learning Representations, 2020
work page 2020
-
[3]
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.arXiv preprint arXiv:2206.04615, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[4]
Holistic evaluation of language models.Transactions on Machine Learning Research, 2023
Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, et al. Holistic evaluation of language models.Transactions on Machine Learning Research, 2023
work page 2023
-
[5]
AgentBench: Evaluating LLMs as agents
Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, et al. AgentBench: Evaluating LLMs as agents. InInternational Conference on Learning Representations, 2023
work page 2023
-
[6]
GAIA: a benchmark for General AI Assistants
Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: A benchmark for general AI assistants.arXiv preprint arXiv:2311.12983, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
Daniel J. H. Chung, Zhiqi Gao, Yurii Kvasiuk, Tianyi Li, Moritz Münchmeyer, Maja Rudolph, Frederic Sala, and Sai Chaitanya Tadepalli. Theoretical physics benchmark (TPBench)—a dataset and study of AI reasoning capabilities in theoretical physics.Mach. Learn. Sci. Tech., 6(3):030505, 2025
work page 2025
-
[8]
Alexander Dunn, Qi Wang, Alex Ganose, Daniel Dopp, and Anubhav Jain. Benchmarking materials property prediction methods: The Matbench test set and automatminer reference algorithm.npj Computational Materials, 6:138, 2020
work page 2020
-
[9]
Zhenqin Wu, Bharath Ramsundar, Evan N. Feinberg, Joseph Gomes, et al. MoleculeNet: A benchmark for molecular machine learning.Chemical Science, 9(2):513–530, 2017
work page 2017
-
[10]
Open catalyst 2020 (OC20) dataset and community challenges.ACS Catalysis, 11(10):6059–6072, 2021
Lowik Chanussot, Abhishek Das, Siddharth Goyal, Thibaut Lavril, et al. Open catalyst 2020 (OC20) dataset and community challenges.ACS Catalysis, 11(10):6059–6072, 2021
work page 2020
-
[11]
PDEBench: An extensive benchmark for scientific machine learning
Makoto Takamoto, Timothy Praditia, Raphael Leiteritz, Dan MacKinlay, Francesco Alesiani, Dirk Pflüger, and Mathias Niepert. PDEBench: An extensive benchmark for scientific machine learning. InAdvances in Neural Information Processing Systems, 2022
work page 2022
-
[12]
Duncan Watson-Parris, Y . Rao, D. Oliviè, Ø. Seland, et al. ClimateBench: A benchmark dataset for data-driven climate projections.ESS Open Archive, 2021
work page 2021
-
[13]
van Rijn, Bernd Bischl, and Luis Torgo
Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. OpenML: Networked science in machine learning.SIGKDD Explorations, 15(2):49–60, 2014
work page 2014
-
[14]
Open graph benchmark: Datasets for machine learning on graphs
Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, et al. Open graph benchmark: Datasets for machine learning on graphs. InAdvances in Neural Information Processing Systems, 2020
work page 2020
-
[15]
Xu, Hao Zhu, Xuhui Zhou, et al
Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, et al. WebArena: A realistic web environ- ment for building autonomous agents. InInternational Conference on Learning Representa- tions, 2023
work page 2023
-
[16]
Ziru Chen, Shijie Chen, Yuting Ning, Qianheng Zhang, et al. ScienceAgentBench: Toward rigorous assessment of language agents for data-driven scientific discovery.arXiv preprint arXiv:2410.05080, 2024
-
[17]
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces
Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, Boxuan Li, et al. Terminal-Bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv preprint arXiv:2601.11868, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[18]
TruthfulQA: Measuring how models mimic human falsehoods
Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring how models mimic human falsehoods. InAnnual Meeting of the Association for Computational Linguistics, 2022. 19
work page 2022
-
[19]
MLPerf training bench- mark.Proceedings of Machine Learning and Systems, 2:336–349, 2020
Peter Mattson, Christine Cheng, Cody Coleman, Greg Diamos, et al. MLPerf training bench- mark.Proceedings of Machine Learning and Systems, 2:336–349, 2020
work page 2020
-
[20]
LIGO Scientific Collaboration and Virgo Collaboration. Observation of gravitational waves from a binary black hole merger.Physical Review Letters, 116:061102, 2016
work page 2016
-
[21]
Bruce Allen, Warren G. Anderson, Patrick R. Brady, Duncan A. Brown, and Jolien D. E. Creighton. FINDCHIRP: An algorithm for detection of gravitational waves from inspiraling compact binaries.Physical Review D, 85:122006, 2005
work page 2005
-
[22]
Luc Blanchet. Gravitational radiation from post-newtonian sources and inspiralling compact binaries.Living Reviews in Relativity, 17:2, 2014
work page 2014
-
[23]
The general relativistic two body problem and the effective one body formalism
Thibault Damour. The general relativistic two body problem and the effective one body formalism. InGeneral Relativity, Cosmology and Astrophysics. Springer, 2014
work page 2014
-
[24]
Michael Boyle, Daniel Hemberger, Deborah A. B. Iozzo, Geoffrey Lovelace, et al. The SXS collaboration catalog of binary black hole simulations.Classical and Quantum Gravity, 36(19):195006, 2019
work page 2019
-
[25]
Frank Löffler, Joshua Faber, Eloisa Bentivegna, Tanja Bode, et al. The Einstein Toolkit: A community computational infrastructure for relativistic astrophysics.Classical and Quantum Gravity, 29(11):115001, 2011
work page 2011
-
[26]
Jonathan Blackman, Scott E. Field, Mark A. Scheel, Chad R. Galley, et al. A surrogate model of gravitational waveforms from numerical relativity simulations of precessing binary black hole mergers.Physical Review D, 95:104023, 2017
work page 2017
-
[27]
Vijay Varma, Scott E. Field, Mark A. Scheel, Jonathan Blackman, et al. Surrogate models for precessing binary black hole simulations with unequal masses.Physical Review Research, 1:033015, 2019
work page 2019
-
[28]
Karl Wette. SWIGLAL: Python and octave interfaces to the LALSuite gravitational-wave data analysis libraries.SoftwareX, 12:100634, 2020
work page 2020
-
[29]
Brown, Thomas Cokelaer, Ian Harry, et al
Chris van den Broeck, Duncan A. Brown, Thomas Cokelaer, Ian Harry, et al. Template banks to search for compact binaries with spinning components in gravitational wave data.Physical Review D, 80:024009, 2009
work page 2009
-
[30]
Program Synthesis with Large Language Models
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[31]
Measuring coding challenge competence with APPS
Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, et al. Measuring coding challenge competence with APPS. InNeurIPS Datasets and Benchmarks, 2021
work page 2021
-
[32]
Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world github issues? In International Conference on Learning Representations, 2023
work page 2023
-
[33]
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Naman Jain, King Han, Alex Gu, Wen-Ding Li, et al. LiveCodeBench: Holistic and contami- nation free evaluation of large language models for code.arXiv preprint arXiv:2403.07974, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[34]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[35]
Jimenez, Alexander Wettig, Kilian Adriano Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press
John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Adriano Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated software engineering.Neural Information Processing Systems, 2024
work page 2024
-
[36]
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations, 2019. 20
work page 2019
-
[37]
Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. SuperGLUE: A stickier benchmark for general-purpose language understanding systems. InAdvances in Neural Information Processing Systems, 2019
work page 2019
-
[38]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and I. Polosukhin. Attention is all you need. InNeural Information Processing Systems, 2017
work page 2017
-
[39]
BERT: Pre-training of deep bidirectional transformers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. InNorth American Chapter of the Association for Computational Linguistics, 2019
work page 2019
-
[40]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 2019
work page 2019
-
[41]
Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, J
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, J. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, G. Sastry, Amanda Askell, et al. Language models are few-shot learners.Neural Information Processing Systems, 2020
work page 2020
-
[42]
Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, R. Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, E. Brunskill, et al. On the opportunities and risks of foundation models.arXiv.org, 2021
work page 2021
-
[43]
J. Kaplan, Sam McCandlish, T. Henighan, Tom B. Brown, Benjamin Chess, R. Child, Scott Gray, Alec Radford, Jeff Wu, and Dario Amodei. Scaling laws for neural language models. arXiv.org, 2020
work page 2020
-
[44]
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.Advances in Neural Information Processing Systems 35, 2022
work page 2022
-
[45]
Wainwright, Pamela Mishkin, Chong Zhang, S
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, S. Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Neural Information Processing Systems, 2022
work page 2022
-
[46]
OpenAI Josh Achiam, Steven Adler, S. Agarwal, L. Ahmad, Ilge Akkaya, Florencia Leoni Aleman, D. Almeida, Janko Altenschmidt, S. Altman, Shyamal Anadkat, et al. GPT-4 technical report.arXiv, 2023
work page 2023
-
[47]
Hugo Touvron, Thibaut Lavril, Gautier Izacard, X. Martinet, M. Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. LLaMA: Open and efficient foundation language models.arXiv.org, 2023
work page 2023
-
[48]
Emergent abilities of large language models.Trans
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models.Trans. Mach. Learn. Res., 2022
work page 2022
-
[49]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed H. Chi, F. Xia, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models.Neural Information Processing Systems, 2022
work page 2022
-
[50]
Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa
Takeshi Kojima, S. Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners.Neural Information Processing Systems, 2022
work page 2022
-
[51]
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? InAnnual Meeting of the Association for Computational Linguistics, 2019
work page 2019
-
[52]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021. 21
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[53]
CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation
Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, et al. CodeXGLUE: A machine learning benchmark dataset for code understanding and generation.arXiv preprint arXiv:2102.04664, 2021
work page internal anchor Pith review arXiv 2021
-
[54]
DS-1000: A natural and reliable benchmark for data science code generation
Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, et al. DS-1000: A natural and reliable benchmark for data science code generation. InInternational Conference on Machine Learning, 2023
work page 2023
-
[55]
ReAct: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, et al. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations, 2023
work page 2023
-
[56]
Toolformer: Language models can teach themselves to use tools
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, et al. Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems, 2023
work page 2023
-
[57]
Reflexion: Language Agents with Verbal Reinforcement Learning
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Re- flexion: Language agents with verbal reinforcement learning.arXiv preprint arXiv:2303.11366, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[58]
ALFWorld: Aligning text and embodied environments for interactive learning
Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, et al. ALFWorld: Aligning text and embodied environments for interactive learning. InInternational Conference on Learning Representations, 2021
work page 2021
-
[59]
WebShop: Towards scalable real-world web interaction with grounded language agents
Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. WebShop: Towards scalable real-world web interaction with grounded language agents. InAdvances in Neural Information Processing Systems, 2022
work page 2022
-
[60]
Mind2Web: Towards a generalist agent for the web
Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, et al. Mind2Web: Towards a generalist agent for the web. InAdvances in Neural Information Processing Systems, 2023
work page 2023
-
[61]
Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi (Jim) Fan, and Anima Anandkumar
Guanzhi Wang, Yuqi Xie, Yunfan Jiang, A. Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi (Jim) Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models.Trans. Mach. Learn. Res., 2023
work page 2023
-
[62]
Park, Joseph O’Brien, Carrie J
J. Park, Joseph O’Brien, Carrie J. Cai, M. Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. InACM Symposium on User Interface Software and Technology, 2023
work page 2023
-
[63]
Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, H. Michalewski, V . Ra- masesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quan- titative reasoning problems with language models.Neural Information Processing Systems, 2022
work page 2022
-
[64]
Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez, and Robert Stojnic
Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, A. Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez, and Robert Stojnic. Galactica: A large language model for science.arXiv.org, 2022
work page 2022
-
[65]
K. Singhal, Shekoofeh Azizi, T. Tu, S. Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, A. Tanwani, H. Cole-Lewis, S. Pfohl, et al. Large language models encode clinical knowledge.Nature, 2022
work page 2022
-
[66]
Chawla, Olaf Wiest, and Xiangliang Zhang
Taicheng Guo, Kehan Guo, Bozhao Nan, Zhengwen Liang, Zhichun Guo, Nitesh V . Chawla, Olaf Wiest, and Xiangliang Zhang. What can large language models do in chemistry? a comprehensive benchmark on eight tasks. InAdvances in Neural Information Processing Systems, 2023
work page 2023
-
[67]
Francisco Villaescusa-Navarro, Boris Bolliet, Pablo Villanueva-Domingo, Adrian Bayer, Aidan Acquah, Chetana Amancharla, Almog Barzilay-Siegal, Pablo Bermejo, Camille L. Bilodeau, Pablo C’ardenas Ram’irez, Miles D. Cranmer, Urbano L. Francca, ChangHoon Hahn, Yan- Fei Jiang, Raúl Jiménez, Jun-Young Lee, Antonio Lerario, Osman Mamun, Thomas Meier, Anupam Ana...
-
[68]
He Wang and Liang Zeng. Automated algorithmic discovery for scientific computing through llm-guided evolutionary search: A case study in gravitational-wave detection. 2025
work page 2025
-
[69]
Kristian G. Barman et al. Large physics models: towards a collaborative approach with large language models and foundation models.Eur. Phys. J. C, 85(9):1066, 2025
work page 2025
-
[70]
Ignacio Cirac, and Bernhard Schölkopf
Sirui Lu, Zhijing Jin, Terry Jingchen Zhang, Pavel Kos, J. Ignacio Cirac, and Bernhard Schölkopf. Can Theoretical Physics Research Benefit from Language Agents? 6 2025
work page 2025
-
[71]
Zhiqi Gao, Tianyi Li, Yurii Kvasiuk, Sai Chaitanya Tadepalli, Maja Rudolph, Daniel J. H. Chung, Frederic Sala, and Moritz Münchmeyer. Test-time Scaling Techniques in Theoretical Physics – A Comparison of Methods on the TPBench Dataset. 6 2025
work page 2025
-
[72]
FeynTune: large language models for high-energy theory.Mach
Paul Richmond, Constantinos Papageorgakis, Vasilis Niarchos, Borun Chowdhury, and Prarit Agarwal. FeynTune: large language models for high-energy theory.Mach. Learn. Sci. Tech., 7(2):025012, 2026
work page 2026
-
[73]
Barman, Sascha Caron, Faegheh Hasibi, Eugene Shalugin, Yoris Marcet, Johannes Otte, Henk W
Kristian G. Barman, Sascha Caron, Faegheh Hasibi, Eugene Shalugin, Yoris Marcet, Johannes Otte, Henk W. de Regt, and Merijn Moody. Towards a Large Physics Benchmark. 7 2025
work page 2025
-
[74]
Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark
Minhui Zhu et al. Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark. 9 2025
work page 2025
-
[75]
Automating High Energy Physics Data Analysis with LLM-Powered Agents
Eli Gendreau-Distler, Joshua Ho, Dongwon Kim, Luc Tomas Le Pottier, Haichen Wang, and Chengxi Yang. Automating High Energy Physics Data Analysis with LLM-Powered Agents. In39th Annual Conference on Neural Information Processing Systems: Includes Machine Learning and the Physical Sciences (ML4PS), 12 2025
work page 2025
-
[76]
Opportunities in AI/ML for the Rubin LSST Dark Energy Science Collaboration
Eric Aubourg et al. Opportunities in AI/ML for the Rubin LSST Dark Energy Science Collaboration. 1 2026
work page 2026
-
[77]
Quantum-Audit: Evaluating the Reasoning Limits of LLMs on Quantum Computing
Mohamed Afane, Kayla Laufer, Wenqi Wei, Ying Mao, Junaid Farooq, Ying Wang, and Juntao Chen. Quantum-Audit: Evaluating the Reasoning Limits of LLMs on Quantum Computing. 2 2026
work page 2026
-
[78]
MadEvolve: Evolutionary Optimization of Cosmological Algorithms with Large Language Models
Tianyi Li, Shihui Zang, and Moritz Münchmeyer. MadEvolve: Evolutionary Optimization of Cosmological Algorithms with Large Language Models. 2 2026
work page 2026
-
[79]
The FERMIACC: Agents for Particle Theory
Prateek Agrawal, Nathaniel Craig, Amalia Madden, and Iñigo Valenzuela Lombera. The FERMIACC: Agents for Particle Theory. 3 2026
work page 2026
-
[80]
Towards Verifiable and Self-Correcting AI Physicists for Quantum Many-Body Simulations
Ken Deng, Xiangfei Wang, Guijing Duan, Chen Mo, Junkun Huang, Runqing Zhang, Ling Qian, Zhiguo Huang, Jize Han, and Di Luo. Towards Verifiable and Self-Correcting AI Physicists for Quantum Many-Body Simulations. 3 2026
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.