Elbeheiry, María Victoria Gil, Christina Glaubitz, Maximilian Greiner, Caroline T

Adrian Mirza, Nawaf Alampara, Sreekanth Kunchapu, Martiño Ríos-García, Benedict Emoekabu, Aswanth Krishnan, Tanya Gupta, Mara Schilling-Wilhelmi, Macjonathan Okereke, Anagha Aneesh, Mehrdad Asgari, Juliane Eberhardt, Amir Mohammad Elahi, Ha · 2025 · DOI 10.1038/s41557-025-01815-x

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

open at publisher browse 5 citing papers

representative citing papers

Evaluating Large Language Models in Scientific Discovery

cs.AI · 2025-12-17 · unverdicted · novelty 8.0

The SDE benchmark shows LLMs lag on scientific discovery tasks relative to general science tests, with diminishing scaling returns and shared weaknesses across models.

Collider-Bench: Benchmarking AI Agents with Particle Physics Analysis Reproduction

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

Collider-Bench is a new benchmark showing that current LLM agents cannot reliably reproduce LHC analyses at the level of a physicist-in-the-loop.

AgentSPEX: An Agent SPecification and EXecution Language

cs.CL · 2026-04-14 · unverdicted · novelty 6.0

AgentSPEX is a new language and harness for explicitly specifying and running structured LLM-agent workflows with typed steps, control flow, parallel execution, and a visual editor.

SciCore-Mol: Augmenting Large Language Models with Pluggable Molecular Cognition Modules

cs.AI · 2026-05-21 · unverdicted · novelty 5.0

SciCore-Mol augments LLMs with three integrated modules for molecular perception, latent diffusion generation, and reaction reasoning, claiming an 8B open model competes with or exceeds proprietary systems on chemical tasks.

How Human-Like Are Large Language Models? A Register-Aware Linguistic Evaluation Framework

cs.CL · 2026-05-22

citing papers explorer

Showing 5 of 5 citing papers.

Evaluating Large Language Models in Scientific Discovery cs.AI · 2025-12-17 · unverdicted · none · ref 46
The SDE benchmark shows LLMs lag on scientific discovery tasks relative to general science tests, with diminishing scaling returns and shared weaknesses across models.
Collider-Bench: Benchmarking AI Agents with Particle Physics Analysis Reproduction cs.LG · 2026-05-13 · unverdicted · none · ref 13
Collider-Bench is a new benchmark showing that current LLM agents cannot reliably reproduce LHC analyses at the level of a physicist-in-the-loop.
AgentSPEX: An Agent SPecification and EXecution Language cs.CL · 2026-04-14 · unverdicted · none · ref 8
AgentSPEX is a new language and harness for explicitly specifying and running structured LLM-agent workflows with typed steps, control flow, parallel execution, and a visual editor.
SciCore-Mol: Augmenting Large Language Models with Pluggable Molecular Cognition Modules cs.AI · 2026-05-21 · unverdicted · none · ref 33
SciCore-Mol augments LLMs with three integrated modules for molecular perception, latent diffusion generation, and reaction reasoning, claiming an 8B open model competes with or exceeds proprietary systems on chemical tasks.
How Human-Like Are Large Language Models? A Register-Aware Linguistic Evaluation Framework cs.CL · 2026-05-22 · unreviewed · ref 141

Elbeheiry, María Victoria Gil, Christina Glaubitz, Maximilian Greiner, Caroline T

fields

years

verdicts

representative citing papers

citing papers explorer