Collider-Bench is a new benchmark showing that current LLM agents cannot reliably reproduce LHC analyses at the level of a physicist-in-the-loop.
Title resolution pending
5 Pith papers cite this work. Polarity classification is still indexing.
5
Pith papers citing it
years
2026 5representative citing papers
Single-prompt evaluations of instruction-tuned embedding models misrepresent performance and allow any model to be ranked first by favorable prompt choice.
ResearchArena shows that agent-generated papers fail top-tier acceptance standards primarily due to fabricated results, underpowered experiments, and plan-execution mismatches that vary sharply by agent.
Three Metapath2Vec variants create ingredient embeddings by walking a co-occurrence graph from recipes, a typed chemical compound graph from FlavorDB, or a controlled blend of both.