CoRR abs/2407.13168(2024)

Minyang Tian, Luyu Gao, Shizhuo Dylan Zhang, Xinan Chen, Cunwei Fan, Xuefei Guo, Roland Haas, Pan Ji, Kittithat Krongchon, Yao Li, Shengyan Liu, Di Luo, Yutao Ma, Hao Tong, Kha Trinh, Chenyu Tian, Zihan Wang, Bohao Wu, Yanyu Xiong, Shengzhu · 2024 · arXiv 2407.13168

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

read on arXiv browse 9 citing papers

citation-role summary

dataset 2 background 1

citation-polarity summary

use dataset 2 background 1

representative citing papers

SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science

cs.AI · 2026-05-18 · unverdicted · novelty 7.0

SCICONVBENCH is a new benchmark evaluating LLMs on multi-turn disambiguation and inconsistency resolution for task formulation in computational science, with frontier models reaching only 52.7% success on fluid mechanics disambiguation cases.

Collider-Bench: Benchmarking AI Agents with Particle Physics Analysis Reproduction

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

Collider-Bench is a new benchmark showing that current LLM agents cannot reliably reproduce LHC analyses at the level of a physicist-in-the-loop.

Open-World Evaluations for Measuring Frontier AI Capabilities

cs.AI · 2026-05-19 · conditional · novelty 6.0

Open-world evaluations using qualitative review of real-world tasks can give earlier warnings of frontier AI capabilities than automated benchmarks, as demonstrated by an AI agent publishing a simple iOS app with one minor human fix.

No Test Cases, No Problem: Distillation-Driven Code Generation for Scientific Workflows

cs.SE · 2026-04-25 · unverdicted · novelty 6.0

MOSAIC generates executable scientific code without I/O test cases by combining student-teacher distillation with a consolidated context window to reduce hallucinations across subproblems.

Towards Verifiable and Self-Correcting AI Physicists for Quantum Many-Body Simulations

physics.comp-ph · 2026-03-31 · unverdicted · novelty 6.0

QMP-Bench supplies a realistic test set for AI on quantum many-body problems while PhysVEC uses integrated verifiers to turn unreliable LLM generations into code that passes both syntax and physics checks, outperforming baselines.

MOSAIC: Multi-agent Orchestration for Task-Intelligent Scientific Coding

cs.CL · 2025-10-09 · unverdicted · novelty 6.0

MOSAIC is a training-free multi-agent LLM framework with rationale, coding, reflection, and debugging agents plus a consolidated context window that outperforms prior methods on scientific coding benchmarks.

The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning

cs.LG · 2025-05-21 · unverdicted · novelty 6.0

Entropy minimization on self-generated outputs elicits strong reasoning in pretrained LLMs, matching or exceeding supervised RL methods on benchmarks.

From Pixels to Digital Agents: An Empirical Study on the Taxonomy and Technological Trends of Reinforcement Learning Environments

cs.AI · 2026-03-25 · unverdicted · novelty 5.0

An empirical literature analysis reveals a bifurcation in RL environments into Semantic Prior (LLM-dominated) and Domain-Specific Generalization ecosystems with distinct cognitive fingerprints.

AI for Auto-Research: Roadmap & User Guide

cs.AI · 2026-05-18 · unverdicted · novelty 4.0

The paper delivers a stage-by-stage roadmap for AI in research, showing reliable assistance in retrieval and tool tasks but fragility in novelty and judgment, advocating human-governed collaboration.

citing papers explorer

Showing 9 of 9 citing papers.

SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science cs.AI · 2026-05-18 · unverdicted · none · ref 64
SCICONVBENCH is a new benchmark evaluating LLMs on multi-turn disambiguation and inconsistency resolution for task formulation in computational science, with frontier models reaching only 52.7% success on fluid mechanics disambiguation cases.
Collider-Bench: Benchmarking AI Agents with Particle Physics Analysis Reproduction cs.LG · 2026-05-13 · unverdicted · none · ref 31
Collider-Bench is a new benchmark showing that current LLM agents cannot reliably reproduce LHC analyses at the level of a physicist-in-the-loop.
Open-World Evaluations for Measuring Frontier AI Capabilities cs.AI · 2026-05-19 · conditional · none · ref 80
Open-world evaluations using qualitative review of real-world tasks can give earlier warnings of frontier AI capabilities than automated benchmarks, as demonstrated by an AI agent publishing a simple iOS app with one minor human fix.
No Test Cases, No Problem: Distillation-Driven Code Generation for Scientific Workflows cs.SE · 2026-04-25 · unverdicted · none · ref 13
MOSAIC generates executable scientific code without I/O test cases by combining student-teacher distillation with a consolidated context window to reduce hallucinations across subproblems.
Towards Verifiable and Self-Correcting AI Physicists for Quantum Many-Body Simulations physics.comp-ph · 2026-03-31 · unverdicted · none · ref 31
QMP-Bench supplies a realistic test set for AI on quantum many-body problems while PhysVEC uses integrated verifiers to turn unreliable LLM generations into code that passes both syntax and physics checks, outperforming baselines.
MOSAIC: Multi-agent Orchestration for Task-Intelligent Scientific Coding cs.CL · 2025-10-09 · unverdicted · none · ref 33
MOSAIC is a training-free multi-agent LLM framework with rationale, coding, reflection, and debugging agents plus a consolidated context window that outperforms prior methods on scientific coding benchmarks.
The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning cs.LG · 2025-05-21 · unverdicted · none · ref 82
Entropy minimization on self-generated outputs elicits strong reasoning in pretrained LLMs, matching or exceeding supervised RL methods on benchmarks.
From Pixels to Digital Agents: An Empirical Study on the Taxonomy and Technological Trends of Reinforcement Learning Environments cs.AI · 2026-03-25 · unverdicted · none · ref 178
An empirical literature analysis reveals a bifurcation in RL environments into Semantic Prior (LLM-dominated) and Domain-Specific Generalization ecosystems with distinct cognitive fingerprints.
AI for Auto-Research: Roadmap & User Guide cs.AI · 2026-05-18 · unverdicted · none · ref 204
The paper delivers a stage-by-stage roadmap for AI in research, showing reliable assistance in retrieval and tool tasks but fragility in novelty and judgment, advocating human-governed collaboration.

CoRR abs/2407.13168(2024)

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer