InfiniteScienceGym procedurally generates unbounded scientific repositories with exact ground-truth QA pairs to benchmark LLMs on data reasoning, abstention, and tool use without static datasets.
Title resolution pending
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3verdicts
UNVERDICTED 3representative citing papers
A hybrid confidence framework for LLM-based short answer grading combines model signals with aleatoric uncertainty from semantic clustering of responses and improves selective grading reliability over single-source methods.
WebUncertainty improves web agent performance on benchmarks by adaptively selecting planning modes based on task uncertainty and using confidence-induced action uncertainty in MCTS to quantify aleatoric and epistemic uncertainty for better decisions.
citing papers explorer
-
InfiniteScienceGym: An Unbounded, Procedurally-Generated Benchmark for Scientific Analysis
InfiniteScienceGym procedurally generates unbounded scientific repositories with exact ground-truth QA pairs to benchmark LLMs on data reasoning, abstention, and tool use without static datasets.
-
Confidence Estimation in Automatic Short Answer Grading with LLMs
A hybrid confidence framework for LLM-based short answer grading combines model signals with aleatoric uncertainty from semantic clustering of responses and improves selective grading reliability over single-source methods.
-
WebUncertainty: Dual-Level Uncertainty Driven Planning and Reasoning For Autonomous Web Agent
WebUncertainty improves web agent performance on benchmarks by adaptively selecting planning modes based on task uncertainty and using confidence-induced action uncertainty in MCTS to quantify aleatoric and epistemic uncertainty for better decisions.