Title resolution pending

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces , author= · 2026

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

browse 2 citing papers

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

representative citing papers

Harnesses for Inference-Time Alignment over Execution Trajectories

cs.LG · 2026-05-15 · unverdicted · novelty 6.0

Partial harnesses for LLM agents, specifying only initial execution steps, achieve higher pass rates than fully decomposed workflows, as analyzed through trajectory alignment and validated in synthetic and terminal benchmarks.

Measuring Five-Nines Reliability: Sample-Efficient LLM Evaluation in Saturated Benchmarks

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

Cross-entropy method sampling reduces inferences needed to estimate five-nines LLM reliability by up to 156x on parameterized GSM8K templates, revealing reliability differences hidden by saturated accuracy scores.

citing papers explorer

Showing 2 of 2 citing papers.

Harnesses for Inference-Time Alignment over Execution Trajectories cs.LG · 2026-05-15 · unverdicted · none · ref 39
Partial harnesses for LLM agents, specifying only initial execution steps, achieve higher pass rates than fully decomposed workflows, as analyzed through trajectory alignment and validated in synthetic and terminal benchmarks.
Measuring Five-Nines Reliability: Sample-Efficient LLM Evaluation in Saturated Benchmarks cs.LG · 2026-05-11 · unverdicted · none · ref 32
Cross-entropy method sampling reduces inferences needed to estimate five-nines LLM reliability by up to 156x on parameterized GSM8K templates, revealing reliability differences hidden by saturated accuracy scores.

Title resolution pending

fields

years

verdicts

representative citing papers

citing papers explorer