A new benchmark with cognitive traps shows frontier deep research agents achieve only 13-16% acceptance on expert consulting tasks under combined verifier and rubric criteria.
The American Journal of Psychology 15, 72–101
16 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 16roles
method 1polarities
use method 1representative citing papers
QVal is a new evaluation framework that directly measures dense supervision quality via Q-alignment to a reference policy, showing simple prompting baselines outperform 21 other methods across environments and models.
STEB is a new benchmark dataset and LLM-based evaluation framework for measuring expressiveness preservation in speech-to-speech translation systems.
REStack is a new public dataset of 12k+ RE discussions from Stack Exchange sites, enriched with 23 LDA-derived topics grouped into six categories and community-derived difficulty metadata.
ProactBench measures LLM conversational proactivity in three phases using 198 multi-agent dialogues and finds recovery behavior hard to predict from existing benchmarks.
Gemini 3.0 Pro with rubric prompts reached ICC 0.888 agreement with human graders on low-complexity Linux/bash responses but lower agreement at higher taxonomy levels across 1200 student answers from three expert raters.
Audit of GSO, SWE-Perf and SWE-fficiency reveals that reference patches satisfy validity rules across machines for only 39/102, 11/140 and 411/498 tasks respectively, public submissions beat references on 85.3% of replay-valid tasks, and scoring rules cause ranking disagreements.
LLMs recover dominant binomial orders from corpora but align less closely with exact preference distributions, with preference strength partially encoded in middle-to-late layers and manipulable via steering.
Develops ACW-based semantic timescale features showing longer autocorrelation windows associate with generic vocabulary and shorter ones with specific words in both human and LLM speech, with the pattern abolished by randomizing word order and timing.
Semantic segmentation decomposes monitoring features into canonical and residual components that concentrate fault-predictive information while preserving operational meaning in predictive maintenance.
CAQFM adds controlled quantum gates based on Pearson, Spearman, Kendall Tau, Mutual Information, and Distance Correlation measures to create richer feature maps, yielding higher accuracy than standard maps in VQC simulations on three benchmark datasets.
The Stakeholder Grounding Exercise shows neural text embeddings are 19-26pp less reliable than human experts at capturing semantic distinctions, with misalignment strongly correlated to poorer clustering performance (ρ=0.9), replicated across Danish policy and US AI domains.
An LLM agent automates iterative refinement of data embedding visualizations by generating semantic evaluation reports and recommending configuration changes.
Cluster-based semantic chunking does not outperform fixed-size or recursive chunking for RAG on academic theses, and RAGAs faithfulness shows limited reliability in this setup.
Self-reported LLM usage frequency associates more consistently with pre-instruction AI perceptions than prior education or self-rated familiarity in graduate trainees.
Off-the-shelf German NER tools produce divergent toponym sets that lead to distinct country assignments for climate event news, affecting assessments of national prominence in media coverage.
citing papers explorer
-
STEB: A Speech-to-Speech Translation Expressiveness Benchmark for Evaluating Beyond Translation Fidelity
STEB is a new benchmark dataset and LLM-based evaluation framework for measuring expressiveness preservation in speech-to-speech translation systems.