A new benchmark with cognitive traps shows frontier deep research agents achieve only 13-16% acceptance on expert consulting tasks under combined verifier and rubric criteria.
The American Journal of Psychology 15, 72–101
16 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 16roles
method 1polarities
use method 1representative citing papers
QVal is a new evaluation framework that directly measures dense supervision quality via Q-alignment to a reference policy, showing simple prompting baselines outperform 21 other methods across environments and models.
STEB is a new benchmark dataset and LLM-based evaluation framework for measuring expressiveness preservation in speech-to-speech translation systems.
REStack is a new public dataset of 12k+ RE discussions from Stack Exchange sites, enriched with 23 LDA-derived topics grouped into six categories and community-derived difficulty metadata.
ProactBench measures LLM conversational proactivity in three phases using 198 multi-agent dialogues and finds recovery behavior hard to predict from existing benchmarks.
Gemini 3.0 Pro with rubric prompts reached ICC 0.888 agreement with human graders on low-complexity Linux/bash responses but lower agreement at higher taxonomy levels across 1200 student answers from three expert raters.
Audit of GSO, SWE-Perf and SWE-fficiency reveals that reference patches satisfy validity rules across machines for only 39/102, 11/140 and 411/498 tasks respectively, public submissions beat references on 85.3% of replay-valid tasks, and scoring rules cause ranking disagreements.
LLMs recover dominant binomial orders from corpora but align less closely with exact preference distributions, with preference strength partially encoded in middle-to-late layers and manipulable via steering.
Develops ACW-based semantic timescale features showing longer autocorrelation windows associate with generic vocabulary and shorter ones with specific words in both human and LLM speech, with the pattern abolished by randomizing word order and timing.
Semantic segmentation decomposes monitoring features into canonical and residual components that concentrate fault-predictive information while preserving operational meaning in predictive maintenance.
CAQFM adds controlled quantum gates based on Pearson, Spearman, Kendall Tau, Mutual Information, and Distance Correlation measures to create richer feature maps, yielding higher accuracy than standard maps in VQC simulations on three benchmark datasets.
The Stakeholder Grounding Exercise shows neural text embeddings are 19-26pp less reliable than human experts at capturing semantic distinctions, with misalignment strongly correlated to poorer clustering performance (ρ=0.9), replicated across Danish policy and US AI domains.
An LLM agent automates iterative refinement of data embedding visualizations by generating semantic evaluation reports and recommending configuration changes.
Cluster-based semantic chunking does not outperform fixed-size or recursive chunking for RAG on academic theses, and RAGAs faithfulness shows limited reliability in this setup.
Self-reported LLM usage frequency associates more consistently with pre-instruction AI perceptions than prior education or self-rated familiarity in graduate trainees.
Off-the-shelf German NER tools produce divergent toponym sets that lead to distinct country assignments for climate event news, affecting assessments of national prominence in media coverage.
citing papers explorer
-
Evaluating Deep Research Agents on Expert Consulting Work: A Benchmark with Verifiers, Rubrics, and Cognitive Traps
A new benchmark with cognitive traps shows frontier deep research agents achieve only 13-16% acceptance on expert consulting tasks under combined verifier and rubric criteria.
-
QVal: Cheaply Evaluating Dense Supervision Signals for Long-Horizon LLM Agents
QVal is a new evaluation framework that directly measures dense supervision quality via Q-alignment to a reference policy, showing simple prompting baselines outperform 21 other methods across environments and models.
-
REStack: A Large-Scale Dataset of Reverse Engineering Discussions from Stack Exchange
REStack is a new public dataset of 12k+ RE discussions from Stack Exchange sites, enriched with 23 LDA-derived topics grouped into six categories and community-derived difficulty metadata.
-
ProactBench: Beyond What The User Asked For
ProactBench measures LLM conversational proactivity in three phases using 198 multi-agent dialogues and finds recovery behavior hard to predict from existing benchmarks.
-
Automated grading of Linux/bash examinations using large language models: a four-level cognitive taxonomy approach
Gemini 3.0 Pro with rubric prompts reached ICC 0.888 agreement with human graders on low-complexity Linux/bash responses but lower agreement at higher taxonomy levels across 1200 student answers from three expert raters.
-
Are Performance-Optimization Benchmarks Reliably Measuring Coding Agents?
Audit of GSO, SWE-Perf and SWE-fficiency reveals that reference patches satisfy validity rules across machines for only 39/102, 11/140 and 411/498 tasks respectively, public submissions beat references on 85.3% of replay-valid tasks, and scoring rules cause ranking disagreements.
-
Behavioral and Representational Evidence of Binomial Ordering Preferences in Large Language Models
LLMs recover dominant binomial orders from corpora but align less closely with exact preference distributions, with preference strength partially encoded in middle-to-late layers and manipulable via steering.
-
The Dynamics of Human and AI-Generated Language: How Semantics Fluctuates across Different Timescales
Develops ACW-based semantic timescale features showing longer autocorrelation windows associate with generic vocabulary and shorter ones with specific words in both human and LLM speech, with the pattern abolished by randomizing word order and timing.
-
Semantic Feature Segmentation for Interpretable Predictive Maintenance in Complex Systems
Semantic segmentation decomposes monitoring features into canonical and residual components that concentrate fault-predictive information while preserving operational meaning in predictive maintenance.
-
A Correlation Aware Quantum Feature Map for Variational Quantum Classification
CAQFM adds controlled quantum gates based on Pearson, Spearman, Kendall Tau, Mutual Information, and Distance Correlation measures to create richer feature maps, yielding higher accuracy than standard maps in VQC simulations on three benchmark datasets.
-
Grounding Text Embeddings in Stakeholder Associations
The Stakeholder Grounding Exercise shows neural text embeddings are 19-26pp less reliable than human experts at capturing semantic distinctions, with misalignment strongly correlated to poorer clustering performance (ρ=0.9), replicated across Danish policy and US AI domains.
-
Explainable Iterative Data Visualisation Refinement via an LLM Agent
An LLM agent automates iterative refinement of data embedding visualizations by generating semantic evaluation reports and recommending configuration changes.
-
Evaluating Chunking Strategies for Retrieval-Augmented Generation on Academic Texts
Cluster-based semantic chunking does not outperform fixed-size or recursive chunking for RAG on academic theses, and RAGAs faithfulness shows limited reliability in this setup.
-
Engagement Intensity as a Learner-Modeling Signal for Adaptive AI Ethics Instruction
Self-reported LLM usage frequency associates more consistently with pre-instruction AI perceptions than prior education or self-rated familiarity in graduate trainees.
-
Geolocating News about Extreme Climate Events: A Comparative Analysis of Off-the-Shelf Tools for Toponym Identification in German
Off-the-shelf German NER tools produce divergent toponym sets that lead to distinct country assignments for climate event news, affecting assessments of national prominence in media coverage.