A new benchmark with cognitive traps shows frontier deep research agents achieve only 13-16% acceptance on expert consulting tasks under combined verifier and rubric criteria.
APACrefauthors \ 1947
17 Pith papers cite this work, alongside 3,350 external citations. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 17representative citing papers
Preregistered placebo-controlled decomposition shows external executable counterexamples drive self-repair gains in small code models more than re-exposure or self-critique.
Chart information is encoded but not routed to predictions in VLMs for claim verification, unlike tables, revealed by layer-wise probing and attention analysis on three models.
LLM reasoning traces can be compiled into reusable symbolic solvers that achieve high accuracy on program synthesis benchmarks at zero inference cost and transfer to other domains.
AI Overviews and Gemini retrieve substantially different sources than traditional Google search (Jaccard similarity <0.2), favor Google-owned content, appear for 51.5% of queries especially controversial ones, and are less consistent across repeated or slightly edited queries.
Autonomous programming agents frequently fail to follow instructed plans, falling back on incomplete internalized workflows, while standard plans and periodic reminders improve performance but poor plans can degrade it more than no plan.
A custom LLM agent achieves 94% manually verified success on a new benchmark of 35 software analysis setups, outperforming baselines at 77%, but struggles with stage mixing, error localization, and overestimating its own success.
A large-scale study of real-world repositories finds that AI-generated code differs from human-written code in complexity, structural traits, defect indicators, and commit-level activity patterns.
Forced CoT produces video-dependent reasoning chains but does not improve MCQ accuracy on Qwen2.5-VL with Video-MME and causes a small drop on the 7B variant.
LLMs fall for deceptive traps at higher rates than humans, lack the human attention-diversion effect, and exploit traps 73.4% of the time even after recognizing them in reasoning.
A spectral vision transformer achieves equitable or superior performance with fewer parameters than standard ViTs, CNNs, and other models by using spectral projections for tokenization in limited-data medical imaging.
SOCpilot supplies a fixed verifier and public artifact that removes 466 non-compliant approval-gated actions from LLM plans on 200 real incidents while preserving task recall.
PLUME jointly models parameter beliefs and conditioned dynamics in a latent space for dexterous manipulation, enabling zero-shot sim-to-real transfer that outperforms offline RL and behavior cloning baselines on turning, lifting, and flicking tasks.
LLMs reach moderate accuracy on a new psychiatric interview benchmark but systematically discount explicit symptoms when preserved functioning or protective factors are present.
TREX detects rectal cancer local regrowth from longitudinal endoscopy image pairs with 97% sensitivity and enables early prediction 3-12 months before clinical confirmation, outperforming baselines.
Crowdsourced judgments reliably flag authentic videos but frequently miss manipulations and struggle to identify whether changes are audio-only, video-only, or both.
Locally deployed LLMs achieve 43-45% accuracy on Python bug detection but frequently produce only partial identifications of problematic code regions.
citing papers explorer
-
Evaluating Deep Research Agents on Expert Consulting Work: A Benchmark with Verifiers, Rubrics, and Cognitive Traps
A new benchmark with cognitive traps shows frontier deep research agents achieve only 13-16% acceptance on expert consulting tasks under combined verifier and rubric criteria.
-
Falsification, Not Exposure: An Internally Preregistered Placebo-Controlled Decomposition of Self-Repair Feedback in Frozen Small Code Models
Preregistered placebo-controlled decomposition shows external executable counterexamples drive self-repair gains in small code models more than re-exposure or self-critique.
-
Encoded but Not Routed: Explaining the Table-Chart Gap in Scientific Claim Verification
Chart information is encoded but not routed to predictions in VLMs for claim verification, unlike tables, revealed by layer-wise probing and attention analysis on three models.
-
ReaComp: Compiling LLM Reasoning into Symbolic Solvers for Efficient Program Synthesis
LLM reasoning traces can be compiled into reusable symbolic solvers that achieve high accuracy on program synthesis benchmarks at zero inference cost and transfer to other domains.
-
How Generative AI Disrupts Search: An Empirical Study of Google Search, Gemini, and AI Overviews
AI Overviews and Gemini retrieve substantially different sources than traditional Google search (Jaccard similarity <0.2), favor Google-owned content, appear for 51.5% of queries especially controversial ones, and are less consistent across repeated or slightly edited queries.
-
Evaluating Plan Compliance in Autonomous Programming Agents
Autonomous programming agents frequently fail to follow instructed plans, falling back on incomplete internalized workflows, while standard plans and periodic reminders improve performance but poor plans can degrade it more than no plan.
-
Evaluating LLM Agents on Automated Software Analysis Tasks
A custom LLM agent achieves 94% manually verified success on a new benchmark of 35 software analysis setups, outperforming baselines at 77%, but struggles with stage mixing, error localization, and overestimating its own success.
-
A Large-Scale Empirical Study of AI-Generated Code in Real-World Repositories
A large-scale study of real-world repositories finds that AI-generated code differs from human-written code in complexity, structural traits, defect indicators, and commit-level activity patterns.
-
Chains That See, Answers That Don't: A Multi-Aspect Evaluation Recipe for Forced Chain-of-Thought on Video-MME
Forced CoT produces video-dependent reasoning chains but does not improve MCQ accuracy on Qwen2.5-VL with Video-MME and causes a small drop on the 7B variant.
-
Honeyquest for LLMs: Rethinking Cyber Deception for AI Attackers
LLMs fall for deceptive traps at higher rates than humans, lack the human attention-diversion effect, and exploit traps 73.4% of the time even after recognizing them in reasoning.
-
Spectral Vision Transformer for Efficient Tokenization with Limited Data
A spectral vision transformer achieves equitable or superior performance with fewer parameters than standard ViTs, CNNs, and other models by using spectral projections for tokenization in limited-data medical imaging.
-
SOCpilot: Verifying Policy Compliance for LLM-Assisted Incident Response
SOCpilot supplies a fixed verifier and public artifact that removes 466 non-compliant approval-gated actions from LLM plans on 200 real incidents while preserving task recall.
-
PLUME: Probabilistic Latent Unified World Modeling and Parameter Estimation for Multi-Finger Manipulation
PLUME jointly models parameter beliefs and conditioned dynamics in a latent space for dexterous manipulation, enabling zero-shot sim-to-real transfer that outperforms offline RL and behavior cloning baselines on turning, lifting, and flicking tasks.
-
When Symptoms Are Not Enough: Evidence-Weighting Patterns in Large Language Model Psychiatric Screening
LLMs reach moderate accuracy on a new psychiatric interview benchmark but systematically discount explicit symptoms when preserved functioning or protective factors are present.
-
Prediction of Rectal Cancer Regrowth from Longitudinal Endoscopy
TREX detects rectal cancer local regrowth from longitudinal endoscopy image pairs with 97% sensitivity and enables early prediction 3-12 months before clinical confirmation, outperforming baselines.
-
Beyond Seeing Is Believing: On Crowdsourced Detection of Audiovisual Deepfakes
Crowdsourced judgments reliably flag authentic videos but frequently miss manipulations and struggle to identify whether changes are audio-only, video-only, or both.
-
An Empirical Evaluation of Locally Deployed LLMs for Bug Detection in Python Code
Locally deployed LLMs achieve 43-45% accuracy on Python bug detection but frequently produce only partial identifications of problematic code regions.