A new benchmark with cognitive traps shows frontier deep research agents achieve only 13-16% acceptance on expert consulting tasks under combined verifier and rubric criteria.
APACrefauthors \ 1947
17 Pith papers cite this work, alongside 3,350 external citations. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 17representative citing papers
Preregistered placebo-controlled decomposition shows external executable counterexamples drive self-repair gains in small code models more than re-exposure or self-critique.
Chart information is encoded but not routed to predictions in VLMs for claim verification, unlike tables, revealed by layer-wise probing and attention analysis on three models.
LLM reasoning traces can be compiled into reusable symbolic solvers that achieve high accuracy on program synthesis benchmarks at zero inference cost and transfer to other domains.
AI Overviews and Gemini retrieve substantially different sources than traditional Google search (Jaccard similarity <0.2), favor Google-owned content, appear for 51.5% of queries especially controversial ones, and are less consistent across repeated or slightly edited queries.
Autonomous programming agents frequently fail to follow instructed plans, falling back on incomplete internalized workflows, while standard plans and periodic reminders improve performance but poor plans can degrade it more than no plan.
A custom LLM agent achieves 94% manually verified success on a new benchmark of 35 software analysis setups, outperforming baselines at 77%, but struggles with stage mixing, error localization, and overestimating its own success.
A large-scale study of real-world repositories finds that AI-generated code differs from human-written code in complexity, structural traits, defect indicators, and commit-level activity patterns.
Forced CoT produces video-dependent reasoning chains but does not improve MCQ accuracy on Qwen2.5-VL with Video-MME and causes a small drop on the 7B variant.
LLMs fall for deceptive traps at higher rates than humans, lack the human attention-diversion effect, and exploit traps 73.4% of the time even after recognizing them in reasoning.
A spectral vision transformer achieves equitable or superior performance with fewer parameters than standard ViTs, CNNs, and other models by using spectral projections for tokenization in limited-data medical imaging.
SOCpilot supplies a fixed verifier and public artifact that removes 466 non-compliant approval-gated actions from LLM plans on 200 real incidents while preserving task recall.
PLUME jointly models parameter beliefs and conditioned dynamics in a latent space for dexterous manipulation, enabling zero-shot sim-to-real transfer that outperforms offline RL and behavior cloning baselines on turning, lifting, and flicking tasks.
LLMs reach moderate accuracy on a new psychiatric interview benchmark but systematically discount explicit symptoms when preserved functioning or protective factors are present.
TREX detects rectal cancer local regrowth from longitudinal endoscopy image pairs with 97% sensitivity and enables early prediction 3-12 months before clinical confirmation, outperforming baselines.
Crowdsourced judgments reliably flag authentic videos but frequently miss manipulations and struggle to identify whether changes are audio-only, video-only, or both.
Locally deployed LLMs achieve 43-45% accuracy on Python bug detection but frequently produce only partial identifications of problematic code regions.
citing papers explorer
-
Chains That See, Answers That Don't: A Multi-Aspect Evaluation Recipe for Forced Chain-of-Thought on Video-MME
Forced CoT produces video-dependent reasoning chains but does not improve MCQ accuracy on Qwen2.5-VL with Video-MME and causes a small drop on the 7B variant.