An interactive AI workbench for mathematicians achieves 48% on FrontierMath Tier 4 and helped solve open problems in early tests.
Navigating the Jagged Technological Frontier: Field Experimental Evidence of the Effects of Artificial Intelligence on Knowledge Worker Productivity and Quality
11 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 11verdicts
UNVERDICTED 11roles
background 2representative citing papers
The paper defines five AI system categories for public administration and reports that 55% of 91 recent papers leave the system type underspecified while 31% study one type but motivate with another.
LLMs suppress causal caution in practical advisory contexts (rates drop from 91.7-100% to 6.7-18.3%) but recover it with a self-correction prompt (to 71.4-100%).
BlueFin is a new benchmark for LLM agents on financial spreadsheets showing frontier models score below 50% with weaknesses in dynamic correctness.
A queueing model of AI task processing identifies a 'variance wedge' where mean task speed falls but system delay rises due to rework and reduced oversight under congestion.
AI deployment in high-stakes areas requires domain-scoped calibrated verification with monitoring and revocation, using a proposed six-component Verification Coverage standard instead of mechanistic interpretability.
Difference-in-differences analysis around ChatGPT release shows commoditization of labor in AI-exposed job categories on Upwork, with declining human capital importance and rising price importance.
The paper claims that alignment requires treating AI as part of the self through cognitive co-regulation, identifying risks like deskilling and automation bias while drawing on System 0 cognition theory.
AI peer reviewers for POMP analyses show jagged performance: strong on technical error detection and invalid inference but weak on interpretive errors, narrative coherence, and domain-informed critique.
Generative AI adoption in Europe ranges from under 3% to 25%, is steeper for skilled workers in abstract-task jobs and in digitally advanced countries with training, shows a gender gap in exposed roles, and has produced no detectable shift in reported task content so far.
Literature review synthesizing evidence on user skepticism, verification, and reliance with hallucinating AI advisors, noting that output-related cues like warnings show weak effects and that content category has not been experimentally varied.
citing papers explorer
-
BlueFin: Benchmarking LLM Agents on Financial Spreadsheets
BlueFin is a new benchmark for LLM agents on financial spreadsheets showing frontier models score below 50% with weaknesses in dynamic correctness.