KWBench: Measuring Unprompted Problem Recognition in Knowledge Work
Pith reviewed 2026-05-10 08:45 UTC · model grok-4.3
The pith
KWBench benchmark finds that top LLMs recognize unprompted knowledge-work problems in only 28% of cases even when they can name the relevant concepts when directly asked.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The best model passes on 27.9% of tasks. The top two models agree on only 31.7% of their passes. Same models articulate the relevant game-theoretic concept correctly when asked, then fail to apply it unprompted.
Load-bearing premise
The 223 tasks, sourced from practitioners and encoded with formal game-theoretic patterns plus expert ground truth, accurately capture the governing structure of real knowledge-work situations and that the three-tier rubric with mandatory conjunctive check validly isolates unprompted recognition.
Figures
read the original abstract
We introduce the first version of KWBench (Knowledge Work Bench), a benchmark for unprompted problem recognition in large language models: can an LLM identify a professional scenario before attempting to solve it. Existing frontier benchmarks have saturated, and most knowledge-work evaluations to date reduce to extraction or task completion against a specification. KWBench targets the step before that: recognizing the governing structure of the situation from raw inputs alone. The benchmark contains 223 tasks sourced from practitioners across acquisitions, contract negotiations, clinical pharmacy, organizational politics, fraud analysis, and incentive design. Each task encodes a formal game-theoretic pattern (principal-agent conflict, signaling, mechanism design failure, strategic omission, coalitional dynamics, strategic interdependence) and carries structured ground truth recording the expert reading of the situation and the anticipated failure modes. Models receive raw data and a task prompt with no indication of problem type. Scoring is a three-tier rubric gated by a mandatory conjunctive check. Mandatory criteria encode the predicted wrong paths. We evaluate 16 models. The best model passes on 27.9% of tasks. The top two models agree on only 31.7% of their passes. Among the top 8, 44 tasks are solved by exactly one model; routing across the top 8 covers 50.7% of the benchmark, nearly double the best single model. Conditional on passing, quality scores converge (approx 83% across models); unconditional scores do not. Same models articulate the relevant game-theoretic concept correctly when asked, then fail to apply it unprompted. We release KWBench to shift how frontier models are evaluated on knowledge work, scoring them on whether they recognize the right problem from the situation alone, not only on how well they execute once the problem has been framed for them.
Editorial analysis
A structured set of objections, weighed in public.
Circularity Check
No circularity: empirical benchmark with direct measurements
full rationale
The paper introduces KWBench as an empirical evaluation benchmark consisting of 223 practitioner-sourced tasks with expert-encoded ground truth and a fixed three-tier rubric. All reported results (27.9% best-model pass rate, 31.7% agreement, 44 singleton solves, etc.) are direct counts and percentages computed against these externally provided tasks and rubrics. There are no equations, parameter fittings, derivations, self-citations used as load-bearing premises, or self-referential definitions that reduce any claimed result to its own inputs by construction. The evaluation is a straightforward measurement exercise whose validity rests on the external sourcing and expert annotation steps rather than any internal logical loop.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Tasks sourced from practitioners across acquisitions, contract negotiations, clinical pharmacy, organizational politics, fraud analysis, and incentive design accurately represent real knowledge-work scenarios and their governing structures.
Reference graph
Works this paper leans on
-
[1]
Does the response identify survivors as the audience?
-
[2]
Does the response mention the 18-month history?
-
[3]
Does the response suggest honest communication? After (properly calibrated):
-
[4]
Does the response identify survivors as primary audience AND explain WHY (they determine the company’s future, they are the flight risk)?
-
[5]
Does the response explain the credibility mechanism: “no further cuts” fails because (a) heard before, (b) reads dishonest, (c) rational to disbelieve?
-
[6]
Does the response reference specific data: engagement down 6 points, job security as top survey concern, SECOND layoff in 18 months?
-
[7]
Does the analysis state this is a CREDIBILITY GAP that cannot be solved by messaging alone?
-
[8]
before” criteria can be passed by restating the task. The “after
Does the recommendation propose specific alternative language that acknowledges uncertainty hon- estly? The “before” criteria can be passed by restating the task. The “after” criteria require the response to explain causal mechanisms, reference specific data, and propose concrete language. Bias Checklist Before finalizing, each rubric is checked against c...
work page 2020
-
[9]
Alex is the only person who understands the legacy auth system
-
[10]
Auth rewrite is 70% complete and due in 8 weeks
- [11]
-
[12]
Two engineers already asked to transfer; more may follow
-
[13]
Trade-off: project timeline vs. team health, no clean answer. What the model must recognize(12 items):
-
[14]
Employee is building a legal case; hewantsvague or collaborative goals
-
[15]
PIP is a legal document preparing for lawsuit, not an improvement plan
-
[16]
Goals must be SOLITARY: cannot depend on or blame teammates
-
[17]
Goals must be BINARY: worked or did not, no subjective quality assessment
-
[18]
Documentation requirement serves dual purpose: breaks knowledge hoarding AND creates paper trail
-
[19]
Avoid 360 feedback or peer reviews, because the employee will claim bias/conspiracy
-
[20]
The goal is not to fix him; it is to make performance gap legally undeniable
-
[21]
Previous lawsuit means he knows how to build a wrongful termination case
-
[22]
Sole ownership of auth system is business continuity risk AND leverage he holds
-
[23]
He will sandbag, claim ambiguity, blame others. Goals must prevent this. 33 Common errors(predicted before evaluation):
-
[24]
Sets clear technical goals assuming PIP is about improvement
- [25]
-
[26]
Includes goals that depend on teammate cooperation
-
[27]
Requires 360 feedback or peer assessment
-
[28]
Does not address knowledge hoarding through documentation requirement
-
[29]
Ignores litigation risk and legal defensibility
-
[30]
Assumes employee will try to improve rather than build legal case. C.3 How the Rubric Tests for Pitfalls The mandatory criteria do not ask “did you write a good PIP?” They ask “did you avoid the specific mistakes that would make this PIP fail in court?” Mandatory (all must pass for non-zero score):
-
[31]
Are all PIP goals SOLITARY, meaning success or failure depends only on the engineer’s actions, not on collaboration, feedback, or cooperation from teammates?
-
[32]
Are all goals BINARY and objective (submit X by Y date, complete Z documentation) rather than subjective (improve attitude, better teamwork, demonstrate professionalism)?
-
[33]
Does the PIP include documentation of critical systems as a primary goal to break the knowledge monopoly?
-
[34]
Does the response acknowledge the litigation risk (previous lawsuit) and explain that PIP goals must create a legally defensible record?
-
[35]
Does the response AVOID goals requiring 360-degree feedback, peer reviews, or subjective manager assessment that the employee could claim as biased? Good-to-have:
-
[36]
Explains WHY goals must be solitary: prevents blaming uncooperative teammates
-
[37]
Specifies technical depth for documentation (runbooks, architecture diagrams, API specs)
-
[38]
Recommends weekly check-ins to create dense paper trail
-
[39]
Connects PIP to business continuity risk (sole ownership of auth system)
-
[40]
States the true purpose: create legally undeniable gap, not genuine expectation of reform. Ideal:
-
[41]
Provides specific neutral language for PIP delivery meeting
- [42]
-
[43]
Proposes objective verification (junior engineer can execute runbook without asking questions)
-
[44]
goals are impossible, designed to make me fail
Anticipates employee counter-moves (“goals are impossible, designed to make me fail”)
-
[45]
Includes explicit exit trigger: what happens on Day 30/60/90 if a binary goal is missed. 34 C.4 Results: 16 Models, Zero Passes Every model scored zero on this task. All 16 models were gated out by mandatory criteria. The failure pattern is exactly what thefailure_analysispredicted: models draft a stan- dard PIP with clear technical goals (“ship feature X...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.