KWBench: Measuring Unprompted Problem Recognition in Knowledge Work

Ankit Maloo

arxiv: 2604.15760 · v1 · submitted 2026-04-17 · 💻 cs.AI · cs.GT

KWBench: Measuring Unprompted Problem Recognition in Knowledge Work

Ankit Maloo This is my paper

Pith reviewed 2026-05-10 08:45 UTC · model grok-4.3

classification 💻 cs.AI cs.GT

keywords modelsproblemkwbenchacrossbenchmarkknowledgemodelsituation

0 comments

The pith

KWBench benchmark finds that top LLMs recognize unprompted knowledge-work problems in only 28% of cases even when they can name the relevant concepts when directly asked.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The work builds a collection of realistic scenarios drawn from fields such as contract negotiation, clinical pharmacy, and fraud detection. Each scenario contains a hidden game-theoretic pattern such as a conflict of interest or a signaling problem. Models receive only the raw facts and a neutral task prompt; they must first identify what kind of situation they are facing before any solution attempt. A three-part scoring system requires the model to pass a mandatory check for predicted failure modes before quality is scored. When 16 models were run, the strongest one cleared the bar on roughly one-quarter of the tasks, and the top models overlapped on fewer than one-third of their successes.

Core claim

The best model passes on 27.9% of tasks. The top two models agree on only 31.7% of their passes. Same models articulate the relevant game-theoretic concept correctly when asked, then fail to apply it unprompted.

Load-bearing premise

The 223 tasks, sourced from practitioners and encoded with formal game-theoretic patterns plus expert ground truth, accurately capture the governing structure of real knowledge-work situations and that the three-tier rubric with mandatory conjunctive check validly isolates unprompted recognition.

Figures

Figures reproduced from arXiv: 2604.15760 by Ankit Maloo.

**Figure 2.** Figure 2: Pass rate by consolidated category (top 8 models). Color indicates tier: game-theoretic [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗

**Figure 3.** Figure 3: Mandatory gate pass rates for the top 12 models. Annotations show passed/evaluated [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗

**Figure 4.** Figure 4: Pairwise Jaccard similarity of gate-pass sets among the top 8 models. Mean overlap [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

**Figure 5.** Figure 5: Task overlap between the top two models. 35 tasks are solved by Opus 4.6 only; 21 [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Capability fingerprints: Opus 4.6 vs GPT-5.4. Each axis is a task category; distance [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Covering the benchmark. Each bar shows a model’s cumulative coverage (grey) plus [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: How many of the top 8 models pass each task. 110 tasks are unsolved. Among the [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

read the original abstract

We introduce the first version of KWBench (Knowledge Work Bench), a benchmark for unprompted problem recognition in large language models: can an LLM identify a professional scenario before attempting to solve it. Existing frontier benchmarks have saturated, and most knowledge-work evaluations to date reduce to extraction or task completion against a specification. KWBench targets the step before that: recognizing the governing structure of the situation from raw inputs alone. The benchmark contains 223 tasks sourced from practitioners across acquisitions, contract negotiations, clinical pharmacy, organizational politics, fraud analysis, and incentive design. Each task encodes a formal game-theoretic pattern (principal-agent conflict, signaling, mechanism design failure, strategic omission, coalitional dynamics, strategic interdependence) and carries structured ground truth recording the expert reading of the situation and the anticipated failure modes. Models receive raw data and a task prompt with no indication of problem type. Scoring is a three-tier rubric gated by a mandatory conjunctive check. Mandatory criteria encode the predicted wrong paths. We evaluate 16 models. The best model passes on 27.9% of tasks. The top two models agree on only 31.7% of their passes. Among the top 8, 44 tasks are solved by exactly one model; routing across the top 8 covers 50.7% of the benchmark, nearly double the best single model. Conditional on passing, quality scores converge (approx 83% across models); unconditional scores do not. Same models articulate the relevant game-theoretic concept correctly when asked, then fail to apply it unprompted. We release KWBench to shift how frontier models are evaluated on knowledge work, scoring them on whether they recognize the right problem from the situation alone, not only on how well they execute once the problem has been framed for them.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

KWBench shows models struggle with unprompted recognition but the benchmark's task encodings may be doing some of the work.

read the letter

The key takeaway is that current models still miss recognizing the underlying structure in professional scenarios even when they know the concepts in the abstract, with the best one hitting just under 28% on this new set of 223 tasks. What the paper does is introduce a benchmark focused on that initial recognition step using tasks drawn from actual practice areas like contract work and clinical decisions. It evaluates a range of models, notes the low overlap in what they solve, and points out that prompting for the concept works but unprompted application does not. Releasing the tasks and rubric gives others a concrete way to test this. The main concern is whether the single expert encoding of each task as a specific game-theoretic pattern really captures what should be recognized from the raw input. If models are penalized for seeing a different valid pattern, the pass rates could reflect the benchmark's framing choices as much as any model limit. The abstract mentions structured ground truth but leaves open how much validation went into those encodings and the conjunctive criteria. This work is aimed at people evaluating or improving LLMs for real knowledge work settings. It is worth sending for peer review because the problem it targets matters and the results are concrete enough to discuss, though the methods section will need close scrutiny on task construction.

Circularity Check

0 steps flagged

No circularity: empirical benchmark with direct measurements

full rationale

The paper introduces KWBench as an empirical evaluation benchmark consisting of 223 practitioner-sourced tasks with expert-encoded ground truth and a fixed three-tier rubric. All reported results (27.9% best-model pass rate, 31.7% agreement, 44 singleton solves, etc.) are direct counts and percentages computed against these externally provided tasks and rubrics. There are no equations, parameter fittings, derivations, self-citations used as load-bearing premises, or self-referential definitions that reduce any claimed result to its own inputs by construction. The evaluation is a straightforward measurement exercise whose validity rests on the external sourcing and expert annotation steps rather than any internal logical loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that practitioner-sourced tasks faithfully encode game-theoretic patterns and that expert-provided ground truth correctly identifies the governing structure and failure modes.

axioms (1)

domain assumption Tasks sourced from practitioners across acquisitions, contract negotiations, clinical pharmacy, organizational politics, fraud analysis, and incentive design accurately represent real knowledge-work scenarios and their governing structures.
The benchmark construction and scoring rubric depend on this sourcing step being representative and correctly annotated.

pith-pipeline@v0.9.0 · 5625 in / 1244 out tokens · 38235 ms · 2026-05-10T08:45:33.908580+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages

[1]

Does the response identify survivors as the audience?

work page
[2]

Does the response mention the 18-month history?

work page
[3]

Does the response suggest honest communication? After (properly calibrated):

work page
[4]

Does the response identify survivors as primary audience AND explain WHY (they determine the company’s future, they are the flight risk)?

work page
[5]

no further cuts

Does the response explain the credibility mechanism: “no further cuts” fails because (a) heard before, (b) reads dishonest, (c) rational to disbelieve?

work page
[6]

Does the response reference specific data: engagement down 6 points, job security as top survey concern, SECOND layoff in 18 months?

work page
[7]

Does the analysis state this is a CREDIBILITY GAP that cannot be solved by messaging alone?

work page
[8]

before” criteria can be passed by restating the task. The “after

Does the recommendation propose specific alternative language that acknowledges uncertainty hon- estly? The “before” criteria can be passed by restating the task. The “after” criteria require the response to explain causal mechanisms, reference specific data, and propose concrete language. Bias Checklist Before finalizing, each rubric is checked against c...

work page 2020
[9]

Alex is the only person who understands the legacy auth system

work page
[10]

Auth rewrite is 70% complete and due in 8 weeks

work page
[11]

8-week project deadline

4–6 month backfill timeline vs. 8-week project deadline

work page
[12]

Two engineers already asked to transfer; more may follow

work page
[13]

team health, no clean answer

Trade-off: project timeline vs. team health, no clean answer. What the model must recognize(12 items):

work page
[14]

Employee is building a legal case; hewantsvague or collaborative goals

work page
[15]

PIP is a legal document preparing for lawsuit, not an improvement plan

work page
[16]

Goals must be SOLITARY: cannot depend on or blame teammates

work page
[17]

Goals must be BINARY: worked or did not, no subjective quality assessment

work page
[18]

Documentation requirement serves dual purpose: breaks knowledge hoarding AND creates paper trail

work page
[19]

Avoid 360 feedback or peer reviews, because the employee will claim bias/conspiracy

work page
[20]

The goal is not to fix him; it is to make performance gap legally undeniable

work page
[21]

Previous lawsuit means he knows how to build a wrongful termination case

work page
[22]

Sole ownership of auth system is business continuity risk AND leverage he holds

work page
[23]

Goals must prevent this

He will sandbag, claim ambiguity, blame others. Goals must prevent this. 33 Common errors(predicted before evaluation):

work page
[24]

Sets clear technical goals assuming PIP is about improvement

work page
[25]

improve attitude,

Uses subjective criteria (“improve attitude,” “better teamwork”)

work page
[26]

Includes goals that depend on teammate cooperation

work page
[27]

Requires 360 feedback or peer assessment

work page
[28]

Does not address knowledge hoarding through documentation requirement

work page
[29]

Ignores litigation risk and legal defensibility

work page
[30]

did you write a good PIP?

Assumes employee will try to improve rather than build legal case. C.3 How the Rubric Tests for Pitfalls The mandatory criteria do not ask “did you write a good PIP?” They ask “did you avoid the specific mistakes that would make this PIP fail in court?” Mandatory (all must pass for non-zero score):

work page
[31]

Are all PIP goals SOLITARY, meaning success or failure depends only on the engineer’s actions, not on collaboration, feedback, or cooperation from teammates?

work page
[32]

Are all goals BINARY and objective (submit X by Y date, complete Z documentation) rather than subjective (improve attitude, better teamwork, demonstrate professionalism)?

work page
[33]

Does the PIP include documentation of critical systems as a primary goal to break the knowledge monopoly?

work page
[34]

Does the response acknowledge the litigation risk (previous lawsuit) and explain that PIP goals must create a legally defensible record?

work page
[35]

Does the response AVOID goals requiring 360-degree feedback, peer reviews, or subjective manager assessment that the employee could claim as biased? Good-to-have:

work page
[36]

Explains WHY goals must be solitary: prevents blaming uncooperative teammates

work page
[37]

Specifies technical depth for documentation (runbooks, architecture diagrams, API specs)

work page
[38]

Recommends weekly check-ins to create dense paper trail

work page
[39]

Connects PIP to business continuity risk (sole ownership of auth system)

work page
[40]

States the true purpose: create legally undeniable gap, not genuine expectation of reform. Ideal:

work page
[41]

Provides specific neutral language for PIP delivery meeting

work page
[42]

too busy

Prioritizes documentation above feature work to preempt “too busy” excuse

work page
[43]

Proposes objective verification (junior engineer can execute runbook without asking questions)

work page
[44]

goals are impossible, designed to make me fail

Anticipates employee counter-moves (“goals are impossible, designed to make me fail”)

work page
[45]

ship feature X,

Includes explicit exit trigger: what happens on Day 30/60/90 if a binary goal is missed. 34 C.4 Results: 16 Models, Zero Passes Every model scored zero on this task. All 16 models were gated out by mandatory criteria. The failure pattern is exactly what thefailure_analysispredicted: models draft a stan- dard PIP with clear technical goals (“ship feature X...

work page

[1] [1]

Does the response identify survivors as the audience?

work page

[2] [2]

Does the response mention the 18-month history?

work page

[3] [3]

Does the response suggest honest communication? After (properly calibrated):

work page

[4] [4]

Does the response identify survivors as primary audience AND explain WHY (they determine the company’s future, they are the flight risk)?

work page

[5] [5]

no further cuts

Does the response explain the credibility mechanism: “no further cuts” fails because (a) heard before, (b) reads dishonest, (c) rational to disbelieve?

work page

[6] [6]

Does the response reference specific data: engagement down 6 points, job security as top survey concern, SECOND layoff in 18 months?

work page

[7] [7]

Does the analysis state this is a CREDIBILITY GAP that cannot be solved by messaging alone?

work page

[8] [8]

before” criteria can be passed by restating the task. The “after

Does the recommendation propose specific alternative language that acknowledges uncertainty hon- estly? The “before” criteria can be passed by restating the task. The “after” criteria require the response to explain causal mechanisms, reference specific data, and propose concrete language. Bias Checklist Before finalizing, each rubric is checked against c...

work page 2020

[9] [9]

Alex is the only person who understands the legacy auth system

work page

[10] [10]

Auth rewrite is 70% complete and due in 8 weeks

work page

[11] [11]

8-week project deadline

4–6 month backfill timeline vs. 8-week project deadline

work page

[12] [12]

Two engineers already asked to transfer; more may follow

work page

[13] [13]

team health, no clean answer

Trade-off: project timeline vs. team health, no clean answer. What the model must recognize(12 items):

work page

[14] [14]

Employee is building a legal case; hewantsvague or collaborative goals

work page

[15] [15]

PIP is a legal document preparing for lawsuit, not an improvement plan

work page

[16] [16]

Goals must be SOLITARY: cannot depend on or blame teammates

work page

[17] [17]

Goals must be BINARY: worked or did not, no subjective quality assessment

work page

[18] [18]

Documentation requirement serves dual purpose: breaks knowledge hoarding AND creates paper trail

work page

[19] [19]

Avoid 360 feedback or peer reviews, because the employee will claim bias/conspiracy

work page

[20] [20]

The goal is not to fix him; it is to make performance gap legally undeniable

work page

[21] [21]

Previous lawsuit means he knows how to build a wrongful termination case

work page

[22] [22]

Sole ownership of auth system is business continuity risk AND leverage he holds

work page

[23] [23]

Goals must prevent this

He will sandbag, claim ambiguity, blame others. Goals must prevent this. 33 Common errors(predicted before evaluation):

work page

[24] [24]

Sets clear technical goals assuming PIP is about improvement

work page

[25] [25]

improve attitude,

Uses subjective criteria (“improve attitude,” “better teamwork”)

work page

[26] [26]

Includes goals that depend on teammate cooperation

work page

[27] [27]

Requires 360 feedback or peer assessment

work page

[28] [28]

Does not address knowledge hoarding through documentation requirement

work page

[29] [29]

Ignores litigation risk and legal defensibility

work page

[30] [30]

did you write a good PIP?

Assumes employee will try to improve rather than build legal case. C.3 How the Rubric Tests for Pitfalls The mandatory criteria do not ask “did you write a good PIP?” They ask “did you avoid the specific mistakes that would make this PIP fail in court?” Mandatory (all must pass for non-zero score):

work page

[31] [31]

Are all PIP goals SOLITARY, meaning success or failure depends only on the engineer’s actions, not on collaboration, feedback, or cooperation from teammates?

work page

[32] [32]

Are all goals BINARY and objective (submit X by Y date, complete Z documentation) rather than subjective (improve attitude, better teamwork, demonstrate professionalism)?

work page

[33] [33]

Does the PIP include documentation of critical systems as a primary goal to break the knowledge monopoly?

work page

[34] [34]

Does the response acknowledge the litigation risk (previous lawsuit) and explain that PIP goals must create a legally defensible record?

work page

[35] [35]

Does the response AVOID goals requiring 360-degree feedback, peer reviews, or subjective manager assessment that the employee could claim as biased? Good-to-have:

work page

[36] [36]

Explains WHY goals must be solitary: prevents blaming uncooperative teammates

work page

[37] [37]

Specifies technical depth for documentation (runbooks, architecture diagrams, API specs)

work page

[38] [38]

Recommends weekly check-ins to create dense paper trail

work page

[39] [39]

Connects PIP to business continuity risk (sole ownership of auth system)

work page

[40] [40]

States the true purpose: create legally undeniable gap, not genuine expectation of reform. Ideal:

work page

[41] [41]

Provides specific neutral language for PIP delivery meeting

work page

[42] [42]

too busy

Prioritizes documentation above feature work to preempt “too busy” excuse

work page

[43] [43]

Proposes objective verification (junior engineer can execute runbook without asking questions)

work page

[44] [44]

goals are impossible, designed to make me fail

Anticipates employee counter-moves (“goals are impossible, designed to make me fail”)

work page

[45] [45]

ship feature X,

Includes explicit exit trigger: what happens on Day 30/60/90 if a binary goal is missed. 34 C.4 Results: 16 Models, Zero Passes Every model scored zero on this task. All 16 models were gated out by mandatory criteria. The failure pattern is exactly what thefailure_analysispredicted: models draft a stan- dard PIP with clear technical goals (“ship feature X...

work page