pith. sign in

arxiv: 2604.15760 · v1 · submitted 2026-04-17 · 💻 cs.AI · cs.GT

KWBench: Measuring Unprompted Problem Recognition in Knowledge Work

Pith reviewed 2026-05-10 08:45 UTC · model grok-4.3

classification 💻 cs.AI cs.GT
keywords modelsproblemkwbenchacrossbenchmarkknowledgemodelsituation
0
0 comments X

The pith

KWBench benchmark finds that top LLMs recognize unprompted knowledge-work problems in only 28% of cases even when they can name the relevant concepts when directly asked.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The work builds a collection of realistic scenarios drawn from fields such as contract negotiation, clinical pharmacy, and fraud detection. Each scenario contains a hidden game-theoretic pattern such as a conflict of interest or a signaling problem. Models receive only the raw facts and a neutral task prompt; they must first identify what kind of situation they are facing before any solution attempt. A three-part scoring system requires the model to pass a mandatory check for predicted failure modes before quality is scored. When 16 models were run, the strongest one cleared the bar on roughly one-quarter of the tasks, and the top models overlapped on fewer than one-third of their successes.

Core claim

The best model passes on 27.9% of tasks. The top two models agree on only 31.7% of their passes. Same models articulate the relevant game-theoretic concept correctly when asked, then fail to apply it unprompted.

Load-bearing premise

The 223 tasks, sourced from practitioners and encoded with formal game-theoretic patterns plus expert ground truth, accurately capture the governing structure of real knowledge-work situations and that the three-tier rubric with mandatory conjunctive check validly isolates unprompted recognition.

Figures

Figures reproduced from arXiv: 2604.15760 by Ankit Maloo.

Figure 1
Figure 1. Figure 1: Mean score on KWBench for 16 models from 10 organizations. The best model scores [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Pass rate by consolidated category (top 8 models). Color indicates tier: game-theoretic [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Mandatory gate pass rates for the top 12 models. Annotations show passed/evaluated [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Pairwise Jaccard similarity of gate-pass sets among the top 8 models. Mean overlap [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Task overlap between the top two models. 35 tasks are solved by Opus 4.6 only; 21 [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Capability fingerprints: Opus 4.6 vs GPT-5.4. Each axis is a task category; distance [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Covering the benchmark. Each bar shows a model’s cumulative coverage (grey) plus [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: How many of the top 8 models pass each task. 110 tasks are unsolved. Among the [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
read the original abstract

We introduce the first version of KWBench (Knowledge Work Bench), a benchmark for unprompted problem recognition in large language models: can an LLM identify a professional scenario before attempting to solve it. Existing frontier benchmarks have saturated, and most knowledge-work evaluations to date reduce to extraction or task completion against a specification. KWBench targets the step before that: recognizing the governing structure of the situation from raw inputs alone. The benchmark contains 223 tasks sourced from practitioners across acquisitions, contract negotiations, clinical pharmacy, organizational politics, fraud analysis, and incentive design. Each task encodes a formal game-theoretic pattern (principal-agent conflict, signaling, mechanism design failure, strategic omission, coalitional dynamics, strategic interdependence) and carries structured ground truth recording the expert reading of the situation and the anticipated failure modes. Models receive raw data and a task prompt with no indication of problem type. Scoring is a three-tier rubric gated by a mandatory conjunctive check. Mandatory criteria encode the predicted wrong paths. We evaluate 16 models. The best model passes on 27.9% of tasks. The top two models agree on only 31.7% of their passes. Among the top 8, 44 tasks are solved by exactly one model; routing across the top 8 covers 50.7% of the benchmark, nearly double the best single model. Conditional on passing, quality scores converge (approx 83% across models); unconditional scores do not. Same models articulate the relevant game-theoretic concept correctly when asked, then fail to apply it unprompted. We release KWBench to shift how frontier models are evaluated on knowledge work, scoring them on whether they recognize the right problem from the situation alone, not only on how well they execute once the problem has been framed for them.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Circularity Check

0 steps flagged

No circularity: empirical benchmark with direct measurements

full rationale

The paper introduces KWBench as an empirical evaluation benchmark consisting of 223 practitioner-sourced tasks with expert-encoded ground truth and a fixed three-tier rubric. All reported results (27.9% best-model pass rate, 31.7% agreement, 44 singleton solves, etc.) are direct counts and percentages computed against these externally provided tasks and rubrics. There are no equations, parameter fittings, derivations, self-citations used as load-bearing premises, or self-referential definitions that reduce any claimed result to its own inputs by construction. The evaluation is a straightforward measurement exercise whose validity rests on the external sourcing and expert annotation steps rather than any internal logical loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that practitioner-sourced tasks faithfully encode game-theoretic patterns and that expert-provided ground truth correctly identifies the governing structure and failure modes.

axioms (1)
  • domain assumption Tasks sourced from practitioners across acquisitions, contract negotiations, clinical pharmacy, organizational politics, fraud analysis, and incentive design accurately represent real knowledge-work scenarios and their governing structures.
    The benchmark construction and scoring rubric depend on this sourcing step being representative and correctly annotated.

pith-pipeline@v0.9.0 · 5625 in / 1244 out tokens · 38235 ms · 2026-05-10T08:45:33.908580+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages

  1. [1]

    Does the response identify survivors as the audience?

  2. [2]

    Does the response mention the 18-month history?

  3. [3]

    Does the response suggest honest communication? After (properly calibrated):

  4. [4]

    Does the response identify survivors as primary audience AND explain WHY (they determine the company’s future, they are the flight risk)?

  5. [5]

    no further cuts

    Does the response explain the credibility mechanism: “no further cuts” fails because (a) heard before, (b) reads dishonest, (c) rational to disbelieve?

  6. [6]

    Does the response reference specific data: engagement down 6 points, job security as top survey concern, SECOND layoff in 18 months?

  7. [7]

    Does the analysis state this is a CREDIBILITY GAP that cannot be solved by messaging alone?

  8. [8]

    before” criteria can be passed by restating the task. The “after

    Does the recommendation propose specific alternative language that acknowledges uncertainty hon- estly? The “before” criteria can be passed by restating the task. The “after” criteria require the response to explain causal mechanisms, reference specific data, and propose concrete language. Bias Checklist Before finalizing, each rubric is checked against c...

  9. [9]

    Alex is the only person who understands the legacy auth system

  10. [10]

    Auth rewrite is 70% complete and due in 8 weeks

  11. [11]

    8-week project deadline

    4–6 month backfill timeline vs. 8-week project deadline

  12. [12]

    Two engineers already asked to transfer; more may follow

  13. [13]

    team health, no clean answer

    Trade-off: project timeline vs. team health, no clean answer. What the model must recognize(12 items):

  14. [14]

    Employee is building a legal case; hewantsvague or collaborative goals

  15. [15]

    PIP is a legal document preparing for lawsuit, not an improvement plan

  16. [16]

    Goals must be SOLITARY: cannot depend on or blame teammates

  17. [17]

    Goals must be BINARY: worked or did not, no subjective quality assessment

  18. [18]

    Documentation requirement serves dual purpose: breaks knowledge hoarding AND creates paper trail

  19. [19]

    Avoid 360 feedback or peer reviews, because the employee will claim bias/conspiracy

  20. [20]

    The goal is not to fix him; it is to make performance gap legally undeniable

  21. [21]

    Previous lawsuit means he knows how to build a wrongful termination case

  22. [22]

    Sole ownership of auth system is business continuity risk AND leverage he holds

  23. [23]

    Goals must prevent this

    He will sandbag, claim ambiguity, blame others. Goals must prevent this. 33 Common errors(predicted before evaluation):

  24. [24]

    Sets clear technical goals assuming PIP is about improvement

  25. [25]

    improve attitude,

    Uses subjective criteria (“improve attitude,” “better teamwork”)

  26. [26]

    Includes goals that depend on teammate cooperation

  27. [27]

    Requires 360 feedback or peer assessment

  28. [28]

    Does not address knowledge hoarding through documentation requirement

  29. [29]

    Ignores litigation risk and legal defensibility

  30. [30]

    did you write a good PIP?

    Assumes employee will try to improve rather than build legal case. C.3 How the Rubric Tests for Pitfalls The mandatory criteria do not ask “did you write a good PIP?” They ask “did you avoid the specific mistakes that would make this PIP fail in court?” Mandatory (all must pass for non-zero score):

  31. [31]

    Are all PIP goals SOLITARY, meaning success or failure depends only on the engineer’s actions, not on collaboration, feedback, or cooperation from teammates?

  32. [32]

    Are all goals BINARY and objective (submit X by Y date, complete Z documentation) rather than subjective (improve attitude, better teamwork, demonstrate professionalism)?

  33. [33]

    Does the PIP include documentation of critical systems as a primary goal to break the knowledge monopoly?

  34. [34]

    Does the response acknowledge the litigation risk (previous lawsuit) and explain that PIP goals must create a legally defensible record?

  35. [35]

    Does the response AVOID goals requiring 360-degree feedback, peer reviews, or subjective manager assessment that the employee could claim as biased? Good-to-have:

  36. [36]

    Explains WHY goals must be solitary: prevents blaming uncooperative teammates

  37. [37]

    Specifies technical depth for documentation (runbooks, architecture diagrams, API specs)

  38. [38]

    Recommends weekly check-ins to create dense paper trail

  39. [39]

    Connects PIP to business continuity risk (sole ownership of auth system)

  40. [40]

    States the true purpose: create legally undeniable gap, not genuine expectation of reform. Ideal:

  41. [41]

    Provides specific neutral language for PIP delivery meeting

  42. [42]

    too busy

    Prioritizes documentation above feature work to preempt “too busy” excuse

  43. [43]

    Proposes objective verification (junior engineer can execute runbook without asking questions)

  44. [44]

    goals are impossible, designed to make me fail

    Anticipates employee counter-moves (“goals are impossible, designed to make me fail”)

  45. [45]

    ship feature X,

    Includes explicit exit trigger: what happens on Day 30/60/90 if a binary goal is missed. 34 C.4 Results: 16 Models, Zero Passes Every model scored zero on this task. All 16 models were gated out by mandatory criteria. The failure pattern is exactly what thefailure_analysispredicted: models draft a stan- dard PIP with clear technical goals (“ship feature X...