arxiv: 2604.22517 · v1 · submitted 2026-04-24 · 💻 cs.CL

Recognition: unknown

Aggregate vs. Personalized Judges in Business Idea Evaluation: Evidence from Expert Disagreement

Wataru Hirota , Tomoki Taniguchi , Tomoko Ohkuma , Kosuke Takahashi , Takahiro Omi , Kosuke Arima , Takuto Asakura , Chung-Chi Chen

show 1 more author

Tatsuya Ishigaki

Authors on Pith no claims yet

Pith reviewed 2026-05-08 11:53 UTC · model grok-4.3

classification 💻 cs.CL

keywords expert disagreementpersonalized judgesLLM evaluationbusiness idea assessmentaggregate judgesscoring historypatent ideasmulti-dimensional criteria

0 comments

The pith

Personalized judges conditioned on an individual evaluator's scoring history align more closely with that evaluator than aggregate judges using mixed histories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Business idea evaluation involves multi-dimensional expert judgments on criteria such as innovativeness and market size, where fine-grained scores often disagree across evaluators. The paper tests whether an automatic judge should target an average consensus or instead model each evaluator's distinct patterns by conditioning on their past scores. Using a dataset of roughly 3,000 expert ratings on 300 patent-grounded ideas, analyses reveal that personalized judges outperform aggregate ones in matching the target person across dimensions and model sizes. Evaluator agreement further predicts similarity in the judge's generated reasoning exclusively under personalized conditioning. This setup treats disagreement as learnable structure rather than noise to be averaged away.

Core claim

Across the six business-oriented dimensions, the personalized judge that conditions on the target evaluator's scoring history produces outputs that match the corresponding evaluator more closely than either a rubric-only zero-shot judge or an aggregate judge conditioned on mixed evaluator histories; moreover, agreement between two evaluators correlates with similarity of judge-generated reasoning only when the judge is personalized rather than pooled.

What carries the argument

Personalized conditioning on the target evaluator's scoring history, which learns individual patterns and enables comparison against aggregate conditioning on mixed histories and rubric-only baselines.

Load-bearing premise

The observed expert disagreements reflect consistent, learnable individual differences in judgment that can be captured from scoring history rather than irreducible random noise or unmeasured factors.

What would settle it

A replication on a comparable dataset of expert business-idea scores in which the personalized judge fails to show higher alignment with the target evaluator than the aggregate judge, or in which inter-evaluator agreement no longer predicts reasoning similarity under personalization.

Figures

Figures reproduced from arXiv: 2604.22517 by Chung-Chi Chen, Kosuke Arima, Kosuke Takahashi, Takahiro Omi, Takuto Asakura, Tatsuya Ishigaki, Tomoki Taniguchi, Tomoko Ohkuma, Wataru Hirota.

**Figure 1.** Figure 1: Distribution of per-evaluator mean scores for view at source ↗

**Figure 2.** Figure 2: Alignment between automatic judges and expert annotations, measured by Krippendorff’s view at source ↗

**Figure 3.** Figure 3: Relationship between evaluator agreement view at source ↗

**Figure 4.** Figure 4: Staged screening protocol for expert scoring. view at source ↗

**Figure 5.** Figure 5: The prompt template for LLM-as-a-Judge models. Angle brackets (<...>) denote placeholders. view at source ↗

read the original abstract

Evaluating LLM-generated business ideas is often harder to scale than generating them. Unlike standard NLP benchmarks, business idea evaluation relies on multi-dimensional criteria such as feasibility, novelty, differentiation, user need, and market size, and expert judgments often disagree. This paper studies a methodological question raised by such disagreement: should an automatic judge approximate an aggregate consensus, or model evaluators individually? We introduce PBIG-DATA, a dataset of approximately 3,000 individual scores across 300 patent-grounded product ideas, provided by domain experts on six business-oriented dimensions: specificity, technical validity, innovativeness, competitive advantage, need validity, and market size. Analyses show substantial expert disagreement on fine-grained ordinal scores, while agreement is higher under coarse selection, suggesting structured heterogeneity rather than random noise. We then compare three judge configurations: a rubric-only zero-shot judge, an aggregate judge conditioned on mixed evaluator histories, and a personalized judge conditioned on the target evaluator's scoring history. Across dimensions and model sizes, personalized judges align more closely with the corresponding evaluator than aggregate judges, and evaluator agreement correlates with similarity of judge-generated reasoning only under personalized conditioning. These results indicate that pooled labels can be a fragile target in pluralistic evaluation settings and motivate evaluator-conditioned judge designs for business idea assessment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper releases PBIG-DATA and reports that conditioning LLM judges on individual evaluator scoring histories improves alignment over aggregate or zero-shot setups, though the gain may trace to prompt detail rather than modeled heterogeneity.

read the letter

The main takeaway is that this work finds personalized LLM judges, fed an evaluator's own past scores, match that evaluator better than aggregate judges using mixed histories, and it ships a new dataset of expert ratings on business ideas to support the comparison. They gathered around 3000 scores from domain experts on 300 patent-grounded product ideas, each rated on six dimensions such as innovativeness, competitive advantage, and market size. The data shows clear expert disagreement on the fine-grained ordinal scores, with higher agreement only when collapsing to coarse categories, which the authors read as structured differences in criteria rather than pure noise. The experiment then tests three judge variants: a plain rubric zero-shot prompt, an aggregate version conditioned on mixed evaluator histories, and personalized versions conditioned on the target evaluator's history. Personalized conditioning produces higher alignment with the corresponding human across model sizes and dimensions, and only in that condition does agreement between evaluators correlate with similarity in the judge's generated reasoning. That is a direct, practical result for anyone automating subjective evaluation. The soft spot is the missing controls. The abstract gives no indication that the aggregate condition received an equal number of examples or that a mismatched-history baseline was run, so the reported edge could simply reflect richer prompting rather than capture of person-specific criteria. Statistical details, error analysis, and checks for confounds like idea difficulty are also absent from the summary, which leaves the strength of the evidence hard to assess. This is aimed at researchers building LLM judges for creative or business tasks where opinions diverge. A reader working on pluralistic evaluation or human-AI alignment will get value from the dataset and the head-to-head comparison. It deserves peer review because the question is concrete and the data release is useful, even if the methods section will need tighter controls and more transparency to strengthen the claims.

Referee Report

3 major / 2 minor

Summary. The paper introduces the PBIG-DATA dataset (~3,000 expert scores on 300 patent-grounded business ideas across six dimensions: specificity, technical validity, innovativeness, competitive advantage, need validity, market size) and compares three LLM judge setups: rubric-only zero-shot, aggregate (conditioned on mixed evaluator histories), and personalized (conditioned on the target evaluator's scoring history). It reports substantial fine-grained expert disagreement but higher coarse agreement, and claims that personalized judges achieve closer alignment with individual evaluators than aggregate judges, with evaluator agreement correlating to similarity in judge-generated reasoning only under personalized conditioning.

Significance. If the comparisons are robust, the work provides empirical evidence that pooled/aggregate labels can be fragile targets when expert disagreement reflects structured heterogeneity rather than noise, motivating evaluator-conditioned judge designs for pluralistic evaluation tasks. The new dataset grounded in real patents is a useful resource for studying LLM judges in applied business/innovation settings.

major comments (3)

[§4 (Judge Configurations and Experimental Setup)] §4 (Judge Configurations and Experimental Setup): The manuscript does not report whether the aggregate judge receives an equivalent number and format of scoring examples as the personalized judge. Without this control, the reported alignment advantage for personalized judges could arise from differences in prompt richness or example volume rather than capture of person-specific criteria.
[§4 and §5 (Results)] §4 and §5 (Results): No mismatched-history control (conditioning on random other evaluators' scores) is described. Such a control is necessary to test whether the personalization benefit is due to modeling individual heterogeneity versus generic example-following; its absence leaves the central claim that 'personalized judges align more closely with the corresponding evaluator' vulnerable to alternative explanations.
[§5 (Analyses)] §5 (Analyses): The abstract and results report directional improvements in alignment and correlation without providing full statistical details (e.g., exact agreement rates, effect sizes, confidence intervals, or per-dimension breakdowns with model sizes). This limits verification of robustness across the six dimensions and model scales.

minor comments (2)

[Abstract] Abstract: Include at least one quantitative metric (e.g., coarse vs. fine-grained agreement rates or average alignment delta) to allow readers to assess the magnitude of the reported effects without reading the full methods.
[Dataset section] Dataset section: Report inter-rater reliability metrics (e.g., Krippendorff's alpha or pairwise agreement) for the six dimensions to quantify the 'substantial expert disagreement' claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of experimental controls and statistical transparency that we have addressed through targeted revisions to strengthen the manuscript.

read point-by-point responses

Referee: [§4 (Judge Configurations and Experimental Setup)] §4 (Judge Configurations and Experimental Setup): The manuscript does not report whether the aggregate judge receives an equivalent number and format of scoring examples as the personalized judge. Without this control, the reported alignment advantage for personalized judges could arise from differences in prompt richness or example volume rather than capture of person-specific criteria.

Authors: We agree that explicit confirmation of matched prompt structure is necessary. In the original design, both aggregate and personalized judges used identical example counts (five per dimension) and formatting. However, this was not documented comparatively. We have revised §4 to add Table 2, which details prompt templates, example counts, token lengths, and formatting for both conditions, confirming equivalence. This isolates the personalization effect. revision: yes
Referee: [§4 and §5 (Results)] §4 and §5 (Results): No mismatched-history control (conditioning on random other evaluators' scores) is described. Such a control is necessary to test whether the personalization benefit is due to modeling individual heterogeneity versus generic example-following; its absence leaves the central claim that 'personalized judges align more closely with the corresponding evaluator' vulnerable to alternative explanations.

Authors: This is a substantive point. We have added a mismatched-history control experiment in the revised §5, conditioning judges on histories from randomly selected non-target evaluators. New results (Table 4) show personalized judges retain a statistically significant advantage over the mismatched control (average +11.4% alignment, p < 0.01 across dimensions), supporting that gains arise from individual-specific modeling rather than generic in-context learning. We have updated the methods and discussion sections accordingly. revision: yes
Referee: [§5 (Analyses)] §5 (Analyses): The abstract and results report directional improvements in alignment and correlation without providing full statistical details (e.g., exact agreement rates, effect sizes, confidence intervals, or per-dimension breakdowns with model sizes). This limits verification of robustness across the six dimensions and model scales.

Authors: We concur that fuller statistical reporting improves verifiability. The revised §5 now includes exact agreement rates, Cohen's kappa, Cohen's d effect sizes with 95% confidence intervals, and complete per-dimension and per-model (GPT-3.5-turbo, GPT-4, Llama-3-70B) breakdowns. These appear in updated Tables 3–5 and Appendix C. The abstract has been revised to reference the expanded statistics. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison of prompting strategies

full rationale

The paper introduces a new dataset (PBIG-DATA) of expert scores and performs direct empirical comparisons between rubric-only, aggregate-history, and personalized-history LLM judges. No mathematical derivations, equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the reported chain. Claims about alignment and reasoning similarity rest on observed metrics from the experiments rather than reducing to definitional equivalences or prior author results by construction. The analysis is self-contained against external benchmarks of judge performance.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical study with no mathematical derivations; relies on standard statistical notions of agreement and correlation but introduces no free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5561 in / 1084 out tokens · 37800 ms · 2026-05-08T11:53:32.489578+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

DRIP-R: A Benchmark for Decision-Making and Reasoning Under Real-World Policy Ambiguity in the Retail Domain
cs.CL 2026-05 unverdicted novelty 7.0

DRIP-R is a new benchmark showing that frontier LLMs systematically disagree on how to resolve identical ambiguous retail policy scenarios, highlighting ambiguity as a core challenge for agent decision-making.

Reference graph

Works this paper leans on

11 extracted references · 2 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

GPT-4 Technical Report

Can LLM be a Personalized Judge? In Find- ings of the Association for Computational Linguis- tics: EMNLP 2024 , pages 10126–10141. Associa- tion for Computational Linguistics. Mika Hämäläinen and Khalid Alnajjar. 2021. Human evaluation of creative NLG systems: An interdisci- plinary survey on recent papers . In Proceedings of the First Workshop on Natural...

work page internal anchor Pith review arXiv 2024
[2]

In Proceedings of The Thirteenth Inter- national Conference on Learning Representations

Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Re- searchers. In Proceedings of The Thirteenth Inter- national Conference on Learning Representations . Sijun Tan, Siyuan Zhuang, Kyle Montgomery, William Yuan Tang, Alejandro Cuadron, Chen- guang Wang, Raluca Popa, and Ion Stoica. 2025. JudgeBench: A Benchmark for Evaluating...

2025
[3]

In Proceedings of the 26th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 322–337

Exploring the design of multi-agent LLM di- alogues for research ideation. In Proceedings of the 26th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 322–337. Associ- ation for Computational Linguistics. Koki Wataoka, Tsubasa Takahashi, and Ryokan Ri
[4]

Self-Preference Bias in LLM-as-a-Judge

Self-Preference Bias in LLM-as-a-Judge. arXiv preprint arXiv:2410.21819. Yuzheng Xu, Tosho Hirasawa, Seiya Kawano, Shota Kato, and Tadashi Kozuno. 2025. MK2 at PBIG competition: A prompt generation solution. In Pro- ceedings of the 2nd Workshop on Agent AI for Sce- nario Planning, pages 58–66. Hayato Y oshiyasu. 2025. Team NS_NLP at the AgentScen shared t...

work page internal anchor Pith review arXiv 2025
[5]

Can be read as language, but the idea’s meaning is barely conveyed; 3

Cannot be read as coherent language; 2. Can be read as language, but the idea’s meaning is barely conveyed; 3. One or more concrete products can be imagined; 4. A single concrete product can be clearly imagined. Technical valid- ity Feasibility of implementing the idea using the patent
[6]

Building a prototype using the technology is challenging but possible; 3

The patented technology does not seem suitable for the use; 2. Building a prototype using the technology is challenging but possible; 3. A prototype could be built using the technology; 4. The technology can be applied to a production-level product. Innovativeness Novelty and originality of the proposed solution
[7]

Known use case of similar technology, but not yet fully explored; 3

A well-known application; lacks novelty; 2. Known use case of similar technology, but not yet fully explored; 3. A use case I hadn’t thought of, but not particularly exciting; 4. Surprising and novel; strong originality; 5. Clearly innovative and potentially groundbreaking. Competitive advantage Distinct benefits and advan- tages over existing solutions. ...
[8]

Only B; 3

Neither A nor B; 2. Only B; 3. Only A; 4. Both. Need validity Relevance of the product to genuine user needs
[9]

Both qualitative and quantitative returns are low; 2

Not a B2B product; 1. Both qualitative and quantitative returns are low; 2. Either quantitative (monetary) or qualitative (for corporate growth) returns are large; 3. Both qualitative and quantitative returns are large. Market size Number of potential users
[10]

a plausible but not obvious extension of the patent … beyond generic customization by integrating into clinical workflows and compliance needs

Not a B2B product; 1. Niche, appeals to some companies; 2. Many com- panies acknowledge the issue; adoption depends on budget/systems; 3. Nec- essary for almost all companies. Table 6: Full rubric for the six business-oriented scoring dimensions. Dim. NLP CS Mat. Spec. 0.012% 0.103% 0.319% Tech. val. 0.026% 0.008% 0.108% Innov. 0.000% 0.000% 0.000% Comp. ...

2025
[11]

score": <number>,

Table 8 reports Krippendorff’s α between the judge predictions and expert annotations for the ag- gregate and personalized configurations. Personal- ized conditioning yields alignment that is higher than or comparable to the aggregate judge on five of the six dimensions, with the largest gaps on need validity and market size. This mirrors the pattern obse...