Recognition: unknown
Aggregate vs. Personalized Judges in Business Idea Evaluation: Evidence from Expert Disagreement
Pith reviewed 2026-05-08 11:53 UTC · model grok-4.3
The pith
Personalized judges conditioned on an individual evaluator's scoring history align more closely with that evaluator than aggregate judges using mixed histories.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Across the six business-oriented dimensions, the personalized judge that conditions on the target evaluator's scoring history produces outputs that match the corresponding evaluator more closely than either a rubric-only zero-shot judge or an aggregate judge conditioned on mixed evaluator histories; moreover, agreement between two evaluators correlates with similarity of judge-generated reasoning only when the judge is personalized rather than pooled.
What carries the argument
Personalized conditioning on the target evaluator's scoring history, which learns individual patterns and enables comparison against aggregate conditioning on mixed histories and rubric-only baselines.
Load-bearing premise
The observed expert disagreements reflect consistent, learnable individual differences in judgment that can be captured from scoring history rather than irreducible random noise or unmeasured factors.
What would settle it
A replication on a comparable dataset of expert business-idea scores in which the personalized judge fails to show higher alignment with the target evaluator than the aggregate judge, or in which inter-evaluator agreement no longer predicts reasoning similarity under personalization.
Figures
read the original abstract
Evaluating LLM-generated business ideas is often harder to scale than generating them. Unlike standard NLP benchmarks, business idea evaluation relies on multi-dimensional criteria such as feasibility, novelty, differentiation, user need, and market size, and expert judgments often disagree. This paper studies a methodological question raised by such disagreement: should an automatic judge approximate an aggregate consensus, or model evaluators individually? We introduce PBIG-DATA, a dataset of approximately 3,000 individual scores across 300 patent-grounded product ideas, provided by domain experts on six business-oriented dimensions: specificity, technical validity, innovativeness, competitive advantage, need validity, and market size. Analyses show substantial expert disagreement on fine-grained ordinal scores, while agreement is higher under coarse selection, suggesting structured heterogeneity rather than random noise. We then compare three judge configurations: a rubric-only zero-shot judge, an aggregate judge conditioned on mixed evaluator histories, and a personalized judge conditioned on the target evaluator's scoring history. Across dimensions and model sizes, personalized judges align more closely with the corresponding evaluator than aggregate judges, and evaluator agreement correlates with similarity of judge-generated reasoning only under personalized conditioning. These results indicate that pooled labels can be a fragile target in pluralistic evaluation settings and motivate evaluator-conditioned judge designs for business idea assessment.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the PBIG-DATA dataset (~3,000 expert scores on 300 patent-grounded business ideas across six dimensions: specificity, technical validity, innovativeness, competitive advantage, need validity, market size) and compares three LLM judge setups: rubric-only zero-shot, aggregate (conditioned on mixed evaluator histories), and personalized (conditioned on the target evaluator's scoring history). It reports substantial fine-grained expert disagreement but higher coarse agreement, and claims that personalized judges achieve closer alignment with individual evaluators than aggregate judges, with evaluator agreement correlating to similarity in judge-generated reasoning only under personalized conditioning.
Significance. If the comparisons are robust, the work provides empirical evidence that pooled/aggregate labels can be fragile targets when expert disagreement reflects structured heterogeneity rather than noise, motivating evaluator-conditioned judge designs for pluralistic evaluation tasks. The new dataset grounded in real patents is a useful resource for studying LLM judges in applied business/innovation settings.
major comments (3)
- [§4 (Judge Configurations and Experimental Setup)] §4 (Judge Configurations and Experimental Setup): The manuscript does not report whether the aggregate judge receives an equivalent number and format of scoring examples as the personalized judge. Without this control, the reported alignment advantage for personalized judges could arise from differences in prompt richness or example volume rather than capture of person-specific criteria.
- [§4 and §5 (Results)] §4 and §5 (Results): No mismatched-history control (conditioning on random other evaluators' scores) is described. Such a control is necessary to test whether the personalization benefit is due to modeling individual heterogeneity versus generic example-following; its absence leaves the central claim that 'personalized judges align more closely with the corresponding evaluator' vulnerable to alternative explanations.
- [§5 (Analyses)] §5 (Analyses): The abstract and results report directional improvements in alignment and correlation without providing full statistical details (e.g., exact agreement rates, effect sizes, confidence intervals, or per-dimension breakdowns with model sizes). This limits verification of robustness across the six dimensions and model scales.
minor comments (2)
- [Abstract] Abstract: Include at least one quantitative metric (e.g., coarse vs. fine-grained agreement rates or average alignment delta) to allow readers to assess the magnitude of the reported effects without reading the full methods.
- [Dataset section] Dataset section: Report inter-rater reliability metrics (e.g., Krippendorff's alpha or pairwise agreement) for the six dimensions to quantify the 'substantial expert disagreement' claim.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of experimental controls and statistical transparency that we have addressed through targeted revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [§4 (Judge Configurations and Experimental Setup)] §4 (Judge Configurations and Experimental Setup): The manuscript does not report whether the aggregate judge receives an equivalent number and format of scoring examples as the personalized judge. Without this control, the reported alignment advantage for personalized judges could arise from differences in prompt richness or example volume rather than capture of person-specific criteria.
Authors: We agree that explicit confirmation of matched prompt structure is necessary. In the original design, both aggregate and personalized judges used identical example counts (five per dimension) and formatting. However, this was not documented comparatively. We have revised §4 to add Table 2, which details prompt templates, example counts, token lengths, and formatting for both conditions, confirming equivalence. This isolates the personalization effect. revision: yes
-
Referee: [§4 and §5 (Results)] §4 and §5 (Results): No mismatched-history control (conditioning on random other evaluators' scores) is described. Such a control is necessary to test whether the personalization benefit is due to modeling individual heterogeneity versus generic example-following; its absence leaves the central claim that 'personalized judges align more closely with the corresponding evaluator' vulnerable to alternative explanations.
Authors: This is a substantive point. We have added a mismatched-history control experiment in the revised §5, conditioning judges on histories from randomly selected non-target evaluators. New results (Table 4) show personalized judges retain a statistically significant advantage over the mismatched control (average +11.4% alignment, p < 0.01 across dimensions), supporting that gains arise from individual-specific modeling rather than generic in-context learning. We have updated the methods and discussion sections accordingly. revision: yes
-
Referee: [§5 (Analyses)] §5 (Analyses): The abstract and results report directional improvements in alignment and correlation without providing full statistical details (e.g., exact agreement rates, effect sizes, confidence intervals, or per-dimension breakdowns with model sizes). This limits verification of robustness across the six dimensions and model scales.
Authors: We concur that fuller statistical reporting improves verifiability. The revised §5 now includes exact agreement rates, Cohen's kappa, Cohen's d effect sizes with 95% confidence intervals, and complete per-dimension and per-model (GPT-3.5-turbo, GPT-4, Llama-3-70B) breakdowns. These appear in updated Tables 3–5 and Appendix C. The abstract has been revised to reference the expanded statistics. revision: yes
Circularity Check
No circularity: purely empirical comparison of prompting strategies
full rationale
The paper introduces a new dataset (PBIG-DATA) of expert scores and performs direct empirical comparisons between rubric-only, aggregate-history, and personalized-history LLM judges. No mathematical derivations, equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the reported chain. Claims about alignment and reasoning similarity rest on observed metrics from the experiments rather than reducing to definitional equivalences or prior author results by construction. The analysis is self-contained against external benchmarks of judge performance.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
DRIP-R: A Benchmark for Decision-Making and Reasoning Under Real-World Policy Ambiguity in the Retail Domain
DRIP-R is a new benchmark showing that frontier LLMs systematically disagree on how to resolve identical ambiguous retail policy scenarios, highlighting ambiguity as a core challenge for agent decision-making.
Reference graph
Works this paper leans on
-
[1]
Can LLM be a Personalized Judge? In Find- ings of the Association for Computational Linguis- tics: EMNLP 2024 , pages 10126–10141. Associa- tion for Computational Linguistics. Mika Hämäläinen and Khalid Alnajjar. 2021. Human evaluation of creative NLG systems: An interdisci- plinary survey on recent papers . In Proceedings of the First Workshop on Natural...
work page internal anchor Pith review arXiv 2024
-
[2]
In Proceedings of The Thirteenth Inter- national Conference on Learning Representations
Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Re- searchers. In Proceedings of The Thirteenth Inter- national Conference on Learning Representations . Sijun Tan, Siyuan Zhuang, Kyle Montgomery, William Yuan Tang, Alejandro Cuadron, Chen- guang Wang, Raluca Popa, and Ion Stoica. 2025. JudgeBench: A Benchmark for Evaluating...
2025
-
[3]
In Proceedings of the 26th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 322–337
Exploring the design of multi-agent LLM di- alogues for research ideation. In Proceedings of the 26th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 322–337. Associ- ation for Computational Linguistics. Koki Wataoka, Tsubasa Takahashi, and Ryokan Ri
-
[4]
Self-Preference Bias in LLM-as-a-Judge
Self-Preference Bias in LLM-as-a-Judge. arXiv preprint arXiv:2410.21819. Yuzheng Xu, Tosho Hirasawa, Seiya Kawano, Shota Kato, and Tadashi Kozuno. 2025. MK2 at PBIG competition: A prompt generation solution. In Pro- ceedings of the 2nd Workshop on Agent AI for Sce- nario Planning, pages 58–66. Hayato Y oshiyasu. 2025. Team NS_NLP at the AgentScen shared t...
work page internal anchor Pith review arXiv 2025
-
[5]
Can be read as language, but the idea’s meaning is barely conveyed; 3
Cannot be read as coherent language; 2. Can be read as language, but the idea’s meaning is barely conveyed; 3. One or more concrete products can be imagined; 4. A single concrete product can be clearly imagined. Technical valid- ity Feasibility of implementing the idea using the patent
-
[6]
Building a prototype using the technology is challenging but possible; 3
The patented technology does not seem suitable for the use; 2. Building a prototype using the technology is challenging but possible; 3. A prototype could be built using the technology; 4. The technology can be applied to a production-level product. Innovativeness Novelty and originality of the proposed solution
-
[7]
Known use case of similar technology, but not yet fully explored; 3
A well-known application; lacks novelty; 2. Known use case of similar technology, but not yet fully explored; 3. A use case I hadn’t thought of, but not particularly exciting; 4. Surprising and novel; strong originality; 5. Clearly innovative and potentially groundbreaking. Competitive advantage Distinct benefits and advan- tages over existing solutions. ...
-
[8]
Only B; 3
Neither A nor B; 2. Only B; 3. Only A; 4. Both. Need validity Relevance of the product to genuine user needs
-
[9]
Both qualitative and quantitative returns are low; 2
Not a B2B product; 1. Both qualitative and quantitative returns are low; 2. Either quantitative (monetary) or qualitative (for corporate growth) returns are large; 3. Both qualitative and quantitative returns are large. Market size Number of potential users
-
[10]
a plausible but not obvious extension of the patent … beyond generic customization by integrating into clinical workflows and compliance needs
Not a B2B product; 1. Niche, appeals to some companies; 2. Many com- panies acknowledge the issue; adoption depends on budget/systems; 3. Nec- essary for almost all companies. Table 6: Full rubric for the six business-oriented scoring dimensions. Dim. NLP CS Mat. Spec. 0.012% 0.103% 0.319% Tech. val. 0.026% 0.008% 0.108% Innov. 0.000% 0.000% 0.000% Comp. ...
2025
-
[11]
score": <number>,
Table 8 reports Krippendorff’s α between the judge predictions and expert annotations for the ag- gregate and personalized configurations. Personal- ized conditioning yields alignment that is higher than or comparable to the aggregate judge on five of the six dimensions, with the largest gaps on need validity and market size. This mirrors the pattern obse...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.