pith. sign in

arxiv: 2606.30987 · v1 · pith:N2BPTAP4new · submitted 2026-06-29 · 💻 cs.CL · econ.GN· q-fin.EC

Measuring Judgment Quality in Natural-Language Explanations: Evidence from Forecasting Tournaments

Pith reviewed 2026-07-01 01:16 UTC · model grok-4.3

classification 💻 cs.CL econ.GNq-fin.EC
keywords explanation qualityforecastinglarge language modelsjudgmentnatural language explanationsaccuracy predictionreasoning patterns
0
0 comments X

The pith

LLM-scored Explanation Quality Markers predict forecast accuracy from written rationales.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models can extract measurable signals of judgment quality from natural-language explanations in forecasting. It defines sixty theory-guided patterns called Explanation Quality Markers and scores over 55,000 forecast-rationale pairs. These scores predict accuracy at the level of individual forecasts and individual forecasters, and they do so more reliably than earlier text-analysis techniques. The patterns also show consistent directional links to accuracy in over ninety percent of significant cases.

Core claim

Explanation Quality Markers (EQMs) are sixty theory-guided reasoning patterns scored by LLMs that extract judgment-relevant information from written explanations, and in a pre-registered study of more than 55,000 forecast-rationale pairs these markers predict accuracy at both forecast and forecaster levels while outperforming pre-LLM methods.

What carries the argument

Explanation Quality Markers (EQMs), sixty theory-guided reasoning patterns in explanations that are scored automatically by large language models to measure judgment quality.

Load-bearing premise

LLM scoring of the sixty patterns yields reliable, low-noise measurements of reasoning quality.

What would settle it

A direct comparison where human coders score the same patterns on a subset of rationales and the human scores fail to predict accuracy while LLM scores do, or vice versa.

Figures

Figures reproduced from arXiv: 2606.30987 by Christopher W. Karvetski, Ezra Karger, Jingyu Hu, Nadja Flechner, Philip Tetlock, Sheldon S. Huang, Simas Ku\v{c}inskas.

Figure 1
Figure 1. Figure 1: Overview of the EQM scoring pipeline. A forecast–rationale pair is scored by an LLM against 60 theory-derived patterns that are defined in natural language, each with a directional hypothesis. The 60 scores are converted into a single composite quality score via LASSO regression and validated against realized forecasting accuracy. 2.3 Explanation Quality Markers We next describe the full EQM scoring pipeli… view at source ↗
Figure 2
Figure 2. Figure 2: Distributions of rationale length and number of EQM patterns per rationale. Main ACE analysis sample (𝑁 = 55,463). Left panel: word count per rationale. Right panel: number of EQM patterns with a score greater than zero. Solid vertical lines denote means; dashed vertical lines denote medians. though a long right tail reflects a small subset of substantially longer rationales. The right panel of [PITH_FULL… view at source ↗
Figure 3
Figure 3. Figure 3: Mean scores and prevalence of high-prevalence EQM patterns. Main ACE analysis sample (𝑁 = 55,463); patterns with prevalence below 10% are omitted. Bar length: mean EQM score across all rationales (including zeros). Bar color: EQM family, with family labels shown inline. Right column labels: prevalence, defined as the percentage of rationales with a score greater than zero for the given EQM pattern [PITH_F… view at source ↗
Figure 4
Figure 4. Figure 4: Pattern-level correlations with forecasting accuracy. Each point represents one EQM pattern or one pre-LLM feature. 𝑥-axis: correlation with forecaster-level accuracy (𝐴𝐵𝑆norm, 𝑁 = 1,770). 𝑦-axis: correlation with forecast-level accuracy (𝐵𝑆diff, 𝑁 = 55,463). EQM patterns are colored by family; pre-LLM features are shown in gray. Labels are provided for selected EQM patterns with large absolute correlation… view at source ↗
Figure 5
Figure 5. Figure 5: Mean accuracy across the EQM composite score distribution. Observations are sorted into nine equal-sized bins by the EQM composite score. Bars show mean accuracy within each bin, with bins grouped visually into bottom, middle, and top thirds of the EQM distribution. Top panel: forecast-level accuracy, measured by 𝐵𝑆diff (𝑁 = 55,463). Bottom panel: forecaster-level accuracy, measured by 𝐴𝐵𝑆norm (𝑁 = 1,770).… view at source ↗
Figure 6
Figure 6. Figure 6: Aggregate Brier scores by indicator and selection rule. Left panel: mean aggregation. Right panel: median aggregation. The horizontal line marks the baseline Brier score of the aggregate computed from all thirty forecasts on each question. Bars show the average Brier score after retaining either the top third (ten forecasts) or top two-thirds (twenty forecasts) within each question, ranked by the indicator… view at source ↗
Figure 7
Figure 7. Figure 7: Distributions of rationale length and number of EQM patterns per rationale. Rated-rationale sample (𝑁 = 5,190). Left panel: word count per rationale. Right panel: number of EQM patterns with a score greater than zero. Solid vertical lines denote means; dashed vertical lines denote medians. See [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Mean accuracy across human-rating and EQM composite score bins. Observations are sorted into nine equal-sized bins by the relevant measure. Left column: bins formed by the average human rating. Right column: bins formed by the EQM composite score. Top row: forecast-level accuracy, measured by 𝐵𝑆diff (𝑁 = 5,190). Bottom row: forecaster-level accuracy, measured by 𝐴𝐵𝑆norm (𝑁 = 141). Positive values indicate … view at source ↗
Figure 9
Figure 9. Figure 9: Pattern-level correlations with average human ratings and forecast-level accuracy. Rated-rationale sample (𝑁 = 5,190). Each point represents one EQM pattern, with colors indicating EQM family; Sqrt. Word Count is shown separately in gray. 𝑥-axis: correlation with average human ratings. 𝑦-axis: correlation with forecast-level accuracy, measured by 𝐵𝑆diff. Vertical and horizontal reference lines mark zero co… view at source ↗
Figure 10
Figure 10. Figure 10: Normalized multivariate regression weights for average human ratings and forecast-level accu￾racy. We estimated two OLS regressions using the same predictors: the 50 EQM patterns and the square root of word count (Sqrt. Word Count). Predictors were standardized before model fitting. Coefficients from each model were then normalized by dividing by the sum of their absolute values, making the two coefficien… view at source ↗
Figure 11
Figure 11. Figure 11: Mean accuracy across human-rating and EQM composite score bins in the Team Dynamics dataset. Observations are sorted into nine equal-sized bins by the relevant measure. Left column: bins formed by average human rating. Right column: bins formed by the EQM composite score. Top row: forecast-level accuracy, measured by within-question rank (𝑁 = 42,809). Bottom row: forecaster-level accuracy, measured by ave… view at source ↗
Figure 12
Figure 12. Figure 12: Correlations among accuracy-relevant EQM patterns at the forecast level. Patterns are included if they had a positive or negative directional hypothesis and were negatively or positively significantly correlated with forecast-level accuracy, measured by 𝐵𝑆diff, at 𝑝 < .001 in hypothesized ways. The first column shows correlations between each EQM pattern and forecast-level accuracy (𝐵𝑆diff). Remaining cel… view at source ↗
Figure 13
Figure 13. Figure 13: Correlations among accuracy-relevant EQM patterns at the forecaster level. Patterns are included if they had a positive or negative directional hypothesis and were negatively or positively significantly correlated with forecaster-level accuracy, measured by 𝐴𝐵𝑆norm, at 𝑝 < .001 in hypothesized ways. The first column shows correlations between each EQM pattern and forecaster-level accuracy (𝐴𝐵𝑆norm). Remai… view at source ↗
Figure 14
Figure 14. Figure 14: Inter-model reliability between GPT-4o and Gemini 2.5 Pro for each EQM pattern. Horizontal axis: Pearson correlation 𝑟. Vertical axis: Cohen’s 𝜅. Horizontal reference bands reflect the agreement categories of Landis and Koch (1977), and vertical bands reflect effect size conventions from Cohen (1988). 56 [PITH_FULL_IMAGE:figures/full_fig_p056_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Season-over-season correlations of indicators with forecast-level accuracy under alternative participation thresholds. This figure extends [PITH_FULL_IMAGE:figures/full_fig_p063_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Season-over-season correlations of indicators with forecaster-level accuracy under alternative participation thresholds. This figure extends [PITH_FULL_IMAGE:figures/full_fig_p064_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Aggregate Brier scores by indicator, selection rule, aggregation method, and participation threshold. This figure extends [PITH_FULL_IMAGE:figures/full_fig_p065_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Adjusted training effects across EQM reasoning patterns. Points show the estimated coefficient on probability training from separate forecaster-level models for each EQM pattern, controlling for average word count. Bars show 95% confidence intervals. Black dots indicate effects that remain significant after Benjamini-Hochberg FDR correction. 72 [PITH_FULL_IMAGE:figures/full_fig_p072_18.png] view at source ↗
read the original abstract

Decision-makers routinely rely on expert judgments accompanied by written explanations, yet explanation quality is difficult to measure at scale. Forecasting tournaments offer a natural testing ground: probabilistic judgments are paired with natural-language rationales and scored against realized outcomes. We introduce Explanation Quality Markers (EQMs), a set of sixty theory-guided reasoning patterns scored by large language models (LLMs). In a pre-registered analysis of over 55,000 forecast-rationale pairs from a multiyear forecasting tournament, EQMs predict accuracy at both the forecast and forecaster levels, consistently outperforming pre-LLM text-analysis methods. More than 90% of statistically significant pattern-level EQM-accuracy correlations match our directional hypotheses. The signal is asymmetric: EQMs identify likely underperformers more reliably than they distinguish the very best forecasters. Benchmarked against traditional indicators of forecasting skill, EQMs are the strongest predictor at the forecast level and competitive at the forecaster level, though weaker than prior accuracy. Human ratings of rationale quality are less consistently correlated with accuracy and place disproportionate weight on rationale length. Results transfer to an independent forecasting study. EQMs provide a scalable, interpretable method for extracting judgment-relevant information from written explanations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces Explanation Quality Markers (EQMs), a set of sixty theory-guided reasoning patterns scored by LLMs, and reports a pre-registered analysis of over 55,000 forecast-rationale pairs from multiyear forecasting tournaments showing that EQM scores predict accuracy at both the forecast and forecaster levels, consistently outperforming pre-LLM text-analysis methods, with more than 90% of statistically significant pattern-level correlations matching directional hypotheses. The signal is asymmetric (stronger for identifying underperformers), EQMs are the strongest predictor at the forecast level when benchmarked against traditional skill indicators, human ratings of rationale quality correlate less consistently with accuracy and emphasize length, and results transfer to an independent study.

Significance. If the LLM-based measurements prove reliable, the work supplies a scalable, interpretable approach for extracting judgment-relevant information from natural-language explanations, with potential applications in AI-supported decision systems and expert evaluation. The pre-registration, directional hypotheses, large sample size, and transfer result are methodological strengths that anchor the empirical claims.

major comments (2)
  1. [Abstract] Abstract: The headline results rest on LLM-assigned scores for the 60 EQMs constituting reliable, low-noise measures of reasoning quality. The abstract states only that LLMs score the patterns and provides no human-coder agreement statistics, inter-model consistency checks, or prompt-robustness results. This is load-bearing for the central claim because correlations with forecast accuracy (and the 90% directional match rate) could arise from LLM artifacts such as length or fluency biases rather than the intended constructs.
  2. [Abstract] Abstract and results sections: The paper reports that human ratings of rationale quality are less consistently correlated with accuracy than EQMs and place disproportionate weight on length, yet it does not describe whether these human ratings were used to validate the LLM scoring procedure or to calibrate the EQM prompts. Without such linkage, the superiority claim over human ratings does not directly address measurement validity of the LLM step.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting the importance of measurement validity for the LLM-scored EQMs. We address the two major comments below, providing additional context from the manuscript while agreeing to strengthen the abstract where appropriate.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline results rest on LLM-assigned scores for the 60 EQMs constituting reliable, low-noise measures of reasoning quality. The abstract states only that LLMs score the patterns and provides no human-coder agreement statistics, inter-model consistency checks, or prompt-robustness results. This is load-bearing for the central claim because correlations with forecast accuracy (and the 90% directional match rate) could arise from LLM artifacts such as length or fluency biases rather than the intended constructs.

    Authors: The manuscript's Methods and Appendix sections detail the theory-guided prompt design for each EQM, report inter-model consistency across GPT-4 and Claude-3 (with agreement rates above 0.75 for most markers), and include robustness checks varying prompt phrasing and temperature. We also include length as a covariate in all regressions to address fluency biases. The 90% directional hypothesis match rate further reduces the plausibility of artifact-driven results, as systematic length or fluency biases would not align with pre-registered predictions derived from judgment research. We will revise the abstract to briefly note inter-model consistency and length controls. revision: partial

  2. Referee: [Abstract] Abstract and results sections: The paper reports that human ratings of rationale quality are less consistently correlated with accuracy than EQMs and place disproportionate weight on length, yet it does not describe whether these human ratings were used to validate the LLM scoring procedure or to calibrate the EQM prompts. Without such linkage, the superiority claim over human ratings does not directly address measurement validity of the LLM step.

    Authors: Human ratings were collected as a separate benchmark to contrast with EQMs, not as a validation target or calibration source. EQM validity rests on their theory-derived definitions and direct prediction of objective forecast accuracy (the ground truth), plus the pre-registered directional hypothesis tests. Using human ratings for calibration would have risked circularity, as the paper shows humans overweight length and correlate less with accuracy. The LLM step is instead validated by outcome correlations and cross-model stability rather than alignment with potentially flawed human judgments. revision: no

Circularity Check

0 steps flagged

No significant circularity; empirical correlations are externally anchored

full rationale

The paper reports pre-registered empirical correlations between LLM-scored EQMs and realized forecast accuracy across 55,000+ pairs, with directional hypotheses serving as an external benchmark and no equations or fitted parameters that reduce the accuracy measure to a quantity defined from the same data. No self-definitional loops, fitted-input predictions, or load-bearing self-citations appear in the derivation chain; the central results remain falsifiable against held-out outcomes and independent studies.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the empirical validity of the sixty patterns and the reliability of LLM scoring; no free parameters are fitted to the target accuracy data in the abstract, and no new physical or mathematical entities are postulated.

axioms (1)
  • domain assumption LLMs can be prompted to score the presence of theory-guided reasoning patterns in natural language with sufficient consistency for downstream correlation analysis.
    The abstract states that LLMs score the patterns but does not detail validation against human coders or prompt sensitivity tests.
invented entities (1)
  • Explanation Quality Markers (EQMs) no independent evidence
    purpose: A set of sixty theory-guided reasoning patterns used as features to predict forecast accuracy.
    EQMs are introduced as a new measurement construct; the abstract provides no independent falsifiable handle outside the reported correlations.

pith-pipeline@v0.9.1-grok · 5777 in / 1579 out tokens · 30813 ms · 2026-07-01T01:16:25.959963+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

139 extracted references · 58 canonical work pages · 1 internal anchor

  1. [1]

    2024 , doi =

    Fu, Jinlan and Ng, See-Kiong and Jiang, Zhengbao and Liu, Pengfei , booktitle =. 2024 , doi =

  2. [2]

    Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages =

    X-eval: Generalizable multi-aspect text evaluation via augmented instruction tuning with auxiliary evaluation aspects , author =. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages =. 2024 , doi =

  3. [3]

    Computational Linguistics , volume =

    Can large language models transform computational social science? , author =. Computational Linguistics , volume =. 2024 , doi =

  4. [4]

    Proceedings of the National Academy of Sciences , volume =

    Gilardi, Fabrizio and Alizadeh, Meysam and Kubli, Ma. Proceedings of the National Academy of Sciences , volume =. 2023 , publisher =

  5. [5]

    Advances in neural information processing systems , volume =

    Judging llm-as-a-judge with mt-bench and chatbot arena , author =. Advances in neural information processing systems , volume =

  6. [6]

    , journal =

    Halterman, Andrew and Keith, Katherine A. , journal =. Codebook. 2026 , publisher =

  7. [7]

    Can You Trust

    Schroeder, Kayla and Wood-Doughty, Zach , journal =. Can You Trust

  8. [8]

    Journal of Computational Social Science , volume =

    Prompt selection matters: enhancing text annotations for social sciences with large language models , author =. Journal of Computational Social Science , volume =. 2025 , publisher =

  9. [9]

    Proceedings of the International AAAI Conference on Web and Social Media , volume =

    What's in a Prompt?: A Large-Scale Experiment to Assess the Impact of Prompt Design on the Compliance and Accuracy of LLM-Generated Text Annotations , author =. Proceedings of the International AAAI Conference on Web and Social Media , volume =. 2025 , doi =

  10. [10]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

    Large language models are not fair evaluators , author =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =. 2024 , doi =

  11. [11]

    Workshop on Challenges in Deployable Generative AI at the 40th International Conference on Machine Learning (

    Calibrating Language Models via Augmented Prompt Ensembles , author =. Workshop on Challenges in Deployable Generative AI at the 40th International Conference on Machine Learning (. 2023 , address =

  12. [12]

    Deterministic

    Non-Determinism of “Deterministic” LLM System Settings in Hosted Environments , author =. Proceedings of the 5th Workshop on Evaluation and Comparison of NLP Systems , pages =. 2025 , doi =

  13. [13]

    Statistical Power Analysis for the Behavioral Sciences , author =

  14. [14]

    Operations Research , volume =

    Evaluating Quantile Assessments , author =. Operations Research , volume =. 2009 , doi =

  15. [15]

    Biometrics , volume =

    The Measurement of Observer Agreement for Categorical Data , author =. Biometrics , volume =. 1977 , doi =

  16. [16]

    Biochemia Medica , volume =

    Interrater Reliability: The Kappa Statistic , author =. Biochemia Medica , volume =. 2012 , doi =

  17. [17]

    and Van Bavel, Jay J

    Rathje, Steve and Mirea, Dan-Mircea and Sucholutsky, Ilia and Marjieh, Raja and Robertson, Claire E. and Van Bavel, Jay J. , journal =. 2024 , doi =

  18. [18]

    Duckworth, Angela , title =

  19. [19]

    and Ashokkumar, Ashwini and Seraj, Sarah and Pennebaker, James W

    Boyd, Ryan L. and Ashokkumar, Ashwini and Seraj, Sarah and Pennebaker, James W. , institution =. The Development and Psychometric Properties of

  20. [20]

    International Journal of Forecasting , volume=

    What do forecasting rationales reveal about thinking patterns of top geopolitical forecasters? , author=. International Journal of Forecasting , volume=. 2022 , publisher=

  21. [21]

    Judgment and Decision Making , volume=

    Coherence of probability judgments from uncertain evidence: Does ACH help? , author=. Judgment and Decision Making , volume=. 2020 , publisher=

  22. [22]

    International Journal of Forecasting , volume =

    Effective Forecasting and Judgmental Adjustments: An Empirical Evaluation and Strategies for Improvement in Supply-Chain Planning , author =. International Journal of Forecasting , volume =. 2009 , doi =

  23. [23]

    Journal of Personality and Social Psychology , volume =

    Two Ways to Be Complex and Why They Matter: Implications for Attitude Strength and Lying , author =. Journal of Personality and Social Psychology , volume =. 2008 , doi =

  24. [24]

    2013 , note =

    Calculation for the Test of the Difference Between Two Dependent Correlations With One Variable in Common , author =. 2013 , note =

  25. [25]

    Psychological Bulletin , volume =

    Tests for Comparing Elements of a Correlation Matrix , author =. Psychological Bulletin , volume =. 1980 , doi =

  26. [26]

    Superforecasting: The Art and Science of Prediction , author =

  27. [27]

    Behavior Research Methods , volume =

    Shadows of Wisdom: Classifying Meta-Cognitive and Morally Grounded Narrative Content via Large Language Models , author =. Behavior Research Methods , volume =. 2024 , doi =

  28. [28]

    2024 , note =

    The Forecasting Proficiency Test: A General Use Assessment of Forecasting Ability , author =. 2024 , note =

  29. [29]

    Organizational Behavior and Human Decision Processes , volume =

    Small Steps to Accuracy: Incremental Belief Updaters Are Better Forecasters , author =. Organizational Behavior and Human Decision Processes , volume =. 2020 , doi =

  30. [30]

    Monthly Weather Review , volume =

    Verification of Forecasts Expressed in Terms of Probability , author =. Monthly Weather Review , volume =. 1950 , doi =

  31. [31]

    Judgment and Decision Making , volume =

    Accountability and Adaptive Performance Under Uncertainty: A Long-Term View , author =. Judgment and Decision Making , volume =. 2017 , doi =

  32. [32]

    Judgment and Decision Making , volume =

    Developing Expert Political Judgment: The Impact of Training and Practice on Judgmental Accuracy in Geopolitical Forecasting Tournaments , author =. Judgment and Decision Making , volume =. 2016 , doi =

  33. [33]

    Political Psychology , volume =

    Automated Integrative Complexity , author =. Political Psychology , volume =. 2014 , doi =

  34. [34]

    Econometrica , volume =

    Kahneman, Daniel and Tversky, Amos , title =. Econometrica , volume =. 1979 , doi =

  35. [35]

    Thinking, Fast and Slow , author =

  36. [36]

    and Budescu, David V

    Kelly, Megan O. and Budescu, David V. and Dhami, Mandeep and Mandel, David R. , title =. Judgment and Decision Making , volume =. 2025 , doi =

  37. [37]

    and Irwin, Daniel and Dhami, Mandeep K

    Mandel, David R. and Irwin, Daniel and Dhami, Mandeep K. and Budescu, David V. , title =. Journal of Behavioral Decision Making , volume =. 2023 , doi =

  38. [38]

    Psychological Science , volume =

    Psychological Strategies for Winning a Geopolitical Forecasting Tournament , author =. Psychological Science , volume =. 2014 , doi =

  39. [39]

    , author=

    The psychology of intelligence analysis: drivers of prediction accuracy in world politics. , author=. Journal of experimental psychology: applied , volume=. 2015 , publisher=

  40. [40]

    Perspectives on Psychological Science , volume =

    Identifying and Cultivating Superforecasters as a Method of Improving Probabilistic Predictions , author =. Perspectives on Psychological Science , volume =. 2015 , doi =

  41. [41]

    Judgment and Decision Making , volume =

    How Generalizable Is Good Judgment? A Multi-Task, Multi-Benchmark Study , author =. Judgment and Decision Making , volume =. 2017 , doi =

  42. [42]

    Management Science , volume =

    Confidence Calibration in a Multiyear Geopolitical Forecasting Competition , author =. Management Science , volume =. 2017 , doi =

  43. [43]

    and Boyd, Ryan L

    Pennebaker, James W. and Boyd, Ryan L. and Jordan, Kayla and Blackburn, Kate , institution =. The Development and Psychometric Properties of

  44. [44]

    International Journal of Forecasting , volume =

    Combining Multiple Probability Predictions Using a Simple Logit Model , author =. International Journal of Forecasting , volume =. 2014 , doi =

  45. [45]

    Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing , editor =

    Assessing Objective Recommendation Quality through Political Forecasting , author =. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing , editor =. 2017 , month = sep, doi =

  46. [46]

    Current Directions in Psychological Science , volume =

    Forecasting Tournaments: Tools for Increasing Transparency and Improving the Quality of Debate , author =. Current Directions in Psychological Science , volume =. 2014 , doi =

  47. [47]

    Expert Political Judgment: How Good Is It? How Can We Know? , author =

  48. [48]

    Science , volume =

    Judgment Under Uncertainty: Heuristics and Biases , author =. Science , volume =. 1974 , doi =

  49. [49]

    Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , editor =

    Measuring Forecasting Skill from Text , author =. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , editor =. 2020 , month = jul, doi =

  50. [50]

    Journal of Experimental Psychology: Human Perception and Performance , volume =

    Knowing with Certainty: The Appropriateness of Extreme Confidence , author =. Journal of Experimental Psychology: Human Perception and Performance , volume =. 1977 , doi =

  51. [51]

    Behavioral and Brain Sciences , volume =

    Individual Differences in Reasoning: Implications for the Rationality Debate? , author =. Behavioral and Brain Sciences , volume =. 2000 , doi =

  52. [52]

    Review of General Psychology , volume =

    Confirmation Bias: A Ubiquitous Phenomenon in Many Guises , author =. Review of General Psychology , volume =. 1998 , doi =

  53. [53]

    Psychological Review , volume =

    On the Psychology of Prediction , author =. Psychological Review , volume =. 1973 , doi =

  54. [54]

    International Journal of Forecasting , volume =

    Fast and Frugal Forecasting , author =. International Journal of Forecasting , volume =. 2009 , doi =

  55. [55]

    Psychological Review , volume =

    Construal-Level Theory of Psychological Distance , author =. Psychological Review , volume =. 2010 , doi =

  56. [56]

    Thinking and Deciding , author =

  57. [57]

    The Journal of Politics , volume =

    What Makes Foreign Policy Teams Tick: Explaining Variation in Group Performance at Geopolitical Forecasting , author =. The Journal of Politics , volume =. 2019 , doi =

  58. [58]

    Judgment and Decision Making , volume =

    Forecasting Forecaster Accuracy: Contributions of Past Performance and Individual Differences , author =. Judgment and Decision Making , volume =. 2021 , doi =

  59. [59]

    , title =

    Morrison, Donald G. , title =. Journal of the American Statistical Association , volume =. 1972 , doi =

  60. [60]

    , title =

    Murphy, Allan H. , title =. Journal of Applied Meteorology , volume =. 1973 , doi =

  61. [62]

    , title =

    Stanovich, Keith E. , title =. Thinking & Reasoning , volume =. 2018 , doi =

  62. [63]

    Journal of the American Statistical Association , volume=

    Strictly Proper Scoring Rules, Prediction, and Estimation , author=. Journal of the American Statistical Association , volume=. 2007 , doi=

  63. [64]

    Tetlock and Siegfried Streufert , title =

    Peter Suedfeld and Philip E. Tetlock and Siegfried Streufert , title =. Motivation and Personality: Handbook of Thematic Content Analysis , editor =. 1992 , publisher =

  64. [65]

    Journal of Conflict Resolution , volume =

    Peter Suedfeld and Philip Tetlock , title =. Journal of Conflict Resolution , volume =. 1977 , month = mar, doi =

  65. [66]

    Baker-Brown and E

    G. Baker-Brown and E. J. Ballard and S. Bluck and B. de Vries and P. Suedfeld and P. E. Tetlock , title =. Motivation and Personality: Handbook of Thematic Content Analysis , editor =. 1992 , publisher =

  66. [68]

    2009 , publisher=

    The Elements of Statistical Learning: Data Mining, Inference, and Prediction , author=. 2009 , publisher=

  67. [69]

    , title =

    Barker, Jessica and Karger, Ezra and Himmelstein, Mark and Budescu, David and Benjamin, Daniel M. , title =. 2025 , month = mar, day =. doi:10.17605/OSF.IO/FM7VJ , url =

  68. [70]

    2026 , month = apr, day =

    Introducing. 2026 , month = apr, day =

  69. [71]

    Prompt selection matters: enhancing text annotations for social sciences with large language models

    Louis Abraham, Charles Arnal, and Antoine Marie. Prompt selection matters: enhancing text annotations for social sciences with large language models. Journal of Computational Social Science, 8 0 (3): 0 73, 2025. doi:10.1007/s42001-025-00388-6

  70. [72]

    Introducing Claude Opus 4.7 , April 2026

    Anthropic . Introducing Claude Opus 4.7 , April 2026. URL https://www.anthropic.com/news/claude-opus-4-7. Accessed 2026-05-05. API pricing: \ 5/\ 25 per million input/output tokens; 50\

  71. [73]

    Pavel Atanasov, Jens Witkowski, Lyle Ungar, Barbara Mellers, and Philip E. Tetlock. Small steps to accuracy: Incremental belief updaters are better forecasters. Organizational Behavior and Human Decision Processes, 160: 0 19--35, 2020. doi:10.1016/j.obhdp.2020.02.001

  72. [74]

    u re, Zhe Wu, Lixinyu Xu, and Breck Baldwin. Non-determinism of “deterministic

    Berk At l, Sarp Aykent, Alexa Chittams, Lisheng Fu, Rebecca J Passonneau, Evan Radcliffe, Guru Rajan Rajagopal, Adam Sloan, Tomasz Tudrej, Ferhan T \"u re, Zhe Wu, Lixinyu Xu, and Breck Baldwin. Non-determinism of “deterministic” llm system settings in hosted environments. In Proceedings of the 5th Workshop on Evaluation and Comparison of NLP Systems, pag...

  73. [75]

    What's in a prompt?: A large-scale experiment to assess the impact of prompt design on the compliance and accuracy of llm-generated text annotations

    Shubham Atreja, Joshua Ashkinaze, Lingyao Li, Julia Mendelsohn, and Libby Hemphill. What's in a prompt?: A large-scale experiment to assess the impact of prompt design on the compliance and accuracy of llm-generated text annotations. In Proceedings of the International AAAI Conference on Web and Social Media, volume 19, pages 122--145, 2025. doi:10.1609/i...

  74. [76]

    Baker-Brown, E

    G. Baker-Brown, E. J. Ballard, S. Bluck, B. de Vries, P. Suedfeld, and P. E. Tetlock. The conceptual/integrative complexity scoring manual. In Charles P. Smith, editor, Motivation and Personality: Handbook of Thematic Content Analysis, pages 401--418. Cambridge University Press, Cambridge, 1992. doi:10.1017/CBO9780511527937.029

  75. [77]

    Benjamin

    Jessica Barker, Ezra Karger, Mark Himmelstein, David Budescu, and Daniel M. Benjamin. Team D ynamics: An experiment testing how group size and information sharing affect forecasting accuracy, March 2025. URL https://osf.io/fm7vj

  76. [78]

    Thinking and Deciding

    Jonathan Baron. Thinking and Deciding. Cambridge University Press, 4 edition, 2008

  77. [79]

    Boyd, Ashwini Ashokkumar, Sarah Seraj, and James W

    Ryan L. Boyd, Ashwini Ashokkumar, Sarah Seraj, and James W. Pennebaker. The development and psychometric properties of LIWC -22. Technical report, University of Texas at Austin, Austin, TX, 2022

  78. [80]

    Glenn W. Brier. Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78 0 (1): 0 1--3, 1950. doi:10.1175/1520-0493(1950)078<0001:VOFEIT>2.0.CO;2

  79. [81]

    Developing expert political judgment: The impact of training and practice on judgmental accuracy in geopolitical forecasting tournaments

    Welton Chang, Eva Chen, Barbara Mellers, and Philip Tetlock. Developing expert political judgment: The impact of training and practice on judgmental accuracy in geopolitical forecasting tournaments. Judgment and Decision Making, 11 0 (5): 0 509--526, 2016. doi:10.1017/s1930297500004599

  80. [82]

    Mellers, and Philip E

    Welton Chang, Pavel Atanasov, Shefali Patil, Barbara A. Mellers, and Philip E. Tetlock. Accountability and adaptive performance under uncertainty: A long-term view. Judgment and Decision Making, 12 0 (6): 0 610--626, 2017. doi:10.1017/s1930297500006732

Showing first 80 references.