Measurement Risk in Supervised Financial NLP: Rubric and Metric Sensitivity on JF-ICR

Peiying Zhu; Rongdong Chai; Sidi Chang; Yuxiao Chen

arxiv: 2604.27374 · v1 · submitted 2026-04-30 · 💻 cs.AI · cs.CL

Measurement Risk in Supervised Financial NLP: Rubric and Metric Sensitivity on JF-ICR

Sidi Chang , Peiying Zhu , Yuxiao Chen , Rongdong Chai This is my paper

Pith reviewed 2026-05-07 08:59 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords measurement riskrubric sensitivityfinancial NLPbenchmark evaluationimplicit commitment recognitionLLM labelingmetric identifiabilityearnings call analysis

0 comments

The pith

Rubric wording changes model labels on financial implicit-commitment data, with agreement between variants ranging from 70 to 83 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether gold labels in supervised financial NLP remain stable when the rubric used to create them is reworded or when the evaluation metric is swapped. It applies five rubric variants and five ordinal metrics across four frontier models on a fixed 253-item Japanese earnings-call test set for implicit-commitment recognition. Rubric differences move a substantial fraction of labels, especially near the boundary between explicit and implicit commitment. Certain standard metrics become uninformative because the class distribution makes near-misses too easy or the rarest class too sparse. Model rankings stabilize only after the audit restricts attention to the three identifiable metrics.

Core claim

Gold labels in supervised financial NLP are not objective because they depend on the precise wording of the annotation rubric and on the choice of performance metric; on the JF-ICR split, inter-rubric agreement ranges from 70.0 percent to 83.4 percent with most movement at the +1/0 boundary, within-one accuracy and worst-class accuracy fail identifiability tests, and Bradley-Terry, Borda, and Ranked Pairs rankings agree only on the subset of exact accuracy, macro-F1, and weighted kappa.

What carries the argument

The measurement-risk audit that crosses five rubric variants, three sampling temperatures, and five ordinal metrics on four LLMs evaluated against the pinned 253-item JF-ICR test split.

If this is right

Most label movement occurs at the pragmatic boundary between +1 and 0 implicit commitment.
Exact accuracy, macro-F1, and weighted kappa remain informative while within-one accuracy is too lenient and worst-class accuracy is too noisy.
Ranking methods agree on model order once the audit is restricted to the three identifiable metrics.
Full sweeps across all five metrics produce unstable orderings for the closest model pairs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Benchmark creators should publish the exact rubric text and the metric-selection rule they used.
Similar rubric sensitivity is likely in other financial NLP tasks that rely on ordinal judgment of language.
Controlled experiments that isolate changes in semantics from changes in example count or verbosity could test the pragmatic-boundary hypothesis directly.

Load-bearing premise

The five rubric variants are representative of the wording differences that actually occur when financial benchmarks are built in practice.

What would settle it

Repeat the full rubric-and-metric sweep on an independently constructed financial dataset of similar size and class balance; if label shifts and metric informativeness patterns fail to repeat, the claimed measurement risk does not generalize.

read the original abstract

As LLMs become credible readers of earnings calls, investor-relations Q\&A, guidance, and disclosure language, supervised financial NLP benchmarks increasingly function as decision evidence for model selection and deployment. A hidden assumption is that gold labels make such evidence objective. This assumption breaks down when the benchmark ruler itself is sensitive to rubric wording, metric choice, or aggregation policy. We study this measurement risk on Japanese Financial Implicit-Commitment Recognition (JF-ICR; a pinned 253-item test split x 4 frontier LLMs x 5 rubrics x 3 temperatures x 5 ordinal metrics). Three findings follow. First, rubric wording materially changes model-assigned labels: R2--R3 agreement ranges from 70.0% to 83.4%, with the dominant movement near the +1 / 0 implicit-commitment boundary. This pattern is consistent with a pragmatic-boundary interpretation, but is not a validated linguistic-causality claim because the present rubric variants confound semantics, examples, and verbosity. Second, not every metric remains informative under the JF-ICR class distribution. Within-one accuracy is too easy because near misses receive credit and the majority class dominates; worst-class accuracy is too noisy because the rarest class has only two examples. Exact accuracy, macro-F1, and weighted \k{appa} are therefore the identifiable metrics under our operational rule. Third, ranking claims become more defensible only after this metric-identifiability audit: Bradley--Terry, Borda, and Ranked Pairs agree on the identifiable metric subset, while the full five-metric sweep produces disagreement on the closest pair. The contribution is not a new leaderboard, but a reporting discipline for supervised financial benchmarks whose gold labels exist and whose evaluation ruler still requires governance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows rubric changes shift labels on JF-ICR and narrows to three stable metrics, but the variants mix too many factors to isolate wording and the tiny rare class limits how far the findings travel.

read the letter

The core point is that gold labels in financial NLP benchmarks are more fragile than people treat them, and this work measures that fragility on a Japanese implicit-commitment dataset with four LLMs and five rubric versions. They find pairwise agreement between 70 and 83 percent, mostly flipping at the +1/0 boundary, and they argue that only exact accuracy, macro-F1, and weighted kappa stay informative once you account for the class distribution. The practical upshot is that model rankings look more consistent once you drop the noisier metrics. That is a useful reminder for anyone picking LLMs for earnings-call or disclosure analysis where the output feeds real decisions. The work does a clean job of laying out the operational rule for which metrics count as identifiable and showing how the full metric sweep creates disagreement on close pairs. It stays empirical and does not overclaim a new theory. The main limitation is exactly what the stress-test flags: the rubric variants change semantics, examples, and verbosity together, so the label shifts cannot be pinned on wording alone. The 253-item pinned split also has a rare class with only two examples, which makes worst-class accuracy look noisy by construction and turns the choice of stable metrics into something that may not hold on other distributions. No statistical tests or replication sets are described in the abstract, which leaves the generalizability open. This is worth a referee for groups that build or audit financial NLP benchmarks. Readers who care about evaluation hygiene in high-stakes domains will get something concrete from it, even if they end up wanting tighter controls on the rubric changes and a bigger or more varied test collection. I would send it to review rather than desk-reject.

Referee Report

2 major / 2 minor

Summary. The paper claims that rubric wording materially changes model-assigned labels on the JF-ICR benchmark (R2--R3 agreement 70.0% to 83.4%, dominant movement at the +1/0 boundary), that not every metric is informative under the observed class distribution (within-one accuracy too easy due to majority class and near-miss credit; worst-class accuracy too noisy with only two examples in the rarest class), and that model rankings (Bradley-Terry, Borda, Ranked Pairs) become more defensible and consistent only after restricting to the identifiable metrics (exact accuracy, macro-F1, weighted kappa). The contribution is framed as a reporting discipline and metric-identifiability audit rather than a new leaderboard, based on direct empirical measurements across 4 LLMs, 5 rubrics, 3 temperatures, and 5 ordinal metrics on a pinned 253-item split.

Significance. If the empirical patterns hold, the work is significant for supervised financial NLP because it supplies concrete agreement percentages and an operational rule for metric identifiability, directly addressing the hidden assumption that gold labels are objective when LLMs serve as readers of earnings calls and disclosures. The absence of free parameters, derivations, or invented entities, combined with the use of fixed models and a pinned split, strengthens the reproducibility of the reported measurements and provides a practical template for benchmark governance.

major comments (2)

[Abstract, first finding] Abstract, first finding: the central claim attributes material label changes to 'rubric wording' (70.0%--83.4% R2--R3 disagreement), yet the manuscript itself states that the five rubric variants confound semantics, examples, and verbosity; without controls that hold examples and length fixed while varying only the commitment definition, the observed shifts cannot isolate wording effects and therefore do not yet establish that the reported measurement risk is representative of supervised financial NLP.
[Abstract, second finding] Abstract, second finding: worst-class accuracy is dismissed as noisy precisely because the rarest class contains only two examples on the 253-item pinned split; this makes the 'identifiable metric' subset (exact accuracy, macro-F1, weighted kappa) an artifact of this particular distribution rather than a general property, so the claim that ranking defensibility improves after the audit requires either sensitivity analysis on larger or differently distributed benchmarks or explicit qualification of its scope.

minor comments (2)

[Abstract] Abstract: the notation 'weighted k{appa}' is a likely typesetting artifact for weighted kappa and should be rendered consistently as weighted kappa or weighted κ.
[Abstract] Abstract and methods: the five rubric variants are described only at a high level; including their full text (or a pointer to an appendix) would allow readers to assess the confounding factors directly.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on causal isolation and scope of claims. We will revise the abstract and discussion to align phrasing with the manuscript's existing qualifications, framing the results as evidence of measurement risk under realistic rubric variations on this specific benchmark rather than isolated wording effects or universal metric properties.

read point-by-point responses

Referee: [Abstract, first finding] the central claim attributes material label changes to 'rubric wording' (70.0%--83.4% R2--R3 disagreement), yet the manuscript itself states that the five rubric variants confound semantics, examples, and verbosity; without controls that hold examples and length fixed while varying only the commitment definition, the observed shifts cannot isolate wording effects and therefore do not yet establish that the reported measurement risk is representative of supervised financial NLP.

Authors: We agree that the current rubric variants do not isolate wording alone. The manuscript already states in the abstract that 'the present rubric variants confound semantics, examples, and verbosity' and that the pattern 'is not a validated linguistic-causality claim.' The abstract's shorthand 'rubric wording' is imprecise. We will revise the abstract's first sentence to read 'changes across five rubric variants (confounding semantics, examples, and verbosity) materially change model-assigned labels' while retaining the explicit disclaimer. This preserves the empirical agreement ranges and the contribution as a reporting discipline for benchmark governance without claiming pure wording causality. revision: yes
Referee: [Abstract, second finding] worst-class accuracy is dismissed as noisy precisely because the rarest class contains only two examples on the 253-item pinned split; this makes the 'identifiable metric' subset (exact accuracy, macro-F1, weighted kappa) an artifact of this particular distribution rather than a general property, so the claim that ranking defensibility improves after the audit requires either sensitivity analysis on larger or differently distributed benchmarks or explicit qualification of its scope.

Authors: We accept that metric identifiability is distribution-dependent. The paper already qualifies the rule as 'under our operational rule' and 'under the JF-ICR class distribution.' We will add explicit scope language in the abstract and conclusion: 'The identifiable metrics (exact accuracy, macro-F1, weighted kappa) and improved ranking consistency hold for the observed 253-item distribution with its small rarest class; applicability to other distributions requires separate validation.' No new experiments are needed because the work is scoped to this pinned split; the qualification directly addresses the concern while keeping the contribution as a template for benchmark-specific audits. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical measurements on fixed inputs

full rationale

The paper reports direct experimental results from evaluating four LLMs on a pinned 253-item JF-ICR test split across five rubric variants, three temperatures, and five metrics. All claims (rubric agreement ranges of 70.0-83.4%, identification of exact accuracy/macro-F1/weighted kappa as informative, and improved ranking consensus on that subset) are observational outputs from those runs rather than derivations, fitted parameters, or self-referential definitions. No equations, predictions, or uniqueness theorems appear that reduce to the inputs by construction, and no load-bearing self-citations are invoked to justify the central findings. The work is therefore self-contained as an empirical audit.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical sensitivity study; it introduces no free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5632 in / 1142 out tokens · 39786 ms · 2026-05-07T08:59:34.473610+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 2 canonical work pages

[1]

D. H. Krantz, R. D. Luce, P. Suppes, and A. Tversky, Foundations of Measurement, Vol. I: Additive and Polynomial Representations. New York: Academic Press, 1971

1971
[2]

Measurement and fairness,

A. Z. Jacobs and H. Wallach, “Measurement and fairness,” in Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (F AccT), 2021, pp. 375–385

2021
[3]

Rank analysis of incomplete block designs: I. the method of paired comparisons,

R. A. Bradley and M. E. Terry, “Rank analysis of incomplete block designs: I. the method of paired comparisons,” Biometrika, vol. 39, no. 3/4, pp. 324–345, 1952

1952
[4]

Mémoire sur les élections au scrutin,

J.-C. d. Borda, “Mémoire sur les élections au scrutin,” in Histoire de l’Académie Royale des Sciences , Paris, 1781, pp. 657–665
[5]

A “reasonable

A. H. Copeland, “A “reasonable” social welfare function,” Seminar on Mathematics in the Social Sciences, University of Michigan, Tech. Rep., 1951, mimeographed notes

1951
[6]

Independence of clones as a criterion for voting rules,

T. N. Tideman, “Independence of clones as a criterion for voting rules,” Social Choice and Welfare , vol. 4, no. 3, pp. 185–206, 1987

1987
[7]

Chatbot arena: An open platform for evaluating LLMs by human preference,

W.-L. Chiang, L. Zheng, Y. Sheng, A. N. Angelopoulos, T. Li, D. Li, B. Zhu, H. Zhang, M. I. Jordan, J. E. Gonzalez, and I. Stoica, “Chatbot arena: An open platform for evaluating LLMs by human preference,” in Proceedings of the 41st International Conference on Machine Learning , ser. Proceedings of Machine Learning Research, vol. 235, 2024, pp. 8359–8388....

2024
[8]

Judging LLM-as-a-judge with MT-bench and chatbot arena,

L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica, “Judging LLM-as-a-judge with MT-bench and chatbot arena,” in Advances in Neural Information Processing Systems 36 (NeurIPS) , 2023, pp. 46 595–46 623. [Online]. A vailable: https://proceedings.neurips. cc/paper_files/pa...

2023
[9]

K. J. Arrow, Social Choice and Individual Values . New York: John Wiley & Sons, 1951

1951
[10]

EBISU: Benchmarking large language models in japanese finance,

X. Peng, R. Xiang, F. Zhang, M. Song, M. Jiang, Y. Wang, L. Qian, T. Hara, Y. Guo, J. Huang, J. Tsujii, and S. Ananiadou, “EBISU: Benchmarking large language models in japanese finance,” arXiv preprint arXiv:2602.01479, 2026, dataset: https://huggingface.co/datasets/TheFinAI/JF-ICR. [Online]. A vailable: https://huggingface.co/datasets/TheFinAI/JF-ICR

work page arXiv 2026
[11]

Hyland, Metadiscourse: Exploring Interaction in Writing

K. Hyland, Metadiscourse: Exploring Interaction in Writing . London: Continuum, 2005

2005
[12]

Brown and S

P. Brown and S. C. Levinson, Politeness: Some Universals in Language Usage. Cambridge University Press, 1987

1987
[13]

Reexamination of the universality of face: Politeness phenomena in Japanese,

Y. Matsumoto, “Reexamination of the universality of face: Politeness phenomena in Japanese,” Journal of Pragmatics , vol. 12, no. 4, pp. 403–426, 1988

1988
[14]

Probable inference, the law of succession, and statistical inference,

E. B. Wilson, “Probable inference, the law of succession, and statistical inference,” Journal of the American Statistical Association, vol. 22, no. 158, pp. 209–212, 1927

1927
[15]

The use of confidence or fiducial limits illustrated in the case of the binomial,

C. J. Clopper and E. S. Pearson, “The use of confidence or fiducial limits illustrated in the case of the binomial,” Biometrika, vol. 26, no. 4, pp. 404–413, 1934

1934
[16]

Note on the sampling error of the difference between correlated proportions or percentages,

Q. McNemar, “Note on the sampling error of the difference between correlated proportions or percentages,” Psychometrika, vol. 12, no. 2, pp. 153–157, 1947

1947
[17]

Bootstrap methods: Another look at the jackknife,

B. Efron, “Bootstrap methods: Another look at the jackknife,” The Annals of Statistics , vol. 7, no. 1, pp. 1–26, 1979

1979
[18]

Efron and R

B. Efron and R. J. Tibshirani, An Introduction to the Bootstrap. New York: Chapman & Hall, 1993

1993
[19]

Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit,

J. Cohen, “Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit,” Psychological Bulletin , vol. 70, no. 4, pp. 213–220, 1968

1968
[20]

A simple sequentially rejective multiple test procedure,

S. Holm, “A simple sequentially rejective multiple test procedure,” Scandinavian Journal of Statistics , vol. 6, no. 2, pp. 65–70, 1979

1979
[21]

Permutation p-values should never be zero: Calculating exact p-values when permutations are randomly drawn,

B. Phipson and G. K. Smyth, “Permutation p-values should never be zero: Calculating exact p-values when permutations are randomly drawn,” Statistical Applications in Genetics and Molecular Biology, vol. 9, no. 1, p. Article 39, 2010

2010
[22]

A new measure of rank correlation,

M. G. Kendall, “A new measure of rank correlation,” Biometrika, vol. 30, no. 1/2, pp. 81–93, 1938

1938
[23]

M. d. Condorcet, Marie Jean Antoine Nicolas de Caritat, Essai sur l’application de l’analyse à la probabilité des décisions rendues à la pluralité des voix . Paris: Imprimerie Royale, 1785
[24]

The end of rented discovery: How AI search redistributes power between hotels and intermediaries,

P. Zhu and S. Chang, “The end of rented discovery: How AI search redistributes power between hotels and intermediaries,” arXiv preprint arXiv:2603.20062, 2026. [Online]. A vailable: https://arxiv.org/abs/2603.20062

work page arXiv 2026
[25]

Controlling the false discovery rate: A practical and powerful approach to multiple testing,

Y. Benjamini and Y. Hochberg, “Controlling the false discovery rate: A practical and powerful approach to multiple testing,” Journal of the Royal Statistical Society, Series B , vol. 57, no. 1, pp. 289–300, 1995

1995

[1] [1]

D. H. Krantz, R. D. Luce, P. Suppes, and A. Tversky, Foundations of Measurement, Vol. I: Additive and Polynomial Representations. New York: Academic Press, 1971

1971

[2] [2]

Measurement and fairness,

A. Z. Jacobs and H. Wallach, “Measurement and fairness,” in Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (F AccT), 2021, pp. 375–385

2021

[3] [3]

Rank analysis of incomplete block designs: I. the method of paired comparisons,

R. A. Bradley and M. E. Terry, “Rank analysis of incomplete block designs: I. the method of paired comparisons,” Biometrika, vol. 39, no. 3/4, pp. 324–345, 1952

1952

[4] [4]

Mémoire sur les élections au scrutin,

J.-C. d. Borda, “Mémoire sur les élections au scrutin,” in Histoire de l’Académie Royale des Sciences , Paris, 1781, pp. 657–665

[5] [5]

A “reasonable

A. H. Copeland, “A “reasonable” social welfare function,” Seminar on Mathematics in the Social Sciences, University of Michigan, Tech. Rep., 1951, mimeographed notes

1951

[6] [6]

Independence of clones as a criterion for voting rules,

T. N. Tideman, “Independence of clones as a criterion for voting rules,” Social Choice and Welfare , vol. 4, no. 3, pp. 185–206, 1987

1987

[7] [7]

Chatbot arena: An open platform for evaluating LLMs by human preference,

W.-L. Chiang, L. Zheng, Y. Sheng, A. N. Angelopoulos, T. Li, D. Li, B. Zhu, H. Zhang, M. I. Jordan, J. E. Gonzalez, and I. Stoica, “Chatbot arena: An open platform for evaluating LLMs by human preference,” in Proceedings of the 41st International Conference on Machine Learning , ser. Proceedings of Machine Learning Research, vol. 235, 2024, pp. 8359–8388....

2024

[8] [8]

Judging LLM-as-a-judge with MT-bench and chatbot arena,

L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica, “Judging LLM-as-a-judge with MT-bench and chatbot arena,” in Advances in Neural Information Processing Systems 36 (NeurIPS) , 2023, pp. 46 595–46 623. [Online]. A vailable: https://proceedings.neurips. cc/paper_files/pa...

2023

[9] [9]

K. J. Arrow, Social Choice and Individual Values . New York: John Wiley & Sons, 1951

1951

[10] [10]

EBISU: Benchmarking large language models in japanese finance,

X. Peng, R. Xiang, F. Zhang, M. Song, M. Jiang, Y. Wang, L. Qian, T. Hara, Y. Guo, J. Huang, J. Tsujii, and S. Ananiadou, “EBISU: Benchmarking large language models in japanese finance,” arXiv preprint arXiv:2602.01479, 2026, dataset: https://huggingface.co/datasets/TheFinAI/JF-ICR. [Online]. A vailable: https://huggingface.co/datasets/TheFinAI/JF-ICR

work page arXiv 2026

[11] [11]

Hyland, Metadiscourse: Exploring Interaction in Writing

K. Hyland, Metadiscourse: Exploring Interaction in Writing . London: Continuum, 2005

2005

[12] [12]

Brown and S

P. Brown and S. C. Levinson, Politeness: Some Universals in Language Usage. Cambridge University Press, 1987

1987

[13] [13]

Reexamination of the universality of face: Politeness phenomena in Japanese,

Y. Matsumoto, “Reexamination of the universality of face: Politeness phenomena in Japanese,” Journal of Pragmatics , vol. 12, no. 4, pp. 403–426, 1988

1988

[14] [14]

Probable inference, the law of succession, and statistical inference,

E. B. Wilson, “Probable inference, the law of succession, and statistical inference,” Journal of the American Statistical Association, vol. 22, no. 158, pp. 209–212, 1927

1927

[15] [15]

The use of confidence or fiducial limits illustrated in the case of the binomial,

C. J. Clopper and E. S. Pearson, “The use of confidence or fiducial limits illustrated in the case of the binomial,” Biometrika, vol. 26, no. 4, pp. 404–413, 1934

1934

[16] [16]

Note on the sampling error of the difference between correlated proportions or percentages,

Q. McNemar, “Note on the sampling error of the difference between correlated proportions or percentages,” Psychometrika, vol. 12, no. 2, pp. 153–157, 1947

1947

[17] [17]

Bootstrap methods: Another look at the jackknife,

B. Efron, “Bootstrap methods: Another look at the jackknife,” The Annals of Statistics , vol. 7, no. 1, pp. 1–26, 1979

1979

[18] [18]

Efron and R

B. Efron and R. J. Tibshirani, An Introduction to the Bootstrap. New York: Chapman & Hall, 1993

1993

[19] [19]

Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit,

J. Cohen, “Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit,” Psychological Bulletin , vol. 70, no. 4, pp. 213–220, 1968

1968

[20] [20]

A simple sequentially rejective multiple test procedure,

S. Holm, “A simple sequentially rejective multiple test procedure,” Scandinavian Journal of Statistics , vol. 6, no. 2, pp. 65–70, 1979

1979

[21] [21]

Permutation p-values should never be zero: Calculating exact p-values when permutations are randomly drawn,

B. Phipson and G. K. Smyth, “Permutation p-values should never be zero: Calculating exact p-values when permutations are randomly drawn,” Statistical Applications in Genetics and Molecular Biology, vol. 9, no. 1, p. Article 39, 2010

2010

[22] [22]

A new measure of rank correlation,

M. G. Kendall, “A new measure of rank correlation,” Biometrika, vol. 30, no. 1/2, pp. 81–93, 1938

1938

[23] [23]

M. d. Condorcet, Marie Jean Antoine Nicolas de Caritat, Essai sur l’application de l’analyse à la probabilité des décisions rendues à la pluralité des voix . Paris: Imprimerie Royale, 1785

[24] [24]

The end of rented discovery: How AI search redistributes power between hotels and intermediaries,

P. Zhu and S. Chang, “The end of rented discovery: How AI search redistributes power between hotels and intermediaries,” arXiv preprint arXiv:2603.20062, 2026. [Online]. A vailable: https://arxiv.org/abs/2603.20062

work page arXiv 2026

[25] [25]

Controlling the false discovery rate: A practical and powerful approach to multiple testing,

Y. Benjamini and Y. Hochberg, “Controlling the false discovery rate: A practical and powerful approach to multiple testing,” Journal of the Royal Statistical Society, Series B , vol. 57, no. 1, pp. 289–300, 1995

1995