MARS: Magnitude-Aware Rank Statistics

Afsaneh M. Nejad; Arthur Zimek; Muhammad Rajabinasab

arxiv: 2605.23563 · v1 · pith:LSVUOY7Cnew · submitted 2026-05-22 · 💻 cs.LG

MARS: Magnitude-Aware Rank Statistics

Muhammad Rajabinasab , Afsaneh M. Nejad , Arthur Zimek This is my paper

Pith reviewed 2026-05-25 04:51 UTC · model grok-4.3

classification 💻 cs.LG

keywords critical difference diagramsmagnitude-aware ranksmodel performance evaluationranking statisticsperformance gapsmachine learning benchmarksstatistical comparison

0 comments

The pith

MARS weights discrete ranks by the size of performance gaps to address magnitude-blindness in critical difference diagrams.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets the limitation in standard critical difference diagrams, which rely on discrete ranks and ignore how large the actual performance differences are between machine learning models. MARS introduces a relative margin coefficient that scales those ranks according to the distance from the best performer to the worst one. Dynamic projection handles cases where models sit at the extremes. The adjusted ranks then feed into the usual critical difference calculation. This produces diagrams that better reflect real differences in large-scale experiments.

Core claim

MARS incorporates a relative margin coefficient as a weight for the discrete ranks. This coefficient scales ranks based on the distance between the best and worst performers, with a dynamic projection to handle boundary cases. Followed by the calculation of a CD value, MARS results in a more realistic statistical representation of differences of model performances.

What carries the argument

Relative margin coefficient that scales discrete ranks according to best-to-worst performance distance, with dynamic projection for boundary cases.

If this is right

Critical difference diagrams will separate models more when gaps between them are large and cluster them when gaps are small.
Slightly better methods will no longer appear equivalent to top performers if the margin is accounted for.
Large experimental comparisons will yield rankings that reflect the scale of improvements rather than order alone.
Statistical tests after MARS will distinguish meaningful performance edges from minor variations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

MARS could change which methods count as state of the art when many models post close scores on the same tasks.
The same weighting idea might apply to other rank-based tests used in algorithm comparison.
Testing MARS on synthetic performance data with controlled gap sizes would show whether the adjusted diagrams match expected significance patterns.
If the scaling changes conclusions too often, it might indicate that magnitude information should be handled separately from the rank test itself.

Load-bearing premise

Multiplying discrete ranks by a relative margin coefficient derived from the best-to-worst performance distance, together with dynamic projection for boundary cases, preserves the statistical validity of the subsequent critical difference calculation.

What would settle it

Apply both standard critical difference diagrams and MARS to the same collection of model performance scores that include both small and large gaps, then check whether the sets of models declared significantly different are the same.

Figures

Figures reproduced from arXiv: 2605.23563 by Afsaneh M. Nejad, Arthur Zimek, Muhammad Rajabinasab.

**Figure 2.** Figure 2: Comparison for Scenario 2: Standard (Top) vs. MARS (Bot [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 5.** Figure 5: Comparison for Scenario 5: Standard (Top) vs. MARS (Bot [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison for Scenario 6: Standard (Top) vs. MARS (Bot [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

read the original abstract

Comprehensive evaluation of machine learning models is the key to make sure that they perform as robustly and consistently as desired. In order to summarize the experimental results and pick a winner, Critical Difference (CD) diagrams are used. Standard CD diagrams rely on discrete ranks, discarding the magnitude of performance gaps between models, raising an issue which we call magnitude-blindness. In order to address this issue, we propose Magnitude-Aware Rank Statistics (MARS) that incorporates a relative margin coefficient as a weight for the discrete ranks. This coefficient scales ranks based on the distance between the best and worst performers, with a dynamic projection to handle boundary cases. Followed by the calculation of a CD value, MARS results in a more realistic statistical representation of differences of model performances and more insights on how methods actually perform in vast and extensive experimental settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MARS weights ranks by magnitude but does not verify that the CD test still works.

read the letter

The main takeaway is that this paper proposes weighting discrete ranks with a performance magnitude factor for use in critical difference diagrams, but it does not address how this affects the validity of the subsequent statistical test. Standard CD diagrams use ranks from the Friedman test and a critical difference threshold from the Nemenyi test to decide which models differ significantly. The new MARS method multiplies those ranks by a coefficient based on the gap between the best and worst model, with special handling for boundary cases. This is a reasonable attempt to incorporate more information from the actual performance values rather than just their ordering. The paper does well in clearly stating the magnitude-blindness problem and offering a simple fix that could be easy to implement. However, the central assumption is that the original critical difference value can still be used after this transformation. Because the ranks are no longer integers and now depend on the observed performances, the null distribution of the average ranks changes. There is no evidence provided that the type I error rate remains controlled. The abstract mentions no experiments, no simulations under the null, and no new derivation for the threshold. This makes the soundness low, as the claim of more realistic representation depends on an untested step. The work targets the community of machine learning researchers who rely on CD diagrams for comparing many algorithms across datasets. It could be of interest to those looking to refine their evaluation practices. A serious reader would want to see the full details on how the coefficient is computed and whether the test properties hold. I recommend sending it for peer review. Referees can check the math on the distribution and suggest if simulations are needed to validate the approach.

Referee Report

2 major / 2 minor

Summary. The paper claims that standard Critical Difference (CD) diagrams suffer from magnitude-blindness by relying on discrete ranks that ignore performance gap sizes. It proposes Magnitude-Aware Rank Statistics (MARS), which weights these ranks by a relative margin coefficient derived from the best-to-worst performance distance, applies dynamic projection for boundary cases, and then computes a CD value, yielding more realistic representations of model performance differences in large experiments.

Significance. If the weighted ranks preserve the statistical properties required for valid CD inference, the approach could improve interpretation of comparative results by incorporating magnitude information. The manuscript provides no machine-checked proofs, reproducible code, or falsifiable predictions to support this.

major comments (2)

[Abstract, §3] Abstract and §3 (MARS construction): the central claim that MARS produces 'more realistic statistical representation' and valid CD diagrams rests on the unverified assumption that multiplying discrete ranks by the relative margin coefficient (plus boundary projection) preserves the null distribution underlying the Nemenyi critical difference. No derivation, closed-form adjustment, or Monte Carlo simulation of type-I error rates under the null is supplied.
[§4] §4 (experiments, if present) or evaluation section: no results are reported that compare false-positive rates of MARS-based CD tests against the nominal level when all methods are equivalent, leaving the weakest assumption untested.

minor comments (2)

[§3] Notation for the relative margin coefficient should be introduced with an explicit equation number rather than inline description.
[Abstract] The abstract states the method 'results in' improved insights but provides no quantitative comparison (e.g., changed rankings or different significant pairs) on any benchmark.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the statistical validity of MARS. We address each major comment below and outline the revisions we will make.

read point-by-point responses

Referee: [Abstract, §3] Abstract and §3 (MARS construction): the central claim that MARS produces 'more realistic statistical representation' and valid CD diagrams rests on the unverified assumption that multiplying discrete ranks by the relative margin coefficient (plus boundary projection) preserves the null distribution underlying the Nemenyi critical difference. No derivation, closed-form adjustment, or Monte Carlo simulation of type-I error rates under the null is supplied.

Authors: We acknowledge that the manuscript does not contain a formal derivation or Monte Carlo study confirming preservation of the Nemenyi null distribution after weighting. The relative margin coefficient is defined to scale ranks by the observed performance range while preserving order, and the dynamic projection is introduced only to keep values within [1, k]. Nevertheless, these design choices alone do not constitute a proof of distributional invariance. In the revised manuscript we will add a dedicated subsection with Monte Carlo simulations that estimate type-I error rates under the global null (all models equivalent) for both standard CD and MARS-based CD at nominal levels 0.05 and 0.01. revision: yes
Referee: [§4] §4 (experiments, if present) or evaluation section: no results are reported that compare false-positive rates of MARS-based CD tests against the nominal level when all methods are equivalent, leaving the weakest assumption untested.

Authors: The submitted manuscript is a methodological proposal and therefore contains no experimental section that reports false-positive rates. We agree that such verification is required before the method can be recommended for statistical inference. The revision will include controlled simulations in which all algorithms draw from identical distributions; we will tabulate empirical type-I error for MARS-CD and compare it to the nominal level and to the standard Nemenyi CD. revision: yes

Circularity Check

0 steps flagged

No circularity: MARS is a direct definitional extension of ranks using observed performance margins.

full rationale

The paper introduces MARS as a new weighted rank statistic constructed explicitly from raw performance values (best-to-worst distance) and a dynamic projection rule. This is a definitional proposal, not a derivation that reduces by construction to prior fitted quantities, self-citations, or renamed known results. No equations equate the output to its inputs tautologically, and the central claim (modified CD diagrams) rests on the explicit new formula rather than any load-bearing self-reference. The statistical validity concern raised by the skeptic is a separate question of null-distribution preservation, not circularity.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The method rests on the introduction of an unspecified relative margin coefficient whose exact functional form and selection procedure are not derived from first principles.

free parameters (1)

relative margin coefficient
Weighting factor applied to discrete ranks that scales with the distance between best and worst performers; its precise definition is not given.

pith-pipeline@v0.9.0 · 5673 in / 937 out tokens · 45223 ms · 2026-05-25T04:51:16.322347+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 2 internal anchors

[1]

Journal of Machine Learning Research , volume=

Statistical comparisons of classifiers over multiple data sets , author=. Journal of Machine Learning Research , volume=. 2006 , publisher=

work page 2006
[2]

1993 , publisher=

Probability and statistics for engineers and scientists , author=. 1993 , publisher=

work page 1993
[3]

Machine Learning: ECML 2006 , pages=

Cost curves: An improved method for visualizing classifier performance , author=. Machine Learning: ECML 2006 , pages=. 2006 , organization=

work page 2006
[4]

Journal of Machine Learning Research , volume=

Time for a change: a tutorial for comparing hypothesis testing in machine learning through Bayesian analysis , author=. Journal of Machine Learning Research , volume=

work page
[5]

2011 , publisher=

Evaluating Learning Algorithms: A Classification Perspective , author=. 2011 , publisher=

work page 2011
[6]

Machine Learning Evaluation

Japkowicz, Nathalie and Boukouvalas, Zois , year =. Machine Learning Evaluation. Towards Reliable and Reesponsible

work page
[7]

BMC medicine , volume=

Key challenges for delivering clinical impact with artificial intelligence , author=. BMC medicine , volume=

work page
[8]

Nature medicine , volume=

AI in health and medicine , author=. Nature medicine , volume=

work page
[9]

Nature Machine Intelligence , volume=

AI for radiographic COVID-19 detection selects shortcuts over signal , author=. Nature Machine Intelligence , volume=

work page
[10]

IEEE Intelligent Transportation Systems Magazine , volume=

Autonomous vehicle safety: An interdisciplinary challenge , author=. IEEE Intelligent Transportation Systems Magazine , volume=

work page
[11]

IEEE International Conference on Robotics and Automation (ICRA) , pages=

End-to-end driving via conditional imitation learning , author=. IEEE International Conference on Robotics and Automation (ICRA) , pages=

work page
[12]

European Conference on Computer Vision (ECCV) , pages=

Lift-splat-shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3D , author=. European Conference on Computer Vision (ECCV) , pages=

work page
[13]

Journal of Machine Learning Research , volume=

An extension on statistical comparisons of classifiers over multiple data sets for all pairwise comparisons , author=. Journal of Machine Learning Research , volume=

work page
[14]

Scandinavian journal of statistics , pages=

A simple sequentially rejective multiple test procedure , author=. Scandinavian journal of statistics , pages=

work page
[15]

Journal of Machine Learning Research , volume=

Should we really use post-hoc tests based on mean-ranks? , author=. Journal of Machine Learning Research , volume=

work page
[16]

Proceedings of Machine Learning and Systems , volume=

Accounting for variance in machine learning benchmarks , author=. Proceedings of Machine Learning and Systems , volume=

work page
[17]

On pace, progress, and empirical rigor , volume=

Winner’s curse , author=. On pace, progress, and empirical rigor , volume=

work page
[18]

Neural computation , volume=

Approximate statistical tests for comparing supervised classification learning algorithms , author=. Neural computation , volume=

work page
[19]

Neural networks , volume=

Analysis of hidden units in a layered network trained to classify sonar targets , author=. Neural networks , volume=

work page
[20]

Journal of the American Statistical Association , volume=

The use of ranks to avoid the assumption of normality implicit in the analysis of variance , author=. Journal of the American Statistical Association , volume=

work page
[21]

The Annals of Mathematical Statistics , volume=

A comparison of alternative tests of significance for the problem of m rankings , author=. The Annals of Mathematical Statistics , volume=

work page
[22]

Proceedings of the 34th International Conference on Machine Learning (ICML) , pages=

On Calibration of Modern Neural Networks , author=. Proceedings of the 34th International Conference on Machine Learning (ICML) , pages=

work page
[23]

Transportation Research Part A: Policy and Practice , volume=

Driving to safety: How many miles of driving would it take to demonstrate autonomous vehicle reliability? , author=. Transportation Research Part A: Policy and Practice , volume=. 2016 , publisher=

work page 2016
[24]

Concrete Problems in AI Safety

Amodei, Dario and Olah, Chris and Steinhardt, Jacob and Christiano, Paul and Schulman, John and Man. Concrete Problems in. arXiv preprint arXiv:1606.06565 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Proceedings of the 39th International Conference on Machine Learning (ICML) , year=

Defining Out-of-Distribution Reward Hacking , author=. Proceedings of the 39th International Conference on Machine Learning (ICML) , year=

work page
[26]

1970 , publisher=

Collective Choice and Social Welfare , author=. 1970 , publisher=

work page 1970
[27]

, booktitle=

Dwork, Cynthia and Kumar, Ravi and Naor, Moni and Sivakumar, D. , booktitle=. Rank aggregation methods for the

work page
[28]

Pattern Recognition Letters , volume=

An experimental comparison of performance measures for binary classification , author=. Pattern Recognition Letters , volume=. 2009 , publisher=

work page 2009
[29]

A Survey of Large Language Models

A Survey of Large Language Models , author =. arXiv preprint arXiv:2303.18223 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[30]

Transactions of the American Mathematical Society , volume=

Tests of statistical hypotheses concerning several parameters when the number of observations is large , author=. Transactions of the American Mathematical Society , volume=

work page
[31]

2000 , publisher=

Asymptotic Statistics , author=. 2000 , publisher=

work page 2000

[1] [1]

Journal of Machine Learning Research , volume=

Statistical comparisons of classifiers over multiple data sets , author=. Journal of Machine Learning Research , volume=. 2006 , publisher=

work page 2006

[2] [2]

1993 , publisher=

Probability and statistics for engineers and scientists , author=. 1993 , publisher=

work page 1993

[3] [3]

Machine Learning: ECML 2006 , pages=

Cost curves: An improved method for visualizing classifier performance , author=. Machine Learning: ECML 2006 , pages=. 2006 , organization=

work page 2006

[4] [4]

Journal of Machine Learning Research , volume=

Time for a change: a tutorial for comparing hypothesis testing in machine learning through Bayesian analysis , author=. Journal of Machine Learning Research , volume=

work page

[5] [5]

2011 , publisher=

Evaluating Learning Algorithms: A Classification Perspective , author=. 2011 , publisher=

work page 2011

[6] [6]

Machine Learning Evaluation

Japkowicz, Nathalie and Boukouvalas, Zois , year =. Machine Learning Evaluation. Towards Reliable and Reesponsible

work page

[7] [7]

BMC medicine , volume=

Key challenges for delivering clinical impact with artificial intelligence , author=. BMC medicine , volume=

work page

[8] [8]

Nature medicine , volume=

AI in health and medicine , author=. Nature medicine , volume=

work page

[9] [9]

Nature Machine Intelligence , volume=

AI for radiographic COVID-19 detection selects shortcuts over signal , author=. Nature Machine Intelligence , volume=

work page

[10] [10]

IEEE Intelligent Transportation Systems Magazine , volume=

Autonomous vehicle safety: An interdisciplinary challenge , author=. IEEE Intelligent Transportation Systems Magazine , volume=

work page

[11] [11]

IEEE International Conference on Robotics and Automation (ICRA) , pages=

End-to-end driving via conditional imitation learning , author=. IEEE International Conference on Robotics and Automation (ICRA) , pages=

work page

[12] [12]

European Conference on Computer Vision (ECCV) , pages=

Lift-splat-shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3D , author=. European Conference on Computer Vision (ECCV) , pages=

work page

[13] [13]

Journal of Machine Learning Research , volume=

An extension on statistical comparisons of classifiers over multiple data sets for all pairwise comparisons , author=. Journal of Machine Learning Research , volume=

work page

[14] [14]

Scandinavian journal of statistics , pages=

A simple sequentially rejective multiple test procedure , author=. Scandinavian journal of statistics , pages=

work page

[15] [15]

Journal of Machine Learning Research , volume=

Should we really use post-hoc tests based on mean-ranks? , author=. Journal of Machine Learning Research , volume=

work page

[16] [16]

Proceedings of Machine Learning and Systems , volume=

Accounting for variance in machine learning benchmarks , author=. Proceedings of Machine Learning and Systems , volume=

work page

[17] [17]

On pace, progress, and empirical rigor , volume=

Winner’s curse , author=. On pace, progress, and empirical rigor , volume=

work page

[18] [18]

Neural computation , volume=

Approximate statistical tests for comparing supervised classification learning algorithms , author=. Neural computation , volume=

work page

[19] [19]

Neural networks , volume=

Analysis of hidden units in a layered network trained to classify sonar targets , author=. Neural networks , volume=

work page

[20] [20]

Journal of the American Statistical Association , volume=

The use of ranks to avoid the assumption of normality implicit in the analysis of variance , author=. Journal of the American Statistical Association , volume=

work page

[21] [21]

The Annals of Mathematical Statistics , volume=

A comparison of alternative tests of significance for the problem of m rankings , author=. The Annals of Mathematical Statistics , volume=

work page

[22] [22]

Proceedings of the 34th International Conference on Machine Learning (ICML) , pages=

On Calibration of Modern Neural Networks , author=. Proceedings of the 34th International Conference on Machine Learning (ICML) , pages=

work page

[23] [23]

Transportation Research Part A: Policy and Practice , volume=

Driving to safety: How many miles of driving would it take to demonstrate autonomous vehicle reliability? , author=. Transportation Research Part A: Policy and Practice , volume=. 2016 , publisher=

work page 2016

[24] [24]

Concrete Problems in AI Safety

Amodei, Dario and Olah, Chris and Steinhardt, Jacob and Christiano, Paul and Schulman, John and Man. Concrete Problems in. arXiv preprint arXiv:1606.06565 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

Proceedings of the 39th International Conference on Machine Learning (ICML) , year=

Defining Out-of-Distribution Reward Hacking , author=. Proceedings of the 39th International Conference on Machine Learning (ICML) , year=

work page

[26] [26]

1970 , publisher=

Collective Choice and Social Welfare , author=. 1970 , publisher=

work page 1970

[27] [27]

, booktitle=

Dwork, Cynthia and Kumar, Ravi and Naor, Moni and Sivakumar, D. , booktitle=. Rank aggregation methods for the

work page

[28] [28]

Pattern Recognition Letters , volume=

An experimental comparison of performance measures for binary classification , author=. Pattern Recognition Letters , volume=. 2009 , publisher=

work page 2009

[29] [29]

A Survey of Large Language Models

A Survey of Large Language Models , author =. arXiv preprint arXiv:2303.18223 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[30] [30]

Transactions of the American Mathematical Society , volume=

Tests of statistical hypotheses concerning several parameters when the number of observations is large , author=. Transactions of the American Mathematical Society , volume=

work page

[31] [31]

2000 , publisher=

Asymptotic Statistics , author=. 2000 , publisher=

work page 2000