GF-Score: Certified Class-Conditional Robustness Evaluation with Fairness Guarantees

Arya Shah; Kaveri Visavadiya; Manisha Padala

arxiv: 2604.12757 · v1 · submitted 2026-04-14 · 💻 cs.LG · cs.AI

GF-Score: Certified Class-Conditional Robustness Evaluation with Fairness Guarantees

Arya Shah , Kaveri Visavadiya , Manisha Padala This is my paper

Pith reviewed 2026-05-10 16:22 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords certified robustnessclass-conditional robustnessfairness in robustnessadversarial robustness evaluationself-calibrationneural networksimage classification

0 comments

The pith

Certified robustness can be exactly decomposed into per-class profiles with fairness disparity metrics and attack-free calibration.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to break an overall certified robustness score into precise scores for each individual class. This split allows four fairness-inspired metrics to quantify how much protection varies from one class to another. The calibration of a key parameter relies only on patterns in clean data accuracy rather than any adversarial testing. If successful, this creates a practical way to audit whether all classes receive equal robustness guarantees in neural network models.

Core claim

The decomposition of the certified robustness score into per-class robustness profiles is exact and enables quantification of disparity via four welfare-economics metrics. Dependence on adversarial attacks is eliminated by self-calibrating the temperature parameter solely from clean accuracy correlations. Testing on multiple models for image classification datasets confirms the exact decomposition, identifies consistently vulnerable classes, and notes greater disparity in more robust models.

What carries the argument

Exact decomposition of aggregate certified robustness into per-class profiles combined with self-calibration from clean accuracy correlations.

If this is right

The per-class profiles recombine exactly to the original aggregate score.
Disparity can be measured using metrics that capture worst-case and distribution aspects without attack data.
Certain classes emerge as systematically weaker across a range of models.
Higher overall robustness tends to coincide with larger class-to-class differences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Auditors could apply these metrics to enforce balanced robustness in deployed systems.
The calibration technique may transfer to other certification methods that use similar parameters.
Training procedures could be adjusted to reduce disparity for identified weak classes.
The findings suggest a need for class-aware robustness benchmarks in future evaluations.

Load-bearing premise

The procedure that tunes the temperature parameter using only clean accuracy correlations produces accurate and unbiased certified per-class scores.

What would settle it

Computing per-class certified scores with full adversarial calibration and observing large differences from the self-calibrated versions would show the method is not equivalent.

Figures

Figures reproduced from arXiv: 2604.12757 by Arya Shah, Kaveri Visavadiya, Manisha Padala.

**Figure 2.** Figure 2: Aggregate GREAT Score vs. RDI. Higher robustness correlates with greater class-level [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Disparity metric bar charts. Models are sorted by aggregate GREAT Score. Higher RDI [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 4.** Figure 4: Class vulnerability analysis. On CIFAR-10, “cat” appears as the worst-case class in 13 of [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: FP-GREAT re-ranking. On CIFAR-10, models with low disparity (e.g., Wu2020) rise, [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Self-calibration curves. Both curves are smooth, confirming that the two-phase grid search [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: RDI concentration bounds. The bound tightens as per-class sample size increases. Our [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Radar chart of per-class GREAT Scores for selected CIFAR-10 models. Each axis represents [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: shows the per-class GREAT Score heatmap for ImageNet models, analogous to [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

read the original abstract

Adversarial robustness is essential for deploying neural networks in safety-critical applications, yet standard evaluation methods either require expensive adversarial attacks or report only a single aggregate score that obscures how robustness is distributed across classes. We introduce the \emph{GF-Score} (GREAT-Fairness Score), a framework that decomposes the certified GREAT Score into per-class robustness profiles and quantifies their disparity through four metrics grounded in welfare economics: the Robustness Disparity Index (RDI), the Normalized Robustness Gini Coefficient (NRGC), Worst-Case Class Robustness (WCR), and a Fairness-Penalized GREAT Score (FP-GREAT). The framework further eliminates the original method's dependence on adversarial attacks through a self-calibration procedure that tunes the temperature parameter using only clean accuracy correlations. Evaluating 22 models from RobustBench across CIFAR-10 and ImageNet, we find that the decomposition is exact, that per-class scores reveal consistent vulnerability patterns (e.g., ``cat'' is the weakest class in 76\% of CIFAR-10 models), and that more robust models tend to exhibit greater class-level disparity. These results establish a practical, attack-free auditing pipeline for diagnosing where certified robustness guarantees fail to protect all classes equally. We release our code on \href{https://github.com/aryashah2k/gf-score}{GitHub}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GF-Score decomposes certified robustness into per-class scores and adds welfare-economics fairness metrics, but the clean-data self-calibration step lacks a clear argument that the original certification bounds survive the temperature tuning.

read the letter

The main thing to know is that the paper takes the existing GREAT certified score, splits it exactly by class, and defines four new disparity measures (RDI, NRGC, WCR, FP-GREAT) drawn from welfare economics. It also replaces the usual adversarial attack step with a temperature calibration that uses only clean accuracy correlations on the evaluation set itself. They run this on 22 RobustBench models for CIFAR-10 and ImageNet and report that some classes (cat on CIFAR-10) are repeatedly the weakest while stronger overall models tend to show larger class gaps. The code release helps anyone who wants to check the numbers directly.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces the GF-Score framework, which decomposes the certified GREAT Score into exact per-class robustness profiles and quantifies disparity via four welfare-economics-inspired metrics: Robustness Disparity Index (RDI), Normalized Robustness Gini Coefficient (NRGC), Worst-Case Class Robustness (WCR), and Fairness-Penalized GREAT Score (FP-GREAT). A central innovation is a self-calibration procedure that tunes the temperature parameter solely from clean accuracy correlations, thereby removing dependence on adversarial attacks. The method is evaluated on 22 models from RobustBench on CIFAR-10 and ImageNet, reporting that the decomposition is exact, that certain classes (e.g., 'cat') are consistently weakest, and that higher aggregate robustness correlates with greater class-level disparity.

Significance. If the self-calibration is rigorously shown to preserve certification validity, the work supplies a practical, attack-free auditing pipeline for class-conditional certified robustness and fairness that goes beyond single aggregate scores. The public release of code on GitHub is a clear strength that supports reproducibility and further use of the RDI/NRGC/WCR/FP-GREAT metrics.

major comments (3)

[Abstract / calibration procedure] Abstract and calibration procedure: the central claim that temperature tuning via clean-accuracy correlations on the same evaluation data produces valid certified per-class GREAT scores (i.e., that the original Lipschitz/concentration bounds still hold) is asserted without any derivation, theorem statement, or verification that the proxy choice satisfies the conditions of the base certified method. This is load-bearing for the 'attack-free' guarantee.
[Abstract] Abstract: the statement that 'the decomposition is exact' is presented as a finding, yet no supporting theorem, proof sketch, or equation showing how the per-class profiles are obtained from the aggregate GREAT Score while preserving certification is supplied, leaving the exactness claim unverified.
[Evaluation section] Evaluation on 22 models: while patterns such as 'cat' being weakest in 76% of CIFAR-10 models are reported, no table or section quantifies the tightness of the resulting certified bounds or compares them against the original attack-dependent procedure, so it is unclear whether the self-calibrated scores remain conservative.

minor comments (2)

[Abstract] The four new metrics (RDI, NRGC, WCR, FP-GREAT) are introduced without explicit formulas or pseudocode in the abstract; adding these in the main text would improve clarity.
[Method] The manuscript would benefit from a short discussion of how the self-calibration temperature is chosen in practice (e.g., optimization objective or closed-form expression) to allow readers to reproduce the procedure exactly.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, acknowledging where the manuscript requires strengthening and outlining the specific revisions we will implement.

read point-by-point responses

Referee: [Abstract / calibration procedure] Abstract and calibration procedure: the central claim that temperature tuning via clean-accuracy correlations on the same evaluation data produces valid certified per-class GREAT scores (i.e., that the original Lipschitz/concentration bounds still hold) is asserted without any derivation, theorem statement, or verification that the proxy choice satisfies the conditions of the base certified method. This is load-bearing for the 'attack-free' guarantee.

Authors: We agree that the validity of the self-calibration must be formally established rather than asserted. The current manuscript does not contain a derivation or theorem confirming that bounds are preserved. In the revised manuscript we will add a new subsection (Section 3.3) containing a theorem and proof sketch. The theorem states that selecting the temperature via correlation with clean accuracy on the evaluation set preserves the Lipschitz constant and concentration inequalities of the base certified method, because the calibration operates exclusively on clean-data statistics and does not alter the model's certified radius computation. We will also include a short empirical check on three models showing that the self-calibrated bounds remain conservative relative to the attack-dependent baseline. revision: yes
Referee: [Abstract] Abstract: the statement that 'the decomposition is exact' is presented as a finding, yet no supporting theorem, proof sketch, or equation showing how the per-class profiles are obtained from the aggregate GREAT Score while preserving certification is supplied, leaving the exactness claim unverified.

Authors: The claim of exact decomposition is currently stated without supporting formalism. We will revise the manuscript by inserting an explicit equation and brief proof in Section 3.2. The equation defines the per-class GREAT score as the certified robustness computed independently on the subset of examples belonging to that class; the aggregate score is their average. The short proof shows that because each per-class certification applies the identical Lipschitz and concentration arguments to its own data subset, the individual bounds remain valid and the decomposition does not relax any guarantee. This addition will make the exactness claim verifiable. revision: yes
Referee: [Evaluation section] Evaluation on 22 models: while patterns such as 'cat' being weakest in 76% of CIFAR-10 models are reported, no table or section quantifies the tightness of the resulting certified bounds or compares them against the original attack-dependent procedure, so it is unclear whether the self-calibrated scores remain conservative.

Authors: We accept that the evaluation section lacks a direct tightness comparison. In the revision we will add a new table (Table 4) and accompanying paragraph that reports, for five representative models on CIFAR-10, both the self-calibrated per-class GREAT scores and the scores obtained with the original attack-dependent temperature tuning. The table will include the mean absolute difference in certified values and the fraction of classes for which the self-calibrated bound is at least as tight as the attack-based bound, thereby demonstrating conservatism. revision: yes

Circularity Check

1 steps flagged

Self-calibration tunes temperature from clean-data correlations, reducing claimed certified per-class scores to a fitted quantity

specific steps

fitted input called prediction [Abstract]
"The framework further eliminates the original method's dependence on adversarial attacks through a self-calibration procedure that tunes the temperature parameter using only clean accuracy correlations."

The temperature parameter is fitted directly to clean accuracy correlations computed on the identical dataset and models used to produce the per-class robustness profiles and the four fairness metrics. Consequently the reported 'certified' GF-Score values and disparity quantifiers are not independent bounds but are statistically forced by the same clean-data fit that the calibration procedure employs.

full rationale

The paper presents GF-Score as delivering exact decomposition of certified GREAT scores into per-class profiles plus attack-free fairness metrics via self-calibration of the temperature parameter. The abstract explicitly describes this calibration as using only clean accuracy correlations on the evaluation data. This matches the fitted-input-called-prediction pattern: the key hyperparameter controlling the certified bounds is adjusted to the same clean-data statistics that the per-class scores and disparity indices (RDI, NRGC, WCR, FP-GREAT) are later computed on. While the decomposition step itself may be algebraic, the certification claim loses independence once the smoothing parameter is data-fitted rather than fixed by external Lipschitz or concentration assumptions. No other circular patterns (self-citation load-bearing, ansatz smuggling, etc.) are evident from the provided text.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 4 invented entities

The framework rests on the validity of the prior GREAT Score, assumes exact decomposability into classes, and introduces a fitted temperature parameter plus four new metrics without independent evidence outside the evaluation.

free parameters (1)

temperature parameter
Tuned via self-calibration on clean accuracy correlations to eliminate attack dependence.

axioms (1)

domain assumption Certified GREAT Score admits exact per-class decomposition
Invoked as the basis for the GF-Score profiles.

invented entities (4)

Robustness Disparity Index (RDI) no independent evidence
purpose: Quantify disparity in per-class robustness
New metric introduced in the framework.
Normalized Robustness Gini Coefficient (NRGC) no independent evidence
purpose: Measure inequality of robustness across classes
New metric introduced in the framework.
Worst-Case Class Robustness (WCR) no independent evidence
purpose: Identify the least robust class
New metric introduced in the framework.
Fairness-Penalized GREAT Score (FP-GREAT) no independent evidence
purpose: Overall score adjusted for class disparity
New metric introduced in the framework.

pith-pipeline@v0.9.0 · 5554 in / 1626 out tokens · 70619 ms · 2026-05-10T16:22:01.679412+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages

[1]

Anish Athalye, Nicholas Carlini, and David Wagner

URL https://arxiv.org/abs/2503.16179. Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples, 2018. URL https://arxiv.org/ abs/1802.00420. Philipp Benz, Chaoning Zhang, Adil Karjauv, and In So Kweon. Robustness may be at odds with fairness: An empirical study o...

work page arXiv 2018
[2]

Hadi Salman, Greg Yang, Jerry Li, Pengchuan Zhang, Huan Zhang, Ilya Razenshteyn, and Sebastien Bubeck

URL https://arxiv.org/abs/1911.08731. Hadi Salman, Greg Yang, Jerry Li, Pengchuan Zhang, Huan Zhang, Ilya Razenshteyn, and Sebastien Bubeck. Provably robust deep learning via adversarially trained smoothed classifiers, 2020. URL https://arxiv.org/abs/1906.04584. C Spearman. The proof and measurement of association between two things. Int. J. Epidemiol., 3...

work page doi:10.1145/3219819.3220046 1911
[3]

Fast is better than free: Revisiting adversarial training

URL https://arxiv.org/abs/2001.03994. Dongxian Wu, Shu tao Xia, and Yisen Wang. Adversarial weight perturbation helps robust general- ization, 2020. URL https://arxiv.org/abs/2004.05884. 11 Han Xu, Xiaorui Liu, Yaxin Li, Anil K. Jain, and Jiliang Tang. To be robust or to be fair: Towards fairness in adversarial training, 2021. URL https://arxiv.org/abs/20...

work page arXiv 2001

[1] [1]

Anish Athalye, Nicholas Carlini, and David Wagner

URL https://arxiv.org/abs/2503.16179. Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples, 2018. URL https://arxiv.org/ abs/1802.00420. Philipp Benz, Chaoning Zhang, Adil Karjauv, and In So Kweon. Robustness may be at odds with fairness: An empirical study o...

work page arXiv 2018

[2] [2]

Hadi Salman, Greg Yang, Jerry Li, Pengchuan Zhang, Huan Zhang, Ilya Razenshteyn, and Sebastien Bubeck

URL https://arxiv.org/abs/1911.08731. Hadi Salman, Greg Yang, Jerry Li, Pengchuan Zhang, Huan Zhang, Ilya Razenshteyn, and Sebastien Bubeck. Provably robust deep learning via adversarially trained smoothed classifiers, 2020. URL https://arxiv.org/abs/1906.04584. C Spearman. The proof and measurement of association between two things. Int. J. Epidemiol., 3...

work page doi:10.1145/3219819.3220046 1911

[3] [3]

Fast is better than free: Revisiting adversarial training

URL https://arxiv.org/abs/2001.03994. Dongxian Wu, Shu tao Xia, and Yisen Wang. Adversarial weight perturbation helps robust general- ization, 2020. URL https://arxiv.org/abs/2004.05884. 11 Han Xu, Xiaorui Liu, Yaxin Li, Anil K. Jain, and Jiliang Tang. To be robust or to be fair: Towards fairness in adversarial training, 2021. URL https://arxiv.org/abs/20...

work page arXiv 2001