[Re] FairDICE: A Fair Tradeoff in Multi-objective Offline RL

Aleksey Evstratovskiy; Karim Galliamov; Peter Adema; Ross Geurts

arxiv: 2603.03454 · v2 · pith:D7CG5SPWnew · submitted 2026-03-03 · 💻 cs.LG

[Re] FairDICE: A Fair Tradeoff in Multi-objective Offline RL

Peter Adema , Karim Galliamov , Aleksey Evstratovskiy , Ross Geurts This is my paper

Pith reviewed 2026-05-22 10:22 UTC · model grok-4.3

classification 💻 cs.LG

keywords offline reinforcement learningmulti-objective optimizationreplication studyFairDICEbehavior cloningcode implementation errorhyperparameter tuning

0 comments

The pith

A code error made FairDICE collapse to behavior cloning in continuous environments, but fixing it lets the method scale to complex multi-objective offline RL.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This replication study checks the claims of FairDICE, an offline RL algorithm that learns automatic weights across multiple objectives to reach fair trade-offs. Most theoretical properties hold, yet an implementation bug caused the algorithm to reduce to ordinary behavior cloning whenever the environment had continuous actions. After the bug is removed and missing hyperparameter details are supplied, new experiments show the corrected version can manage harder tasks and higher-dimensional reward vectors. Readers should care because the result clarifies whether an automated fairness mechanism actually works in realistic offline settings rather than only on paper.

Core claim

The authors show that FairDICE's original code contained an error that made its policy identical to behavior cloning in continuous control tasks. Once the error is corrected, the algorithm produces policies that improve over cloning baselines across a wider range of environments and reward dimensions, although success still depends on online hyperparameter selection.

What carries the argument

The code-level bug in the FairDICE implementation that forced equivalence to behavior cloning.

If this is right

Corrected FairDICE can be applied to environments with many simultaneous objectives.
The method still requires online tuning to reach its reported performance.
Theoretical guarantees from the original work remain intact after the fix.
Future replications of offline RL algorithms must include full code verification.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers of multi-objective methods may benefit from automated checks that prevent accidental reduction to cloning.
Similar hidden implementation details could explain performance gaps in other offline RL papers.
Making hyperparameter search itself offline would increase the method's practical value.
Scaling tests in even larger state spaces could expose new limits on the automatic weighting scheme.

Load-bearing premise

The code error was the main cause of weak results and the new experiments avoid hidden selection effects from extra tuning.

What would settle it

A controlled run in a fresh high-dimensional continuous environment where the corrected FairDICE still matches or falls below behavior cloning even after reasonable hyperparameter search.

read the original abstract

Offline Reinforcement Learning (RL) is an emerging field of RL in which policies are learned solely from demonstrations. Within offline RL, some environments involve balancing multiple objectives, but existing multi-objective offline RL algorithms do not provide an efficient way to find a fair compromise. FairDICE (see arXiv:2506.08062v2) seeks to fill this gap by adapting OptiDICE (an offline RL algorithm) to automatically learn weights for multiple objectives to e.g. incentivise fairness among objectives. As this would be a valuable contribution, this replication study examines the replicability of claims made regarding FairDICE. We find that many theoretical claims hold, but an error in the code reduces FairDICE to standard behaviour cloning in continuous environments, and many important hyperparameters were originally underspecified. After rectifying this, we show in experiments extending the original paper that FairDICE can scale to complex environments and high-dimensional rewards, though it can be reliant on (online) hyperparameter tuning. We conclude that FairDICE is a theoretically interesting method, but the experimental justification requires significant revision.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This replication fixes a concrete code bug in FairDICE that collapsed it to behavior cloning, then shows the corrected version scales better, but the gains appear to depend on online hyperparameter tuning that offline RL normally rules out.

read the letter

The main takeaway is that the original FairDICE implementation had a real error in continuous domains, and fixing it lets the method avoid collapsing to simple imitation. The replication also runs additional tests on higher-dimensional rewards and more complex environments, which the first paper did not cover as fully. Those are the concrete contributions here: a documented bug fix plus broader empirical checks. The theoretical claims are reported to hold after the correction, and the re-implementation details are laid out clearly enough to follow. That part is useful for anyone who might want to build on the method. The weaker part is the experimental protocol. The paper notes that the new results rely on online hyperparameter tuning, and the original work left several choices underspecified. In offline RL this matters because tuning with access to evaluation metrics can produce inflated numbers that do not reflect what an actual offline user would see. The stress-test concern about post-hoc selection looks like it applies directly. The scaling conclusion therefore rests on evidence that is not fully controlled for the offline setting. This is a solid replication for the subfield of multi-objective offline RL. Readers who care about fairness trade-offs in static data will find the bug report and the extended tests worth seeing. It is not a first-principles advance, but it cleans up prior claims and flags where more careful evaluation is needed. I would send it to peer review with a request for explicit details on how hyperparameters were selected without online feedback, and for any ablations that isolate the effect of the code fix alone.

Referee Report

2 major / 2 minor

Summary. The manuscript is a replication study of FairDICE, an adaptation of OptiDICE for multi-objective offline RL that learns automatic weights to achieve fair tradeoffs among objectives. The authors verify that core theoretical claims hold, identify a code error in the original implementation that reduces FairDICE to behavior cloning in continuous environments, and note that many hyperparameters were underspecified. After correcting the error, they present extended experiments showing that FairDICE scales to complex environments and high-dimensional rewards, while acknowledging reliance on online hyperparameter tuning. They conclude that FairDICE is theoretically interesting but that the original experimental justification requires significant revision.

Significance. If the corrected results hold, the replication strengthens the case for FairDICE as a method capable of handling multi-objective tradeoffs in offline settings beyond simple domains. It also contributes to the field by documenting reproducibility challenges in offline RL, including code bugs and hyperparameter underspecification, which are common pain points. The work provides a useful corrected baseline for future multi-objective offline RL research.

major comments (2)

[Experiments] Experiments section: The claim that FairDICE scales after the code fix rests on extended experiments that rely on online hyperparameter tuning. The manuscript should explicitly describe the tuning protocol (e.g., whether a held-out validation set or fixed search budget was used) to rule out post-hoc selection effects on the reported metrics, as this is load-bearing for the scaling conclusion in an offline RL context where online feedback is normally prohibited.
[§3] §3 (Code error analysis): The identification of the bug that collapses FairDICE to behavior cloning in continuous domains is central to the replication. The manuscript should provide the precise implementation detail or equation (e.g., the weighting term or loss component) responsible for the reduction, so that readers can verify the fix independently.

minor comments (2)

[Abstract] Abstract: The statement that 'many important hyperparameters were originally underspecified' could list the most impactful ones (e.g., those affecting the fairness weighting) to help readers assess the scope of the replication effort.
[Notation] Notation: Ensure consistent use of symbols for the multi-objective weights and the fairness regularizer across sections to avoid ambiguity when comparing to the original FairDICE paper.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments, which help improve the clarity and rigor of our replication study. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Experiments] Experiments section: The claim that FairDICE scales after the code fix rests on extended experiments that rely on online hyperparameter tuning. The manuscript should explicitly describe the tuning protocol (e.g., whether a held-out validation set or fixed search budget was used) to rule out post-hoc selection effects on the reported metrics, as this is load-bearing for the scaling conclusion in an offline RL context where online feedback is normally prohibited.

Authors: We agree that explicitly documenting the hyperparameter tuning protocol is essential for supporting the scaling claims and for transparency in an offline RL setting. In the revised manuscript, we will add a new subsection under Experiments that details the full tuning procedure: the hyperparameter search space, the number of random search trials performed, the selection criterion (performance averaged over 5 seeds on a small held-out subset of trajectories where feasible, or overall mean return otherwise), and the total search budget used. We will also explicitly note the reliance on online feedback during tuning and discuss its implications. This addresses the concern about potential post-hoc selection effects. revision: yes
Referee: [§3] §3 (Code error analysis): The identification of the bug that collapses FairDICE to behavior cloning in continuous domains is central to the replication. The manuscript should provide the precise implementation detail or equation (e.g., the weighting term or loss component) responsible for the reduction, so that readers can verify the fix independently.

Authors: We appreciate this suggestion. The bug originated in the implementation of the multi-objective weighting term within the OptiDICE-style loss (specifically, an erroneous broadcasting of the learned weight vector that caused it to be ignored, reducing the objective to standard behavior cloning). In the revised manuscript, we will include both the original faulty code snippet and the corrected version, along with the corresponding mathematical expression showing how the weighting term collapsed. This will enable readers to independently verify the diagnosis and the fix. revision: yes

Circularity Check

0 steps flagged

Replication study of FairDICE exhibits no circularity

full rationale

This is a replication paper that re-implements the FairDICE algorithm from an external prior work (arXiv:2506.08062v2 by different authors), identifies a code error that reduced it to behavior cloning, and reports new experimental results after the fix. The theoretical claims are stated to hold from the original derivation, which is external and not redefined here. The scaling conclusion rests on empirical runs in extended environments rather than any self-definitional equations, fitted parameters renamed as predictions, or load-bearing self-citations within this manuscript. No derivation chain reduces by construction to the paper's own inputs; results are tested against external benchmarks and new data. This is the standard honest outcome for a bug-fix replication study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper relies on standard offline RL assumptions such as the existence of a fixed dataset and the validity of the OptiDICE base algorithm; no new free parameters, axioms, or invented entities are introduced by this replication.

pith-pipeline@v0.9.0 · 5735 in / 1070 out tokens · 35944 ms · 2026-05-22T10:22:02.768259+00:00 · methodology

[Re] FairDICE: A Fair Tradeoff in Multi-objective Offline RL

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)