arxiv: 2604.00904 · v2 · submitted 2026-04-01 · 💻 cs.LG

Recognition: no theorem link

Fatigue-Aware Learning to Defer via Constrained Optimisation

Zheng Zhang , Cuong C. Nguyen , David Rosewarne , Kevin Wells , Gustavo Carneiro

Authors on Pith no claims yet

Pith reviewed 2026-05-13 22:37 UTC · model grok-4.3

classification 💻 cs.LG

keywords learning to deferhuman-AI collaborationfatigue modelingconstrained Markov decision processreinforcement learningworkload-aware optimization

0 comments

The pith

FALCON models human fatigue via workload curves in a constrained MDP to improve learning-to-defer decisions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard learning-to-defer systems assume human experts maintain fixed accuracy, yet performance degrades with cumulative workload according to psychological findings. The paper introduces FALCON to address this gap by incorporating fatigue curves that link workload to accuracy into the deferral policy. It casts the problem as a constrained Markov decision process whose state tracks task features plus cumulative human workload, then trains the policy with PPO-Lagrangian to maximize accuracy while respecting cooperation budgets. This approach matters for sustained human-AI teams where fatigue accumulates over time, as it enables the system to defer adaptively rather than following a static rule. Experiments demonstrate consistent gains over prior methods, zero-shot transfer to new experts, and superiority when the system must share decisions between AI and human.

Core claim

FALCON formulates learning to defer as a Constrained Markov Decision Process whose state includes both task features and cumulative human workload, uses psychologically grounded fatigue curves to model how human accuracy declines with workload, and optimizes the policy via PPO-Lagrangian training to maximize accuracy under explicit human-AI cooperation budgets.

What carries the argument

Constrained Markov decision process (CMDP) whose state augments task features with cumulative workload, paired with fatigue curves that map workload to human accuracy and optimized by PPO-Lagrangian.

If this is right

Adaptive policies outperform state-of-the-art L2D methods at every coverage level tested.
Zero-shot generalization holds to unseen experts whose fatigue patterns differ from those seen in training.
When coverage must lie strictly between 0 and 1, the fatigue-aware policy yields higher accuracy than either an AI-only or human-only baseline.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Deployed systems could feed live workload estimates from sensors or task logs into the state to keep the policy current.
The same CMDP-plus-fatigue structure could extend to other human-in-the-loop settings such as medical image review or moderation queues.
One could test whether a policy trained on one family of fatigue curves transfers to a different family without retraining.

Load-bearing premise

Psychologically grounded fatigue curves accurately capture how human accuracy degrades with cumulative workload in the specific decision-deferral tasks studied.

What would settle it

An experiment that measures real human accuracy degradation under increasing workload in the studied tasks and finds it deviates substantially from the modeled fatigue curves, or a direct comparison showing FALCON loses its performance edge once fatigue is present.

Figures

Figures reproduced from arXiv: 2604.00904 by Cuong C. Nguyen, David Rosewarne, Gustavo Carneiro, Kevin Wells, Zheng Zhang.

**Figure 1.** Figure 1: Example of an L2D scenario illustrating workload-variant human performance in human–AI task allocation [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: (a): Examples of w(ρ). The values of parameters (w0, wpeak, wbase, k, ρ, ¯ ρˆ) in Example 1,2 and 3 are (0.9, 1, 0.7, 0.1, 0.375, 0.05), (0.8, 0.95, 0.5, 0.09, 0.5, 0.025) and (0.8, 0.9, 0.6, 0.2, 0.6, 0.1). (b): The architecture of FALCON with workload-variant human performance. A backbone model extracts visual features from the input xt, while the cumulative human workload ρt is passed through an embeddi… view at source ↗

**Figure 3.** Figure 3: Human performance-Cumulative Workload curves on various datasets. The blue and red lines denote the [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Training time of FALCON and competing methods on Cifar100 (1e7 iterations). [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: Inference time of FALCON and competing methods on Cifar100 (50 episodes). [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Accuracy-Coverage curves of several L2D strategies and FALCON on various datasets. [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Different human performance during testing (left column) and corresponding results with fine-tuning (middle [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Accuracy-Coverage curves of CMDP ablation on Cifar100 dataset. [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: Validation against clinical human performance data. (a) Comparison between simulated human performance [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

read the original abstract

Learning to defer (L2D) enables human-AI cooperation by deciding when an AI system should act autonomously or defer to a human expert. Existing L2D methods, however, assume static human performance, contradicting well-established findings on fatigue-induced degradation. We propose Fatigue-Aware Learning to Defer via Constrained Optimisation (FALCON), which explicitly models workload-varying human performance using psychologically grounded fatigue curves. FALCON formulates L2D as a Constrained Markov Decision Process (CMDP) whose state includes both task features and cumulative human workload, and optimises accuracy under human-AI cooperation budgets via PPO-Lagrangian training. We further introduce FA-L2D, a benchmark that systematically varies fatigue dynamics from near-static to rapidly degrading regimes. Experiments across multiple datasets show that FALCON consistently outperforms state-of-the-art L2D methods across coverage levels, generalises zero-shot to unseen experts with different fatigue patterns, and demonstrates the advantage of adaptive human-AI collaboration over AI-only or human-only decision-making when coverage lies strictly between 0 and 1.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FALCON adds fatigue modeling to learning-to-defer inside a CMDP and shows gains in simulation, but the fatigue curves are not validated against real human data on the tasks.

read the letter

The paper's main move is to treat human accuracy as a function of cumulative workload using parametric fatigue curves, then fold that into the state of a CMDP so the deferral policy can plan around it. They train with standard PPO-Lagrangian under accuracy and coverage constraints and release a benchmark that sweeps fatigue from near-static to fast-decay regimes. Within those simulations the method beats static L2D baselines across coverage levels and transfers zero-shot to new parameter settings, which is a clean result for the simulated setting.

Referee Report

3 major / 2 minor

Summary. The paper proposes FALCON, which formulates learning to defer as a Constrained Markov Decision Process whose state augments task features with cumulative human workload, models human accuracy via psychologically grounded parametric fatigue curves, and optimizes accuracy subject to cooperation budgets using PPO-Lagrangian. It introduces the FA-L2D benchmark that varies fatigue dynamics from near-static to rapid decay and reports that FALCON outperforms prior L2D methods across coverage levels, generalizes zero-shot to unseen fatigue parameters, and yields better performance than AI-only or human-only baselines when coverage is strictly between 0 and 1.

Significance. If the fatigue curves accurately reflect real workload-induced degradation and transfer across experts, the work would meaningfully extend L2D beyond static human-performance assumptions and provide a reproducible benchmark for testing robustness to fatigue variation. The use of constrained optimization on an augmented CMDP state is a clean technical contribution that could be adopted in other human-AI settings.

major comments (3)

[Abstract] Abstract and Experiments section: the zero-shot generalization claim to 'unseen experts with different fatigue patterns' is evaluated exclusively inside the FA-L2D simulation that samples parameters from the same functional family; no human-subject data is collected to test whether the chosen curves match observed accuracy decay on the actual decision tasks.
[§3] CMDP formulation (state transition and reward): human accuracy is defined as a deterministic function of cumulative workload via the parametric fatigue curves; if real degradation is non-monotonic, task-dependent, or exhibits higher variance than the simulated family, both the transition model and the learned policy become misspecified, directly undermining the reported gains over static L2D baselines.
[§5] Results across coverage levels: all quantitative comparisons (outperformance, advantage of adaptive collaboration when coverage lies strictly between 0 and 1) are obtained under the same simulated fatigue dynamics used to train the policy; this makes the central empirical claim circular with respect to the modeling assumptions rather than an external validation.

minor comments (2)

[§4.1] The precise functional forms and parameter ranges for the 'near-static to rapidly degrading' regimes should be stated explicitly with equations rather than described qualitatively.
[§3] Notation for the Lagrangian multiplier schedule and the workload accumulator is introduced without a consolidated table; a single reference table would improve readability.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive feedback. We address each major comment below, clarifying the simulation-based scope of the work while agreeing where revisions are needed to improve clarity.

read point-by-point responses

Referee: [Abstract] Abstract and Experiments section: the zero-shot generalization claim to 'unseen experts with different fatigue patterns' is evaluated exclusively inside the FA-L2D simulation that samples parameters from the same functional family; no human-subject data is collected to test whether the chosen curves match observed accuracy decay on the actual decision tasks.

Authors: We agree that the zero-shot generalization experiments sample unseen parameters from within the same parametric family used to define the FA-L2D benchmark. No human-subject data was collected to validate the fatigue curves against observed accuracy decay on the decision tasks. We will revise the abstract and experiments section to explicitly qualify the generalization claim as holding within the modeled family and to note the simulation-based nature of the evaluation as a limitation. revision: yes
Referee: [§3] CMDP formulation (state transition and reward): human accuracy is defined as a deterministic function of cumulative workload via the parametric fatigue curves; if real degradation is non-monotonic, task-dependent, or exhibits higher variance than the simulated family, both the transition model and the learned policy become misspecified, directly undermining the reported gains over static L2D baselines.

Authors: The formulation does define human accuracy deterministically via the chosen parametric curves. If real degradation deviates (non-monotonic, task-dependent, or higher variance), the model would be misspecified. The contribution is to incorporate psychologically grounded curves into an L2D CMDP; the benchmark then tests robustness across regimes within this family. We will add discussion in §3 and the limitations section acknowledging the deterministic assumption and potential misspecification risks. revision: partial
Referee: [§5] Results across coverage levels: all quantitative comparisons (outperformance, advantage of adaptive collaboration when coverage lies strictly between 0 and 1) are obtained under the same simulated fatigue dynamics used to train the policy; this makes the central empirical claim circular with respect to the modeling assumptions rather than an external validation.

Authors: All reported comparisons are generated inside the FA-L2D simulation that encodes the fatigue dynamics. This design isolates the benefit of fatigue-aware modeling versus static baselines under controlled conditions. We will revise §5 to frame the results explicitly as evidence under the assumed fatigue model and to emphasize the benchmark's role in systematic, reproducible testing rather than claiming external validation. revision: partial

standing simulated objections not resolved

No human-subject data is available to validate whether the parametric fatigue curves match observed accuracy decay on the actual decision tasks.

Circularity Check

0 steps flagged

No load-bearing circularity; standard PPO-Lagrangian on workload-augmented CMDP with external baseline comparisons

full rationale

The paper's derivation formulates L2D as a CMDP whose state includes cumulative workload and applies PPO-Lagrangian for constrained optimization of accuracy under cooperation budgets. These are established techniques independent of the specific fatigue curves. Performance claims are obtained by direct comparison to external SOTA L2D baselines on the FA-L2D benchmark, which varies parameters in the fatigue family but does not define the reported metrics or outperformance as a function of fitted values. No step reduces a prediction to a self-fit by construction, no uniqueness theorem is imported from self-citation, and any self-citations are peripheral rather than load-bearing for the central optimization or generalization results. The zero-shot tests apply the policy to different simulated fatigue parameters within the same functional family, constituting an empirical evaluation inside the model rather than a definitional equivalence.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The central claim rests on the validity of psychological fatigue curves as a model for human accuracy decay and on the standard convergence assumptions of PPO-Lagrangian; the new benchmark is an invented evaluation artifact rather than an independent empirical finding.

free parameters (2)

fatigue curve parameters
Parameters that define the rate and shape of performance degradation with cumulative workload; these must be chosen or fitted from psychological data or domain knowledge.
Lagrangian multiplier schedule
The multiplier used to enforce the human-workload budget constraint during PPO-Lagrangian training.

axioms (2)

domain assumption Human accuracy degrades according to psychologically grounded fatigue curves as a function of cumulative workload
Invoked to justify the state augmentation and the zero-shot generalization claim.
domain assumption The CMDP formulation with workload state and coverage budget constraint correctly captures the human-AI deferral trade-off
Required for the PPO-Lagrangian training to produce the claimed accuracy improvements.

invented entities (1)

FA-L2D benchmark no independent evidence
purpose: Synthetic testbed that systematically varies fatigue dynamics from near-static to rapidly degrading regimes
New evaluation artifact introduced to demonstrate robustness across fatigue patterns.

pith-pipeline@v0.9.0 · 5498 in / 1619 out tokens · 56804 ms · 2026-05-13T22:37:29.842692+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages

[1]

Disparate interactions: An algorithm-in-the-loop analysis of fairness in risk assessments

Ben Green and Yiling Chen. Disparate interactions: An algorithm-in-the-loop analysis of fairness in risk assessments. InConference on Fairness, Accountability, and Transparency, pages 90–99, 2019. 1

work page 2019
[2]

OPTIMAM mammog- raphy image database: a large-scale resource of mammography images and clinical data.Radiology: Artificial Intelligence, 3(1):e200103, 2020

Mark D Halling-Brown, Lucy M Warren, Dominic Ward, Emma Lewis, Alistair Mackenzie, Matthew G Wallis, Louise S Wilkinson, Rosalind M Given-Wilson, Rita McAvinchey, and Kenneth C Young. OPTIMAM mammog- raphy image database: a large-scale resource of mammography images and clinical data.Radiology: Artificial Intelligence, 3(1):e200103, 2020. 1

work page 2020
[3]

Hybrid llm: Cost-efficient and quality-aware query routing

Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Ruhle, Laks VS Lakshmanan, and Ahmed Hassan Awadallah. Hybrid llm: Cost-efficient and quality-aware query routing. InInternational Conference on Learning Representations, 2024. 1

work page 2024
[4]

Cognitive challenges in human–artificial intelligence collaboration: Investigating the path toward productive delegation.Information Systems Research, 33(2):678–696, 2022

Andreas F ¨ugener, J ¨orn Grahl, Alok Gupta, and Wolfgang Ketter. Cognitive challenges in human–artificial intelligence collaboration: Investigating the path toward productive delegation.Information Systems Research, 33(2):678–696, 2022. 2

work page 2022
[5]

Predict responsibly: improving fairness and accuracy by learning to defer

David Madras, Toni Pitassi, and Richard Zemel. Predict responsibly: improving fairness and accuracy by learning to defer. InAdvances in Neural Information Processing Systems, volume 31, 2018. 2, 3, 9, 12, 13, 15

work page 2018
[6]

Consistent estimators for learning to defer to an expert

Hussein Mozannar and David Sontag. Consistent estimators for learning to defer to an expert. In Hal Daum ´e Iii and Aarti Singh, editors,International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 7076–7087. PMLR, 2020. 2, 3, 9, 12, 13, 15

work page 2020
[7]

Learning to defer to a population: A meta-learning approach

Dharmesh Tailor, Aditya Patra, Rajeev Verma, Putra Manggala, and Eric Nalisnick. Learning to defer to a population: A meta-learning approach. InInternational Conference on Artificial Intelligence and Statistics, 2024. 2, 3, 9, 12, 13, 15

work page 2024
[8]

Expert-agnostic learning to defer

Joshua Strong, Pramit Saha, Yasin Ibrahim, Cheng Ouyang, and Alison Noble. Expert-agnostic learning to defer. arXiv preprint arXiv:2502.10533, 2025. 2, 3, 9, 12, 13, 15

work page arXiv 2025
[9]

The rise of human factors: optimising performance of individuals and teams to improve patients’ outcomes.Journal of thoracic disease, 11(Suppl 7):S998, 2019

Gianluca Casali, William Cullen, and Gareth Lock. The rise of human factors: optimising performance of individuals and teams to improve patients’ outcomes.Journal of thoracic disease, 11(Suppl 7):S998, 2019. 2, 14

work page 2019
[10]

Analysis of human performance as a measure of mental fatigue

Andr´e Pimenta, Davide Carneiro, Paulo Novais, and Jos´e Neves. Analysis of human performance as a measure of mental fatigue. InHybrid Artificial Intelligence Systems: 9th International Conference, HAIS 2014, Salamanca, Spain, June 11-13, 2014. Proceedings 9, pages 389–401. Springer, 2014. 2

work page 2014
[11]

Regression- based continuous driving fatigue estimation: Toward practical implementation.IEEE Transactions on Cognitive and Developmental Systems, 12(2):323–331, 2019

Rohit Bose, Hongtao Wang, Andrei Dragomir, Nitish V Thakor, Anastasios Bezerianos, and Junhua Li. Regression- based continuous driving fatigue estimation: Toward practical implementation.IEEE Transactions on Cognitive and Developmental Systems, 12(2):323–331, 2019. 2

work page 2019
[12]

Psychometric curves reveal changes in bias, lapse rate, and guess rate in an online vigilance task.Attention, Perception, & Psychophysics, 85(8):2879–2893, 2023

Shannon P Gyles, Jason S McCarley, and Yusuke Yamani. Psychometric curves reveal changes in bias, lapse rate, and guess rate in an online vigilance task.Attention, Perception, & Psychophysics, 85(8):2879–2893, 2023. 2, 4, 14

work page 2023
[13]

Double-sigmoid model for fitting fatigue profiles in mouse fast-and slow-twitch muscle.Experimental physiology, 93(7):851–862, 2008

SP Cairns, DM Robinson, and DS Loiselle. Double-sigmoid model for fitting fatigue profiles in mouse fast-and slow-twitch muscle.Experimental physiology, 93(7):851–862, 2008. 2

work page 2008
[14]

Cognitive and system factors contributing to diagnostic errors in radiology.American Journal of Roentgenology, 201(3):611–617, 2013

Cindy S Lee, Paul G Nagy, Sallie J Weaver, and David E Newman-Toker. Cognitive and system factors contributing to diagnostic errors in radiology.American Journal of Roentgenology, 201(3):611–617, 2013. 2 16 APREPRINT- APRIL7, 2026

work page 2013
[15]

The insidious problem of fatigue in medical imaging practice.Journal of digital imaging, 25(1):3–6, 2012

Bruce I Reiner and Elizabeth Krupinski. The insidious problem of fatigue in medical imaging practice.Journal of digital imaging, 25(1):3–6, 2012. 2

work page 2012
[16]

Tired in the reading room: the influence of fatigue in radiology.Journal of the American College of Radiology, 14(2):191–197, 2017

Stephen Waite, Srinivas Kolla, Jean Jeudy, Alan Legasto, Stephen L Macknik, Susana Martinez-Conde, Elizabeth A Krupinski, and Deborah L Reede. Tired in the reading room: the influence of fatigue in radiology.Journal of the American College of Radiology, 14(2):191–197, 2017. 2

work page 2017
[17]

Fatigue in radiology: a fertile area for future research.The British journal of radiology, 92(1099):20190043, 2019

Sian Taylor-Phillips and Chris Stinton. Fatigue in radiology: a fertile area for future research.The British journal of radiology, 92(1099):20190043, 2019. 2

work page 2019
[18]

Liability of interpreting too many radiographs.American Journal of Roentgenology, 175(1):17–22,

Leonard Berlin. Liability of interpreting too many radiographs.American Journal of Roentgenology, 175(1):17–22,

work page
[19]

The workload curve: Subjective mental workload.Human factors, 57(7):1174–1187, 2015

Steven Estes. The workload curve: Subjective mental workload.Human factors, 57(7):1174–1187, 2015. 2, 3, 4, 14

work page 2015
[20]

Mechanisms of skill acquisition and the law of practice

Allen Newell and Paul S Rosenbloom. Mechanisms of skill acquisition and the law of practice. InCognitive skills and their acquisition, pages 1–55. Psychology Press, 2013. 2, 4

work page 2013
[21]

Learning with noisy labels revisited: A study using real-world human annotations

Jiaheng Wei, Zhaowei Zhu, Hao Cheng, Tongliang Liu, Gang Niu, and Yang Liu. Learning with noisy labels revisited: A study using real-world human annotations. InInternational Conference on Learning Representations,

work page
[22]

Learning visual sentiment distributions via augmented conditional probability neural network

Jufeng Yang, Ming Sun, and Xiaoxiao Sun. Learning visual sentiment distributions via augmented conditional probability neural network. InProceedings of the AAAI Conference on Artificial Intelligence, volume 31(1), 2017. 3, 6, 9

work page 2017
[23]

A data-centric approach for improving ambiguous labels with combined semi- supervised classification and clustering

Lars Schmarje, Monty Santarossa, Simon-Martin Schr¨oder, Claudius Zelenka, Rainer Kiko, Jenny Stracke, Nina V olkmann, and Reinhard Koch. A data-centric approach for improving ambiguous labels with combined semi- supervised classification and clustering. InEuropean Conference on Computer Vision, pages 363–380. Springer,

work page
[24]

Hard sample aware noise robust learning for histopathology image classification.IEEE Transactions on Medical Imaging, 41(4):881–894, 2021

Chuang Zhu, Wenkai Chen, Ting Peng, Ying Wang, and Mulan Jin. Hard sample aware noise robust learning for histopathology image classification.IEEE Transactions on Medical Imaging, 41(4):881–894, 2021. 3, 6, 8

work page 2021
[25]

Calibrated learning to defer with one-vs-all classifiers

Rajeev Verma and Eric Nalisnick. Calibrated learning to defer with one-vs-all classifiers. InInternational Conference on Machine Learning, pages 22184–22202. PMLR, 2022. 3

work page 2022
[26]

Consistent estimators for learning to defer to an expert

Hussein Mozannar and David Sontag. Consistent estimators for learning to defer to an expert. InInternational Conference on Machine Learning, pages 7076–7087. PMLR, 2020. 3

work page 2020
[27]

Mental effort, workload, time on task, and certainty: Beyond linear models.Educational Psychology Review, 31:421–438, 2019

Jimmie Leppink and Patricia P´erez-Fuster. Mental effort, workload, time on task, and certainty: Beyond linear models.Educational Psychology Review, 31:421–438, 2019. 3, 14

work page 2019
[28]

Routledge, 2021

Eitan Altman.Constrained Markov decision processes. Routledge, 2021. 3

work page 2021
[29]

Sarah K Hopko, Riya Khurana, Ranjana K Mehta, and Prabhakar R Pagilla. Effect of cognitive fatigue, operator sex, and robot assistance on task performance metrics, workload, and situation awareness in human-robot collaboration.IEEE Robotics and Automation Letters, 6(2):3049–3056, 2021. 4

work page 2021
[30]

Psychometric curves reveal three mechanisms of vigilance decrement

Jason S McCarley and Yusuke Yamani. Psychometric curves reveal three mechanisms of vigilance decrement. Psychological science, 32(10):1675–1683, 2021. 4, 14

work page 2021
[31]

Structured state space models for in-context reinforcement learning.Advances in Neural Information Processing Systems, 36:47016–47031, 2023

Chris Lu, Yannick Schroecker, Albert Gu, Emilio Parisotto, Jakob Foerster, Satinder Singh, and Feryal Behbahani. Structured state space models for in-context reinforcement learning.Advances in Neural Information Processing Systems, 36:47016–47031, 2023. 5, 6

work page 2023
[32]

Simplified state space layers for sequence modeling

Jimmy TH Smith, Andrew Warrington, and Scott Linderman. Simplified state space layers for sequence modeling. InThe Eleventh International Conference on Learning Representations, 2023. 5, 6

work page 2023
[33]

Efficiently modeling long sequences with structured state spaces

Albert Gu, Karan Goel, and Christopher R´e. Efficiently modeling long sequences with structured state spaces. International Conference on Learning Representations, 2022. 5, 6

work page 2022
[34]

Benchmarking safe exploration in deep reinforcement learning,

Alex Ray, Joshua Achiam, and Dario Amodei. Benchmarking safe exploration in deep reinforcement learning. arXiv preprint arXiv:1910.01708, 7(1):2, 2019. 5

work page arXiv 1910
[35]

Constrained Markov decision processes with total cost criteria: Lagrangian approach and dual linear program.Mathematical methods of operations research, 48(3):387–417, 1998

Eitan Altman. Constrained Markov decision processes with total cost criteria: Lagrangian approach and dual linear program.Mathematical methods of operations research, 48(3):387–417, 1998. 5

work page 1998
[36]

An efficient end-to-end training approach for zero-shot human-ai coordination.Advances in Neural Information Processing Systems, 36:2636– 2658, 2023

Xue Yan, Jiaxian Guo, Xingzhou Lou, Jun Wang, Haifeng Zhang, and Yali Du. An efficient end-to-end training approach for zero-shot human-ai coordination.Advances in Neural Information Processing Systems, 36:2636– 2658, 2023. 6 17 APREPRINT- APRIL7, 2026

work page 2023
[37]

Cross- environment cooperation enables zero-shot multi-agent coordination

Kunal Jha, Wilka Carvalho, Yancheng Liang, Simon S Du, Max Kleiman-Weiner, and Natasha Jaques. Cross- environment cooperation enables zero-shot multi-agent coordination. InInternational Conference on Machine Learning, 2025. 6

work page 2025
[38]

Overcookedv2: Rethinking overcooked for zero-shot coordination

Tobias Gessler, Tin Dizdarevic, Ani Calinescu, Benjamin Ellis, Andrei Lupu, and Jakob Nicolaus Foerster. Overcookedv2: Rethinking overcooked for zero-shot coordination. InInternational Conference on Learning Representations, 2025. 6

work page 2025
[39]

Popgym: Benchmarking partially observable reinforcement learning.The Eleventh International Conference on Learning Representations,

Steven Morad, Ryan Kortvelesy, Matteo Bettini, Stephan Liwicki, and Amanda Prorok. Popgym: Benchmarking partially observable reinforcement learning.The Eleventh International Conference on Learning Representations,

work page
[40]

Decision s4: Efficient sequence-based rl via state spaces layers

Shmuel Bar David, Itamar Zimerman, Eliya Nachmani, and Lior Wolf. Decision s4: Efficient sequence-based rl via state spaces layers. InThe Eleventh International Conference on Learning Representations, 2022. 6

work page 2022
[41]

Decision transformer: Reinforcement learning via sequence modeling.Advances in neural information processing systems, 34:15084–15097, 2021

Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling.Advances in neural information processing systems, 34:15084–15097, 2021. 6

work page 2021
[42]

Stabilizing transformers for reinforcement learning

Emilio Parisotto, Francis Song, Jack Rae, Razvan Pascanu, Caglar Gulcehre, Siddhant Jayakumar, Max Jaderberg, Raphael Lopez Kaufman, Aidan Clark, Seb Noury, et al. Stabilizing transformers for reinforcement learning. In International conference on machine learning, pages 7487–7498. PMLR, 2020. 6

work page 2020
[43]

Gradients are not all you need.arXiv preprint arXiv:2111.05803, 2021

Luke Metz, C Daniel Freeman, Samuel S Schoenholz, and Tal Kachman. Gradients are not all you need.arXiv preprint arXiv:2111.05803, 2021. 6

work page arXiv 2021
[44]

Learning multiple layers of features from tiny images

Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009. 6, 8

work page 2009
[45]

Do we train on test data? Purging CIFAR of near-duplicates.Journal of Imaging, 6(6):41, 2020

Bj¨orn Barz and Joachim Denzler. Do we train on test data? Purging CIFAR of near-duplicates.Journal of Imaging, 6(6):41, 2020. 8

work page 2020
[46]

Large-scale visual sentiment ontology and detectors using adjective noun pairs

Damian Borth, Rongrong Ji, Tao Chen, Thomas Breuel, and Shih-Fu Chang. Large-scale visual sentiment ontology and detectors using adjective noun pairs. InProceedings of the 21st ACM international conference on Multimedia, pages 223–232, 2013. 9

work page 2013
[47]

Coverage- constrained human-ai cooperation with multiple experts.The Fortieth AAAI Conference on Artificial Intelligence,

Zheng Zhang, Cuong Nguyen, Kevin Wells, Thanh-Toan Do, David Rosewarne, and Gustavo Carneiro. Coverage- constrained human-ai cooperation with multiple experts.The Fortieth AAAI Conference on Artificial Intelligence,

work page
[48]

Accuracy-rejection curves (arcs) for comparing classification methods with a reject option

Malik Sajjad Ahmed Nadeem, Jean-Daniel Zucker, and Blaise Hanczar. Accuracy-rejection curves (arcs) for comparing classification methods with a reject option. InMachine Learning in Systems Biology, pages 65–81. PMLR, 2009. 12, 15

work page 2009
[49]

Time-of-day effects on mammographic film reading performance

Helen C Cowley and Alastair G Gale. Time-of-day effects on mammographic film reading performance. In Medical Imaging 1997: Image Perception, volume 3036, pages 212–221. SPIE, 1997. 13, 15

work page 1997
[50]

Exploiting human-AI dependence for learning to defer

Zixi Wei, Yuzhou Cao, and Lei Feng. Exploiting human-AI dependence for learning to defer. InInternational Conference on Machine Learning, 2024. 13

work page 2024
[51]

In defense of softmax parametrization for calibrated and consistent learning to defer

Yuzhou Cao, Hussein Mozannar, Lei Feng, Hongxin Wei, and Bo An. In defense of softmax parametrization for calibrated and consistent learning to defer. InAdvances in Neural Information Processing Systems, volume 36,

work page
[52]

Two-stage learning to defer with multiple experts

Anqi Mao, Christopher Mohri, Mehryar Mohri, and Yutao Zhong. Two-stage learning to defer with multiple experts. InAdvances in Neural Information Processing Systems, 2023. 13

work page 2023
[53]

Learning to defer to multiple experts: Consistent surrogate losses, confidence calibration, and conformal ensembles

Rajeev Verma, Daniel Barrejon, and Eric Nalisnick. Learning to defer to multiple experts: Consistent surrogate losses, confidence calibration, and conformal ensembles. InInternational Conference on Artificial Intelligence and Statistics, pages 11415–11434. PMLR, 25–27 Apr 2023. 13

work page 2023
[54]

Regression with multi-expert deferral

Anqi Mao, Mehryar Mohri, and Yutao Zhong. Regression with multi-expert deferral. InInternational Conference on Machine Learning, 2024. 13

work page 2024
[55]

Learning to complement and to defer to multiple users

Zheng Zhang, Wenjie Ai, Kevin Wells, David Rosewarne, Thanh-Toan Do, and Gustavo Carneiro. Learning to complement and to defer to multiple users. InEuropean Conference on Computer Vision, pages 144–162. Springer, 2025. 13

work page 2025
[56]

Probabilistic learning to defer: Handling missing expert annotations and controlling workload distribution

Cuong C Nguyen, Thanh-Toan Do, and Gustavo Carneiro. Probabilistic learning to defer: Handling missing expert annotations and controlling workload distribution. InThe Thirteenth International Conference on Learning Representations, 2025. 13 18 APREPRINT- APRIL7, 2026

work page 2025
[57]

Learning-to-defer for sequential medical decision-making under uncertainty.Transactions on Machine Learning Research, 2023

Shalmali Joshi, Sonali Parbhoo, and Finale Doshi-Velez. Learning-to-defer for sequential medical decision-making under uncertainty.Transactions on Machine Learning Research, 2023. 14

work page 2023
[58]

Psychology Press, 2013

James E Driskell and Eduardo Salas.Stress and human performance. Psychology Press, 2013. 14

work page 2013
[59]

A drop in cognitive performance, whodunit? subjective mental fatigue, brain deactivation or increased parasympathetic activity? it’s complicated!Cortex, 155:30–45, 2022

Jeroen Van Cutsem, Peter Van Schuerbeek, Nathalie Pattyn, Hubert Raeymaekers, Johan De Mey, Romain Meeusen, and Bart Roelands. A drop in cognitive performance, whodunit? subjective mental fatigue, brain deactivation or increased parasympathetic activity? it’s complicated!Cortex, 155:30–45, 2022. 14

work page 2022
[60]

Neural and computational mechanisms of momentary fatigue and persistence in effort-based choice.Nature Communications, 12(1):4593, 2021

Tanja M¨uller, Miriam C Klein-Fl¨ugge, Sanjay G Manohar, Masud Husain, and Matthew AJ Apps. Neural and computational mechanisms of momentary fatigue and persistence in effort-based choice.Nature Communications, 12(1):4593, 2021. 14

work page 2021
[61]

Perceived—and not manipulated—self-control depletion predicts students’ achievement outcomes in foreign language assessments.Educational Psychology, 40(4):490–508, 2020

Christoph Lindner and Jan Retelsdorf. Perceived—and not manipulated—self-control depletion predicts students’ achievement outcomes in foreign language assessments.Educational Psychology, 40(4):490–508, 2020. 14

work page 2020
[62]

Cambridge University Press, 2013

Robert Hockey.The psychology of fatigue: Work, effort and control. Cambridge University Press, 2013. 14

work page 2013
[63]

Translating fatigue to human performance.Medicine and science in sports and exercise, 48(11):2228, 2016

Roger M Enoka and Jacques Duchateau. Translating fatigue to human performance.Medicine and science in sports and exercise, 48(11):2228, 2016. 14

work page 2016
[64]

The effects of mental fatigue on physical performance: a systematic review.Sports medicine, 47(8):1569–1588,

Jeroen Van Cutsem, Samuele Marcora, Kevin De Pauw, Stephen Bailey, Romain Meeusen, and Bart Roelands. The effects of mental fatigue on physical performance: a systematic review.Sports medicine, 47(8):1569–1588,

work page
[65]

Mental fatigue impairs physical performance in humans.Journal of applied physiology, 106(3):857–864, 2009

Samuele M Marcora, Walter Staiano, and Victoria Manning. Mental fatigue impairs physical performance in humans.Journal of applied physiology, 106(3):857–864, 2009. 14

work page 2009
[66]

Cognitive tasks elicit mental fatigue and impair subsequent physical task endurance: Effects of task duration and type.Psychophysiology, 59(12):e14126, 2022

Neil Dallaway, Samuel JE Lucas, and Christopher Ring. Cognitive tasks elicit mental fatigue and impair subsequent physical task endurance: Effects of task duration and type.Psychophysiology, 59(12):e14126, 2022. 14

work page 2022
[67]

Psychology Press, 2013

John R Anderson.Cognitive skills and their acquisition. Psychology Press, 2013. 14

work page 2013
[68]

Incorporating human fatigue and recovery into the learning–forgetting process.Applied mathematical modelling, 37(12-13):7287–7299, 2013

Mohamad Y Jaber, ZS Givi, and W Patrick Neumann. Incorporating human fatigue and recovery into the learning–forgetting process.Applied mathematical modelling, 37(12-13):7287–7299, 2013. 14 19

work page 2013