pith. machine review for the scientific record. sign in

arxiv: 2604.00904 · v2 · submitted 2026-04-01 · 💻 cs.LG

Recognition: no theorem link

Fatigue-Aware Learning to Defer via Constrained Optimisation

Authors on Pith no claims yet

Pith reviewed 2026-05-13 22:37 UTC · model grok-4.3

classification 💻 cs.LG
keywords learning to deferhuman-AI collaborationfatigue modelingconstrained Markov decision processreinforcement learningworkload-aware optimization
0
0 comments X

The pith

FALCON models human fatigue via workload curves in a constrained MDP to improve learning-to-defer decisions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard learning-to-defer systems assume human experts maintain fixed accuracy, yet performance degrades with cumulative workload according to psychological findings. The paper introduces FALCON to address this gap by incorporating fatigue curves that link workload to accuracy into the deferral policy. It casts the problem as a constrained Markov decision process whose state tracks task features plus cumulative human workload, then trains the policy with PPO-Lagrangian to maximize accuracy while respecting cooperation budgets. This approach matters for sustained human-AI teams where fatigue accumulates over time, as it enables the system to defer adaptively rather than following a static rule. Experiments demonstrate consistent gains over prior methods, zero-shot transfer to new experts, and superiority when the system must share decisions between AI and human.

Core claim

FALCON formulates learning to defer as a Constrained Markov Decision Process whose state includes both task features and cumulative human workload, uses psychologically grounded fatigue curves to model how human accuracy declines with workload, and optimizes the policy via PPO-Lagrangian training to maximize accuracy under explicit human-AI cooperation budgets.

What carries the argument

Constrained Markov decision process (CMDP) whose state augments task features with cumulative workload, paired with fatigue curves that map workload to human accuracy and optimized by PPO-Lagrangian.

If this is right

  • Adaptive policies outperform state-of-the-art L2D methods at every coverage level tested.
  • Zero-shot generalization holds to unseen experts whose fatigue patterns differ from those seen in training.
  • When coverage must lie strictly between 0 and 1, the fatigue-aware policy yields higher accuracy than either an AI-only or human-only baseline.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Deployed systems could feed live workload estimates from sensors or task logs into the state to keep the policy current.
  • The same CMDP-plus-fatigue structure could extend to other human-in-the-loop settings such as medical image review or moderation queues.
  • One could test whether a policy trained on one family of fatigue curves transfers to a different family without retraining.

Load-bearing premise

Psychologically grounded fatigue curves accurately capture how human accuracy degrades with cumulative workload in the specific decision-deferral tasks studied.

What would settle it

An experiment that measures real human accuracy degradation under increasing workload in the studied tasks and finds it deviates substantially from the modeled fatigue curves, or a direct comparison showing FALCON loses its performance edge once fatigue is present.

Figures

Figures reproduced from arXiv: 2604.00904 by Cuong C. Nguyen, David Rosewarne, Gustavo Carneiro, Kevin Wells, Zheng Zhang.

Figure 1
Figure 1. Figure 1: Example of an L2D scenario illustrating workload-variant human performance in human–AI task allocation [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: (a): Examples of w(ρ). The values of parameters (w0, wpeak, wbase, k, ρ, ¯ ρˆ) in Example 1,2 and 3 are (0.9, 1, 0.7, 0.1, 0.375, 0.05), (0.8, 0.95, 0.5, 0.09, 0.5, 0.025) and (0.8, 0.9, 0.6, 0.2, 0.6, 0.1). (b): The architecture of FALCON with workload-variant human performance. A backbone model extracts visual features from the input xt, while the cumulative human workload ρt is passed through an embeddi… view at source ↗
Figure 3
Figure 3. Figure 3: Human performance-Cumulative Workload curves on various datasets. The blue and red lines denote the [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Training time of FALCON and competing methods on Cifar100 (1e7 iterations). [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Inference time of FALCON and competing methods on Cifar100 (50 episodes). [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Accuracy-Coverage curves of several L2D strategies and FALCON on various datasets. [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Different human performance during testing (left column) and corresponding results with fine-tuning (middle [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Accuracy-Coverage curves of CMDP ablation on Cifar100 dataset. [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Validation against clinical human performance data. (a) Comparison between simulated human performance [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
read the original abstract

Learning to defer (L2D) enables human-AI cooperation by deciding when an AI system should act autonomously or defer to a human expert. Existing L2D methods, however, assume static human performance, contradicting well-established findings on fatigue-induced degradation. We propose Fatigue-Aware Learning to Defer via Constrained Optimisation (FALCON), which explicitly models workload-varying human performance using psychologically grounded fatigue curves. FALCON formulates L2D as a Constrained Markov Decision Process (CMDP) whose state includes both task features and cumulative human workload, and optimises accuracy under human-AI cooperation budgets via PPO-Lagrangian training. We further introduce FA-L2D, a benchmark that systematically varies fatigue dynamics from near-static to rapidly degrading regimes. Experiments across multiple datasets show that FALCON consistently outperforms state-of-the-art L2D methods across coverage levels, generalises zero-shot to unseen experts with different fatigue patterns, and demonstrates the advantage of adaptive human-AI collaboration over AI-only or human-only decision-making when coverage lies strictly between 0 and 1.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes FALCON, which formulates learning to defer as a Constrained Markov Decision Process whose state augments task features with cumulative human workload, models human accuracy via psychologically grounded parametric fatigue curves, and optimizes accuracy subject to cooperation budgets using PPO-Lagrangian. It introduces the FA-L2D benchmark that varies fatigue dynamics from near-static to rapid decay and reports that FALCON outperforms prior L2D methods across coverage levels, generalizes zero-shot to unseen fatigue parameters, and yields better performance than AI-only or human-only baselines when coverage is strictly between 0 and 1.

Significance. If the fatigue curves accurately reflect real workload-induced degradation and transfer across experts, the work would meaningfully extend L2D beyond static human-performance assumptions and provide a reproducible benchmark for testing robustness to fatigue variation. The use of constrained optimization on an augmented CMDP state is a clean technical contribution that could be adopted in other human-AI settings.

major comments (3)
  1. [Abstract] Abstract and Experiments section: the zero-shot generalization claim to 'unseen experts with different fatigue patterns' is evaluated exclusively inside the FA-L2D simulation that samples parameters from the same functional family; no human-subject data is collected to test whether the chosen curves match observed accuracy decay on the actual decision tasks.
  2. [§3] CMDP formulation (state transition and reward): human accuracy is defined as a deterministic function of cumulative workload via the parametric fatigue curves; if real degradation is non-monotonic, task-dependent, or exhibits higher variance than the simulated family, both the transition model and the learned policy become misspecified, directly undermining the reported gains over static L2D baselines.
  3. [§5] Results across coverage levels: all quantitative comparisons (outperformance, advantage of adaptive collaboration when coverage lies strictly between 0 and 1) are obtained under the same simulated fatigue dynamics used to train the policy; this makes the central empirical claim circular with respect to the modeling assumptions rather than an external validation.
minor comments (2)
  1. [§4.1] The precise functional forms and parameter ranges for the 'near-static to rapidly degrading' regimes should be stated explicitly with equations rather than described qualitatively.
  2. [§3] Notation for the Lagrangian multiplier schedule and the workload accumulator is introduced without a consolidated table; a single reference table would improve readability.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive feedback. We address each major comment below, clarifying the simulation-based scope of the work while agreeing where revisions are needed to improve clarity.

read point-by-point responses
  1. Referee: [Abstract] Abstract and Experiments section: the zero-shot generalization claim to 'unseen experts with different fatigue patterns' is evaluated exclusively inside the FA-L2D simulation that samples parameters from the same functional family; no human-subject data is collected to test whether the chosen curves match observed accuracy decay on the actual decision tasks.

    Authors: We agree that the zero-shot generalization experiments sample unseen parameters from within the same parametric family used to define the FA-L2D benchmark. No human-subject data was collected to validate the fatigue curves against observed accuracy decay on the decision tasks. We will revise the abstract and experiments section to explicitly qualify the generalization claim as holding within the modeled family and to note the simulation-based nature of the evaluation as a limitation. revision: yes

  2. Referee: [§3] CMDP formulation (state transition and reward): human accuracy is defined as a deterministic function of cumulative workload via the parametric fatigue curves; if real degradation is non-monotonic, task-dependent, or exhibits higher variance than the simulated family, both the transition model and the learned policy become misspecified, directly undermining the reported gains over static L2D baselines.

    Authors: The formulation does define human accuracy deterministically via the chosen parametric curves. If real degradation deviates (non-monotonic, task-dependent, or higher variance), the model would be misspecified. The contribution is to incorporate psychologically grounded curves into an L2D CMDP; the benchmark then tests robustness across regimes within this family. We will add discussion in §3 and the limitations section acknowledging the deterministic assumption and potential misspecification risks. revision: partial

  3. Referee: [§5] Results across coverage levels: all quantitative comparisons (outperformance, advantage of adaptive collaboration when coverage lies strictly between 0 and 1) are obtained under the same simulated fatigue dynamics used to train the policy; this makes the central empirical claim circular with respect to the modeling assumptions rather than an external validation.

    Authors: All reported comparisons are generated inside the FA-L2D simulation that encodes the fatigue dynamics. This design isolates the benefit of fatigue-aware modeling versus static baselines under controlled conditions. We will revise §5 to frame the results explicitly as evidence under the assumed fatigue model and to emphasize the benchmark's role in systematic, reproducible testing rather than claiming external validation. revision: partial

standing simulated objections not resolved
  • No human-subject data is available to validate whether the parametric fatigue curves match observed accuracy decay on the actual decision tasks.

Circularity Check

0 steps flagged

No load-bearing circularity; standard PPO-Lagrangian on workload-augmented CMDP with external baseline comparisons

full rationale

The paper's derivation formulates L2D as a CMDP whose state includes cumulative workload and applies PPO-Lagrangian for constrained optimization of accuracy under cooperation budgets. These are established techniques independent of the specific fatigue curves. Performance claims are obtained by direct comparison to external SOTA L2D baselines on the FA-L2D benchmark, which varies parameters in the fatigue family but does not define the reported metrics or outperformance as a function of fitted values. No step reduces a prediction to a self-fit by construction, no uniqueness theorem is imported from self-citation, and any self-citations are peripheral rather than load-bearing for the central optimization or generalization results. The zero-shot tests apply the policy to different simulated fatigue parameters within the same functional family, constituting an empirical evaluation inside the model rather than a definitional equivalence.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The central claim rests on the validity of psychological fatigue curves as a model for human accuracy decay and on the standard convergence assumptions of PPO-Lagrangian; the new benchmark is an invented evaluation artifact rather than an independent empirical finding.

free parameters (2)
  • fatigue curve parameters
    Parameters that define the rate and shape of performance degradation with cumulative workload; these must be chosen or fitted from psychological data or domain knowledge.
  • Lagrangian multiplier schedule
    The multiplier used to enforce the human-workload budget constraint during PPO-Lagrangian training.
axioms (2)
  • domain assumption Human accuracy degrades according to psychologically grounded fatigue curves as a function of cumulative workload
    Invoked to justify the state augmentation and the zero-shot generalization claim.
  • domain assumption The CMDP formulation with workload state and coverage budget constraint correctly captures the human-AI deferral trade-off
    Required for the PPO-Lagrangian training to produce the claimed accuracy improvements.
invented entities (1)
  • FA-L2D benchmark no independent evidence
    purpose: Synthetic testbed that systematically varies fatigue dynamics from near-static to rapidly degrading regimes
    New evaluation artifact introduced to demonstrate robustness across fatigue patterns.

pith-pipeline@v0.9.0 · 5498 in / 1619 out tokens · 56804 ms · 2026-05-13T22:37:29.842692+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages

  1. [1]

    Disparate interactions: An algorithm-in-the-loop analysis of fairness in risk assessments

    Ben Green and Yiling Chen. Disparate interactions: An algorithm-in-the-loop analysis of fairness in risk assessments. InConference on Fairness, Accountability, and Transparency, pages 90–99, 2019. 1

  2. [2]

    OPTIMAM mammog- raphy image database: a large-scale resource of mammography images and clinical data.Radiology: Artificial Intelligence, 3(1):e200103, 2020

    Mark D Halling-Brown, Lucy M Warren, Dominic Ward, Emma Lewis, Alistair Mackenzie, Matthew G Wallis, Louise S Wilkinson, Rosalind M Given-Wilson, Rita McAvinchey, and Kenneth C Young. OPTIMAM mammog- raphy image database: a large-scale resource of mammography images and clinical data.Radiology: Artificial Intelligence, 3(1):e200103, 2020. 1

  3. [3]

    Hybrid llm: Cost-efficient and quality-aware query routing

    Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Ruhle, Laks VS Lakshmanan, and Ahmed Hassan Awadallah. Hybrid llm: Cost-efficient and quality-aware query routing. InInternational Conference on Learning Representations, 2024. 1

  4. [4]

    Cognitive challenges in human–artificial intelligence collaboration: Investigating the path toward productive delegation.Information Systems Research, 33(2):678–696, 2022

    Andreas F ¨ugener, J ¨orn Grahl, Alok Gupta, and Wolfgang Ketter. Cognitive challenges in human–artificial intelligence collaboration: Investigating the path toward productive delegation.Information Systems Research, 33(2):678–696, 2022. 2

  5. [5]

    Predict responsibly: improving fairness and accuracy by learning to defer

    David Madras, Toni Pitassi, and Richard Zemel. Predict responsibly: improving fairness and accuracy by learning to defer. InAdvances in Neural Information Processing Systems, volume 31, 2018. 2, 3, 9, 12, 13, 15

  6. [6]

    Consistent estimators for learning to defer to an expert

    Hussein Mozannar and David Sontag. Consistent estimators for learning to defer to an expert. In Hal Daum ´e Iii and Aarti Singh, editors,International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 7076–7087. PMLR, 2020. 2, 3, 9, 12, 13, 15

  7. [7]

    Learning to defer to a population: A meta-learning approach

    Dharmesh Tailor, Aditya Patra, Rajeev Verma, Putra Manggala, and Eric Nalisnick. Learning to defer to a population: A meta-learning approach. InInternational Conference on Artificial Intelligence and Statistics, 2024. 2, 3, 9, 12, 13, 15

  8. [8]

    Expert-agnostic learning to defer

    Joshua Strong, Pramit Saha, Yasin Ibrahim, Cheng Ouyang, and Alison Noble. Expert-agnostic learning to defer. arXiv preprint arXiv:2502.10533, 2025. 2, 3, 9, 12, 13, 15

  9. [9]

    The rise of human factors: optimising performance of individuals and teams to improve patients’ outcomes.Journal of thoracic disease, 11(Suppl 7):S998, 2019

    Gianluca Casali, William Cullen, and Gareth Lock. The rise of human factors: optimising performance of individuals and teams to improve patients’ outcomes.Journal of thoracic disease, 11(Suppl 7):S998, 2019. 2, 14

  10. [10]

    Analysis of human performance as a measure of mental fatigue

    Andr´e Pimenta, Davide Carneiro, Paulo Novais, and Jos´e Neves. Analysis of human performance as a measure of mental fatigue. InHybrid Artificial Intelligence Systems: 9th International Conference, HAIS 2014, Salamanca, Spain, June 11-13, 2014. Proceedings 9, pages 389–401. Springer, 2014. 2

  11. [11]

    Regression- based continuous driving fatigue estimation: Toward practical implementation.IEEE Transactions on Cognitive and Developmental Systems, 12(2):323–331, 2019

    Rohit Bose, Hongtao Wang, Andrei Dragomir, Nitish V Thakor, Anastasios Bezerianos, and Junhua Li. Regression- based continuous driving fatigue estimation: Toward practical implementation.IEEE Transactions on Cognitive and Developmental Systems, 12(2):323–331, 2019. 2

  12. [12]

    Psychometric curves reveal changes in bias, lapse rate, and guess rate in an online vigilance task.Attention, Perception, & Psychophysics, 85(8):2879–2893, 2023

    Shannon P Gyles, Jason S McCarley, and Yusuke Yamani. Psychometric curves reveal changes in bias, lapse rate, and guess rate in an online vigilance task.Attention, Perception, & Psychophysics, 85(8):2879–2893, 2023. 2, 4, 14

  13. [13]

    Double-sigmoid model for fitting fatigue profiles in mouse fast-and slow-twitch muscle.Experimental physiology, 93(7):851–862, 2008

    SP Cairns, DM Robinson, and DS Loiselle. Double-sigmoid model for fitting fatigue profiles in mouse fast-and slow-twitch muscle.Experimental physiology, 93(7):851–862, 2008. 2

  14. [14]

    Cognitive and system factors contributing to diagnostic errors in radiology.American Journal of Roentgenology, 201(3):611–617, 2013

    Cindy S Lee, Paul G Nagy, Sallie J Weaver, and David E Newman-Toker. Cognitive and system factors contributing to diagnostic errors in radiology.American Journal of Roentgenology, 201(3):611–617, 2013. 2 16 APREPRINT- APRIL7, 2026

  15. [15]

    The insidious problem of fatigue in medical imaging practice.Journal of digital imaging, 25(1):3–6, 2012

    Bruce I Reiner and Elizabeth Krupinski. The insidious problem of fatigue in medical imaging practice.Journal of digital imaging, 25(1):3–6, 2012. 2

  16. [16]

    Tired in the reading room: the influence of fatigue in radiology.Journal of the American College of Radiology, 14(2):191–197, 2017

    Stephen Waite, Srinivas Kolla, Jean Jeudy, Alan Legasto, Stephen L Macknik, Susana Martinez-Conde, Elizabeth A Krupinski, and Deborah L Reede. Tired in the reading room: the influence of fatigue in radiology.Journal of the American College of Radiology, 14(2):191–197, 2017. 2

  17. [17]

    Fatigue in radiology: a fertile area for future research.The British journal of radiology, 92(1099):20190043, 2019

    Sian Taylor-Phillips and Chris Stinton. Fatigue in radiology: a fertile area for future research.The British journal of radiology, 92(1099):20190043, 2019. 2

  18. [18]

    Liability of interpreting too many radiographs.American Journal of Roentgenology, 175(1):17–22,

    Leonard Berlin. Liability of interpreting too many radiographs.American Journal of Roentgenology, 175(1):17–22,

  19. [19]

    The workload curve: Subjective mental workload.Human factors, 57(7):1174–1187, 2015

    Steven Estes. The workload curve: Subjective mental workload.Human factors, 57(7):1174–1187, 2015. 2, 3, 4, 14

  20. [20]

    Mechanisms of skill acquisition and the law of practice

    Allen Newell and Paul S Rosenbloom. Mechanisms of skill acquisition and the law of practice. InCognitive skills and their acquisition, pages 1–55. Psychology Press, 2013. 2, 4

  21. [21]

    Learning with noisy labels revisited: A study using real-world human annotations

    Jiaheng Wei, Zhaowei Zhu, Hao Cheng, Tongliang Liu, Gang Niu, and Yang Liu. Learning with noisy labels revisited: A study using real-world human annotations. InInternational Conference on Learning Representations,

  22. [22]

    Learning visual sentiment distributions via augmented conditional probability neural network

    Jufeng Yang, Ming Sun, and Xiaoxiao Sun. Learning visual sentiment distributions via augmented conditional probability neural network. InProceedings of the AAAI Conference on Artificial Intelligence, volume 31(1), 2017. 3, 6, 9

  23. [23]

    A data-centric approach for improving ambiguous labels with combined semi- supervised classification and clustering

    Lars Schmarje, Monty Santarossa, Simon-Martin Schr¨oder, Claudius Zelenka, Rainer Kiko, Jenny Stracke, Nina V olkmann, and Reinhard Koch. A data-centric approach for improving ambiguous labels with combined semi- supervised classification and clustering. InEuropean Conference on Computer Vision, pages 363–380. Springer,

  24. [24]

    Hard sample aware noise robust learning for histopathology image classification.IEEE Transactions on Medical Imaging, 41(4):881–894, 2021

    Chuang Zhu, Wenkai Chen, Ting Peng, Ying Wang, and Mulan Jin. Hard sample aware noise robust learning for histopathology image classification.IEEE Transactions on Medical Imaging, 41(4):881–894, 2021. 3, 6, 8

  25. [25]

    Calibrated learning to defer with one-vs-all classifiers

    Rajeev Verma and Eric Nalisnick. Calibrated learning to defer with one-vs-all classifiers. InInternational Conference on Machine Learning, pages 22184–22202. PMLR, 2022. 3

  26. [26]

    Consistent estimators for learning to defer to an expert

    Hussein Mozannar and David Sontag. Consistent estimators for learning to defer to an expert. InInternational Conference on Machine Learning, pages 7076–7087. PMLR, 2020. 3

  27. [27]

    Mental effort, workload, time on task, and certainty: Beyond linear models.Educational Psychology Review, 31:421–438, 2019

    Jimmie Leppink and Patricia P´erez-Fuster. Mental effort, workload, time on task, and certainty: Beyond linear models.Educational Psychology Review, 31:421–438, 2019. 3, 14

  28. [28]

    Routledge, 2021

    Eitan Altman.Constrained Markov decision processes. Routledge, 2021. 3

  29. [29]

    Sarah K Hopko, Riya Khurana, Ranjana K Mehta, and Prabhakar R Pagilla. Effect of cognitive fatigue, operator sex, and robot assistance on task performance metrics, workload, and situation awareness in human-robot collaboration.IEEE Robotics and Automation Letters, 6(2):3049–3056, 2021. 4

  30. [30]

    Psychometric curves reveal three mechanisms of vigilance decrement

    Jason S McCarley and Yusuke Yamani. Psychometric curves reveal three mechanisms of vigilance decrement. Psychological science, 32(10):1675–1683, 2021. 4, 14

  31. [31]

    Structured state space models for in-context reinforcement learning.Advances in Neural Information Processing Systems, 36:47016–47031, 2023

    Chris Lu, Yannick Schroecker, Albert Gu, Emilio Parisotto, Jakob Foerster, Satinder Singh, and Feryal Behbahani. Structured state space models for in-context reinforcement learning.Advances in Neural Information Processing Systems, 36:47016–47031, 2023. 5, 6

  32. [32]

    Simplified state space layers for sequence modeling

    Jimmy TH Smith, Andrew Warrington, and Scott Linderman. Simplified state space layers for sequence modeling. InThe Eleventh International Conference on Learning Representations, 2023. 5, 6

  33. [33]

    Efficiently modeling long sequences with structured state spaces

    Albert Gu, Karan Goel, and Christopher R´e. Efficiently modeling long sequences with structured state spaces. International Conference on Learning Representations, 2022. 5, 6

  34. [34]

    Benchmarking safe exploration in deep reinforcement learning,

    Alex Ray, Joshua Achiam, and Dario Amodei. Benchmarking safe exploration in deep reinforcement learning. arXiv preprint arXiv:1910.01708, 7(1):2, 2019. 5

  35. [35]

    Constrained Markov decision processes with total cost criteria: Lagrangian approach and dual linear program.Mathematical methods of operations research, 48(3):387–417, 1998

    Eitan Altman. Constrained Markov decision processes with total cost criteria: Lagrangian approach and dual linear program.Mathematical methods of operations research, 48(3):387–417, 1998. 5

  36. [36]

    An efficient end-to-end training approach for zero-shot human-ai coordination.Advances in Neural Information Processing Systems, 36:2636– 2658, 2023

    Xue Yan, Jiaxian Guo, Xingzhou Lou, Jun Wang, Haifeng Zhang, and Yali Du. An efficient end-to-end training approach for zero-shot human-ai coordination.Advances in Neural Information Processing Systems, 36:2636– 2658, 2023. 6 17 APREPRINT- APRIL7, 2026

  37. [37]

    Cross- environment cooperation enables zero-shot multi-agent coordination

    Kunal Jha, Wilka Carvalho, Yancheng Liang, Simon S Du, Max Kleiman-Weiner, and Natasha Jaques. Cross- environment cooperation enables zero-shot multi-agent coordination. InInternational Conference on Machine Learning, 2025. 6

  38. [38]

    Overcookedv2: Rethinking overcooked for zero-shot coordination

    Tobias Gessler, Tin Dizdarevic, Ani Calinescu, Benjamin Ellis, Andrei Lupu, and Jakob Nicolaus Foerster. Overcookedv2: Rethinking overcooked for zero-shot coordination. InInternational Conference on Learning Representations, 2025. 6

  39. [39]

    Popgym: Benchmarking partially observable reinforcement learning.The Eleventh International Conference on Learning Representations,

    Steven Morad, Ryan Kortvelesy, Matteo Bettini, Stephan Liwicki, and Amanda Prorok. Popgym: Benchmarking partially observable reinforcement learning.The Eleventh International Conference on Learning Representations,

  40. [40]

    Decision s4: Efficient sequence-based rl via state spaces layers

    Shmuel Bar David, Itamar Zimerman, Eliya Nachmani, and Lior Wolf. Decision s4: Efficient sequence-based rl via state spaces layers. InThe Eleventh International Conference on Learning Representations, 2022. 6

  41. [41]

    Decision transformer: Reinforcement learning via sequence modeling.Advances in neural information processing systems, 34:15084–15097, 2021

    Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling.Advances in neural information processing systems, 34:15084–15097, 2021. 6

  42. [42]

    Stabilizing transformers for reinforcement learning

    Emilio Parisotto, Francis Song, Jack Rae, Razvan Pascanu, Caglar Gulcehre, Siddhant Jayakumar, Max Jaderberg, Raphael Lopez Kaufman, Aidan Clark, Seb Noury, et al. Stabilizing transformers for reinforcement learning. In International conference on machine learning, pages 7487–7498. PMLR, 2020. 6

  43. [43]

    Gradients are not all you need.arXiv preprint arXiv:2111.05803, 2021

    Luke Metz, C Daniel Freeman, Samuel S Schoenholz, and Tal Kachman. Gradients are not all you need.arXiv preprint arXiv:2111.05803, 2021. 6

  44. [44]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009. 6, 8

  45. [45]

    Do we train on test data? Purging CIFAR of near-duplicates.Journal of Imaging, 6(6):41, 2020

    Bj¨orn Barz and Joachim Denzler. Do we train on test data? Purging CIFAR of near-duplicates.Journal of Imaging, 6(6):41, 2020. 8

  46. [46]

    Large-scale visual sentiment ontology and detectors using adjective noun pairs

    Damian Borth, Rongrong Ji, Tao Chen, Thomas Breuel, and Shih-Fu Chang. Large-scale visual sentiment ontology and detectors using adjective noun pairs. InProceedings of the 21st ACM international conference on Multimedia, pages 223–232, 2013. 9

  47. [47]

    Coverage- constrained human-ai cooperation with multiple experts.The Fortieth AAAI Conference on Artificial Intelligence,

    Zheng Zhang, Cuong Nguyen, Kevin Wells, Thanh-Toan Do, David Rosewarne, and Gustavo Carneiro. Coverage- constrained human-ai cooperation with multiple experts.The Fortieth AAAI Conference on Artificial Intelligence,

  48. [48]

    Accuracy-rejection curves (arcs) for comparing classification methods with a reject option

    Malik Sajjad Ahmed Nadeem, Jean-Daniel Zucker, and Blaise Hanczar. Accuracy-rejection curves (arcs) for comparing classification methods with a reject option. InMachine Learning in Systems Biology, pages 65–81. PMLR, 2009. 12, 15

  49. [49]

    Time-of-day effects on mammographic film reading performance

    Helen C Cowley and Alastair G Gale. Time-of-day effects on mammographic film reading performance. In Medical Imaging 1997: Image Perception, volume 3036, pages 212–221. SPIE, 1997. 13, 15

  50. [50]

    Exploiting human-AI dependence for learning to defer

    Zixi Wei, Yuzhou Cao, and Lei Feng. Exploiting human-AI dependence for learning to defer. InInternational Conference on Machine Learning, 2024. 13

  51. [51]

    In defense of softmax parametrization for calibrated and consistent learning to defer

    Yuzhou Cao, Hussein Mozannar, Lei Feng, Hongxin Wei, and Bo An. In defense of softmax parametrization for calibrated and consistent learning to defer. InAdvances in Neural Information Processing Systems, volume 36,

  52. [52]

    Two-stage learning to defer with multiple experts

    Anqi Mao, Christopher Mohri, Mehryar Mohri, and Yutao Zhong. Two-stage learning to defer with multiple experts. InAdvances in Neural Information Processing Systems, 2023. 13

  53. [53]

    Learning to defer to multiple experts: Consistent surrogate losses, confidence calibration, and conformal ensembles

    Rajeev Verma, Daniel Barrejon, and Eric Nalisnick. Learning to defer to multiple experts: Consistent surrogate losses, confidence calibration, and conformal ensembles. InInternational Conference on Artificial Intelligence and Statistics, pages 11415–11434. PMLR, 25–27 Apr 2023. 13

  54. [54]

    Regression with multi-expert deferral

    Anqi Mao, Mehryar Mohri, and Yutao Zhong. Regression with multi-expert deferral. InInternational Conference on Machine Learning, 2024. 13

  55. [55]

    Learning to complement and to defer to multiple users

    Zheng Zhang, Wenjie Ai, Kevin Wells, David Rosewarne, Thanh-Toan Do, and Gustavo Carneiro. Learning to complement and to defer to multiple users. InEuropean Conference on Computer Vision, pages 144–162. Springer, 2025. 13

  56. [56]

    Probabilistic learning to defer: Handling missing expert annotations and controlling workload distribution

    Cuong C Nguyen, Thanh-Toan Do, and Gustavo Carneiro. Probabilistic learning to defer: Handling missing expert annotations and controlling workload distribution. InThe Thirteenth International Conference on Learning Representations, 2025. 13 18 APREPRINT- APRIL7, 2026

  57. [57]

    Learning-to-defer for sequential medical decision-making under uncertainty.Transactions on Machine Learning Research, 2023

    Shalmali Joshi, Sonali Parbhoo, and Finale Doshi-Velez. Learning-to-defer for sequential medical decision-making under uncertainty.Transactions on Machine Learning Research, 2023. 14

  58. [58]

    Psychology Press, 2013

    James E Driskell and Eduardo Salas.Stress and human performance. Psychology Press, 2013. 14

  59. [59]

    A drop in cognitive performance, whodunit? subjective mental fatigue, brain deactivation or increased parasympathetic activity? it’s complicated!Cortex, 155:30–45, 2022

    Jeroen Van Cutsem, Peter Van Schuerbeek, Nathalie Pattyn, Hubert Raeymaekers, Johan De Mey, Romain Meeusen, and Bart Roelands. A drop in cognitive performance, whodunit? subjective mental fatigue, brain deactivation or increased parasympathetic activity? it’s complicated!Cortex, 155:30–45, 2022. 14

  60. [60]

    Neural and computational mechanisms of momentary fatigue and persistence in effort-based choice.Nature Communications, 12(1):4593, 2021

    Tanja M¨uller, Miriam C Klein-Fl¨ugge, Sanjay G Manohar, Masud Husain, and Matthew AJ Apps. Neural and computational mechanisms of momentary fatigue and persistence in effort-based choice.Nature Communications, 12(1):4593, 2021. 14

  61. [61]

    Perceived—and not manipulated—self-control depletion predicts students’ achievement outcomes in foreign language assessments.Educational Psychology, 40(4):490–508, 2020

    Christoph Lindner and Jan Retelsdorf. Perceived—and not manipulated—self-control depletion predicts students’ achievement outcomes in foreign language assessments.Educational Psychology, 40(4):490–508, 2020. 14

  62. [62]

    Cambridge University Press, 2013

    Robert Hockey.The psychology of fatigue: Work, effort and control. Cambridge University Press, 2013. 14

  63. [63]

    Translating fatigue to human performance.Medicine and science in sports and exercise, 48(11):2228, 2016

    Roger M Enoka and Jacques Duchateau. Translating fatigue to human performance.Medicine and science in sports and exercise, 48(11):2228, 2016. 14

  64. [64]

    The effects of mental fatigue on physical performance: a systematic review.Sports medicine, 47(8):1569–1588,

    Jeroen Van Cutsem, Samuele Marcora, Kevin De Pauw, Stephen Bailey, Romain Meeusen, and Bart Roelands. The effects of mental fatigue on physical performance: a systematic review.Sports medicine, 47(8):1569–1588,

  65. [65]

    Mental fatigue impairs physical performance in humans.Journal of applied physiology, 106(3):857–864, 2009

    Samuele M Marcora, Walter Staiano, and Victoria Manning. Mental fatigue impairs physical performance in humans.Journal of applied physiology, 106(3):857–864, 2009. 14

  66. [66]

    Cognitive tasks elicit mental fatigue and impair subsequent physical task endurance: Effects of task duration and type.Psychophysiology, 59(12):e14126, 2022

    Neil Dallaway, Samuel JE Lucas, and Christopher Ring. Cognitive tasks elicit mental fatigue and impair subsequent physical task endurance: Effects of task duration and type.Psychophysiology, 59(12):e14126, 2022. 14

  67. [67]

    Psychology Press, 2013

    John R Anderson.Cognitive skills and their acquisition. Psychology Press, 2013. 14

  68. [68]

    Incorporating human fatigue and recovery into the learning–forgetting process.Applied mathematical modelling, 37(12-13):7287–7299, 2013

    Mohamad Y Jaber, ZS Givi, and W Patrick Neumann. Incorporating human fatigue and recovery into the learning–forgetting process.Applied mathematical modelling, 37(12-13):7287–7299, 2013. 14 19