pith. sign in

arxiv: 2606.08508 · v1 · pith:PT2C45XLnew · submitted 2026-06-07 · 💻 cs.RO · cs.AI

ActProbe: Action-Space Probe for Early Failure Detection of Generative Robot Policies

Pith reviewed 2026-06-27 18:33 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords failure detectiongenerative robot policiesaction spaceearly warningtemporal consistencyrobot learningonline monitoring
0
0 comments X

The pith

Action chunks from generative robot policies already carry strong predictive signals for impending failures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that generative robot policies produce action sequences whose internal patterns can reveal failures in advance, without any need to inspect the policy's internals or run extra samples. It demonstrates this by defining two compact action-derived signals and feeding them through a lightweight predictor to output failure probabilities step by step. A sympathetic reader would care because current detectors either demand white-box access or add runtime cost, while this method stays black-box and cheap yet moves the accuracy-timeliness frontier forward. If the claim holds, early detection becomes portable across policies and transferable to real robots with measurable gains in warning time and downstream learning efficiency.

Core claim

Emitted action chunks from generative robot policies already encode predictive information about failures; two signals extracted from a single forward pass—Temporal Consistency Error between consecutive chunks and Action Chunk Magnitude of the current chunk—when passed through a task-conditioned LSTM-MLP, yield per-step failure probabilities that improve the F1-timeliness Pareto frontier by +12.7% hypervolume and early-detection ROC-AUC by +9.0% on unseen tasks, while transferring to real-robot pick tasks and accelerating PPO fine-tuning with 2.9x fewer interactions.

What carries the argument

ActProbe, a detector that extracts Temporal Consistency Error (TCE) and Action Chunk Magnitude (ACM) from action chunks and maps them via a task-conditioned LSTM-MLP to failure probabilities.

If this is right

  • Alerts can be issued before failures become visually recognizable.
  • The accuracy-timeliness trade-off improves by +12.7% hypervolume on average over internal- and external-feature baselines.
  • +9.0% lead in early-detection ROC-AUC holds on tasks never seen during training.
  • The detector transfers directly to real-robot deployment and reduces environment interactions needed for PPO fine-tuning by a factor of 2.9.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Action-only probing could be combined with minimal observation features to handle edge cases where action signals weaken.
  • The same chunk-consistency idea might extend to other autoregressive sequence generators outside robotics.
  • Because no resampling is required, the method could be inserted as a lightweight safety layer in existing deployed policies.

Load-bearing premise

The two action-derived signals together with the task-conditioned LSTM-MLP suffice to generalize failure prediction across policies, benchmarks, and unseen real-robot tasks without internal access or resampling.

What would settle it

An experiment on a new generative policy or real-robot task in which ActProbe raises no earlier alerts than visual inspection or than baselines that use policy internals or observation features.

read the original abstract

Generative robot policies fail unpredictably at deployment: they hesitate at critical moments, drift off-task, or commit to unrecoverable actions. Existing online failure detectors either require white-box access to policy internals or add runtime overhead through resampling and observation-side signals. Our empirical analysis shows that emitted action chunks themselves already carry strong predictive signal for impending failures in generative robot policies. Motivated by this observation, we introduce ActProbe, a lightweight, pure action-space detector that uses two compact signals available from a single forward pass: Temporal Consistency Error (TCE) between consecutive action chunks and Action Chunk Magnitude (ACM) of the current chunk. ActProbe maps these signals to per-step failure probabilities with a task-conditioned LSTM-MLP architecture. Across a diverse suite of generative robot policies and benchmarks, ActProbe raises alerts before failures become visually recognizable, improving the accuracy (F1)-timeliness Pareto frontier of failure detection by an average hypervolume gain of +12.7% over both internal- and external-feature baselines, with a +9.0% early-detection ROC-AUC lead on unseen tasks. ActProbe further transfers to deployment, predicting failures on unseen real-robot pick tasks and accelerating RL fine-tuning (PPO) with 2.9x fewer environment interactions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that action chunks emitted by generative robot policies contain strong predictive signals for impending failures. It introduces ActProbe, a lightweight detector using two action-derived signals—Temporal Consistency Error (TCE) and Action Chunk Magnitude (ACM)—fed into a task-conditioned LSTM-MLP to output per-step failure probabilities. The method is evaluated across generative policies and benchmarks, reporting +12.7% average hypervolume gain on the accuracy-timeliness Pareto frontier and +9.0% ROC-AUC improvement on unseen tasks, plus successful transfer to real-robot pick tasks and 2.9x faster RL fine-tuning, all without policy internals or resampling.

Significance. If the empirical results hold under rigorous validation, ActProbe would provide a practical, low-overhead failure detector for generative policies that operates purely in action space. This could improve safety monitoring and accelerate training loops in robotics without requiring white-box access, addressing a key deployment challenge. The core observation that action chunks carry failure signal is a useful empirical contribution if the generalization claims are substantiated.

major comments (2)
  1. [Architecture description] Architecture description (likely §3): the task-conditioning mechanism for the LSTM-MLP is under-specified for unseen tasks. The generalization claim of +9.0% ROC-AUC lead and successful transfer to unseen real-robot pick tasks depends on how task embeddings or conditioning vectors are obtained or adapted at deployment; if this requires per-task labels or parameters unavailable without internal access, the zero-shot transfer result rests on an untested assumption.
  2. [Experimental results] Experimental results section (likely §4 or §5): the reported hypervolume gains and ROC-AUC improvements lack details on error bars, dataset splits, number of runs, or statistical significance testing. Without these, it is difficult to assess whether the +12.7% and +9.0% margins are robust or sensitive to benchmark selection.
minor comments (2)
  1. [Abstract] The abstract and introduction would benefit from explicit forward references to the specific tables or figures that support the hypervolume and ROC-AUC claims.
  2. [Method] Notation for TCE and ACM should be defined with equations in the main text rather than relying solely on prose descriptions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point-by-point below, indicating where revisions will strengthen the paper.

read point-by-point responses
  1. Referee: [Architecture description] Architecture description (likely §3): the task-conditioning mechanism for the LSTM-MLP is under-specified for unseen tasks. The generalization claim of +9.0% ROC-AUC lead and successful transfer to unseen real-robot pick tasks depends on how task embeddings or conditioning vectors are obtained or adapted at deployment; if this requires per-task labels or parameters unavailable without internal access, the zero-shot transfer result rests on an untested assumption.

    Authors: We agree that the task-conditioning mechanism in Section 3 requires additional specification to fully support the generalization claims. In the revised manuscript we will expand the architecture description to detail exactly how task embeddings are computed from task descriptions (via a fixed encoder) and injected into the LSTM-MLP, and we will explicitly state the procedure used for unseen tasks. This clarification will confirm that no per-task labels or policy-internal parameters are required at deployment. revision: yes

  2. Referee: [Experimental results] Experimental results section (likely §4 or §5): the reported hypervolume gains and ROC-AUC improvements lack details on error bars, dataset splits, number of runs, or statistical significance testing. Without these, it is difficult to assess whether the +12.7% and +9.0% margins are robust or sensitive to benchmark selection.

    Authors: We acknowledge that the experimental results section would benefit from more rigorous statistical reporting. In the revised manuscript we will add error bars (standard deviation across seeds), explicitly describe the train/validation/test splits, report the number of independent runs, and include statistical significance tests (e.g., paired t-tests) for the reported hypervolume and ROC-AUC improvements. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical detector with no derivations or self-referential fits

full rationale

The paper defines two action-derived signals (TCE, ACM) from emitted chunks, then trains a standard task-conditioned LSTM-MLP to map them to failure probabilities. This is conventional supervised learning on observable inputs; the reported gains are measured against external baselines on held-out tasks and real-robot data. No equations, uniqueness theorems, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The central claim reduces to empirical validation rather than any input-equivalent construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no explicit free parameters, axioms, or invented entities identifiable. The LSTM-MLP architecture and task-conditioning are treated as standard components.

pith-pipeline@v0.9.1-grok · 5783 in / 910 out tokens · 18368 ms · 2026-06-27T18:33:52.060913+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 1 canonical work pages

  1. [1]

    RT-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

    Anthony Brohan et al. RT-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

  2. [2]

    RT-2: Vision-language-action models transfer web knowledge to robotic control

    Anthony Brohan et al. RT-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning (CoRL), 2023. arXiv:2307.15818

  3. [3]

    OpenVLA: An open-source vision-language- action model.arXiv preprint arXiv:2406.09246, 2024

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. OpenVLA: An open-source vision-language- action model.arXiv preprint arXiv:2406.09246, 2024

  4. [4]

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.𝜋0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

  5. [5]

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al.𝜋0.5: A vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

  6. [6]

    Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

    Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

  7. [7]

    CogACT: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2411.19650, 2024

    Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, et al. CogACT: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2411.19650, 2024

  8. [8]

    World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

    Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, et al. World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

  9. [9]

    Cosmos world foundation model platform for physical AI.arXiv preprint arXiv:2501.03575, 2025

    Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopad- hyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical AI.arXiv preprint arXiv:2501.03575, 2025

  10. [10]

    Rynnvla-002: A unified vision-language-action and world model.arXiv preprint arXiv:2511.17502, 2025

    Jun Cen, Siteng Huang, Yuqian Yuan, Kehan Li, Hangjie Yuan, Chaohui Yu, Yuming Jiang, Jiayan Guo, Xin Li, Hao Luo, et al. Rynnvla-002: A unified vision-language-action and world model.arXiv preprint arXiv:2511.17502, 2025. 13

  11. [11]

    Motus: A unified latent action world model.arXiv preprint arXiv:2512.13030, 2025

    Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, et al. Motus: A unified latent action world model.arXiv preprint arXiv:2512.13030, 2025

  12. [12]

    LIBERO: Benchmarking knowledge transfer for lifelong robot learning

    Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. LIBERO: Benchmarking knowledge transfer for lifelong robot learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

  13. [13]

    RoboCasa: Large-scale simulation of everyday tasks for generalist robots.arXiv preprint arXiv:2406.02523, 2024

    Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Mandlekar, and Yuke Zhu. RoboCasa: Large-scale simulation of everyday tasks for generalist robots.arXiv preprint arXiv:2406.02523, 2024

  14. [14]

    DROID: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024

    Alexander Khazatsky, Karl Pertsch, et al. DROID: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024

  15. [15]

    SAFE: Multitask failure detection for vision-language-action models

    Qiao Gu, Yuanliang Ju, Shengxiang Sun, Igor Gilitschenski, Haruki Nishimura, Masha Itkina, and Florian Shkurti. SAFE: Multitask failure detection for vision-language-action models. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

  16. [16]

    Unpacking failure modes of generative policies: Runtime monitoring of consistency and progress

    Christopher Agia, Rohan Sinha, Jingyun Yang, Zi-ang Cao, Rika Antonova, Marco Pavone, and Jeannette Bohg. Unpacking failure modes of generative policies: Runtime monitoring of consistency and progress. InConference on Robot Learning (CoRL), 2024

  17. [17]

    Task-driven out-of-distribution detection with statistical guarantees for robot learning

    Alec Farid, Sushant Veer, and Anirudha Majumdar. Task-driven out-of-distribution detection with statistical guarantees for robot learning. InProceedings of the 5th Conference on Robot Learning (CoRL), pages 970–980, 2021. arXiv:2106.13703

  18. [18]

    Multi-task interactive robot fleet learning with visual world models

    Huihan Liu, Yu Zhang, Vaarij Betala, Evan Zhang, James Liu, Crystal Ding, and Yuke Zhu. Multi-task interactive robot fleet learning with visual world models. InProceedings of the 8th Conference on Robot Learning (CoRL), 2024. arXiv:2410.22689

  19. [19]

    Real-time anomaly detection and reactive planning with large language models

    Rohan Sinha, Amine Elhafsi, Christopher Agia, Matthew Foutter, Edward Schmerling, and Marco Pavone. Real-time anomaly detection and reactive planning with large language models. In Proceedings of Robotics: Science and Systems (RSS), 2024. arXiv:2407.08735

  20. [20]

    Asking for help: Failure prediction in behavioral cloning through value approximation

    Cem Gokmen, Daniel Ho, and Mohi Khansari. Asking for help: Failure prediction in behavioral cloning through value approximation. InIEEE International Conference on Robotics and Automation (ICRA), pages 5821–5828, 2023

  21. [21]

    Vision-language models as success detectors

    Yuqing Du, Ksenia Konyushkova, Misha Denil, Akhil Raju, Jessica Landon, Felix Hill, Nando de Freitas, and Serkan Cabi. Vision-language models as success detectors. InConference on Lifelong Learning Agents (CoLLAs), pages 120–136. PMLR, 2023

  22. [22]

    AHA: A vision-language-model for detecting and reasoning over failures in robotic manipulation

    Jiafei Duan et al. AHA: A vision-language-model for detecting and reasoning over failures in robotic manipulation. InInternational Conference on Learning Representations (ICLR), 2025. arXiv:2410.00371

  23. [23]

    Can we detect failures without failure data? Uncertainty-aware runtime failure detection for imitation learning policies.arXiv preprint arXiv:2503.08558, 2025

    Chen Xu, Tony Khuong Nguyen, Emma Dixon, Christopher Rodriguez, Patrick Miller, Robert Lee, Paarth Shah, Rares Ambrus, Haruki Nishimura, and Masha Itkina. Can we detect failures without failure data? Uncertainty-aware runtime failure detection for imitation learning policies.arXiv preprint arXiv:2503.08558, 2025. 14

  24. [24]

    Schoellig

    Ralf Römer, Adrian Kobras, Luca Worbis, and Angela P. Schoellig. FIPER: Failure prediction at runtimeforgenerativerobotpolicies. InAdvancesinNeuralInformationProcessingSystems(NeurIPS), 2025

  25. [25]

    Temporal difference calibration in sequential tasks: Application to vision-language-action models.arXiv preprint arXiv:2604.20472, 2026

    Shelly Francis-Meretzki, Mirco Mutti, Yaniv Romano, and Aviv Tamar. Temporal difference calibration in sequential tasks: Application to vision-language-action models.arXiv preprint arXiv:2604.20472, 2026

  26. [26]

    Diff-DAgger: Uncertainty estimation with diffusion policy for robotic manipulation

    Sung-Wook Lee, Xuhui Kang, and Yen-Ling Kuo. Diff-DAgger: Uncertainty estimation with diffusion policy for robotic manipulation. InIEEE International Conference on Robotics and Automation (ICRA), 2025

  27. [27]

    The coordination of arm movements: An experimentally confirmed mathematical model.Journal of Neuroscience, 5(7):1688–1703, 1985

    Tamar Flash and Neville Hogan. The coordination of arm movements: An experimentally confirmed mathematical model.Journal of Neuroscience, 5(7):1688–1703, 1985

  28. [28]

    On the analysis of movement smoothness.Journal of NeuroEngineering and Rehabilitation, 12:112, 2015

    SivakumarBalasubramanian, AlejandroMelendez-Calderon, AgnesRoby-Brami, andEtienneBurdet. On the analysis of movement smoothness.Journal of NeuroEngineering and Rehabilitation, 12:112, 2015

  29. [29]

    Fonseca, and Viviane Grunert da Fonseca

    Eckart Zitzler, Lothar Thiele, Marco Laumanns, Carlos M. Fonseca, and Viviane Grunert da Fonseca. Performance assessment of multiobjective optimizers: An analysis and review.IEEE Transactions on Evolutionary Computation, 7(2):117–132, 2003

  30. [30]

    RLinf: Flexible and efficient large-scale reinforcement learning via macro-to-micro flow transformation.arXiv preprint arXiv:2509.15965, 2025

    Chao Yu, Yuanqing Wang, Zhen Guo, Hao Lin, Si Xu, Hongzhi Zang, Quanlu Zhang, Yongji Wu, Chunyang Zhu, Junhao Hu, et al. RLinf: Flexible and efficient large-scale reinforcement learning via macro-to-micro flow transformation.arXiv preprint arXiv:2509.15965, 2025

  31. [31]

    UniVLA: Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2505.06111, 2025

    Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. UniVLA: Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2505.06111, 2025

  32. [32]

    FAST: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

    Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. FAST: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

  33. [33]

    Tenenbaum, Dale Schuur- mans, and Pieter Abbeel

    Yilun Du, Mengjiao Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Joshua B. Tenenbaum, Dale Schuur- mans, and Pieter Abbeel. Learning universal policies via text-guided video generation. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

  34. [34]

    Learninginteractivereal-worldsimulators

    SherryYang,YilunDu,KamyarGhasemipour,JonathanTompson,LeslieKaelbling,DaleSchuurmans, andPieterAbbeel. Learninginteractivereal-worldsimulators. InInternationalConferenceonLearning Representations (ICLR), 2024. arXiv:2310.06114

  35. [35]

    Diffusion policy: Visuomotor policy learning via action diffusion

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. InRobotics: Science and Systems (RSS), 2023. arXiv:2303.04137

  36. [36]

    Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023

    Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023. 15

  37. [37]

    The internal state of an LLM knows when it’s lying

    Amos Azaria and Tom Mitchell. The internal state of an LLM knows when it’s lying. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 967–976, 2023. arXiv:2304.13734

  38. [38]

    Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221, 2022

    Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221, 2022

  39. [39]

    Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation

    Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. InInternational Conference on Learning Representations (ICLR), 2023. arXiv:2302.09664

  40. [40]

    A baseline for detecting misclassified and out-of-distribution examples in neural networks

    Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks. InInternational Conference on Learning Representations (ICLR), 2017. arXiv:1610.02136

  41. [41]

    Simple and scalable predictive uncertainty estimation using deep ensembles

    Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. InAdvances in Neural Information Processing Systems (NeurIPS), 2017. arXiv:1612.01474

  42. [42]

    Dropout as a Bayesian approximation: Representing model uncertainty in deep learning

    Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. InInternational Conference on Machine Learning (ICML), 2016. arXiv:1506.02142

  43. [43]

    Uncertainty comes for free: Human-in-the-loop policies with diffusion models.arXiv preprint arXiv:2503.01876, 2025

    Zhanpeng He, Yifeng Cao, and Matei Ciocarlie. Uncertainty comes for free: Human-in-the-loop policies with diffusion models.arXiv preprint arXiv:2503.01876, 2025

  44. [44]

    Qwen3 embedding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176, 2025

    Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. Qwen3 embedding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176, 2025

  45. [45]

    2005.Algorithmic Learning in a Random World

    Vladimir Vovk, Alexander Gammerman, and Glenn Shafer.Algorithmic Learning in a Random World. Springer, 2005. doi: 10.1007/b106715

  46. [46]

    Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017. A. Architecture and training hyperparameters Training loss.The probe is trained with the per-step binary cross-entropy, averaged over valid (non- padded) timesteps: L=− 1Í 𝑖 𝑇𝑖 ∑︁ 𝑖 𝑇𝑖∑︁ 𝑡=1 𝑦𝑖 log𝑠 ...