pith. sign in

arxiv: 2605.20270 · v1 · pith:IKRKQWDInew · submitted 2026-05-18 · 💻 cs.LG · cs.AI· stat.ML

Conformal Selective Acting: Anytime-Valid Risk Control for RLVR-Trained LLMs

Pith reviewed 2026-05-21 07:27 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML
keywords conformal predictionanytime validityselective risk controlRLVRLLM deploymente-processrisk bounds
4
0 comments X

The pith

Conformal Selective Acting wraps RLVR-trained LLMs to deliver anytime-pathwise selective risk control at every round.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Conformal Selective Acting as a deployment wrapper for specialist LLMs fine-tuned with reinforcement learning from verifiable rewards. It addresses the need for per-deployment safety certificates in regulated settings without waiting for long-run averages or pooling data across deployments. The method maintains a Ville-type e-process for each threshold on a Bonferroni grid and evaluates it against the RLVR filtration. Under the assumptions of predictable updates and isotonic-calibrated monotone risk, it establishes bounds on selective risk that hold pathwise at every time step. This enables operators to certify and release decisions with controlled error at each round while maintaining high release rates.

Core claim

Conformal Selective Acting fills an empty cell in the (test statistic, validity guarantee, deployment rule) framework by using a per-round wrapper that maintains a Ville-type e-process per threshold on a Bonferroni grid evaluated against the RLVR filtration. It proves an anytime-pathwise selective-risk bound R_T^act ≤ α + O(N_T^{-1/2}), rate-optimal certification matching Θ(η̄^{-2} log(1/δ)), and a horizon-independent release-rate gap.

What carries the argument

The central mechanism is the Conformal Selective Acting wrapper that constructs and evaluates a Ville-type e-process per threshold on a Bonferroni grid to achieve selective risk control.

If this is right

  • The selective risk stays bounded by alpha plus a term that decreases like one over square root of rounds.
  • Certification reaches the optimal rate in terms of average step size.
  • The release rate gap stays independent of the time horizon.
  • It satisfies pathwise validity and non-refusing deployment on all tested benchmark and shift cells.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This wrapper could apply to other online updating models that satisfy predictable updates.
  • Regulated operators could adopt it to meet per-deployment error budgets while keeping release rates high.
  • Relaxing the monotone risk assumption might allow similar bounds for wider classes of risk functions.

Load-bearing premise

The central claim relies on the model updates being predictable and the risk function being isotonic-calibrated and monotone with respect to the threshold.

What would settle it

Observing a sequence of decisions where the empirical selective risk exceeds alpha plus a term that shrinks like one over square root of the number of rounds, despite satisfying predictable updates and monotone risk, would falsify the bound.

Figures

Figures reproduced from arXiv: 2605.20270 by Hamed Khosravi, Xiaoming Huo.

Figure 1
Figure 1. Figure 1: CSA as a plug-and-play deployment wrapper. Top (blue): the RLVR system is unchanged. Middle (green): CSA computes a surrogate score, updates one e-process per candidate threshold, certifies when the e-process crosses δ −1 q , and gates release via the largest certified threshold. Bottom (red): the anytime-valid guarantee holds simultaneously for all T under predictable updates and monotone risk, without ex… view at source ↗
Figure 2
Figure 2. Figure 2: Empirical validation of the RLVR structural layer on the live LoRA experiment (K=8 SFT rounds, 100 MATH problems each). (a) Estimated oracle frontier q ⋆ k ; red dashed lines mark drop rounds (DT =3). (b) Per-round slack ξk; nonzero only at drop rounds. (c) Cumulative slack BT = Pξk ≈ 0.40. Verifying the structural conditions in practice. All structural conditions are empirically falsifiable from the deplo… view at source ↗
Figure 3
Figure 3. Figure 3: Per-benchmark, per-α phase-budget view. Each panel is a benchmark; x-axis is the 6-α grid; each point is one (method, α) cell with Risk on the y-axis and AR encoded by point size. The y=α diagonal (dotted) is the pathwise-safety boundary; the shaded region above is the violation zone. CSA stays strictly at or below the diagonal on every (benchmark, α) cell. F.2.2 Prompts and answer extraction Each benchmar… view at source ↗
Figure 4
Figure 4. Figure 4: Risk vs. nominal budget α across all 8 benchmarks (10 replications per cell). Dotted diagonal y = α: pathwise-safety boundary. Shaded red: violation zone. CSA is the only method whose Risk stays at or below the diagonal on every benchmark at every α [PITH_FULL_IMAGE:figures/full_fig_p028_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Safe-Coverage radar, one mini-radar per method with at least one nonzero vertex. Vertex at benchmark b is Cov at α ⋆ b if the method is pathwise-safe, else 0 [PITH_FULL_IMAGE:figures/full_fig_p029_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Live RLVR + online LoRA on Qwen2.5-Math-7B (MATH cell, T=4,000, 20 replications). Top: running selective risk Ract t for all ten methods; dashed: y=α; red: violation zone. Bottom: running action rate ARt restricted to methods whose final risk is at or below α. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_6.png] view at source ↗
read the original abstract

A local specialist LLM, fine-tuned with reinforcement learning from verifiable rewards (RLVR) on operator-local data, is installed in a regulated organization with per-deployment error budget $\alpha$. The operator needs a safety certificate for this deployment's stream at every round: no pooling across deployments, no waiting for a long-run average. Existing wrappers cannot deliver this on adaptive, online-updated streams: offline conformal-risk methods require exchangeability; online-conformal methods bound only long-run averages; non-exchangeable extensions are marginally valid; and the closest anytime wrapper, A-RCPS, controls marginal rather than selective risk. Using a (test statistic, validity guarantee, deployment rule) framework, we identify one empty cell forced by deployment requirements: e-process per threshold, selective risk, anytime-pathwise validity, max-certified-threshold rule. Conformal Selective Acting (CSA) fills it as a per-round wrapper maintaining a Ville-type e-process per threshold on a Bonferroni grid, evaluated against the RLVR filtration. Under predictable updates and isotonic-calibrated monotone risk we prove (i) an anytime-pathwise selective-risk bound $R_T^{\mathrm{act}}\le\alpha+O(N_T^{-1/2})$, (ii) rate-optimal certification matching $\Theta(\bar\eta^{-2}\log(1/\delta))$, and (iii) a horizon-independent release-rate gap. Across eight specialist benchmarks ($480$ streams), sixteen adversarial distribution-shift cells ($160$ streams), and five live Expert-Iteration RLVR cells with online LoRA over four base models in three architecture families ($10{,}300$ rounds), CSA is the only method among ten compared that satisfies pathwise validity and non-refusing deployment on every cell. We do not propose a new LLM, training algorithm, or policy class; CSA is the deployment-side complement, orthogonal to the model, for operators who cannot use a frontier API.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Conformal Selective Acting (CSA), a per-round wrapper for RLVR-trained LLMs that maintains a Ville-type e-process per threshold on a Bonferroni grid, evaluated against the RLVR filtration, and deploys via a max-certified-threshold rule. Under the assumptions of predictable updates and isotonic-calibrated monotone risk, it proves (i) an anytime-pathwise selective-risk bound R_T^{act} ≤ α + O(N_T^{-1/2}), (ii) rate-optimal certification matching Θ(η̄^{-2} log(1/δ)), and (iii) a horizon-independent release-rate gap. Across 480 benchmark streams, 160 adversarial-shift cells, and 10,300 live Expert-Iteration rounds with online LoRA, CSA is the only method among ten comparators that satisfies pathwise validity and non-refusing deployment on every cell.

Significance. If the central claims hold, the work supplies a practical, theoretically grounded deployment complement for regulated operators who require per-stream safety certificates without pooling across deployments or waiting for long-run averages. It receives credit for identifying an empty cell in the (test statistic, validity guarantee, deployment rule) framework, for applying standard e-process and conformal ingredients to the online RLVR setting, and for the scale of the empirical evaluation (480 + 160 + 10,300 streams) that directly tests pathwise validity.

major comments (2)
  1. [§3.2] §3.2 (Assumption on isotonic-calibrated monotone risk): The proofs of the pathwise selective-risk bound and the supermartingale property of the per-threshold Ville e-process rest on this assumption when the risk is evaluated against the filtration induced by online LoRA updates. The manuscript invokes the assumption for the 10,300 live Expert-Iteration rounds and the 160 adversarial-shift cells, yet reports no diagnostic (e.g., empirical risk-vs-threshold plots or isotonic regression residuals) confirming that monotonicity and calibration hold under the actual RLVR updates. If violated on even a positive fraction of rounds, the claimed pathwise control does not follow.
  2. [Theorem 1 / §4.1] Theorem 1 / §4.1 (derivation of R_T^{act} ≤ α + O(N_T^{-1/2})): The rate term and the horizon-independent release-rate gap are derived under the predictable-updates assumption. No sensitivity analysis or empirical check of this assumption is provided for the live streams, leaving the applicability of the bound to the reported RLVR experiments conditional on an unverified modeling choice.
minor comments (2)
  1. [§3] The Bonferroni grid construction and the precise definition of the max-certified-threshold rule would benefit from an explicit algorithmic pseudocode block in §3.
  2. [Experiments section] Table captions for the 10,300-round experiments should list the four base models and three architecture families to allow direct replication of the reported coverage numbers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback on our manuscript. The comments highlight important points regarding the empirical support for our modeling assumptions. We address each major comment below and will revise the manuscript accordingly to include additional diagnostics and analyses.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Assumption on isotonic-calibrated monotone risk): The proofs of the pathwise selective-risk bound and the supermartingale property of the per-threshold Ville e-process rest on this assumption when the risk is evaluated against the filtration induced by online LoRA updates. The manuscript invokes the assumption for the 10,300 live Expert-Iteration rounds and the 160 adversarial-shift cells, yet reports no diagnostic (e.g., empirical risk-vs-threshold plots or isotonic regression residuals) confirming that monotonicity and calibration hold under the actual RLVR updates. If violated on even a positive fraction of rounds, the claimed pathwise control does not follow.

    Authors: We agree that direct empirical verification of the isotonic-calibrated monotone risk assumption strengthens the applicability of the pathwise selective-risk bound to the reported RLVR experiments. In the revised manuscript we will add risk-versus-threshold plots together with isotonic regression residual diagnostics for the 10,300 live Expert-Iteration rounds and the 160 adversarial-shift cells. These plots will be generated from the same streams used in the original experiments and will confirm that monotonicity and calibration hold under the online LoRA updates, thereby supporting the supermartingale property and the claimed pathwise control. revision: yes

  2. Referee: [Theorem 1 / §4.1] Theorem 1 / §4.1 (derivation of R_T^{act} ≤ α + O(N_T^{-1/2})): The rate term and the horizon-independent release-rate gap are derived under the predictable-updates assumption. No sensitivity analysis or empirical check of this assumption is provided for the live streams, leaving the applicability of the bound to the reported RLVR experiments conditional on an unverified modeling choice.

    Authors: The predictable-updates assumption is a natural modeling choice for online LoRA updates, which are performed using data observed up to the preceding round. We acknowledge that the original submission did not contain a sensitivity analysis. In the revision we will add a sensitivity study that perturbs the degree of predictability (e.g., by introducing controlled lag in the update schedule) and reports the resulting effect on the O(N_T^{-1/2}) rate term and the horizon-independent release-rate gap. This analysis will demonstrate robustness while preserving the theoretical derivation under the stated assumption. revision: yes

Circularity Check

0 steps flagged

No circularity: bounds derived from e-process and conformal methods under explicit external assumptions

full rationale

The paper applies standard Ville-type e-processes and conformal risk control to the RLVR online-update setting. The central claims (anytime-pathwise selective-risk bound R_T^act ≤ α + O(N_T^{-1/2}), rate-optimal certification, horizon-independent release-rate gap) are proven under the stated assumptions of predictable updates and isotonic-calibrated monotone risk with respect to the RLVR filtration. These assumptions are introduced as conditions on the data-generating process rather than quantities defined or fitted inside the derivation; the selective-risk bound is not obtained by renaming a fitted parameter or by self-referential construction. No load-bearing self-citation, ansatz smuggling, or uniqueness theorem imported from the authors' prior work appears in the provided abstract or description. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on two domain assumptions required for the filtration and the risk bound; no free parameters or new entities are introduced in the abstract.

axioms (2)
  • domain assumption predictable updates
    Invoked when the method is evaluated against the RLVR filtration for online streams.
  • domain assumption isotonic-calibrated monotone risk
    Required to obtain the selective-risk bound R_T^{act} ≤ α + O(N_T^{-1/2}).

pith-pipeline@v0.9.0 · 5886 in / 1326 out tokens · 33056 ms · 2026-05-21T07:27:18.838373+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 4 internal anchors

  1. [1]

    2025 , doi =

    Guo, Daya and Yang, Dejian and Zhang, Haowei and others , journal =. 2025 , doi =

  2. [2]

    and Liu, Alisa and Dziri, Nouha and Lyu, Shane and Gu, Yuling and Malik, Saumya and Graf, Victoria and Hwang, Jena D

    Lambert, Nathan and Morrison, Jacob and Pyatkin, Valentina and Huang, Shengyi and Ivison, Hamish and Brahman, Faeze and Miranda, Lester James V. and Liu, Alisa and Dziri, Nouha and Lyu, Shane and Gu, Yuling and Malik, Saumya and Graf, Victoria and Hwang, Jena D. and Yang, Jiangjiang and Bras, Ronan Le and Tafjord, Oyvind and Wilhelm, Chris and Soldaini, L...

  3. [3]

    Kimi k1.5: Scaling Reinforcement Learning with LLMs

    arXiv preprint arXiv:2501.12599 , year =. 2501.12599 , archivePrefix=

  4. [4]

    Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Zhang, Mingchuan and Li, Y. K. and Wu, Yu and Guo, Daya , journal =. 2024 , eprint =

  5. [5]

    2025 , eprint =

    Yu, Qiying and Zhang, Zheng and Zhu, Ruofei and Yuan, Yufeng and Zuo, Xiaochen and Yue, Yu and Dai, Weinan and Fan, Tiantian and Liu, Gaohong and Liu, Lingjun and Lin, Xin and others , journal =. 2025 , eprint =

  6. [6]

    2025 , eprint =

    Peng, Hao and Qi, Yunjia and Wang, Xiaozhi and Xu, Bin and Hou, Lei and Li, Juanzi , journal =. 2025 , eprint =

  7. [7]

    Does Reinforcement Learning Really Incentivize Reasoning Capacity in

    Yue, Yang and Chen, Zhiqi and Lu, Rui and Zhao, Andrew and Wang, Zhaokai and Song, Shiji and Huang, Gao , journal =. Does Reinforcement Learning Really Incentivize Reasoning Capacity in. 2025 , eprint =

  8. [8]

    Measuring Mathematical Problem Solving with the

    Hendrycks, Dan and Burns, Collin and Kadavath, Saurav and Arora, Akul and Basart, Steven and Tang, Eric and Song, Dawn and Steinhardt, Jacob , booktitle =. Measuring Mathematical Problem Solving with the. 2021 , url =

  9. [9]

    2024 , eprint =

    Jain, Naman and Han, King and Gu, Alex and Li, Wen-Ding and Yan, Fanjia and Zhang, Tianjun and Wang, Sida and Solar-Lezama, Armando and Sen, Koushik and Stoica, Ion , journal =. 2024 , eprint =

  10. [10]

    and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle =

    Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle =. 2022 , url =

  11. [11]

    Advances in Neural Information Processing Systems , year =

    Thinking Fast and Slow with Deep Learning and Tree Search , author =. Advances in Neural Information Processing Systems , year =

  12. [12]

    Advances in Neural Information Processing Systems , year =

    Active, Anytime-Valid Risk Controlling Prediction Sets , author =. Advances in Neural Information Processing Systems , year =

  13. [13]

    Selective Conformal Risk Control

    Selective Conformal Risk Control , author =. arXiv preprint arXiv:2512.12844 , year =. 2512.12844 , archivePrefix=

  14. [14]

    Journal of Machine Learning Research , volume =

    Conformal Inference for Online Prediction with Arbitrary Distribution Shifts , author =. Journal of Machine Learning Research , volume =. 2024 , url =

  15. [15]

    Advances in Neural Information Processing Systems , year =

    Localized Adaptive Risk Control , author =. Advances in Neural Information Processing Systems , year =

  16. [16]

    Advances in Neural Information Processing Systems , volume =

    Adaptive Conformal Inference Under Distribution Shift , author =. Advances in Neural Information Processing Systems , volume =. 2021 , url =

  17. [17]

    Proceedings of the 40th International Conference on Machine Learning (ICML) , series =

    Improved Online Conformal Prediction via Strongly Adaptive Online Learning , author =. Proceedings of the 40th International Conference on Machine Learning (ICML) , series =. 2023 , publisher =

  18. [18]

    Proceedings of the 39th International Conference on Machine Learning (ICML) , series =

    Learn Then Test: Calibrating Predictive Algorithms to Achieve Risk Control , author =. Proceedings of the 39th International Conference on Machine Learning (ICML) , series =. 2022 , publisher =

  19. [19]

    Proceedings of the 12th International Conference on Learning Representations (ICLR) , year =

    Conformal Risk Control , author =. Proceedings of the 12th International Conference on Learning Representations (ICLR) , year =

  20. [20]

    Proceedings of the 41st International Conference on Machine Learning (ICML) , series =

    Language Models with Conformal Factuality Guarantees , author =. Proceedings of the 41st International Conference on Machine Learning (ICML) , series =. 2024 , publisher =

  21. [21]

    The Annals of Statistics , volume =

    Conformal Prediction Beyond Exchangeability , author =. The Annals of Statistics , volume =. 2023 , doi =

  22. [22]

    International Conference on Learning Representations , year =

    Conformal Language Modeling , author =. International Conference on Learning Representations , year =

  23. [23]

    Advances in Neural Information Processing Systems , year =

    Large Language Model Validity via Enhanced Conformal Prediction Methods , author =. Advances in Neural Information Processing Systems , year =

  24. [24]

    2025 , url =

    Wang, Zhiyuan and Wang, Qingni and Zhang, Yue and Chen, Tianlong and Zhu, Xiaofeng and Shi, Xiaoshuang and Xu, Kaidi , booktitle =. 2025 , url =

  25. [25]

    The Annals of Mathematical Statistics , volume =

    Sequential Tests of Statistical Hypotheses , author =. The Annals of Mathematical Statistics , volume =. 1945 , publisher =

  26. [26]

    Journal of the Royal Statistical Society Series B: Statistical Methodology , volume =

    Estimating Means of Bounded Random Variables by Betting , author =. Journal of the Royal Statistical Society Series B: Statistical Methodology , volume =. 2024 , doi =

  27. [27]

    ACM Computing Surveys , volume =

    Ji, Ziwei and Lee, Nayeon and Frieske, Rita and Yu, Tiezheng and Su, Dan and Xu, Yan and Ishii, Etsuko and Bang, Yejin and Madotto, Andrea and Fung, Pascale , title =. ACM Computing Surveys , volume =. 2023 , doi =

  28. [28]

    Busch, Felix and Hoffmann, Lena and Rueger, Christopher and van Dijk, Elon H. C. and Kader, Rawen and Ortiz-Prado, Esteban and Makowski, Marcus R. and Saba, Luca and Hadamitzky, Martin and Kather, Jakob Nikolas and Truhn, Daniel and Cuocolo, Renato and Adams, Lisa C. and Bressem, Keno K. , title =. Communications Medicine , volume =. 2025 , doi =

  29. [29]

    Non-Exchangeable Conformal Risk Control , booktitle =

    Ant. Non-Exchangeable Conformal Risk Control , booktitle =. 2024 , eprint =

  30. [30]

    Journal of the ACM , volume =

    Distribution-Free, Risk-Controlling Prediction Sets , author =. Journal of the ACM , volume =. 2021 , doi =

  31. [31]

    Bao, Yajie and Huo, Yuyang and Ren, Haojie and Zou, Changliang , journal =

  32. [32]

    Proceedings of the 37th International Conference on Machine Learning (ICML) , series =

    Online Control of the False Coverage Rate and False Sign Rate , author =. Proceedings of the 37th International Conference on Machine Learning (ICML) , series =. 2020 , publisher =

  33. [33]

    Hu, Zirui and Zhang, Zheng and Wang, Yingjie and Rutkowski, Leszek and Tao, Dacheng , booktitle =

  34. [34]

    Advances in Neural Information Processing Systems (NeurIPS) , year =

    Conformal Arbitrage: Risk-Controlled Balancing of Competing Objectives in Language Models , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  35. [35]

    Applied Sciences , volume =

    What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams , author =. Applied Sciences , volume =. 2021 , doi =

  36. [36]

    and Lu, Xinghua , booktitle =

    Jin, Qiao and Dhingra, Bhuwan and Liu, Zhengping and Cohen, William W. and Lu, Xinghua , booktitle =. 2019 , doi =

  37. [37]

    2021 , doi =

    Zhu, Fengbin and Lei, Wenqiang and Huang, Youcheng and Wang, Chao and Zhang, Shuo and Lv, Jiancheng and Feng, Fuli and Chua, Tat-Seng , booktitle =. 2021 , doi =

  38. [38]

    Proceedings of EMNLP , year =

    Lessons from Natural Language Inference in the Clinical Domain , author =. Proceedings of EMNLP , year =

  39. [39]

    2021 , eprint =

    Training Verifiers to Solve Math Word Problems , author =. 2021 , eprint =

  40. [40]

    Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages =

    Vilares, David and G. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages =. 2019 , publisher =

  41. [41]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Clark, Peter and Cowhey, Isaac and Etzioni, Oren and Khot, Tushar and Sabharwal, Ashish and Schoenick, Carissa and Tafjord, Oyvind , year =. Think You Have Solved Question Answering? Try. 1803.05457 , archivePrefix=

  42. [42]

    and Henderson, Peter and Ho, Daniel E

    Zheng, Lucia and Guha, Neel and Anderson, Brandon R. and Henderson, Peter and Ho, Daniel E. , booktitle =. When Does Pretraining Help?. 2021 , doi =

  43. [43]

    and Androutsopoulos, Ion and Katz, Daniel Martin and Aletras, Nikolaos , booktitle =

    Chalkidis, Ilias and Jana, Abhik and Hartung, Dirk and Bommarito, Michael J. and Androutsopoulos, Ion and Katz, Daniel Martin and Aletras, Nikolaos , booktitle =. 2022 , doi =

  44. [44]

    arXiv preprint arXiv:2509.15279 (2025)

    Liu, Chi and Li, Derek and Shu, Yan and Chen, Robin and Duan, Derek and Fang, Teng and Dai, Bryan , year =. 2509.15279 , archivePrefix=

  45. [45]

    Fin-r1: A large language model for financial reasoning through reinforcement learning.arXiv preprint arXiv:2503.16252, 2025

    Liu, Zhaowei and Guo, Xin and Yang, Zhi and Lou, Fangqi and Zeng, Lingfeng and Niu, Jinyi and Li, Mengping and Qi, Qi and Liu, Zhiqiang and Han, Yiyang and others , year =. 2503.16252 , archivePrefix=

  46. [46]

    Colombo, Pierre and Pires, Telmo Pessoa and Boudiaf, Malik and Culver, Dominic and Melo, Rui and Corro, Caio and Martins, Andre F. T. and Esposito, Fabrizio and Raposo, Vera L. 2024 , eprint =

  47. [47]

    Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

    Yang, An and Zhang, Beichen and Hui, Binyuan and Gao, Bofei and Yu, Bowen and Li, Chengpeng and Liu, Dayiheng and Tu, Jianhong and Zhou, Jingren and Lin, Junyang and others , year =. 2409.12122 , archivePrefix=

  48. [48]

    and Zhang, Hao and Stoica, Ion , booktitle =

    Kwon, Woosuk and Li, Zhuohan and Zhuang, Siyuan and Sheng, Ying and Zheng, Lianmin and Yu, Cody Hao and Gonzalez, Joseph E. and Zhang, Hao and Stoica, Ion , booktitle =. Efficient Memory Management for Large Language Model Serving with. 2023 , doi =

  49. [49]

    arXiv preprint arXiv:2404.14779 , year =

    Christophe, Cl. arXiv preprint arXiv:2404.14779 , year =. 2404.14779 , archivePrefix=