Towards Robust Endogenous Reasoning: Unifying Drift Adaptation in Non-Stationary Tuning

En Yu; Jie Lu; Wei Duan; Xiaoyu Yang

arxiv: 2604.15705 · v1 · submitted 2026-04-17 · 💻 cs.LG

Towards Robust Endogenous Reasoning: Unifying Drift Adaptation in Non-Stationary Tuning

Xiaoyu Yang , En Yu , Wei Duan , Jie Lu This is my paper

Pith reviewed 2026-05-10 08:25 UTC · model grok-4.3

classification 💻 cs.LG

keywords reasoningdriftendogenousmulti-modalacrossmllmsautonomousconcept

0 comments

The pith

CPO++ adapts reinforcement fine-tuning of MLLMs to endogenous multi-modal concept drift through counterfactual reasoning and preference optimization, yielding better coherence and cross-domain robustness in safety-critical settings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Multi-modal large language models process images, text, and other inputs together. When they are fine-tuned using reinforcement methods, their step-by-step reasoning can shift unpredictably on its own, even if the outside world stays the same. The authors call this endogenous reasoning drift and treat it as a form of concept drift across modalities. Their CPO++ method creates controlled changes in both thinking and perception steps, then uses preference optimization to break unwanted links between inputs and outputs. Tests in medical and driving scenarios show improved consistency and the ability to handle new situations without retraining.

Core claim

MLLMs are highly susceptible to endogenous reasoning drift... CPO++ achieves superior performance in reasoning coherence, decision-making precision, and inherent robustness against extreme interference with exceptional zero-shot cross-domain generalization.

Load-bearing premise

That controlled counterfactual perturbations combined with preference optimization can reliably disentangle spurious correlations caused by endogenous drift without introducing new instabilities or domain-specific biases.

Figures

Figures reproduced from arXiv: 2604.15705 by En Yu, Jie Lu, Wei Duan, Xiaoyu Yang.

**Figure 1.** Figure 1: Endogenous Reasoning Drift in RFT. probability and semantic differentiation during the thinking process, the resulting predictions for distinct pathologies can become diametrically opposed. This instability highlights a critical vulnerability where the reasoning trajectory of the model undergoes a systemic divergence, unmooring the final decision from its original logical premises. We further extend the an… view at source ↗

**Figure 2.** Figure 2: The proposed Counterfactual Preference Optimization ++ (CPO++) framework. To mitigate endogenous reasoning drift, the methodology theoretically characterizes it as multi-modal concept drift, and incorporates counterfactual inference to disentangle spurious correlations from genuine causal logic within the original outputs. By leveraging hierarchical domain knowledge and perception-thinking consistency prot… view at source ↗

**Figure 3.** Figure 3: Structural Causal Graph. X: Inputs, Z: Prediction Results, T: Chain-of-Thought, and D: Latent Concept Drift within Non-Stationary CoT. Fortunately, counterfactual causes provide an explicit manner to decouple these two competing goals. We construct a structural causal graph [61], [62] to formalize the causal [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Example Case of Hierarchical Domain Knowledge Graph in Medical Domain. To disentangle detrimental drift, we introduce the graph that generates plausible counterfactual CoTs through controlled attribute perturbations. Green lines represent attributes that are positively associated with the disease, while the red denotes that they are exclusive. • Entities (E): The core objects of interest (e.g., Pneumonia … view at source ↗

**Figure 5.** Figure 5: Qualitative evaluation of the attention scores over visual tokens during CoT decoding. When the term ’lung opacity’ is subtly altered to ’opacity’, the model still produces high responses in key areas, such as the visual tokens at the right side of ① and the pneumonia ②. To qualitatively evaluate the effectiveness of the proposed framework in mitigating endogenous reasoning drift, a visualization of cross… view at source ↗

**Figure 6.** Figure 6: Quantitative evaluation of diagnostic robustness against counterfactual interference on the medical MSCXR-T [84] dataset. It reports the Top-1 accuracy across five pulmonary pathologies, including consolidation (Con.), pleural effusion (PE), pneumonia (Pna.), pneumothorax (Pnx.), and Edema (Ede.) and their overall average (Avg.).To simulate non-stationary and complex reasoning, varying ratios of counterf… view at source ↗

**Figure 1.** Figure 1: Specifically, for pathologies with highly distinct visual [PITH_FULL_IMAGE:figures/full_fig_p010_1.png] view at source ↗

**Figure 7.** Figure 7: Ablation study of the counterfactual decoupling mechanism. The evaluation systematically compares the DPO baseline against isolated interventions, specifically integrating only reasoning counterfactuals or only visual counterfactuals, alongside the complete CPO++ framework featuring dual alignment. Performance is rigorously measured across four critical dimensions: 1) reasoning capability, evaluated via th… view at source ↗

read the original abstract

Reinforcement Fine-Tuning (RFT) has established itself as a critical paradigm for the alignment of Multi-modal Large Language Models (MLLMs) with complex human values and domain-specific requirements. Nevertheless, current research primarily focuses on mitigating exogenous distribution shifts arising from data-centric factors, the non-stationarity inherent in the endogenous reasoning remains largely unexplored. In this work, a critical vulnerability is revealed within MLLMs: they are highly susceptible to endogenous reasoning drift, across both thinking and perception perspectives. It manifests as unpredictable distribution changes that emerge spontaneously during the autoregressive generation process, independent of external environmental perturbations. To adapt it, we first theoretically define endogenous reasoning drift within the RFT of MLLMs as the multi-modal concept drift. In this context, this paper proposes Counterfactual Preference Optimization ++ (CPO++), a comprehensive and autonomous framework adapted to the multi-modal concept drift. It integrates counterfactual reasoning with domain knowledge to execute controlled perturbations across thinking and perception, employing preference optimization to disentangle spurious correlations. Extensive empirical evaluations across two highly dynamic and safety-critical domains: medical diagnosis and autonomous driving. They demonstrate that the proposed framework achieves superior performance in reasoning coherence, decision-making precision, and inherent robustness against extreme interference. The methodology also exhibits exceptional zero-shot cross-domain generalization, providing a principled foundation for reliable multi-modal reasoning in safety-critical applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Circularity Check

0 steps flagged

No significant circularity; derivation chain is conceptual with no equations or reductions shown

full rationale

The abstract and available text present a high-level framework proposal (CPO++) and a conceptual definition of endogenous reasoning drift as multi-modal concept drift, but contain no mathematical derivations, equations, parameter fits, or self-citations. No load-bearing steps reduce to inputs by construction, and the description does not invoke uniqueness theorems or rename known results. This is the expected honest non-finding when the paper's chain is not mathematically specified.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Abstract introduces endogenous reasoning drift as a new theoretical construct and relies on the assumption that preference optimization can separate spurious correlations; no explicit free parameters or external benchmarks are stated.

axioms (1)

domain assumption Endogenous reasoning drift manifests as unpredictable distribution changes during autoregressive generation independent of external perturbations
Directly stated in the abstract as the core vulnerability to be addressed.

invented entities (1)

endogenous reasoning drift no independent evidence
purpose: To capture spontaneous multi-modal distribution shifts inside MLLM generation
Newly defined in the paper as distinct from exogenous shifts

pith-pipeline@v0.9.0 · 5546 in / 1189 out tokens · 41247 ms · 2026-05-10T08:25:28.813688+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Autonomous Drift Learning in Data Streams: A Unified Perspective
cs.LG 2026-05 unverdicted novelty 7.0

A survey proposes a novel 3D taxonomy classifying drifts into time stream, data stream, and model stream categories to unify research on non-stationary autonomous learning.

Reference graph

Works this paper leans on

114 extracted references · 114 canonical work pages · cited by 1 Pith paper · 15 internal anchors

[1]

ReFT: Reasoning with Reinforced Fine-Tuning,

L. Trung, X. Zhang, Z. Jie, P. Sun, X. Jin, and H. Li, “ReFT: Reasoning with Reinforced Fine-Tuning,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), L.-W. Ku, A. Martins, and V . Srikumar, Eds. Bangkok, Thailand: Association for Computational Linguistics, 2024, pp. 7601– 7614

work page 2024
[2]

Visual-RFT: Visual Reinforcement Fine-Tuning

Z. Liu, Z. Sun, Y . Zang, X. Dong, Y . Cao, H. Duan, D. Lin, and J. Wang, “Visual-rft: Visual reinforcement fine-tuning,”arXiv preprint arXiv:2503.01785, 2025

work page internal anchor Pith review arXiv 2025
[3]

Reason-rft: Reinforcement fine-tuning for visual reasoning of vision language models,

H. Tan, Y . Ji, X. Hao, X. Chen, P. Wang, Z. Wang, and S. Zhang, “Reason-rft: Reinforcement fine-tuning for visual reasoning of vision language models,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[4]

Sft memorizes, rl generalizes: A comparative study of foundation model post-training,

T. Chu, Y . Zhai, J. Yang, S. Tong, S. Xie, D. Schuurmans, Q. V . Le, S. Levine, and Y . Ma, “Sft memorizes, rl generalizes: A comparative study of foundation model post-training,” inInternational Conference on Machine Learning. PMLR, 2025, pp. 10 818–10 838

work page 2025
[5]

RL fine-tuning heals the OOD forgetting in SFT,

H. Jin, S. Luan, S. Lyu, G. Rabusseau, D. Precup, and M. Hamdaqa, “RL fine-tuning heals the OOD forgetting in SFT,” inFirst Workshop on Foundations of Reasoning in Language Models, 2025. [Online]. Available: https://openreview.net/forum?id=SN1PCQ0ApV

work page 2025
[6]

Reinforcement learning for out-of-distribution reasoning in LLMs: An empirical study on diagnosis-related group coding,

H. Wang, Z. Wu, G. J. Kolar, H. R. Korsapati, B. Bartlett, B. Hull, and J. Sun, “Reinforcement learning for out-of-distribution reasoning in LLMs: An empirical study on diagnosis-related group coding,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. [Online]. Available: https: //openreview.net/forum?id=0jvnfH0WYV

work page 2025
[7]

Information-theoretic reward modeling for stable rlhf: Detecting and mitigating reward hacking.arXiv preprint arXiv:2510.13694, 2025a

Y . Miao, L. Ding, S. Zhang, R. Bao, L. Zhang, and D. Tao, “Information-theoretic reward modeling for stable rlhf: Detecting and mitigating reward hacking,”arXiv preprint arXiv:2510.13694, 2025

work page arXiv 2025
[8]

Reward shaping to mitigate reward hacking in RLHF,

J. Fu, X. Zhao, C. Yao, H. Wang, Q. Han, and Y . Xiao, “Reward shaping to mitigate reward hacking in RLHF,” inICML 2025 Workshop on Reliable and Responsible Foundation Models, 2025. [Online]. Available: https://openreview.net/forum?id=62A4d5Mokc

work page 2025
[9]

RRM: Robust reward model training mitigates reward hacking,

T. Liu, W. Xiong, J. Ren, L. Chen, J. Wu, R. Joshi, Y . Gao, J. Shen, Z. Qin, T. Yu, D. Sohn, A. Makarova, J. Z. Liu, Y . Liu, B. Piot, A. Ittycheriah, A. Kumar, and M. Saleh, “RRM: Robust reward model training mitigates reward hacking,” inThe Thirteenth International Conference on Learning Representations, 2025. [Online]. Available: https://openreview.ne...

work page 2025
[10]

DAPO: An open-source LLM reinforcement learning system at scale,

Q. Yu, Z. Zhang, R. Zhu, Y . Yuan, X. Zuo, YuYue, W. Dai, T. Fan, G. Liu, J. Liu, L. Liu, X. Liu, H. Lin, Z. Lin, B. Ma, G. Sheng, Y . Tong, C. Zhang, M. Zhang, R. Zhang, W. Zhang, H. Zhu, J. Zhu, J. Chen, J. Chen, C. Wang, H. Yu, Y . Song, X. Wei, H. Zhou, J. Liu, W.-Y . Ma, Y .-Q. Zhang, L. Yan, Y . Wu, and M. Wang, “DAPO: An open-source LLM reinforceme...

work page 2025
[11]

Beyond reverse KL: Generalizing direct preference optimization with diverse divergence constraints,

C. Wang, Y . Jiang, C. Yang, H. Liu, and Y . Chen, “Beyond reverse KL: Generalizing direct preference optimization with diverse divergence constraints,” inThe Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https: //openreview.net/forum?id=2cRzmWXK9N

work page 2024
[12]

Is DPO superior to PPO for LLM alignment? a comprehensive study,

S. Xu, W. Fu, J. Gao, W. Ye, W. Liu, Z. Mei, G. Wang, C. Yu, and Y . Wu, “Is DPO superior to PPO for LLM alignment? a comprehensive study,” inForty-first International Conference on Machine Learning, 2024. [Online]. Available: https: //openreview.net/forum?id=6XH8R7YrSk 13

work page 2024
[13]

Learning dynamics of LLM finetuning,

Y . Ren and D. J. Sutherland, “Learning dynamics of LLM finetuning,” inThe Thirteenth International Conference on Learning Representations, 2025. [Online]. Available: https://openreview.net/ forum?id=tPNHOoZFl9

work page 2025
[14]

CoRR , volume =

K. Kumar, T. Ashraf, O. Thawakar, R. M. Anwer, H. Cholakkal, M. Shah, M.-H. Yang, P. H. Torr, F. S. Khan, and S. Khan, “Llm post- training: A deep dive into reasoning large language models,”arXiv preprint arXiv:2502.21321, 2025

work page arXiv 2025
[15]

Qwen2.5-vl,

Q. Team, “Qwen2.5-vl,” January 2025. [Online]. Available: https: //qwenlm.github.io/blog/qwen2.5-vl/

work page 2025
[16]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model,

R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct Preference Optimization: Your Language Model is Secretly a Reward Model,” vol. 36, pp. 53 728–53 741, 2023. [Online]. Available: https://proceedings.neurips.cc/paper files/paper/2023/hash/ a85b405ed65c6477a4fe8302b5e06ce7-Abstract-Conference.html

work page 2023
[17]

Mimic-cxr, a de- identified publicly available database of chest radiographs with free- text reports,

A. E. Johnson, T. J. Pollard, S. J. Berkowitz, N. R. Greenbaum, M. P. Lungren, C.-y. Deng, R. G. Mark, and S. Horng, “Mimic-cxr, a de- identified publicly available database of chest radiographs with free- text reports,”Scientific data, vol. 6, no. 1, p. 317, 2019

work page 2019
[18]

Efficient streaming language models with attention sinks,

G. Xiao, Y . Tian, B. Chen, S. Han, and M. Lewis, “Efficient streaming language models with attention sinks,” inThe Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/forum?id=NG7sS51zVF

work page 2024
[19]

Walking the tightrope: Autonomous disentangling beneficial and detrimental drifts in non-stationary custom-tuning,

X. Yang, J. Lu, and E. Yu, “Walking the tightrope: Autonomous disentangling beneficial and detrimental drifts in non-stationary custom-tuning,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. [Online]. Available: https: //openreview.net/forum?id=1BAiQmAFsx

work page 2025
[20]

Models, reasoning and inference,

J. Pearlet al., “Models, reasoning and inference,”Cambridge, UK: CambridgeUniversityPress, vol. 19, no. 2, p. 3, 2000

work page 2000
[21]

Interpretation and identification of causal mediation

J. Pearl, “Interpretation and identification of causal mediation.”Psy- chological methods, vol. 19, no. 4, p. 459, 2014

work page 2014
[22]

Reinforced Self-Training (ReST) for Language Modeling

C. Gulcehre, T. L. Paine, S. Srinivasan, K. Konyushkova, L. Weerts, A. Sharma, A. Siddhant, A. Ahern, M. Wang, C. Gu, W. Macherey, A. Doucet, O. Firat, and N. de Freitas, “Reinforced self- training (rest) for language modeling,” 2023. [Online]. Available: https://arxiv.org/abs/2308.08998

work page Pith review arXiv 2023
[23]

Scaling relationship on learning mathematical reasoning with large language models,

Z. Yuan, H. Yuan, C. Li, G. Dong, K. Lu, C. Tan, C. Zhou, and J. Zhou, “Scaling relationship on learning mathematical reasoning with large language models,” 2024. [Online]. Available: https://openreview.net/forum?id=cijO0f8u35

work page 2024
[24]

B-STar: Monitoring and balancing exploration and exploitation in self-taught reasoners,

W. Zeng, Y . Huang, L. Zhao, Y . Wang, Z. Shan, and J. He, “B-STar: Monitoring and balancing exploration and exploitation in self-taught reasoners,” inThe Thirteenth International Conference on Learning Representations, 2025. [Online]. Available: https: //openreview.net/forum?id=P6dwZJpJ4m

work page 2025
[25]

The Lessons of Developing Process Reward Models in Mathematical Reasoning

Z. Zhang, C. Zheng, Y . Wu, B. Zhang, R. Lin, B. Yu, D. Liu, J. Zhou, and J. Lin, “The lessons of developing process reward models in mathematical reasoning,”arXiv preprint arXiv:2501.07301, 2025

work page internal anchor Pith review arXiv 2025
[26]

Scalar: Spatial-concept alignment for robust vision in harsh open world,

X. Yang, L. Xu, X. Zeng, X. Wang, H. Li, and S. Zhang, “Scalar: Spatial-concept alignment for robust vision in harsh open world,” Pattern Recognition, p. 113203, 2026

work page 2026
[27]

Fewer tokens, greater scaling: Self-adaptive visual bases for efficient and expansive representation learning,

S. Young, X. Zeng, and L. Xu, “Fewer tokens, greater scaling: Self-adaptive visual bases for efficient and expansive representation learning,”arXiv preprint arXiv:2511.19515, 2025

work page arXiv 2025
[28]

Turning Drift into Constraint: Robust Reasoning Alignment in Non-Stationary Multi-Stream Environments

X. Yang, J. Lu, and E. Yu, “Learning from all: Concept alignment for autonomous distillation from multiple drifting mllms,” arXiv preprint arXiv:2510.04142, 2025. [Online]. Available: https: //arxiv.org/abs/2510.04142

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Deep reinforcement learning from human preferences,

P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei, “Deep reinforcement learning from human preferences,” Advances in neural information processing systems, vol. 30, 2017

work page 2017
[30]

Training language models to follow instructions with human feedback,

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Rayet al., “Training language models to follow instructions with human feedback,”Advances in neural information processing systems, vol. 35, pp. 27 730–27 744, 2022

work page 2022
[31]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- imal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[32]

Constitutional AI: Harmlessness from AI Feedback

Y . Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirhoseini, C. McKinnonet al., “Constitutional ai: Harmlessness from ai feedback,”arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[33]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Biet al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,”arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . Liet al., “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,”arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

From System 1 to System 2: A Survey of Reasoning Large Language Models

Z.-Z. Li, D. Zhang, M.-L. Zhang, J. Zhang, Z. Liu, Y . Yao, H. Xu, J. Zheng, P.-J. Wang, X. Chenet al., “From system 1 to system 2: A survey of reasoning large language models,”arXiv preprint arXiv:2502.17419, 2025

work page internal anchor Pith review arXiv 2025
[36]

Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

Q. Chen, L. Qin, J. Liu, D. Peng, J. Guan, P. Wang, M. Hu, Y . Zhou, T. Gao, and W. Che, “Towards reasoning era: A survey of long chain-of-thought for reasoning large language models,”arXiv preprint arXiv:2503.09567, 2025

work page internal anchor Pith review arXiv 2025
[37]

Unleashing the potential of diffusion models towards diversified sequential recommendations,

Z. Cai, S. Wang, V . W. Chu, U. Naseem, Y . Wang, and F. Chen, “Unleashing the potential of diffusion models towards diversified sequential recommendations,” inProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2025, pp. 1476–1486

work page 2025
[38]

From Query to Counsel: Structured Reasoning with a Multi-Agent Framework and Dataset for Legal Consultation

M. Lu, Y . Zhang, M. Wu, and Y . Feng, “From query to counsel: Structured reasoning with a multi-agent framework and dataset for legal consultation,” 2026. [Online]. Available: [https: //arxiv.org/abs/2604.10470](https://arxiv.org/abs/2604.10470)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[39]

Multimodal Chain-of-Thought Reasoning in Language Models

Z. Zhang, A. Zhang, M. Li, H. Zhao, G. Karypis, and A. Smola, “Multimodal chain-of-thought reasoning in language models,”arXiv preprint arXiv:2302.00923, 2023

work page internal anchor Pith review arXiv 2023
[40]

M3cot: A novel bench- mark for multi-domain multi-step multi-modal chain-of-thought.arXiv preprint arXiv:2405.16473,

Q. Chen, L. Qin, J. Zhang, Z. Chen, X. Xu, and W. Che, “m 3 cot: A novel benchmark for multi-domain multi-step multi-modal chain-of- thought,”arXiv preprint arXiv:2405.16473, 2024

work page arXiv 2024
[41]

Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

Y . Wang, S. Wu, Y . Zhang, S. Yan, Z. Liu, J. Luo, and H. Fei, “Multimodal chain-of-thought reasoning: A comprehensive survey,” arXiv preprint arXiv:2503.12605, 2025

work page internal anchor Pith review arXiv 2025
[42]

Ddcot: Duty-distinct chain-of-thought prompting for multimodal reasoning in language models,

G. Zheng, B. Yang, J. Tang, H.-Y . Zhou, and S. Yang, “Ddcot: Duty-distinct chain-of-thought prompting for multimodal reasoning in language models,”Advances in Neural Information Processing Systems, vol. 36, pp. 5168–5191, 2023

work page 2023
[43]

Steering diffusion models towards credible content recommendation,

Z. Cai, S. Wang, J. Li, P. Zhou, V . W. Chu, F. Chen, T. Zhu, and C. C. Aggarwal, “Steering diffusion models towards credible content recommendation,” inThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[44]

From newborn to impact: Bias-aware citation prediction,

M. Lu, M. Wu, J. Xu, W. Li, F. Liu, Y . Ding, Y . Sun, J. Lu, and Y . Zhang, “From newborn to impact: Bias-aware citation prediction,” arXiv preprint arXiv:2510.19246, 2025

work page arXiv 2025
[45]

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang

M. Lu, M. Wu, F. Liu, J. Xu, W. Li, H. Wang, Z. Hu, Y . Ding, Y . Sun, J. Luet al., “Choosing how to remember: Adaptive memory structures for llm agents,”arXiv preprint arXiv:2602.14038, 2026

work page arXiv 2026
[46]

Revealing multimodal causality with large language models,

J. Li, S. Wang, Q. Zhang, F. Liu, T. Liu, L. Cao, S. Yu, and F. Chen, “Revealing multimodal causality with large language models,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. [Online]. Available: https://openreview.net/forum?id= nufqobhME7

work page 2025
[47]

Mitigating modality prior-induced hallucinations in multimodal large language models via deciphering attention causality,

G. Zhou, Y . Yan, X. Zou, K. Wang, A. Liu, and X. Hu, “Mitigating modality prior-induced hallucinations in multimodal large language models via deciphering attention causality,” inThe Thirteenth International Conference on Learning Representations, 2025. [Online]. Available: https://openreview.net/forum?id=A V7OXVlAyi

work page 2025
[48]

Causal-cog: A causal- effect look at context generation for boosting multi-modal language models,

S. Zhao, Z. Li, Y . Lu, A. Yuille, and Y . Wang, “Causal-cog: A causal- effect look at context generation for boosting multi-modal language models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024, pp. 13 342–13 351

work page 2024
[49]

Ensemble learning for data stream analysis: A survey,

B. Krawczyk, L. L. Minku, J. Gama, J. Stefanowski, and M. Wo ´zniak, “Ensemble learning for data stream analysis: A survey,”Information Fusion, vol. 37, pp. 132–156, 2017

work page 2017
[50]

Learning under Concept Drift: A Review,

J. Lu, A. Liu, F. Dong, F. Gu, J. Gama, and G. Zhang, “Learning under Concept Drift: A Review,” vol. 31, no. 12, pp. 2346– 2363, 2019. [Online]. Available: https://ieeexplore.ieee.org/abstract/ document/8496795

work page arXiv 2019
[51]

Recent Advances in Concept Drift Adaptation Methods for Deep Learning

L. Yuan, H. Li, B. Xia, C. Gao, M. Liu, W. Yuan, and X. You, “Recent Advances in Concept Drift Adaptation Methods for Deep Learning.” inIJCAI, 2022, pp. 5654–5661

work page 2022
[52]

Concept Neural Network Based on Time-Delay Regret for Dynamic Stream Learning,

Y .-L. Mi, “Concept Neural Network Based on Time-Delay Regret for Dynamic Stream Learning,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 5, pp. 3796–3814, May 2025

work page 2025
[53]

Drift-aware collabora- tive assistance mixture of experts for heterogeneous multistream learn- ing,

E. Yu, J. Lu, K. Wang, X. Yang, and G. Zhang, “Drift-aware collabora- tive assistance mixture of experts for heterogeneous multistream learn- ing,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 19, 2026, pp. 16 199–16 207. 14

work page 2026
[54]

Generalized incremental learning under concept drift across evolving data streams,

E. Yu, J. Lu, and G. Zhang, “Generalized incremental learning under concept drift across evolving data streams,”arXiv preprint arXiv:2506.05736, 2025

work page arXiv 2025
[55]

Automated Concept Drift Handling for Fault Prediction in Edge Clouds Using Reinforcement Learning,

B. Shayesteh, C. Fu, A. Ebrahimzadeh, and R. H. Glitho, “Automated Concept Drift Handling for Fault Prediction in Edge Clouds Using Reinforcement Learning,”IEEE Transactions on Network and Service Management, vol. 19, no. 2, pp. 1321–1335, Jun. 2022

work page 2022
[56]

DRMD: Deep Reinforcement Learning for Malware Detection under Concept Drift,

S. McFadden, M. Foley, M. D’Onghia, C. Hicks, V . Mavroudis, N. Paoletti, and F. Pierazzi, “DRMD: Deep Reinforcement Learning for Malware Detection under Concept Drift,” inThe Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), Nov. 2025

work page 2025
[57]

Adapting multi-modal large language model to concept drift from pre-training onwards,

X. Yang, J. Lu, and E. Yu, “Adapting multi-modal large language model to concept drift from pre-training onwards,” inThe Thirteenth International Conference on Learning Representations, Y . Yue, A. Garg, N. Peng, F. Sha, and R. Yu, Eds., vol. 2025, 2025, pp. 90 869–90 891. [Online]. Available: https://proceedings.iclr.cc/paper files/paper/2025/ file/e25d8...

work page 2025
[58]

T-distributed Spherical Feature Representation for Imbalanced Classification,

X. Yang, Y . Chen, X. Yue, S. Xu, and C. Ma, “T-distributed Spherical Feature Representation for Imbalanced Classification,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 9, pp. 10 825–10 833, 2023. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/26284

work page 2023
[59]

Available: https://arxiv.org/abs/2502.07620

X. Yang, J. Lu, E. Yu, and W. Duan, “Resilient contrastive pre-training under non-stationary drift,”arXiv preprint arXiv:2502.07620, 2025. [Online]. Available: https://arxiv.org/abs/2502.07620

work page arXiv 2025
[60]

One leaf reveals the season: Occlusion-based contrastive learning with semantic- aware views for efficient visual representation,

X. Yang, L. Xu, H. Li, and S. Zhang, “One leaf reveals the season: Occlusion-based contrastive learning with semantic- aware views for efficient visual representation,” inForty-second International Conference on Machine Learning, 2025. [Online]. Available: https://openreview.net/forum?id=toZOqONu9x

work page 2025
[61]

Causal diagrams for empirical research,

J. Pearl, “Causal diagrams for empirical research,”Biometrika, vol. 82, no. 4, pp. 669–688, 1995

work page 1995
[62]

John Wiley & Sons, 2016

——,Causal inference in statistics: a primer. John Wiley & Sons, 2016

work page 2016
[63]

Direct and indirect effects,

——, “Direct and indirect effects,” inProbabilistic and causal infer- ence: the works of Judea Pearl, 2022, pp. 373–392

work page 2022
[64]

Segmentation and vascular vectorization for coronary artery by geometry-based cascaded neural network,

X. Yang, L. Xu, S. Yu, Q. Xia, H. Li, and S. Zhang, “Segmentation and vascular vectorization for coronary artery by geometry-based cascaded neural network,”IEEE Transactions on Medical Imaging, vol. 44, no. 1, pp. 259–269, 2024

work page 2024
[65]

Local linear embedding based interpolation neural network in pancreatic tumor segmentation,

X. Yang, Y . Chen, X. Yue, C. Ma, and P. Yang, “Local linear embedding based interpolation neural network in pancreatic tumor segmentation,” Applied Intelligence, vol. 52, no. 8, pp. 8746–8756, 2022

work page 2022
[66]

arXiv preprint arXiv:2603.01143 (2026)

Z. Chen, S. Young, and L. Xu, “Tc-ssa: Token compression via semantic slot aggregation for gigapixel pathology reasoning,”arXiv preprint arXiv:2603.01143, 2026

work page arXiv 2026
[67]

Knowledge matters: Chest radiology report generation with general and specific knowledge,

S. Yang, X. Wu, S. Ge, S. K. Zhou, and L. Xiao, “Knowledge matters: Chest radiology report generation with general and specific knowledge,”Medical image analysis, vol. 80, p. 102510, 2022

work page 2022
[68]

Clinical-bert: Vision-language pre-training for radiograph diagnosis and reports generation,

B. Yan and M. Pei, “Clinical-bert: Vision-language pre-training for radiograph diagnosis and reports generation,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 3, 2022, pp. 2982–2990

work page 2022
[69]

Metransformer: Radiology report generation by transformer with multiple learnable expert tokens,

Z. Wang, L. Liu, L. Wang, and L. Zhou, “Metransformer: Radiology report generation by transformer with multiple learnable expert tokens,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 11 558–11 567

work page 2023
[70]

Spatio-temporal and retrieval-augmented modelling for chest x-ray report generation,

Y . Yang, X. You, K. Zhang, Z. Fu, X. Wang, J. Ding, J. Sun, Z. Yu, Q. Huang, W. Hanet al., “Spatio-temporal and retrieval-augmented modelling for chest x-ray report generation,”IEEE Transactions on Medical Imaging, 2025

work page 2025
[71]

Diagnostic Captioning by Cooperative Task Interactions and Sample-Graph Consistency,

Z. Wang, L. Wang, X. Li, and L. Zhou, “Diagnostic Captioning by Cooperative Task Interactions and Sample-Graph Consistency,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 8, pp. 6585–6598, Aug. 2025

work page 2025
[72]

R2gengpt: Radiology report generation with frozen llms,

Z. Wang, L. Liu, L. Wang, and L. Zhou, “R2gengpt: Radiology report generation with frozen llms,”Meta-Radiology, vol. 1, no. 3, p. 100033, 2023

work page 2023
[73]

Promptmrg: Diagnosis-driven prompts for medical report generation,

H. Jin, H. Che, Y . Lin, and H. Chen, “Promptmrg: Diagnosis-driven prompts for medical report generation,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 3, 2024, pp. 2607– 2615

work page 2024
[74]

Bootstrapping large language models for radiology report generation,

C. Liu, Y . Tian, W. Chen, Y . Song, and Y . Zhang, “Bootstrapping large language models for radiology report generation,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 17, 2024, pp. 18 635–18 643

work page 2024
[75]

Cxpmrg-bench: Pre-training and benchmarking for x-ray medical report generation on chexpert plus dataset,

X. Wang, F. Wang, Y . Li, Q. Ma, S. Wang, B. Jiang, and J. Tang, “Cxpmrg-bench: Pre-training and benchmarking for x-ray medical report generation on chexpert plus dataset,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2025, pp. 5123–5133

work page 2025
[76]

Reason like a radiologist: Chain- of-thought and reinforcement learning for verifiable report generation,

P. Jing, K. Lee, Z. Zhang, H. Zhou, Z. Yuan, Z. Gao, L. Zhu, G. Pa- panastasiou, Y . Fang, and G. Yang, “Reason like a radiologist: Chain- of-thought and reinforcement learning for verifiable report generation,” Medical Image Analysis, vol. 109, p. 103910, Mar. 2026

work page 2026
[77]

Radiology report generation via multi-objective preference optimization,

T. Xiao, L. Shi, P. Liu, Z. Wang, and C. Bai, “Radiology report generation via multi-objective preference optimization,” inProceedings of the AAAI conference on artificial intelligence, vol. 39, no. 8, 2025, pp. 8664–8672

work page 2025
[78]

Fir-rad: Fine- grained reinforcement with structured reasoning for chest x-ray report generation,

X. Mei, L. Yang, D. Gao, X. Cai, J. Han, and T. Liu, “Fir-rad: Fine- grained reinforcement with structured reasoning for chest x-ray report generation,”IEEE Transactions on Medical Imaging, 2026

work page 2026
[79]

Textual explanations for self-driving vehicles,

J. Kim, A. Rohrbach, T. Darrell, J. Canny, and Z. Akata, “Textual explanations for self-driving vehicles,” inProceedings of the European conference on computer vision (ECCV), 2018, pp. 563–578

work page 2018
[80]

Drivegpt4: Interpretable end-to-end autonomous driving via large language model,

Z. Xu, Y . Zhang, E. Xie, Z. Zhao, Y . Guo, K.-Y . K. Wong, Z. Li, and H. Zhao, “Drivegpt4: Interpretable end-to-end autonomous driving via large language model,”IEEE Robotics and Automation Letters, vol. 9, no. 10, pp. 8186–8193, 2024

work page 2024

Showing first 80 references.

[1] [1]

ReFT: Reasoning with Reinforced Fine-Tuning,

L. Trung, X. Zhang, Z. Jie, P. Sun, X. Jin, and H. Li, “ReFT: Reasoning with Reinforced Fine-Tuning,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), L.-W. Ku, A. Martins, and V . Srikumar, Eds. Bangkok, Thailand: Association for Computational Linguistics, 2024, pp. 7601– 7614

work page 2024

[2] [2]

Visual-RFT: Visual Reinforcement Fine-Tuning

Z. Liu, Z. Sun, Y . Zang, X. Dong, Y . Cao, H. Duan, D. Lin, and J. Wang, “Visual-rft: Visual reinforcement fine-tuning,”arXiv preprint arXiv:2503.01785, 2025

work page internal anchor Pith review arXiv 2025

[3] [3]

Reason-rft: Reinforcement fine-tuning for visual reasoning of vision language models,

H. Tan, Y . Ji, X. Hao, X. Chen, P. Wang, Z. Wang, and S. Zhang, “Reason-rft: Reinforcement fine-tuning for visual reasoning of vision language models,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025

[4] [4]

Sft memorizes, rl generalizes: A comparative study of foundation model post-training,

T. Chu, Y . Zhai, J. Yang, S. Tong, S. Xie, D. Schuurmans, Q. V . Le, S. Levine, and Y . Ma, “Sft memorizes, rl generalizes: A comparative study of foundation model post-training,” inInternational Conference on Machine Learning. PMLR, 2025, pp. 10 818–10 838

work page 2025

[5] [5]

RL fine-tuning heals the OOD forgetting in SFT,

H. Jin, S. Luan, S. Lyu, G. Rabusseau, D. Precup, and M. Hamdaqa, “RL fine-tuning heals the OOD forgetting in SFT,” inFirst Workshop on Foundations of Reasoning in Language Models, 2025. [Online]. Available: https://openreview.net/forum?id=SN1PCQ0ApV

work page 2025

[6] [6]

Reinforcement learning for out-of-distribution reasoning in LLMs: An empirical study on diagnosis-related group coding,

H. Wang, Z. Wu, G. J. Kolar, H. R. Korsapati, B. Bartlett, B. Hull, and J. Sun, “Reinforcement learning for out-of-distribution reasoning in LLMs: An empirical study on diagnosis-related group coding,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. [Online]. Available: https: //openreview.net/forum?id=0jvnfH0WYV

work page 2025

[7] [7]

Information-theoretic reward modeling for stable rlhf: Detecting and mitigating reward hacking.arXiv preprint arXiv:2510.13694, 2025a

Y . Miao, L. Ding, S. Zhang, R. Bao, L. Zhang, and D. Tao, “Information-theoretic reward modeling for stable rlhf: Detecting and mitigating reward hacking,”arXiv preprint arXiv:2510.13694, 2025

work page arXiv 2025

[8] [8]

Reward shaping to mitigate reward hacking in RLHF,

J. Fu, X. Zhao, C. Yao, H. Wang, Q. Han, and Y . Xiao, “Reward shaping to mitigate reward hacking in RLHF,” inICML 2025 Workshop on Reliable and Responsible Foundation Models, 2025. [Online]. Available: https://openreview.net/forum?id=62A4d5Mokc

work page 2025

[9] [9]

RRM: Robust reward model training mitigates reward hacking,

T. Liu, W. Xiong, J. Ren, L. Chen, J. Wu, R. Joshi, Y . Gao, J. Shen, Z. Qin, T. Yu, D. Sohn, A. Makarova, J. Z. Liu, Y . Liu, B. Piot, A. Ittycheriah, A. Kumar, and M. Saleh, “RRM: Robust reward model training mitigates reward hacking,” inThe Thirteenth International Conference on Learning Representations, 2025. [Online]. Available: https://openreview.ne...

work page 2025

[10] [10]

DAPO: An open-source LLM reinforcement learning system at scale,

Q. Yu, Z. Zhang, R. Zhu, Y . Yuan, X. Zuo, YuYue, W. Dai, T. Fan, G. Liu, J. Liu, L. Liu, X. Liu, H. Lin, Z. Lin, B. Ma, G. Sheng, Y . Tong, C. Zhang, M. Zhang, R. Zhang, W. Zhang, H. Zhu, J. Zhu, J. Chen, J. Chen, C. Wang, H. Yu, Y . Song, X. Wei, H. Zhou, J. Liu, W.-Y . Ma, Y .-Q. Zhang, L. Yan, Y . Wu, and M. Wang, “DAPO: An open-source LLM reinforceme...

work page 2025

[11] [11]

Beyond reverse KL: Generalizing direct preference optimization with diverse divergence constraints,

C. Wang, Y . Jiang, C. Yang, H. Liu, and Y . Chen, “Beyond reverse KL: Generalizing direct preference optimization with diverse divergence constraints,” inThe Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https: //openreview.net/forum?id=2cRzmWXK9N

work page 2024

[12] [12]

Is DPO superior to PPO for LLM alignment? a comprehensive study,

S. Xu, W. Fu, J. Gao, W. Ye, W. Liu, Z. Mei, G. Wang, C. Yu, and Y . Wu, “Is DPO superior to PPO for LLM alignment? a comprehensive study,” inForty-first International Conference on Machine Learning, 2024. [Online]. Available: https: //openreview.net/forum?id=6XH8R7YrSk 13

work page 2024

[13] [13]

Learning dynamics of LLM finetuning,

Y . Ren and D. J. Sutherland, “Learning dynamics of LLM finetuning,” inThe Thirteenth International Conference on Learning Representations, 2025. [Online]. Available: https://openreview.net/ forum?id=tPNHOoZFl9

work page 2025

[14] [14]

CoRR , volume =

K. Kumar, T. Ashraf, O. Thawakar, R. M. Anwer, H. Cholakkal, M. Shah, M.-H. Yang, P. H. Torr, F. S. Khan, and S. Khan, “Llm post- training: A deep dive into reasoning large language models,”arXiv preprint arXiv:2502.21321, 2025

work page arXiv 2025

[15] [15]

Qwen2.5-vl,

Q. Team, “Qwen2.5-vl,” January 2025. [Online]. Available: https: //qwenlm.github.io/blog/qwen2.5-vl/

work page 2025

[16] [16]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model,

R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct Preference Optimization: Your Language Model is Secretly a Reward Model,” vol. 36, pp. 53 728–53 741, 2023. [Online]. Available: https://proceedings.neurips.cc/paper files/paper/2023/hash/ a85b405ed65c6477a4fe8302b5e06ce7-Abstract-Conference.html

work page 2023

[17] [17]

Mimic-cxr, a de- identified publicly available database of chest radiographs with free- text reports,

A. E. Johnson, T. J. Pollard, S. J. Berkowitz, N. R. Greenbaum, M. P. Lungren, C.-y. Deng, R. G. Mark, and S. Horng, “Mimic-cxr, a de- identified publicly available database of chest radiographs with free- text reports,”Scientific data, vol. 6, no. 1, p. 317, 2019

work page 2019

[18] [18]

Efficient streaming language models with attention sinks,

G. Xiao, Y . Tian, B. Chen, S. Han, and M. Lewis, “Efficient streaming language models with attention sinks,” inThe Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/forum?id=NG7sS51zVF

work page 2024

[19] [19]

Walking the tightrope: Autonomous disentangling beneficial and detrimental drifts in non-stationary custom-tuning,

X. Yang, J. Lu, and E. Yu, “Walking the tightrope: Autonomous disentangling beneficial and detrimental drifts in non-stationary custom-tuning,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. [Online]. Available: https: //openreview.net/forum?id=1BAiQmAFsx

work page 2025

[20] [20]

Models, reasoning and inference,

J. Pearlet al., “Models, reasoning and inference,”Cambridge, UK: CambridgeUniversityPress, vol. 19, no. 2, p. 3, 2000

work page 2000

[21] [21]

Interpretation and identification of causal mediation

J. Pearl, “Interpretation and identification of causal mediation.”Psy- chological methods, vol. 19, no. 4, p. 459, 2014

work page 2014

[22] [22]

Reinforced Self-Training (ReST) for Language Modeling

C. Gulcehre, T. L. Paine, S. Srinivasan, K. Konyushkova, L. Weerts, A. Sharma, A. Siddhant, A. Ahern, M. Wang, C. Gu, W. Macherey, A. Doucet, O. Firat, and N. de Freitas, “Reinforced self- training (rest) for language modeling,” 2023. [Online]. Available: https://arxiv.org/abs/2308.08998

work page Pith review arXiv 2023

[23] [23]

Scaling relationship on learning mathematical reasoning with large language models,

Z. Yuan, H. Yuan, C. Li, G. Dong, K. Lu, C. Tan, C. Zhou, and J. Zhou, “Scaling relationship on learning mathematical reasoning with large language models,” 2024. [Online]. Available: https://openreview.net/forum?id=cijO0f8u35

work page 2024

[24] [24]

B-STar: Monitoring and balancing exploration and exploitation in self-taught reasoners,

W. Zeng, Y . Huang, L. Zhao, Y . Wang, Z. Shan, and J. He, “B-STar: Monitoring and balancing exploration and exploitation in self-taught reasoners,” inThe Thirteenth International Conference on Learning Representations, 2025. [Online]. Available: https: //openreview.net/forum?id=P6dwZJpJ4m

work page 2025

[25] [25]

The Lessons of Developing Process Reward Models in Mathematical Reasoning

Z. Zhang, C. Zheng, Y . Wu, B. Zhang, R. Lin, B. Yu, D. Liu, J. Zhou, and J. Lin, “The lessons of developing process reward models in mathematical reasoning,”arXiv preprint arXiv:2501.07301, 2025

work page internal anchor Pith review arXiv 2025

[26] [26]

Scalar: Spatial-concept alignment for robust vision in harsh open world,

X. Yang, L. Xu, X. Zeng, X. Wang, H. Li, and S. Zhang, “Scalar: Spatial-concept alignment for robust vision in harsh open world,” Pattern Recognition, p. 113203, 2026

work page 2026

[27] [27]

Fewer tokens, greater scaling: Self-adaptive visual bases for efficient and expansive representation learning,

S. Young, X. Zeng, and L. Xu, “Fewer tokens, greater scaling: Self-adaptive visual bases for efficient and expansive representation learning,”arXiv preprint arXiv:2511.19515, 2025

work page arXiv 2025

[28] [28]

Turning Drift into Constraint: Robust Reasoning Alignment in Non-Stationary Multi-Stream Environments

X. Yang, J. Lu, and E. Yu, “Learning from all: Concept alignment for autonomous distillation from multiple drifting mllms,” arXiv preprint arXiv:2510.04142, 2025. [Online]. Available: https: //arxiv.org/abs/2510.04142

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [29]

Deep reinforcement learning from human preferences,

P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei, “Deep reinforcement learning from human preferences,” Advances in neural information processing systems, vol. 30, 2017

work page 2017

[30] [30]

Training language models to follow instructions with human feedback,

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Rayet al., “Training language models to follow instructions with human feedback,”Advances in neural information processing systems, vol. 35, pp. 27 730–27 744, 2022

work page 2022

[31] [31]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- imal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[32] [32]

Constitutional AI: Harmlessness from AI Feedback

Y . Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirhoseini, C. McKinnonet al., “Constitutional ai: Harmlessness from ai feedback,”arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[33] [33]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Biet al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,”arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . Liet al., “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,”arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[35] [35]

From System 1 to System 2: A Survey of Reasoning Large Language Models

Z.-Z. Li, D. Zhang, M.-L. Zhang, J. Zhang, Z. Liu, Y . Yao, H. Xu, J. Zheng, P.-J. Wang, X. Chenet al., “From system 1 to system 2: A survey of reasoning large language models,”arXiv preprint arXiv:2502.17419, 2025

work page internal anchor Pith review arXiv 2025

[36] [36]

Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

Q. Chen, L. Qin, J. Liu, D. Peng, J. Guan, P. Wang, M. Hu, Y . Zhou, T. Gao, and W. Che, “Towards reasoning era: A survey of long chain-of-thought for reasoning large language models,”arXiv preprint arXiv:2503.09567, 2025

work page internal anchor Pith review arXiv 2025

[37] [37]

Unleashing the potential of diffusion models towards diversified sequential recommendations,

Z. Cai, S. Wang, V . W. Chu, U. Naseem, Y . Wang, and F. Chen, “Unleashing the potential of diffusion models towards diversified sequential recommendations,” inProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2025, pp. 1476–1486

work page 2025

[38] [38]

From Query to Counsel: Structured Reasoning with a Multi-Agent Framework and Dataset for Legal Consultation

M. Lu, Y . Zhang, M. Wu, and Y . Feng, “From query to counsel: Structured reasoning with a multi-agent framework and dataset for legal consultation,” 2026. [Online]. Available: [https: //arxiv.org/abs/2604.10470](https://arxiv.org/abs/2604.10470)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[39] [39]

Multimodal Chain-of-Thought Reasoning in Language Models

Z. Zhang, A. Zhang, M. Li, H. Zhao, G. Karypis, and A. Smola, “Multimodal chain-of-thought reasoning in language models,”arXiv preprint arXiv:2302.00923, 2023

work page internal anchor Pith review arXiv 2023

[40] [40]

M3cot: A novel bench- mark for multi-domain multi-step multi-modal chain-of-thought.arXiv preprint arXiv:2405.16473,

Q. Chen, L. Qin, J. Zhang, Z. Chen, X. Xu, and W. Che, “m 3 cot: A novel benchmark for multi-domain multi-step multi-modal chain-of- thought,”arXiv preprint arXiv:2405.16473, 2024

work page arXiv 2024

[41] [41]

Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

Y . Wang, S. Wu, Y . Zhang, S. Yan, Z. Liu, J. Luo, and H. Fei, “Multimodal chain-of-thought reasoning: A comprehensive survey,” arXiv preprint arXiv:2503.12605, 2025

work page internal anchor Pith review arXiv 2025

[42] [42]

Ddcot: Duty-distinct chain-of-thought prompting for multimodal reasoning in language models,

G. Zheng, B. Yang, J. Tang, H.-Y . Zhou, and S. Yang, “Ddcot: Duty-distinct chain-of-thought prompting for multimodal reasoning in language models,”Advances in Neural Information Processing Systems, vol. 36, pp. 5168–5191, 2023

work page 2023

[43] [43]

Steering diffusion models towards credible content recommendation,

Z. Cai, S. Wang, J. Li, P. Zhou, V . W. Chu, F. Chen, T. Zhu, and C. C. Aggarwal, “Steering diffusion models towards credible content recommendation,” inThe Fourteenth International Conference on Learning Representations, 2026

work page 2026

[44] [44]

From newborn to impact: Bias-aware citation prediction,

M. Lu, M. Wu, J. Xu, W. Li, F. Liu, Y . Ding, Y . Sun, J. Lu, and Y . Zhang, “From newborn to impact: Bias-aware citation prediction,” arXiv preprint arXiv:2510.19246, 2025

work page arXiv 2025

[45] [45]

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang

M. Lu, M. Wu, F. Liu, J. Xu, W. Li, H. Wang, Z. Hu, Y . Ding, Y . Sun, J. Luet al., “Choosing how to remember: Adaptive memory structures for llm agents,”arXiv preprint arXiv:2602.14038, 2026

work page arXiv 2026

[46] [46]

Revealing multimodal causality with large language models,

J. Li, S. Wang, Q. Zhang, F. Liu, T. Liu, L. Cao, S. Yu, and F. Chen, “Revealing multimodal causality with large language models,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. [Online]. Available: https://openreview.net/forum?id= nufqobhME7

work page 2025

[47] [47]

Mitigating modality prior-induced hallucinations in multimodal large language models via deciphering attention causality,

G. Zhou, Y . Yan, X. Zou, K. Wang, A. Liu, and X. Hu, “Mitigating modality prior-induced hallucinations in multimodal large language models via deciphering attention causality,” inThe Thirteenth International Conference on Learning Representations, 2025. [Online]. Available: https://openreview.net/forum?id=A V7OXVlAyi

work page 2025

[48] [48]

Causal-cog: A causal- effect look at context generation for boosting multi-modal language models,

S. Zhao, Z. Li, Y . Lu, A. Yuille, and Y . Wang, “Causal-cog: A causal- effect look at context generation for boosting multi-modal language models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024, pp. 13 342–13 351

work page 2024

[49] [49]

Ensemble learning for data stream analysis: A survey,

B. Krawczyk, L. L. Minku, J. Gama, J. Stefanowski, and M. Wo ´zniak, “Ensemble learning for data stream analysis: A survey,”Information Fusion, vol. 37, pp. 132–156, 2017

work page 2017

[50] [50]

Learning under Concept Drift: A Review,

J. Lu, A. Liu, F. Dong, F. Gu, J. Gama, and G. Zhang, “Learning under Concept Drift: A Review,” vol. 31, no. 12, pp. 2346– 2363, 2019. [Online]. Available: https://ieeexplore.ieee.org/abstract/ document/8496795

work page arXiv 2019

[51] [51]

Recent Advances in Concept Drift Adaptation Methods for Deep Learning

L. Yuan, H. Li, B. Xia, C. Gao, M. Liu, W. Yuan, and X. You, “Recent Advances in Concept Drift Adaptation Methods for Deep Learning.” inIJCAI, 2022, pp. 5654–5661

work page 2022

[52] [52]

Concept Neural Network Based on Time-Delay Regret for Dynamic Stream Learning,

Y .-L. Mi, “Concept Neural Network Based on Time-Delay Regret for Dynamic Stream Learning,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 5, pp. 3796–3814, May 2025

work page 2025

[53] [53]

Drift-aware collabora- tive assistance mixture of experts for heterogeneous multistream learn- ing,

E. Yu, J. Lu, K. Wang, X. Yang, and G. Zhang, “Drift-aware collabora- tive assistance mixture of experts for heterogeneous multistream learn- ing,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 19, 2026, pp. 16 199–16 207. 14

work page 2026

[54] [54]

Generalized incremental learning under concept drift across evolving data streams,

E. Yu, J. Lu, and G. Zhang, “Generalized incremental learning under concept drift across evolving data streams,”arXiv preprint arXiv:2506.05736, 2025

work page arXiv 2025

[55] [55]

Automated Concept Drift Handling for Fault Prediction in Edge Clouds Using Reinforcement Learning,

B. Shayesteh, C. Fu, A. Ebrahimzadeh, and R. H. Glitho, “Automated Concept Drift Handling for Fault Prediction in Edge Clouds Using Reinforcement Learning,”IEEE Transactions on Network and Service Management, vol. 19, no. 2, pp. 1321–1335, Jun. 2022

work page 2022

[56] [56]

DRMD: Deep Reinforcement Learning for Malware Detection under Concept Drift,

S. McFadden, M. Foley, M. D’Onghia, C. Hicks, V . Mavroudis, N. Paoletti, and F. Pierazzi, “DRMD: Deep Reinforcement Learning for Malware Detection under Concept Drift,” inThe Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), Nov. 2025

work page 2025

[57] [57]

Adapting multi-modal large language model to concept drift from pre-training onwards,

X. Yang, J. Lu, and E. Yu, “Adapting multi-modal large language model to concept drift from pre-training onwards,” inThe Thirteenth International Conference on Learning Representations, Y . Yue, A. Garg, N. Peng, F. Sha, and R. Yu, Eds., vol. 2025, 2025, pp. 90 869–90 891. [Online]. Available: https://proceedings.iclr.cc/paper files/paper/2025/ file/e25d8...

work page 2025

[58] [58]

T-distributed Spherical Feature Representation for Imbalanced Classification,

X. Yang, Y . Chen, X. Yue, S. Xu, and C. Ma, “T-distributed Spherical Feature Representation for Imbalanced Classification,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 9, pp. 10 825–10 833, 2023. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/26284

work page 2023

[59] [59]

Available: https://arxiv.org/abs/2502.07620

X. Yang, J. Lu, E. Yu, and W. Duan, “Resilient contrastive pre-training under non-stationary drift,”arXiv preprint arXiv:2502.07620, 2025. [Online]. Available: https://arxiv.org/abs/2502.07620

work page arXiv 2025

[60] [60]

One leaf reveals the season: Occlusion-based contrastive learning with semantic- aware views for efficient visual representation,

X. Yang, L. Xu, H. Li, and S. Zhang, “One leaf reveals the season: Occlusion-based contrastive learning with semantic- aware views for efficient visual representation,” inForty-second International Conference on Machine Learning, 2025. [Online]. Available: https://openreview.net/forum?id=toZOqONu9x

work page 2025

[61] [61]

Causal diagrams for empirical research,

J. Pearl, “Causal diagrams for empirical research,”Biometrika, vol. 82, no. 4, pp. 669–688, 1995

work page 1995

[62] [62]

John Wiley & Sons, 2016

——,Causal inference in statistics: a primer. John Wiley & Sons, 2016

work page 2016

[63] [63]

Direct and indirect effects,

——, “Direct and indirect effects,” inProbabilistic and causal infer- ence: the works of Judea Pearl, 2022, pp. 373–392

work page 2022

[64] [64]

Segmentation and vascular vectorization for coronary artery by geometry-based cascaded neural network,

X. Yang, L. Xu, S. Yu, Q. Xia, H. Li, and S. Zhang, “Segmentation and vascular vectorization for coronary artery by geometry-based cascaded neural network,”IEEE Transactions on Medical Imaging, vol. 44, no. 1, pp. 259–269, 2024

work page 2024

[65] [65]

Local linear embedding based interpolation neural network in pancreatic tumor segmentation,

X. Yang, Y . Chen, X. Yue, C. Ma, and P. Yang, “Local linear embedding based interpolation neural network in pancreatic tumor segmentation,” Applied Intelligence, vol. 52, no. 8, pp. 8746–8756, 2022

work page 2022

[66] [66]

arXiv preprint arXiv:2603.01143 (2026)

Z. Chen, S. Young, and L. Xu, “Tc-ssa: Token compression via semantic slot aggregation for gigapixel pathology reasoning,”arXiv preprint arXiv:2603.01143, 2026

work page arXiv 2026

[67] [67]

Knowledge matters: Chest radiology report generation with general and specific knowledge,

S. Yang, X. Wu, S. Ge, S. K. Zhou, and L. Xiao, “Knowledge matters: Chest radiology report generation with general and specific knowledge,”Medical image analysis, vol. 80, p. 102510, 2022

work page 2022

[68] [68]

Clinical-bert: Vision-language pre-training for radiograph diagnosis and reports generation,

B. Yan and M. Pei, “Clinical-bert: Vision-language pre-training for radiograph diagnosis and reports generation,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 3, 2022, pp. 2982–2990

work page 2022

[69] [69]

Metransformer: Radiology report generation by transformer with multiple learnable expert tokens,

Z. Wang, L. Liu, L. Wang, and L. Zhou, “Metransformer: Radiology report generation by transformer with multiple learnable expert tokens,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 11 558–11 567

work page 2023

[70] [70]

Spatio-temporal and retrieval-augmented modelling for chest x-ray report generation,

Y . Yang, X. You, K. Zhang, Z. Fu, X. Wang, J. Ding, J. Sun, Z. Yu, Q. Huang, W. Hanet al., “Spatio-temporal and retrieval-augmented modelling for chest x-ray report generation,”IEEE Transactions on Medical Imaging, 2025

work page 2025

[71] [71]

Diagnostic Captioning by Cooperative Task Interactions and Sample-Graph Consistency,

Z. Wang, L. Wang, X. Li, and L. Zhou, “Diagnostic Captioning by Cooperative Task Interactions and Sample-Graph Consistency,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 8, pp. 6585–6598, Aug. 2025

work page 2025

[72] [72]

R2gengpt: Radiology report generation with frozen llms,

Z. Wang, L. Liu, L. Wang, and L. Zhou, “R2gengpt: Radiology report generation with frozen llms,”Meta-Radiology, vol. 1, no. 3, p. 100033, 2023

work page 2023

[73] [73]

Promptmrg: Diagnosis-driven prompts for medical report generation,

H. Jin, H. Che, Y . Lin, and H. Chen, “Promptmrg: Diagnosis-driven prompts for medical report generation,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 3, 2024, pp. 2607– 2615

work page 2024

[74] [74]

Bootstrapping large language models for radiology report generation,

C. Liu, Y . Tian, W. Chen, Y . Song, and Y . Zhang, “Bootstrapping large language models for radiology report generation,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 17, 2024, pp. 18 635–18 643

work page 2024

[75] [75]

Cxpmrg-bench: Pre-training and benchmarking for x-ray medical report generation on chexpert plus dataset,

X. Wang, F. Wang, Y . Li, Q. Ma, S. Wang, B. Jiang, and J. Tang, “Cxpmrg-bench: Pre-training and benchmarking for x-ray medical report generation on chexpert plus dataset,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2025, pp. 5123–5133

work page 2025

[76] [76]

Reason like a radiologist: Chain- of-thought and reinforcement learning for verifiable report generation,

P. Jing, K. Lee, Z. Zhang, H. Zhou, Z. Yuan, Z. Gao, L. Zhu, G. Pa- panastasiou, Y . Fang, and G. Yang, “Reason like a radiologist: Chain- of-thought and reinforcement learning for verifiable report generation,” Medical Image Analysis, vol. 109, p. 103910, Mar. 2026

work page 2026

[77] [77]

Radiology report generation via multi-objective preference optimization,

T. Xiao, L. Shi, P. Liu, Z. Wang, and C. Bai, “Radiology report generation via multi-objective preference optimization,” inProceedings of the AAAI conference on artificial intelligence, vol. 39, no. 8, 2025, pp. 8664–8672

work page 2025

[78] [78]

Fir-rad: Fine- grained reinforcement with structured reasoning for chest x-ray report generation,

X. Mei, L. Yang, D. Gao, X. Cai, J. Han, and T. Liu, “Fir-rad: Fine- grained reinforcement with structured reasoning for chest x-ray report generation,”IEEE Transactions on Medical Imaging, 2026

work page 2026

[79] [79]

Textual explanations for self-driving vehicles,

J. Kim, A. Rohrbach, T. Darrell, J. Canny, and Z. Akata, “Textual explanations for self-driving vehicles,” inProceedings of the European conference on computer vision (ECCV), 2018, pp. 563–578

work page 2018

[80] [80]

Drivegpt4: Interpretable end-to-end autonomous driving via large language model,

Z. Xu, Y . Zhang, E. Xie, Z. Zhao, Y . Guo, K.-Y . K. Wong, Z. Li, and H. Zhao, “Drivegpt4: Interpretable end-to-end autonomous driving via large language model,”IEEE Robotics and Automation Letters, vol. 9, no. 10, pp. 8186–8193, 2024

work page 2024