YouZhi: Towards High-Concurrency Financial LLMs via Adaptive GQA-to-MLA Transition

Bin Wang; Chang Liu; Chenghan Jiang; Dong Guo; Duo Zhang; Hang Wang; Huawei LLM Team: Ruihan Long; Hu Zhao; Jiabin Li; Jiahui Zhang

arxiv: 2606.05868 · v1 · pith:5XODL4VYnew · submitted 2026-06-04 · 💻 cs.CL

YouZhi: Towards High-Concurrency Financial LLMs via Adaptive GQA-to-MLA Transition

PSBC LLM Team , Huawei LLM Team: Ruihan Long , Junjie Wu , Tianan Zhang , Duo Zhang , Yaozong Wu , Jinbin Fu , Chang Liu

show 50 more authors

Zhentao Tang Wenshuang Yang Xin Wang Zhihao Song Ning Huang Wenjing Xu Shuai Zong Shupei Sun Sen Wang Jing Hu Bin Wang Xinyu Wang Junkui Ju Zequn Ding Jie Ran Man Luo Shixiong Kai Linkai Hou Kaichao Liang Hu Zhao Yang Zhao Shucheng Lin Wei Yu Chenghan Jiang Jingjing Ding Jiahui Zhang Tian Jin Yuhang Zhang Dong Guo Wei Sun Jun Xie Jianwei Li Lei Cao Pei Li Jiabin Li Jia Yuan Rui Yuan Jing Zhu Mingxuan Yuan Zhangcheng Lv Xin Jiang Xiuhong Fei Xiaozhe Ren Yulong Li Zhipeng Zhang Hang Wang Zhaohui Xu Rui Zhao Yibo He Xinzhuang Niu

This is my paper

classification 💻 cs.CL

keywords financialtransitionadaptiveascendconcurrencydegradationdeploymentgqa-to-mla

0 comments

read the original abstract

Large language models (LLMs) drive significant financial innovations, yet their high-concurrency deployment is severely bottlenecked by KV cache memory overhead, which inflates infrastructure costs and throttles scalability. To address this, we propose YouZhi-LLM, a highly efficient financial LLM empowered by a comprehensive structural transition and training pipeline natively built on the Huawei Ascend ecosystem. At its algorithmic core, YouZhi-LLM features a layer-adaptive GQA-to-MLA transition framework that dynamically assigns per-layer FreqFold sizes, maximizing KV-cache compression while minimizing perplexity degradation. To recover representation capacity and inject domain expertise, the Ascend-based training pipeline seamlessly integrates generalized knowledge distillation with financial-specific supervised fine-tuning. Evaluations demonstrate the superiority of this systematic approach, with the adaptive transition reducing perplexity degradation by up to 35% over uniform baselines. Crucially, when evaluated on Ascend NPUs via vLLM-Ascend, the massive KV-cache reduction translates directly into deployment efficiency. Compared to their respective base models, YouZhi-7B yields a 12.3% improvement in average financial benchmark score alongside a 2.69$\times$ increase in maximum concurrency; similarly, YouZhi-14B achieves a 7.0% accuracy gain and a 2.43$\times$ concurrency boost, establishing a new paradigm for cost-effective, high-throughput financial inference.

This paper has not been read by Pith yet.

YouZhi: Towards High-Concurrency Financial LLMs via Adaptive GQA-to-MLA Transition

discussion (0)