YouZhi: Towards High-Concurrency Financial LLMs via Adaptive GQA-to-MLA Transition

Bin Wang; Chang Liu; Chenghan Jiang; Dong Guo; Duo Zhang; Hang Wang; Huawei LLM Team: Ruihan Long; Hu Zhao; Jiabin Li; Jiahui Zhang

arxiv: 2606.05868 · v1 · pith:5XODL4VYnew · submitted 2026-06-04 · 💻 cs.CL

YouZhi: Towards High-Concurrency Financial LLMs via Adaptive GQA-to-MLA Transition

PSBC LLM Team , Huawei LLM Team: Ruihan Long , Junjie Wu , Tianan Zhang , Duo Zhang , Yaozong Wu , Jinbin Fu , Chang Liu

show 50 more authors

Zhentao Tang Wenshuang Yang Xin Wang Zhihao Song Ning Huang Wenjing Xu Shuai Zong Shupei Sun Sen Wang Jing Hu Bin Wang Xinyu Wang Junkui Ju Zequn Ding Jie Ran Man Luo Shixiong Kai Linkai Hou Kaichao Liang Hu Zhao Yang Zhao Shucheng Lin Wei Yu Chenghan Jiang Jingjing Ding Jiahui Zhang Tian Jin Yuhang Zhang Dong Guo Wei Sun Jun Xie Jianwei Li Lei Cao Pei Li Jiabin Li Jia Yuan Rui Yuan Jing Zhu Mingxuan Yuan Zhangcheng Lv Xin Jiang Xiuhong Fei Xiaozhe Ren Yulong Li Zhipeng Zhang Hang Wang Zhaohui Xu Rui Zhao Yibo He Xinzhuang Niu

This is my paper

Pith reviewed 2026-06-28 01:31 UTC · model grok-4.3

classification 💻 cs.CL

keywords financial LLMKV cache compressionGQA-to-MLA transitionadaptive compressionhigh-concurrency inferenceknowledge distillationsupervised fine-tuningAscend deployment

0 comments

The pith

YouZhi-LLM applies a layer-adaptive GQA-to-MLA transition to compress KV cache and raise maximum concurrency 2.4-2.7 times while lifting financial benchmark scores.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces YouZhi-LLM as a financial-domain model built around a structural change that switches attention mechanisms on a per-layer basis. By assigning different FreqFold sizes to each layer, the approach seeks to shrink the KV cache memory footprint more aggressively than a uniform switch would allow. A training sequence of generalized knowledge distillation followed by financial supervised fine-tuning is used to restore any capacity lost during compression. On Ascend NPUs the resulting models show both higher accuracy on financial tasks and substantially more concurrent requests than their base counterparts. The central argument is that this adaptive compression-plus-recovery pipeline directly removes the memory bottleneck that currently limits high-throughput financial LLM deployment.

Core claim

The layer-adaptive GQA-to-MLA transition framework dynamically assigns per-layer FreqFold sizes to maximize KV-cache compression while minimizing perplexity degradation; when combined with generalized knowledge distillation and financial-specific supervised fine-tuning, the method produces YouZhi-7B (12.3 percent higher average financial benchmark score, 2.69 times concurrency) and YouZhi-14B (7.0 percent accuracy gain, 2.43 times concurrency) relative to their base models, with the adaptive choice cutting perplexity degradation by up to 35 percent versus uniform baselines.

What carries the argument

The layer-adaptive GQA-to-MLA transition framework that dynamically assigns per-layer FreqFold sizes to compress the KV cache.

If this is right

YouZhi-7B delivers a 12.3 percent gain in average financial benchmark score together with a 2.69 times increase in maximum concurrency.
YouZhi-14B delivers a 7.0 percent accuracy improvement together with a 2.43 times concurrency increase.
The adaptive transition reduces perplexity degradation by up to 35 percent relative to uniform GQA-to-MLA baselines.
The KV-cache reduction translates directly into higher throughput when the models run on Ascend NPUs through vLLM-Ascend.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same per-layer decision rule could be tested on non-financial tasks to check whether the recovery steps preserve general capabilities outside the training domain.
If the adaptive schedule proves hardware-agnostic, the technique might extend to other attention variants that trade memory for compute.
Measuring end-to-end latency under varying batch sizes on the same hardware would quantify whether the reported concurrency gains appear in production workloads.

Load-bearing premise

Dynamic per-layer FreqFold sizes can be chosen to maximize KV-cache compression while generalized knowledge distillation and financial SFT fully recover any lost representation capacity without domain-specific biases or later adjustments.

What would settle it

Measure perplexity degradation and concurrency on identical hardware when the same models use uniform rather than per-layer FreqFold sizes; if degradation exceeds 35 percent or concurrency gains fall below 2 times, the adaptive benefit is not established.

Figures

Figures reproduced from arXiv: 2606.05868 by Bin Wang, Chang Liu, Chenghan Jiang, Dong Guo, Duo Zhang, Hang Wang, Huawei LLM Team: Ruihan Long, Hu Zhao, Jiabin Li, Jiahui Zhang, Jianwei Li, Jia Yuan, Jie Ran, Jinbin Fu, Jing Hu, Jingjing Ding, Jing Zhu, Junjie Wu, Junkui Ju, Jun Xie, Kaichao Liang, Lei Cao, Linkai Hou, Man Luo, Mingxuan Yuan, Ning Huang, Pei Li, PSBC LLM Team, Rui Yuan, Rui Zhao, Sen Wang, Shixiong Kai, Shuai Zong, Shucheng Lin, Shupei Sun, Tianan Zhang, Tian Jin, Wei Sun, Wei Yu, Wenjing Xu, Wenshuang Yang, Xiaozhe Ren, Xin Jiang, Xin Wang, Xinyu Wang, Xinzhuang Niu, Xiuhong Fei, Yang Zhao, Yaozong Wu, Yibo He, Yuhang Zhang, Yulong Li, Zequn Ding, Zhangcheng Lv, Zhaohui Xu, Zhentao Tang, Zhihao Song, Zhipeng Zhang.

**Figure 2.** Figure 2: High-Concurrency Mobile Banking. GPT (Yang et al., 2023), YiZhao-12B-Chat (Lab, 2024), and DianJin-R1 (Zhu et al., 2025)typically inherit the standard Transformer architecture with Grouped-Query Attention (GQA) (Ainslie et al., 2023) or Multi-Head Attention (MHA) (Vaswani et al., 2017), enhanced solely through domain pre-training or supervised fine-tuning. While this paradigm improves financial knowledge u… view at source ↗

**Figure 4.** Figure 4: Layer-wise perplexity degradation of OpenPangu-7B on the WikiText-2 dataset in the partial RoPE modification stage using TransMLA with a RoPE dimension of 128. tively for each layer in LLMs. 3.1 Layer-Adaptive GQA2MLA Transition In this section, we introduce a layer-adaptive FreqFold size selection algorithm to minimize perplexity degradation in the GQA2MLA transition. More specifically, we first formul… view at source ↗

**Figure 5.** Figure 5: Schematic Comparison of the Tansition Trajectory: TransMLA vs. Layer-adaptive. [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Two-stage Post-training Pipeline. cover performance by learning from the teacher’s annotated sequences. Stage 2: Financial Domain-Specific SFT. The objective of this stage is to inject financial knowledge, building on the general capability restored from stage 1. To this end, we construct a comprehensive, multi-source supervised fine-tuning (SFT) dataset that spans all major financial sub-domains, comp… view at source ↗

**Figure 7.** Figure 7: Training loss curve of two-stage post-training pipeline on Ascend A3 Cluster. [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Construction of Training Data. et al., 2025) to achieve large-scale data amplification from a small number of seed samples. Specifically, we first construct dozens of highquality seed samples covering typical scenarios and edge cases for the target capability. The SelfInstruct iterative bootstrapping mechanism then guides LLMs to generate new instructions and instances based on the seed samples. Simult… view at source ↗

**Figure 9.** Figure 9: Perplexity Comparison on WikiText-2. (3) Evaluation on High-Concurrency Capability. Leveraging MLA operators on Ascend, we evaluate the system’s efficiency, quantifying the reduction in KV cache size and the increase in maximum concurrency on vLLM-Ascend. 4.1 Experimental Settings Hardware Platform. All experiments are conducted on Huawei Ascend A3 clusters. Models and Datasets. We experiment with off… view at source ↗

read the original abstract

Large language models (LLMs) drive significant financial innovations, yet their high-concurrency deployment is severely bottlenecked by KV cache memory overhead, which inflates infrastructure costs and throttles scalability. To address this, we propose YouZhi-LLM, a highly efficient financial LLM empowered by a comprehensive structural transition and training pipeline natively built on the Huawei Ascend ecosystem. At its algorithmic core, YouZhi-LLM features a layer-adaptive GQA-to-MLA transition framework that dynamically assigns per-layer FreqFold sizes, maximizing KV-cache compression while minimizing perplexity degradation. To recover representation capacity and inject domain expertise, the Ascend-based training pipeline seamlessly integrates generalized knowledge distillation with financial-specific supervised fine-tuning. Evaluations demonstrate the superiority of this systematic approach, with the adaptive transition reducing perplexity degradation by up to 35% over uniform baselines. Crucially, when evaluated on Ascend NPUs via vLLM-Ascend, the massive KV-cache reduction translates directly into deployment efficiency. Compared to their respective base models, YouZhi-7B yields a 12.3% improvement in average financial benchmark score alongside a 2.69$\times$ increase in maximum concurrency; similarly, YouZhi-14B achieves a 7.0% accuracy gain and a 2.43$\times$ concurrency boost, establishing a new paradigm for cost-effective, high-throughput financial inference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The layer-adaptive FreqFold assignment during GQA-to-MLA is the actual new piece here, and the concurrency numbers look practically useful, but the accuracy gains are not isolated from the KD plus financial SFT.

read the letter

The paper's central move is making the GQA-to-MLA shift layer-dependent by picking different FreqFold sizes per layer instead of one size for all. This is meant to cut KV cache more aggressively while limiting the drop in perplexity, then they run generalized distillation and financial SFT to get the capacity back.

What they report is concrete: YouZhi-7B shows 12.3% higher average financial benchmark score and 2.69× max concurrency over the base model; the 14B version gets 7.0% accuracy lift and 2.43× concurrency. They also claim the adaptive choice reduces perplexity degradation by up to 35% versus uniform baselines, all measured on Ascend NPUs with vLLM-Ascend.

The work targets a real production constraint—high-concurrency financial inference where KV cache dominates memory—so the deployment numbers are the part that could matter to practitioners.

The soft spot is exactly the one in the stress-test note. There is no ablation that runs the identical KD and financial SFT pipeline on a uniform (non-adaptive) GQA-to-MLA baseline. Without that control, the benchmark improvements cannot be cleanly attributed to the per-layer adaptation rather than the domain fine-tuning. The abstract also gives no protocol details, error bars, or baseline definitions, which makes the numerical claims hard to evaluate on their own.

This is for teams already working on efficient inference for domain LLMs, especially those on Ascend hardware. If the full paper supplies the missing ablations and the numbers hold, it is worth a serious referee because the concurrency results address a practical bottleneck even if the attribution needs tightening.

Referee Report

1 major / 2 minor

Summary. The paper proposes YouZhi-LLM, a financial-domain LLM that performs a layer-adaptive transition from GQA to MLA by dynamically selecting per-layer FreqFold sizes to maximize KV-cache compression while limiting perplexity degradation. The approach is paired with generalized knowledge distillation and financial-specific supervised fine-tuning on the Huawei Ascend stack. The authors claim the adaptive transition reduces perplexity degradation by up to 35% versus uniform baselines and, when deployed via vLLM-Ascend, yields YouZhi-7B (12.3% average financial-benchmark improvement, 2.69× concurrency) and YouZhi-14B (7.0% accuracy gain, 2.43× concurrency) relative to their base models.

Significance. If the reported accuracy and concurrency gains can be causally attributed to the adaptive FreqFold mechanism rather than to the accompanying KD + financial SFT pipeline, the work would provide a concrete, hardware-aware recipe for scaling financial LLM inference under tight KV-cache constraints. The explicit integration with Ascend NPUs and vLLM-Ascend is a practical strength.

major comments (1)

[Abstract] Abstract: the central claim attributes the 12.3% / 7.0% benchmark gains and the 2.69× / 2.43× concurrency increases to the layer-adaptive GQA-to-MLA transition. However, the manuscript supplies no ablation that applies the identical generalized KD + financial SFT pipeline to a uniform (non-adaptive) GQA-to-MLA baseline. Without this isolation, the performance improvements cannot be attributed to the adaptive FreqFold mechanism rather than domain-specific fine-tuning.

minor comments (2)

[Abstract] Abstract: the financial benchmarks, base-model checkpoints, number of evaluation runs, and statistical tests underlying the reported percentages and concurrency multipliers are not specified, preventing direct verification of the numerical claims.
[Abstract] Abstract: the precise criterion or search procedure used to choose the per-layer FreqFold sizes is not described, leaving the reproducibility of the “dynamic assignment” unclear.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on attribution of gains. We respond point-by-point below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim attributes the 12.3% / 7.0% benchmark gains and the 2.69× / 2.43× concurrency increases to the layer-adaptive GQA-to-MLA transition. However, the manuscript supplies no ablation that applies the identical generalized KD + financial SFT pipeline to a uniform (non-adaptive) GQA-to-MLA baseline. Without this isolation, the performance improvements cannot be attributed to the adaptive FreqFold mechanism rather than domain-specific fine-tuning.

Authors: We agree the referee's point is valid: while the manuscript reports an adaptive-vs-uniform comparison for perplexity degradation (up to 35% reduction), it does not provide the requested ablation that holds the generalized KD + financial SFT pipeline fixed and varies only the FreqFold selection strategy (adaptive vs. uniform) when measuring downstream financial-benchmark accuracy and concurrency. The reported 12.3%/7.0% gains and concurrency multipliers are versus the original base models (which lack both the transition and the domain-specific training), so causal attribution to adaptivity alone is not fully isolated. We will add this ablation study in the revised manuscript, training both adaptive and uniform variants under identical KD + SFT conditions and reporting the resulting benchmark scores and concurrency numbers. This will allow readers to quantify the incremental benefit of the layer-adaptive mechanism. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical results rest on external benchmarks

full rationale

The paper reports measured improvements (12.3% and 7.0% benchmark gains, 2.69× and 2.43× concurrency) from an adaptive GQA-to-MLA transition plus KD and financial SFT, compared against uniform baselines and base models. No equations, fitted parameters, or self-citations are shown that define the reported gains in terms of themselves or reduce the central claim to a renaming or construction. The derivation chain consists of architectural description followed by empirical evaluation on held-out financial tasks, which is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on free parameters, background axioms, or new postulated entities.

pith-pipeline@v0.9.1-grok · 6006 in / 1180 out tokens · 58111 ms · 2026-06-28T01:31:21.747284+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

67 extracted references · 3 canonical work pages

[1]

Attention is All you Need , url =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =
[2]

Advances in neural information processing systems , volume=

C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models , author=. Advances in neural information processing systems , volume=
[3]

arXiv preprint arXiv:2311.07911 , year=

Instruction-following evaluation for large language models , author=. arXiv preprint arXiv:2311.07911 , year=

Pith/arXiv arXiv
[4]

arXiv preprint arXiv:2103.03874 , year=

Measuring mathematical problem solving with the math dataset , author=. arXiv preprint arXiv:2103.03874 , year=

Pith/arXiv arXiv
[5]

International Conference on Learning Representations , volume=

Livecodebench: Holistic and contamination free evaluation of large language models for code , author=. International Conference on Learning Representations , volume=
[6]

Proceedings of the 57th annual meeting of the association for computational linguistics , pages=

Hellaswag: Can a machine really finish your sentence? , author=. Proceedings of the 57th annual meeting of the association for computational linguistics , pages=
[7]

Journal of the Association for Information Science and Technology , volume=

Good debt or bad debt: Detecting semantic orientations in economic texts , author=. Journal of the Association for Information Science and Technology , volume=. 2014 , publisher=

2014
[8]

Proceedings of the 32nd ACM international conference on information and knowledge management , pages=

Xuanyuan 2.0: A large chinese financial chat model with hundreds of billions parameters , author=. Proceedings of the 32nd ACM international conference on information and knowledge management , pages=
[9]

2024 , note =

OpenFinData: A Large Language Model Open-source Financial Evaluation Dataset , author =. 2024 , note =

2024
[10]

Fineval: A chinese financial domain knowledge evaluation benchmark for large language models , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

2025
[11]

Findings of the Association for Computational Linguistics: ACL 2024 , pages=

Benchmarking large language models on CFLUE-a Chinese financial language understanding evaluation dataset , author=. Findings of the Association for Computational Linguistics: ACL 2024 , pages=

2024
[12]

Proceedings of the AAAI conference on artificial intelligence , volume=

Crossner: Evaluating cross-domain named entity recognition , author=. Proceedings of the AAAI conference on artificial intelligence , volume=
[13]

Proceedings of the 2013 conference on empirical methods in natural language processing , pages=

Recursive deep models for semantic compositionality over a sentiment treebank , author=. Proceedings of the 2013 conference on empirical methods in natural language processing , pages=

2013
[14]

arXiv preprint arXiv:2406.11903 , year=

A survey of large language models for financial applications: Progress, prospects and challenges , author=. arXiv preprint arXiv:2406.11903 , year=

arXiv
[15]

Advances in Neural Information Processing Systems , volume=

Finben: A holistic financial benchmark for large language models , author=. Advances in Neural Information Processing Systems , volume=
[16]

International Conference on Learning Representations , volume=

On-policy distillation of language models: Learning from self-generated mistakes , author=. International Conference on Learning Representations , volume=
[17]

arXiv preprint arXiv:2402.03300 , year=

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

Pith/arXiv arXiv
[18]

C ar E xpert: Leveraging Large Language Models for In-Car Conversational Question Answering

Rony, Md Rashad Al Hasan and Suess, Christian and Bhat, Sinchana Ramakanth and Sudhi, Viju and Schneider, Julia and Vogel, Maximilian and Teucher, Roman and Friedl, Ken and Sahoo, Soumya. C ar E xpert: Leveraging Large Language Models for In-Car Conversational Question Answering. Proceedings of the 2023 Conference on Empirical Methods in Natural Language ...

work page doi:10.18653/v1/2023.emnlp-industry.56 2023
[19]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track , pages=

The program testing ability of large language models for code , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track , pages=

2024
[20]

arXiv preprint arXiv:2509.14507 , year=

DeKeyNLU: Enhancing Natural Language to SQL Generation through Task Decomposition and Keyword Extraction , author=. arXiv preprint arXiv:2509.14507 , year=

arXiv
[21]

2025 , eprint=

Pangu Embedded: An Efficient Dual-system LLM Reasoner with Metacognition , author=. 2025 , eprint=

2025
[22]

2024 , eprint=

GPT-4 Technical Report , author=. 2024 , eprint=

2024
[23]

arXiv preprint arXiv:2501.12948 , year=

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

Pith/arXiv arXiv
[24]

2025 , publisher=

Qwen 2.5: A comprehensive review of the leading resource-efficient llm with potentioal to surpass all competitors , author=. 2025 , publisher=

2025
[25]

2022 , eprint=

Efficiently Scaling Transformer Inference , author=. 2022 , eprint=

2022
[26]

arXiv preprint arXiv:2510.26585 , year=

Stop wasting your tokens: Towards efficient runtime multi-agent systems , author=. arXiv preprint arXiv:2510.26585 , year=

arXiv
[27]

arXiv preprint arXiv:2407.21783 , year=

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

Pith/arXiv arXiv
[28]

arXiv preprint arXiv:2405.08944 , year=

Challenges in deploying long-context transformers: A theoretical peak performance analysis , author=. arXiv preprint arXiv:2405.08944 , year=

arXiv
[29]

Advances in Neural Information Processing Systems , volume=

Chain of agents: Large language models collaborating on long-context tasks , author=. Advances in Neural Information Processing Systems , volume=
[30]

arXiv preprint arXiv:2405.04434 , year=

Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model , author=. arXiv preprint arXiv:2405.04434 , year=

Pith/arXiv arXiv
[31]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Towards economical inference: Enabling deepseek’s multi-head latent attention in any transformer-based llms , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[32]

arXiv preprint arXiv:2602.15763 , year=

Glm-5: from vibe coding to agentic engineering , author=. arXiv preprint arXiv:2602.15763 , year=

Pith/arXiv arXiv
[33]

arXiv preprint arXiv:2502.07864 , year=

Transmla: Multi-head latent attention is all you need , author=. arXiv preprint arXiv:2502.07864 , year=

arXiv
[34]

arXiv preprint arXiv:2412.14838 , year=

DynamicKV: Task-aware adaptive KV cache compression for long context LLMs , author=. arXiv preprint arXiv:2412.14838 , year=

arXiv
[35]

arXiv preprint arXiv:2412.12706 , year=

More tokens, lower precision: Towards the optimal token-precision trade-off in kv cache compression , author=. arXiv preprint arXiv:2412.12706 , year=

arXiv
[36]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

A simple and effective l\_2 norm-based strategy for kv cache compression , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

2024
[37]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

Gqa: Training generalized multi-query transformer models from multi-head checkpoints , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

2023
[38]

Proceedings of machine learning and systems , volume=

Efficiently scaling transformer inference , author=. Proceedings of machine learning and systems , volume=
[39]

2026 , eprint=

SWAA: Sliding Window Attention Adaptation for Efficient and Quality Preserving Long Context Processing , author=. 2026 , eprint=

2026
[40]

Parallelizing Linear Transformers with the Delta Rule over Sequence Length , url =

Yang, Songlin and Wang, Bailin and Zhang, Yu and Shen, Yikang and Kim, Yoon , booktitle =. Parallelizing Linear Transformers with the Delta Rule over Sequence Length , url =. doi:10.52202/079017-3668 , editor =

work page doi:10.52202/079017-3668
[41]

arXiv preprint arXiv:2512.23914 , year=

Hardware Acceleration for Neural Networks: A Comprehensive Survey , author=. arXiv preprint arXiv:2512.23914 , year=

arXiv
[42]

Available at SSRN 5381584 , year=

A review of llm agent applications in finance and banking , author=. Available at SSRN 5381584 , year=
[43]

arXiv preprint arXiv:2407.00365 , year=

Financial knowledge large language model , author=. arXiv preprint arXiv:2407.00365 , year=

arXiv
[44]

arXiv preprint arXiv:1609.07843 , year=

Pointer sentinel mixture models , author=. arXiv preprint arXiv:1609.07843 , year=

Pith/arXiv arXiv
[45]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations , pages=

Alpha-gpt: Human-ai interactive alpha mining for quantitative investment , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations , pages=

2025
[46]

arXiv preprint arXiv:2407.17866 , year=

Financial statement analysis with large language models , author=. arXiv preprint arXiv:2407.17866 , year=

arXiv
[47]

2023 , isbn =

Lakkaraju, Kausik and Jones, Sara E and Vuruma, Sai Krishna Revanth and Pallagani, Vishal and Muppasani, Bharath C and Srivastava, Biplav , title =. 2023 , isbn =. doi:10.1145/3604237.3626867 , booktitle =

work page doi:10.1145/3604237.3626867 2023
[48]

arXiv preprint arXiv:2006.08097 , year=

Finbert: A pretrained language model for financial communications , author=. arXiv preprint arXiv:2006.08097 , year=

arXiv 2006
[49]

arXiv preprint arXiv:2302.09432 , year=

Bbt-fin: Comprehensive construction of chinese financial domain pre-trained language model, corpus and benchmark , author=. arXiv preprint arXiv:2302.09432 , year=

arXiv
[50]

arXiv preprint arXiv:2303.17564 , year=

Bloomberggpt: A large language model for finance , author=. arXiv preprint arXiv:2303.17564 , year=

Pith/arXiv arXiv
[51]

arXiv preprint arXiv:2306.06031 , year=

Fingpt: Open-source financial large language models , author=. arXiv preprint arXiv:2306.06031 , year=

arXiv
[52]

2024 , publisher=

YiZhao-12B-Chat: A Chinese Financial Large Language Model , author=. 2024 , publisher=

2024
[53]

arXiv preprint arXiv:2504.15716 , year=

Dianjin-r1: Evaluating and enhancing financial reasoning in large language models , author=. arXiv preprint arXiv:2504.15716 , year=

arXiv
[54]

Bert: Pre-training of deep bidirectional transformers for language understanding , author=. Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) , pages=

2019
[55]

Journal of machine learning research , volume=

Exploring the limits of transfer learning with a unified text-to-text transformer , author=. Journal of machine learning research , volume=
[56]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=
[57]

, author=

Lora: Low-rank adaptation of large language models. , author=. Iclr , volume=
[58]

Direct preference optimization: Your language model is secretly a reward model , author=
[59]

2024 , url =

YiZhao-12B-Chat: A 12B-Parameter Financial Large Language Model , author =. 2024 , url =

2024
[60]

arXiv preprint arXiv:2505.07608 , year=

MiMo: Unlocking the Reasoning Potential of Language Model--From Pretraining to Posttraining , author=. arXiv preprint arXiv:2505.07608 , year=

arXiv
[61]

arXiv preprint arXiv:2407.06645 , year=

Entropy Law: The Story Behind Data Compression and LLM Performance , author=. arXiv preprint arXiv:2407.06645 , year=

arXiv
[62]

International Conference on Learning Representations (ICLR) , year=

\#INSTAG: Instruction Tagging for Analyzing Supervised Fine-Tuning of Large Language Models , author=. International Conference on Learning Representations (ICLR) , year=
[63]

International Conference on Learning Representations (ICLR) , year=

What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning , author=. International Conference on Learning Representations (ICLR) , year=
[64]

arXiv preprint arXiv:2505.13990 , year=

DecIF: Improving Instruction-Following through Meta-Decomposition , author=. arXiv preprint arXiv:2505.13990 , year=

arXiv
[65]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics , year=

TreeCut: A Synthetic Unanswerable Math Word Problem Dataset for LLM Hallucination Evaluation , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics , year=
[66]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics , year=

Self-Instruct: Aligning Language Models with Self-Generated Instructions , author=. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics , year=
[67]

arXiv preprint arXiv:2304.12244 , year=

WizardLM: Empowering Large Pre-trained Language Models to Follow Complex Instructions , author=. arXiv preprint arXiv:2304.12244 , year=

Pith/arXiv arXiv

[1] [1]

Attention is All you Need , url =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =

[2] [2]

Advances in neural information processing systems , volume=

C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models , author=. Advances in neural information processing systems , volume=

[3] [3]

arXiv preprint arXiv:2311.07911 , year=

Instruction-following evaluation for large language models , author=. arXiv preprint arXiv:2311.07911 , year=

Pith/arXiv arXiv

[4] [4]

arXiv preprint arXiv:2103.03874 , year=

Measuring mathematical problem solving with the math dataset , author=. arXiv preprint arXiv:2103.03874 , year=

Pith/arXiv arXiv

[5] [5]

International Conference on Learning Representations , volume=

Livecodebench: Holistic and contamination free evaluation of large language models for code , author=. International Conference on Learning Representations , volume=

[6] [6]

Proceedings of the 57th annual meeting of the association for computational linguistics , pages=

Hellaswag: Can a machine really finish your sentence? , author=. Proceedings of the 57th annual meeting of the association for computational linguistics , pages=

[7] [7]

Journal of the Association for Information Science and Technology , volume=

Good debt or bad debt: Detecting semantic orientations in economic texts , author=. Journal of the Association for Information Science and Technology , volume=. 2014 , publisher=

2014

[8] [8]

Proceedings of the 32nd ACM international conference on information and knowledge management , pages=

Xuanyuan 2.0: A large chinese financial chat model with hundreds of billions parameters , author=. Proceedings of the 32nd ACM international conference on information and knowledge management , pages=

[9] [9]

2024 , note =

OpenFinData: A Large Language Model Open-source Financial Evaluation Dataset , author =. 2024 , note =

2024

[10] [10]

Fineval: A chinese financial domain knowledge evaluation benchmark for large language models , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

2025

[11] [11]

Findings of the Association for Computational Linguistics: ACL 2024 , pages=

Benchmarking large language models on CFLUE-a Chinese financial language understanding evaluation dataset , author=. Findings of the Association for Computational Linguistics: ACL 2024 , pages=

2024

[12] [12]

Proceedings of the AAAI conference on artificial intelligence , volume=

Crossner: Evaluating cross-domain named entity recognition , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

[13] [13]

Proceedings of the 2013 conference on empirical methods in natural language processing , pages=

Recursive deep models for semantic compositionality over a sentiment treebank , author=. Proceedings of the 2013 conference on empirical methods in natural language processing , pages=

2013

[14] [14]

arXiv preprint arXiv:2406.11903 , year=

A survey of large language models for financial applications: Progress, prospects and challenges , author=. arXiv preprint arXiv:2406.11903 , year=

arXiv

[15] [15]

Advances in Neural Information Processing Systems , volume=

Finben: A holistic financial benchmark for large language models , author=. Advances in Neural Information Processing Systems , volume=

[16] [16]

International Conference on Learning Representations , volume=

On-policy distillation of language models: Learning from self-generated mistakes , author=. International Conference on Learning Representations , volume=

[17] [17]

arXiv preprint arXiv:2402.03300 , year=

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

Pith/arXiv arXiv

[18] [18]

C ar E xpert: Leveraging Large Language Models for In-Car Conversational Question Answering

Rony, Md Rashad Al Hasan and Suess, Christian and Bhat, Sinchana Ramakanth and Sudhi, Viju and Schneider, Julia and Vogel, Maximilian and Teucher, Roman and Friedl, Ken and Sahoo, Soumya. C ar E xpert: Leveraging Large Language Models for In-Car Conversational Question Answering. Proceedings of the 2023 Conference on Empirical Methods in Natural Language ...

work page doi:10.18653/v1/2023.emnlp-industry.56 2023

[19] [19]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track , pages=

The program testing ability of large language models for code , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track , pages=

2024

[20] [20]

arXiv preprint arXiv:2509.14507 , year=

DeKeyNLU: Enhancing Natural Language to SQL Generation through Task Decomposition and Keyword Extraction , author=. arXiv preprint arXiv:2509.14507 , year=

arXiv

[21] [21]

2025 , eprint=

Pangu Embedded: An Efficient Dual-system LLM Reasoner with Metacognition , author=. 2025 , eprint=

2025

[22] [22]

2024 , eprint=

GPT-4 Technical Report , author=. 2024 , eprint=

2024

[23] [23]

arXiv preprint arXiv:2501.12948 , year=

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

Pith/arXiv arXiv

[24] [24]

2025 , publisher=

Qwen 2.5: A comprehensive review of the leading resource-efficient llm with potentioal to surpass all competitors , author=. 2025 , publisher=

2025

[25] [25]

2022 , eprint=

Efficiently Scaling Transformer Inference , author=. 2022 , eprint=

2022

[26] [26]

arXiv preprint arXiv:2510.26585 , year=

Stop wasting your tokens: Towards efficient runtime multi-agent systems , author=. arXiv preprint arXiv:2510.26585 , year=

arXiv

[27] [27]

arXiv preprint arXiv:2407.21783 , year=

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

Pith/arXiv arXiv

[28] [28]

arXiv preprint arXiv:2405.08944 , year=

Challenges in deploying long-context transformers: A theoretical peak performance analysis , author=. arXiv preprint arXiv:2405.08944 , year=

arXiv

[29] [29]

Advances in Neural Information Processing Systems , volume=

Chain of agents: Large language models collaborating on long-context tasks , author=. Advances in Neural Information Processing Systems , volume=

[30] [30]

arXiv preprint arXiv:2405.04434 , year=

Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model , author=. arXiv preprint arXiv:2405.04434 , year=

Pith/arXiv arXiv

[31] [31]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Towards economical inference: Enabling deepseek’s multi-head latent attention in any transformer-based llms , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[32] [32]

arXiv preprint arXiv:2602.15763 , year=

Glm-5: from vibe coding to agentic engineering , author=. arXiv preprint arXiv:2602.15763 , year=

Pith/arXiv arXiv

[33] [33]

arXiv preprint arXiv:2502.07864 , year=

Transmla: Multi-head latent attention is all you need , author=. arXiv preprint arXiv:2502.07864 , year=

arXiv

[34] [34]

arXiv preprint arXiv:2412.14838 , year=

DynamicKV: Task-aware adaptive KV cache compression for long context LLMs , author=. arXiv preprint arXiv:2412.14838 , year=

arXiv

[35] [35]

arXiv preprint arXiv:2412.12706 , year=

More tokens, lower precision: Towards the optimal token-precision trade-off in kv cache compression , author=. arXiv preprint arXiv:2412.12706 , year=

arXiv

[36] [36]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

A simple and effective l\_2 norm-based strategy for kv cache compression , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

2024

[37] [37]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

Gqa: Training generalized multi-query transformer models from multi-head checkpoints , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

2023

[38] [38]

Proceedings of machine learning and systems , volume=

Efficiently scaling transformer inference , author=. Proceedings of machine learning and systems , volume=

[39] [39]

2026 , eprint=

SWAA: Sliding Window Attention Adaptation for Efficient and Quality Preserving Long Context Processing , author=. 2026 , eprint=

2026

[40] [40]

Parallelizing Linear Transformers with the Delta Rule over Sequence Length , url =

Yang, Songlin and Wang, Bailin and Zhang, Yu and Shen, Yikang and Kim, Yoon , booktitle =. Parallelizing Linear Transformers with the Delta Rule over Sequence Length , url =. doi:10.52202/079017-3668 , editor =

work page doi:10.52202/079017-3668

[41] [41]

arXiv preprint arXiv:2512.23914 , year=

Hardware Acceleration for Neural Networks: A Comprehensive Survey , author=. arXiv preprint arXiv:2512.23914 , year=

arXiv

[42] [42]

Available at SSRN 5381584 , year=

A review of llm agent applications in finance and banking , author=. Available at SSRN 5381584 , year=

[43] [43]

arXiv preprint arXiv:2407.00365 , year=

Financial knowledge large language model , author=. arXiv preprint arXiv:2407.00365 , year=

arXiv

[44] [44]

arXiv preprint arXiv:1609.07843 , year=

Pointer sentinel mixture models , author=. arXiv preprint arXiv:1609.07843 , year=

Pith/arXiv arXiv

[45] [45]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations , pages=

Alpha-gpt: Human-ai interactive alpha mining for quantitative investment , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations , pages=

2025

[46] [46]

arXiv preprint arXiv:2407.17866 , year=

Financial statement analysis with large language models , author=. arXiv preprint arXiv:2407.17866 , year=

arXiv

[47] [47]

2023 , isbn =

Lakkaraju, Kausik and Jones, Sara E and Vuruma, Sai Krishna Revanth and Pallagani, Vishal and Muppasani, Bharath C and Srivastava, Biplav , title =. 2023 , isbn =. doi:10.1145/3604237.3626867 , booktitle =

work page doi:10.1145/3604237.3626867 2023

[48] [48]

arXiv preprint arXiv:2006.08097 , year=

Finbert: A pretrained language model for financial communications , author=. arXiv preprint arXiv:2006.08097 , year=

arXiv 2006

[49] [49]

arXiv preprint arXiv:2302.09432 , year=

Bbt-fin: Comprehensive construction of chinese financial domain pre-trained language model, corpus and benchmark , author=. arXiv preprint arXiv:2302.09432 , year=

arXiv

[50] [50]

arXiv preprint arXiv:2303.17564 , year=

Bloomberggpt: A large language model for finance , author=. arXiv preprint arXiv:2303.17564 , year=

Pith/arXiv arXiv

[51] [51]

arXiv preprint arXiv:2306.06031 , year=

Fingpt: Open-source financial large language models , author=. arXiv preprint arXiv:2306.06031 , year=

arXiv

[52] [52]

2024 , publisher=

YiZhao-12B-Chat: A Chinese Financial Large Language Model , author=. 2024 , publisher=

2024

[53] [53]

arXiv preprint arXiv:2504.15716 , year=

Dianjin-r1: Evaluating and enhancing financial reasoning in large language models , author=. arXiv preprint arXiv:2504.15716 , year=

arXiv

[54] [54]

Bert: Pre-training of deep bidirectional transformers for language understanding , author=. Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) , pages=

2019

[55] [55]

Journal of machine learning research , volume=

Exploring the limits of transfer learning with a unified text-to-text transformer , author=. Journal of machine learning research , volume=

[56] [56]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

[57] [57]

, author=

Lora: Low-rank adaptation of large language models. , author=. Iclr , volume=

[58] [58]

Direct preference optimization: Your language model is secretly a reward model , author=

[59] [59]

2024 , url =

YiZhao-12B-Chat: A 12B-Parameter Financial Large Language Model , author =. 2024 , url =

2024

[60] [60]

arXiv preprint arXiv:2505.07608 , year=

MiMo: Unlocking the Reasoning Potential of Language Model--From Pretraining to Posttraining , author=. arXiv preprint arXiv:2505.07608 , year=

arXiv

[61] [61]

arXiv preprint arXiv:2407.06645 , year=

Entropy Law: The Story Behind Data Compression and LLM Performance , author=. arXiv preprint arXiv:2407.06645 , year=

arXiv

[62] [62]

International Conference on Learning Representations (ICLR) , year=

\#INSTAG: Instruction Tagging for Analyzing Supervised Fine-Tuning of Large Language Models , author=. International Conference on Learning Representations (ICLR) , year=

[63] [63]

International Conference on Learning Representations (ICLR) , year=

What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning , author=. International Conference on Learning Representations (ICLR) , year=

[64] [64]

arXiv preprint arXiv:2505.13990 , year=

DecIF: Improving Instruction-Following through Meta-Decomposition , author=. arXiv preprint arXiv:2505.13990 , year=

arXiv

[65] [65]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics , year=

TreeCut: A Synthetic Unanswerable Math Word Problem Dataset for LLM Hallucination Evaluation , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics , year=

[66] [66]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics , year=

Self-Instruct: Aligning Language Models with Self-Generated Instructions , author=. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics , year=

[67] [67]

arXiv preprint arXiv:2304.12244 , year=

WizardLM: Empowering Large Pre-trained Language Models to Follow Complex Instructions , author=. arXiv preprint arXiv:2304.12244 , year=

Pith/arXiv arXiv