Recognition: unknown
RetroInfer: A Vector Storage Engine for Scalable Long-Context LLM Inference
read the original abstract
Recent large language models (LLMs) are rapidly extending their context windows, yet inference throughput lags due to increasing GPU memory and bandwidth demands. This is because the key-value (KV) cache, an intermediate structure storing token representations, grows linearly with context length and requires an iterative linear scan for attention computation. A promising direction to accelerate long-context inference is to exploit attention's inherent sparsity by offloading the KV cache to CPU memory and retrieving only a small subset of tokens important to the current generation step. However, prior sparse attention approaches struggle to balance accuracy and retrieval cost due to varying sparsity patterns and inefficient GPU-CPU memory management. We present RetroInfer, a vector storage engine that realizes a sparsity-based KV cache for long-context inference. RetroInfer introduces an Attention-aWare VEctor index (wave index), which fundamentally improves the tradeoff between attention accuracy and retrieval cost through tripartite attention approximation, accuracy-bound attention estimation, and segmented clustering. We also design the wave buffer, a GPU-CPU buffer manager that assigns computation and manages data across heterogeneous hardware. We evaluate RetroInfer across a range of models and workloads, demonstrating up to 4.4X decoding throughput over full attention at 120K context and up to 12.2X over sparse attention baselines at 1 million tokens -- all while preserving full-attention-level accuracy.
This paper has not been read by Pith yet.
Forward citations
Cited by 2 Pith papers
-
AB-Sparse: Sparse Attention with Adaptive Block Size for Accurate and Efficient Long-Context Inference
AB-Sparse adaptively allocates per-head block sizes for sparse attention, adds lossless centroid quantization and custom variable-block GPU kernels, and reports up to 5.43% accuracy gain over fixed-block baselines wit...
-
Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving
SPIN co-designs sparse attention with hierarchical memory to achieve 1.66-5.66x higher throughput, 7-9x lower TTFT, and up to 58% lower TPOT than vLLM and original sparse implementations.
Reference graph
Works this paper leans on
-
[1]
01-ai. 2024. Yi-6B-200K. https://huggingface.co/01-ai/Yi-6B-200K. Accessed: 2024-11-11
work page 2024
-
[2]
01-ai. 2024. Yi-9B-200K. https://huggingface.co/01-ai/Yi-9B-200K. Accessed: 2024-11-11
work page 2024
-
[3]
Gulavani, Alexey Tumanov, and Ramachandran Ramjee
Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwa- tra, Bhargav S. Gulavani, Alexey Tumanov, and Ramachandran Ramjee. 2024. Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve. In18th USENIX Symposium on Operating Systems Design and Implementation. USENIX Association, Santa Clara, CA, USA, 117–134. https://www.usenix....
work page 2024
-
[4]
Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills,
Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, and Ramachandran Ramjee. 2023. SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills.CoRRabs/2308.16369 (2023). https://doi.org/10.48550/ARXIV.2308.16369
-
[5]
Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. 2023. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Singapore, 4895–4901. https:...
work page 2023
-
[6]
Keivan Alizadeh, Iman Mirzadeh, Dmitry Belenko, S. Khatamifard, Minsik Cho, Carlo C. del Mundo, Mohammad Rastegari, and Mehrdad Farajtabar. 2024. LLM in a flash: Efficient Large Language Model Inference with Limited Memory. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for C...
-
[7]
Anthropic. 2025. Claude. https://www.anthropic.com/claude. Accessed: 2025- 08-01
work page 2025
-
[8]
C., Arun Iyer, Suresh Parthasarathy, Sriram K
Ramakrishna Bairi, Atharv Sonwane, Aditya Kanade, Vageesh D. C., Arun Iyer, Suresh Parthasarathy, Sriram K. Rajamani, Balasubramanyan Ashok, and Shashank Shet. 2024. CodePlan: Repository-Level Coding using LLMs and Planning.Proceedings of the ACM on Software Engineering1, FSE (2024), 675–
work page 2024
-
[9]
https://doi.org/10.1145/3643757
-
[10]
Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Baobao Chang, Junjie Hu, and Wen Xiao. 2024. PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling. CoRRabs/2406.02069 (2024). https://doi.org/10.48550/ARXIV.2406.02069
work page internal anchor Pith review doi:10.48550/arxiv.2406.02069 2024
-
[11]
Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. 2023. Accelerating Large Language Model Decoding with Speculative Sampling.CoRRabs/2302.01318 (2023). https: //doi.org/10.48550/ARXIV.2302.01318
work page internal anchor Pith review doi:10.48550/arxiv.2302.01318 2023
-
[12]
Cheng Chen, Chenzhe Jin, Yunan Zhang, Sasha Podolsky, Chun Wu, Szu- Po Wang, Eric Hanson, Zhou Sun, Robert Walzer, and Jianguo Wang. 2024. SingleStore-V: An Integrated Vector Database System in SingleStore.Proc. VLDB Endow.17, 12 (2024), 3772–3785. https://doi.org/10.14778/3685800.3685805
-
[13]
Meng Chen, Kai Zhang, Zhenying He, Yinan Jing, and X. Sean Wang. 2024. RoarGraph: A Projected Bipartite Graph for Efficient Cross-Modal Approximate Nearest Neighbor Search.Proc. VLDB Endow.17, 11 (2024), 2735–2749. https: //doi.org/10.14778/3681954.3681959
-
[14]
Weijian Chen, Shuibing He, Haoyang Qu, Ruidong Zhang, Siling Yang, Ping Chen, Yi Zheng, Baoxing Huai, and Gang Chen. 2025. IMPRESS: An Importance- Informed Multi-Tier Prefix KV Storage System for Large Language Model Infer- ence. In23rd USENIX Conference on File and Storage Technologies. USENIX As- sociation, Santa Clara, CA, USA, 187–201. https://www.use...
work page 2025
-
[15]
Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, and Dong Yu. 2024. Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs.CoRRabs/2412.21187 (2024). https: //doi.org/10.48550/ARXIV.2412.21187
work page internal anchor Pith review doi:10.48550/arxiv.2412.21187 2024
-
[16]
Zhuoming Chen, Ranajoy Sadhukhan, Zihao Ye, Yang Zhou, Jianyu Zhang, Niklas Nolte, Yuandong Tian, Matthijs Douze, Léon Bottou, Zhihao Jia, and Beidi Chen. 2025. MagicPIG: LSH Sampling for Efficient LLM Generation. In The Thirteenth International Conference on Learning Representations. OpenRe- view.net, Singapore. https://openreview.net/forum?id=ALzTQUgW8a
work page 2025
-
[17]
Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating Long Sequences with Sparse Transformers.CoRRabs/1904.10509 (2019). http: //arxiv.org/abs/1904.10509
work page internal anchor Pith review arXiv 2019
-
[18]
Adaptively Sparse Transformers
Gonçalo M. Correia, Vlad Niculae, and André F. T. Martins. 2019. Adaptively Sparse Transformers. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, Hong Kong, China, 2174–2184. https://doi.org/10.18653...
-
[19]
Fu, Stefano Ermon, Atri Rudra, and Christopher Ré
Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. InThe Thirty-Sixth Annual Conference on Neural Information Processing Systems. New Orleans, LA, USA. http://papers.nips.cc/paper_files/paper/2022/hash/ 67d57c32e20fd0a7a302cb81d36e40d5-Abstract-Conference.html
work page 2022
-
[20]
DeepSeek. 2025. DeepSeek-R1-Distill-Llama-8B. https://huggingface.co/ deepseek-ai/DeepSeek-R1-Distill-Llama-8B. Accessed: 2025-08-01
work page 2025
-
[21]
DeepSeek. 2025. DeepSeek-R1-Distill-Qwen-7B. https://huggingface.co/ deepseek-ai/DeepSeek-R1-Distill-Qwen-7B. Accessed: 2025-08-01
work page 2025
-
[22]
DeepSeek-AI. 2025. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.CoRRabs/2501.12948 (2025). https://doi.org/10. 48550/ARXIV.2501.12948
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [23]
-
[24]
Yangshen Deng, Zhengxin You, Long Xiang, Qilong Li, Peiqi Yuan, Zhaoyang Hong, Yitao Zheng, Wanting Li, Runzhong Li, Haotian Liu, Kyriakos Mouratidis, Man Lung Yiu, Huan Li, Qiaomu Shen, Rui Mao, and Bo Tang. 2025. AlayaDB: The Data Foundation for Efficient and Effective Long-context LLM Inference. InCompanion of the 2025 International Conference on Manag...
-
[26]
Cong Fu, Chao Xiang, Changxu Wang, and Deng Cai. 2019. Fast Approximate Nearest Neighbor Search With The Navigating Spreading-out Graph.Proc. VLDB Endow.12, 5 (2019), 461–474. https://doi.org/10.14778/3303753.3303754
-
[27]
Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, and Luo Mai. 2024. ServerlessLLM: Low-Latency Serverless In- ference for Large Language Models. In18th USENIX Symposium on Operating Systems Design and Implementation. USENIX Association, Santa Clara, CA, USA, 135–153. https://www.usenix.org/conference/osdi24/presentation/fu
work page 2024
-
[28]
Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, and Pengfei Zuo. 2024. Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention. InProceedings of the 2024 USENIX Annual Technical Conference. USENIX Asso- ciation, Santa Clara, CA, USA, 111–126. https://www.usenix.org/...
work page 2024
-
[29]
Shiwei Gao, Youmin Chen, and Jiwu Shu. 2025. Fast State Restoration in LLM Serving with HCache. InProceedings of the Twentieth European Conference on Computer Systems. ACM, Rotterdam, The Netherlands, 128–143. https: //doi.org/10.1145/3689031.3696072
-
[30]
Shihong Gao, Xin Zhang, Yanyan Shen, and Lei Chen. 2025. Apt-Serve: Adap- tive Request Scheduling on Hybrid Cache for Scalable LLM Inference Serv- ing.Proceedings of the ACM on Management of Data3, 3 (2025), 130:1–130:28. https://doi.org/10.1145/3725394
- [31]
-
[32]
Google. 2025. Gemini. https://gemini.google.com/app. Accessed: 2025-08-01
work page 2025
-
[33]
gradientai. 2024. Llama-3-8B-Instruct-Gradient-1048k. https://huggingface.co/ gradientai/Llama-3-8B-Instruct-Gradient-1048k. Accessed: 2024-10-29
work page 2024
-
[34]
Greg Kamradt. 2023. Needle in a haystack - pressure testing llms. https: //github.com/gkamradt/LLMTest_NeedleInAHaystack. Accessed: 2024-08-12
work page 2023
-
[35]
Rentong Guo, Xiaofan Luan, Long Xiang, Xiao Yan, Xiaomeng Yi, Jigao Luo, Qianya Cheng, Weizhi Xu, Jiarui Luo, Frank Liu, Zhenshan Cao, Yanliang Qiao, Ting Wang, Bo Tang, and Charles Xie. 2022. Manu: A Cloud Native Vector Database Management System.Proc. VLDB Endow.15, 12 (2022), 3548–3561. https://doi.org/10.14778/3554821.3554843
-
[36]
Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami
Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami. 2024. KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization. In The Thirty-Eighth Annual Conference on Neural Information Processing Systems. Vancouver, BC, Canada. http://papers.nips.cc/paper_files/paper/202...
work page 2024
-
[37]
Mahoney, Kurt Keutzer, and Amir Gholami
Coleman Richard Charles Hooper, Sehoon Kim, Hiva Mohammadzadeh, Mon- ishwaran Maheswaran, Sebastian Zhao, June Paik, Michael W. Mahoney, Kurt Keutzer, and Amir Gholami. 2025. Squeezed Attention: Accelerating Long Context Length LLM Inference. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)....
work page 2025
-
[38]
Kurt Hornik, Ingo Feinerer, Martin Kober, and Christian Buchta. 2012. Spherical k-means clustering.Journal of statistical software50 (2012), 1–22
work page 2012
-
[39]
Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. 2024. RULER: What’s the Real Context Size of Your Long-Context Language Models?CoRRabs/2404.06654 (2024). https://doi.org/10.48550/ARXIV.2404.06654
work page internal anchor Pith review doi:10.48550/arxiv.2404.06654 2024
-
[40]
Piotr Indyk and Rajeev Motwani. 1998. Approximate Nearest Neighbors: To- wards Removing the Curse of Dimensionality. InProceedings of the Thirtieth Annual ACM Symposium on the Theory of Computing. ACM, Dallas, Texas, USA, 604–613. https://doi.org/10.1145/276698.276876
-
[41]
InfiniGen. 2024. InfiniGen Code. https://github.com/snu-comparch/InfiniGen. Accessed: 2025-04-01
work page 2024
-
[42]
Johan Ludwig William Valdemar Jensen. 1906. Sur les fonctions convexes et les inégalités entre les valeurs moyennes.Acta mathematica30, 1 (1906), 175–193
work page 1906
-
[43]
Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu
Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. 2024. MInference 1.0: Accelerating Pre- filling for Long-Context LLMs via Dynamic Sparse Attention. InThe Thirty- Eighth Annual Conference on Neural Information Processing Systems. Van- couver...
work page 2024
-
[44]
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles. ACM, Koblenz, Germany, 611–626. https://doi.org/10.1145/3600006.3613165
-
[45]
Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. 2024. InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management. In18th USENIX Symposium on Operating Systems Design and Implementation. USENIX Association, Santa Clara, CA, 155–172. https: //www.usenix.org/conference/osdi24/presentation/lee
work page 2024
- [46]
-
[47]
Viktor Leis, Adnan Alhomssi, Tobias Ziegler, Yannick Loeck, and Christian Dietrich. 2023. Virtual-Memory Assisted Buffer Management.Proceedings of the ACM on Management of Data1, 1 (2023), 7:1–7:25. https://doi.org/10.1145/ 3588687
work page 2023
-
[48]
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock- täschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-Augmented Genera- tion for Knowledge-Intensive NLP Tasks. InThe Thirty-fourth Annual Conference on Neural Information Processing Systems. virtual. ht...
work page 2020
-
[49]
Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying Geng, Nan Huo, Xuanhe Zhou, Chenhao Ma, Guoliang Li, Kevin Chen-Chuan Chang, Fei Huang, Reynold Cheng, and Yongbin Li. 2023. Can LLM Already Serve as A Database Interface? A Big Bench for Large-Scale Database Grounded Text-to-SQLs. InThe Thirty-Seventh Annual ...
work page 2023
-
[50]
Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. 2024. SnapKV: LLM Knows What You are Looking for Before Generation. InThe Thirty- Eighth Annual Conference on Neural Information Processing Systems. Van- couver, BC, Canada. http://papers.nips.cc/paper_files/paper/2024/hash/ 28a...
work page 2024
-
[51]
Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica
-
[52]
In17th USENIX Symposium on Operating Systems Design and Implementation
AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving. In17th USENIX Symposium on Operating Systems Design and Implementation. USENIX Association, 663–679. https://www.usenix.org/ conference/osdi23/presentation/li-zhouhan
-
[53]
Chaofan Lin, Zhenhua Han, Chengruidong Zhang, Yuqing Yang, Fan Yang, Chen Chen, and Lili Qiu. 2024. Parrot: Efficient Serving of LLM-based Applications with Semantic Variable. In18th USENIX Symposium on Operating Systems Design and Implementation. USENIX Association, Santa Clara, CA, USA, 929–945. https: //www.usenix.org/conference/osdi24/presentation/lin-chaofan
work page 2024
-
[54]
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei- Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. 2024. AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration. InProceedings of the Seventh An- nual Conference on Machine Learning and Systems. mlsys.org, Santa Clara, CA, USA. https://proc...
work page 2024
-
[55]
Di Liu, Meng Chen, Baotong Lu, Huiqiang Jiang, Zhenhua Han, Qianxi Zhang, Qi Chen, Chengruidong Zhang, Bailu Ding, Kai Zhang, Chen Chen, Fan Yang, Yuqing Yang, and Lili Qiu. 2024. RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval.CoRRabs/2409.10516 (2024). https://doi. org/10.48550/ARXIV.2409.10516
-
[56]
Guangda Liu, Chengwei Li, Jieru Zhao, Chenqi Zhang, and Minyi Guo. 2025. ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression. In62nd ACM/IEEE Design Automation Conference. IEEE, San Francisco, CA, USA, 1–7. https://doi.org/10.1109/DAC63849.2025.11132479
-
[57]
Shige Liu, Zhifang Zeng, Li Chen, Adil Ainihaer, Arun Ramasami, Songting Chen, Yu Xu, Mingxi Wu, and Jianguo Wang. 2025. TigerVector: Supporting Vector Search in Graph Databases for Advanced RAGs. InCompanion of the 2025 International Conference on Management of Data. ACM, Berlin, Germany, 553–565. https://doi.org/10.1145/3722212.3724456
-
[58]
Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen (Henry) Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. 2024. KIVI: A Tuning-Free Asym- metric 2bit Quantization for KV Cache. InForty-first International Conference on Machine Learning. OpenReview.net, Vienna, Austria. https://openreview. net/forum?id=L057s2Rq8O
work page 2024
-
[59]
Kejing Lu, Mineichi Kudo, Chuan Xiao, and Yoshiharu Ishikawa. 2021. HVS: Hierarchical Graph Structure Based on Voronoi Diagrams for Solving Approx- imate Nearest Neighbor Search.Proc. VLDB Endow.15, 2 (2021), 246–258. https://doi.org/10.14778/3489496.3489506
-
[60]
MagicPIG. 2024. MagicPIG Code. https://github.com/Infini-AI-Lab/MagicPIG. Accessed: 2025-04-01
work page 2024
-
[61]
Yury A. Malkov and Dmitry A. Yashunin. 2020. Efficient and Robust Approx- imate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs.IEEE Transactions on Pattern Analysis and Machine Intelligence42, 4 (2020), 824–836. https://doi.org/10.1109/TPAMI.2018.2889473
-
[62]
Yixuan Mei, Yonghao Zhuang, Xupeng Miao, Juncheng Yang, Zhihao Jia, and Rashmi Vinayak. 2025. Helix: Serving Large Language Models over Hetero- geneous GPUs and Network via Max-Flow. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1. ACM, Rotterdam, The Netherlands, 58...
-
[63]
Meta. 2024. Llama-3.1-8B-Instruct. https://huggingface.co/meta-llama/Llama- 3.1-8B-Instruct. Accessed: 2024-09-25
work page 2024
-
[64]
Meta. 2025. The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation. https://ai.meta.com/blog/llama-4-multimodal-intelligence/. Ac- cessed: 2025-04-05
work page 2025
-
[65]
Ilyas, Umar Farooq Minhas, Jeffrey Pound, and Theodoros Rekatsinas
Jason Mohoney, Anil Pacaci, Shihabur Rahman Chowdhury, Ali Mousavi, Ihab F. Ilyas, Umar Farooq Minhas, Jeffrey Pound, and Theodoros Rekatsinas. 2023. High-Throughput Vector Similarity Search in Knowledge Graphs.Proceedings of the ACM on Management of Data1, 2 (2023), 197:1–197:25. https://doi.org/ 10.1145/3589777
-
[66]
Jiaqi Mu and Pramod Viswanath. 2018. All-but-the-Top: Simple and Effective Postprocessing for Word Representations. InThe Sixth International Conference on Learning Representations. OpenReview.net, Vancouver, BC, Canada. https: //openreview.net/forum?id=HkuGJ3kCb
work page 2018
-
[67]
Ansong Ni, Jeevana Priya Inala, Chenglong Wang, Alex Polozov, Christopher Meek, Dragomir Radev, and Jianfeng Gao. 2023. Learning Math Reasoning from Self-Sampled Correct and Partially-Correct Solutions. InThe Eleventh International Conference on Learning Representations. OpenReview.net, Kigali, Rwanda. https://openreview.net/forum?id=4D4TSJE6-K
work page 2023
-
[68]
NVIDIA. 2020. NVIDIA A100 Tensor Core GPU. https://www.nvidia.com/en- us/data-center/a100/. Accessed: 2025-04-01
work page 2020
-
[69]
NVIDIA. 2020. NVIDIA RTX A6000 Graphics Card. https://www.nvidia.com/en- us/products/workstations/rtx-a6000/. Accessed: 2025-10-01
work page 2020
-
[70]
Art of Problem Solving. 2024. AIME Problems and Solutions. https:// artofproblemsolving.com/wiki/index.php/AIME_Problems_and_Solutions. Ac- cessed: 2025-08-01
work page 2024
-
[71]
Hiroyuki Ootomo, Akira Naruse, Corey Nolet, Ray Wang, Tamas Feher, and Yong Wang. 2024. CAGRA: Highly Parallel Graph Construction and Approximate Nearest Neighbor Search for GPUs. In40th IEEE International Conference on Data Engineering. IEEE, Utrecht, The Netherlands, 4236–4247. https://doi.org/ 10.1109/ICDE60146.2024.00323
-
[72]
OpenAI. 2025. ChatGPT. https://chat.chatbotapp.ai/. Accessed: 2025-08-01
work page 2025
-
[73]
James Jie Pan, Jianguo Wang, and Guoliang Li. 2024. Survey of vector database management systems.VLDB J.33, 5 (2024), 1591–1615. https://doi.org/10.1007/ S00778-024-00864-X
work page 2024
-
[74]
Liana Patel, Peter Kraft, Carlos Guestrin, and Matei Zaharia. 2024. ACORN: Performant and Predicate-Agnostic Search Over Vector Embeddings and Struc- tured Data.Proceedings of the ACM on Management of Data2, 3 (2024), 120. https://doi.org/10.1145/3654923
-
[75]
Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean
-
[76]
InProceedings of the Sixth Conference on Machine Learning and Systems
Efficiently Scaling Transformer Inference. InProceedings of the Sixth Conference on Machine Learning and Systems. mlsys.org, Mi- ami, FL, USA. https://proceedings.mlsys.org/paper_files/paper/2023/hash/ c4be71ab8d24cdfb45e3d06dbfca2780-Abstract-mlsys2023.html
work page 2023
-
[77]
PQCache. 2024. PQCache. https://github.com/HugoZHL/PQCache. Accessed: 2025-04-01
work page 2024
-
[78]
Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Feng Ren, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. 2025. Mooncake: Trading More Storage for Less Computation - A KVCache-centric Architecture for Serving LLM Chatbot. In23rd USENIX Conference on File and Storage Technologies. USENIX Association, Santa Clara, CA, USA, 155–170. https://www.usenix...
work page 2025
-
[79]
Quest. 2024. Quest Code. https://github.com/mit-han-lab/Quest. Accessed: 2025-04-01
work page 2024
-
[80]
Qwen. 2024. Qwen2.5-72B-Instruct. https://huggingface.co/Qwen/Qwen2.5- 72B-Instruct. Accessed: 2025-01-12
work page 2024
-
[81]
Qwen. 2024. Qwen2.5-7B-Instruct. https://huggingface.co/Qwen/Qwen2.5-7B- Instruct. Accessed: 2025-01-12
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.