DisCEdge: Distributed Context Management for Large Language Models at the Edge

David Bermbach; Minghe Wang; Mohammadreza Malekabbasi

arxiv: 2511.22599 · v2 · submitted 2025-11-27 · 💻 cs.DC · cs.DB· cs.LG

DisCEdge: Distributed Context Management for Large Language Models at the Edge

Mohammadreza Malekabbasi , Minghe Wang , David Bermbach This is my paper

Pith reviewed 2026-05-17 04:13 UTC · model grok-4.3

classification 💻 cs.DC cs.DBcs.LG

keywords edge computingdistributed systemslarge language modelscontext managementtokenized replicationdata consistencyresponse latency

0 comments

The pith

DisCEdge stores and replicates LLM user context as token sequences across edge nodes to cut response times and synchronization costs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models need ongoing user context such as conversation history to answer coherently, yet running them at the edge means this context must travel between scattered nodes without losing the speed advantage. DisCEdge keeps context in tokenized form and replicates the token sequences between nodes instead of raw text. This removes repeated tokenization work and allows lighter updates to travel between nodes. In tests on a realistic edge setup the system delivers faster replies to users, lower traffic between nodes, and far smaller messages from clients, all while keeping context identical everywhere.

Core claim

The paper claims that maintaining user context as sequences of tokens and replicating these sequences across geo-distributed edge nodes eliminates redundant tokenization and enables efficient synchronization, producing up to 14.46 percent better median response times and up to 15 percent lower median inter-node synchronization overhead than raw-text methods, plus a 90 percent median reduction in client request size versus client-side storage, all while preserving data consistency.

What carries the argument

Tokenized context replication, which stores and shares user context as pre-processed token sequences so nodes exchange compact updates without re-tokenizing or sending full text.

Load-bearing premise

That storing and replicating context in tokenized form avoids redundant computation and enables efficient replication without introducing new consistency, overhead, or scalability problems in geo-distributed real-world settings.

What would settle it

A test deployment showing higher synchronization overhead or context mismatches under realistic network partitions and node churn would disprove the claimed gains.

Figures

Figures reproduced from arXiv: 2511.22599 by David Bermbach, Minghe Wang, Mohammadreza Malekabbasi.

**Figure 1.** Figure 1: DisCEdge Architecture Overview In this section, we present the architecture of DisCEdge, our distributed context management system for edge LLMs. The system is designed to efficiently manage user context across geo-distributed edge nodes, enabling low-latency interactions with LLMs as they move. As shown in [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: The “LLM Service” as an inference framework, and [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 5.** Figure 5: Network overhead for synchronizing context data [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 4.** Figure 4: Tokens generated per second (TPS) for tokenized [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 6.** Figure 6: Client-observable response time per turn in a mo [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Client-to-server network usage per request turn. [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

read the original abstract

Deploying Large Language Model (LLM) services at the edge benefits latency-sensitive and privacy-aware applications. However, the stateless nature of LLMs makes managing user context (e.g., sessions, preferences) across geo-distributed edge nodes challenging. Existing solutions, such as client-side context storage, introduce network latency and bandwidth overhead, undermining edge deployment advantages. We propose DisCEdge, a distributed context management system that stores and replicates user context in tokenized form across edge nodes. By maintaining context as token sequences, our system avoids redundant computation and enables efficient data replication. We evaluate an open-source prototype in a realistic edge environment. DisCEdge improves median response times by up to 14.46% and lowers median inter-node synchronization overhead by up to 15% compared to a raw-text-based system. It also reduces client request sizes by a median of 90% compared to client-side context management, while guaranteeing data consistency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces DisCEdge, a distributed context management system for LLMs at the edge. Context is stored and replicated in tokenized form across geo-distributed edge nodes to avoid redundant tokenization, enable efficient replication, and reduce client-side overheads. An open-source prototype is evaluated in a realistic edge environment, claiming up to 14.46% improvement in median response times and 15% lower median inter-node synchronization overhead versus a raw-text baseline, plus a 90% median reduction in client request sizes versus client-side context management, while guaranteeing data consistency.

Significance. If the performance claims are robust, the tokenized replication strategy could meaningfully improve latency and bandwidth efficiency for edge LLM deployments in privacy-sensitive settings. The open-source prototype and focus on consistency guarantees are strengths that support potential adoption and further study in distributed edge systems.

major comments (2)

[Abstract / Evaluation] Abstract and Evaluation section: The reported gains (14.46% median response time, 15% synchronization overhead) are presented without methodology details such as baseline definitions (exact configuration of the raw-text system), edge environment parameters (node count, network RTT distribution), workload characteristics, or statistical measures (error bars, trial counts). This directly affects verifiability of the central empirical claims, especially given the skeptic concern that geo-distributed latencies (50-200 ms) and concurrent updates could alter the observed overhead reductions.
[System Design] System Design / Replication Protocol: The core claim that tokenized context storage enables efficient replication with strong consistency guarantees is load-bearing, yet the manuscript does not detail the protocol for handling concurrent appends, ordering, or conflict resolution on token sequences. Without this, it is unclear whether the 15% overhead reduction would persist under realistic concurrent session updates or higher inter-node latencies.

minor comments (2)

[Abstract] The abstract states 'guaranteeing data consistency' without specifying strong vs. eventual consistency; a short clarification in the introduction or design section would improve precision.
[Evaluation] Figure or table captions in the evaluation could more explicitly label the exact baselines and metrics used for the 14.46% and 15% figures to aid quick comprehension.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and outline the revisions we will make to improve the clarity and verifiability of our claims.

read point-by-point responses

Referee: [Abstract / Evaluation] Abstract and Evaluation section: The reported gains (14.46% median response time, 15% synchronization overhead) are presented without methodology details such as baseline definitions (exact configuration of the raw-text system), edge environment parameters (node count, network RTT distribution), workload characteristics, or statistical measures (error bars, trial counts). This directly affects verifiability of the central empirical claims, especially given the skeptic concern that geo-distributed latencies (50-200 ms) and concurrent updates could alter the observed overhead reductions.

Authors: We agree that additional methodological details are required for full verifiability. In the revised manuscript we will expand the Evaluation section to define the raw-text baseline (replication of untokenized text strings with on-demand tokenization at each node), specify the experimental setup (five edge nodes with emulated RTTs drawn from a 50-200 ms distribution), characterize the workloads (session lengths, append frequencies, and update concurrency levels), and report statistical measures (medians with interquartile ranges from 20 independent trials). We will also add a short discussion of how the observed overhead reductions behave under the cited latency range and concurrent updates, drawing on the trace data already collected. revision: yes
Referee: [System Design] System Design / Replication Protocol: The core claim that tokenized context storage enables efficient replication with strong consistency guarantees is load-bearing, yet the manuscript does not detail the protocol for handling concurrent appends, ordering, or conflict resolution on token sequences. Without this, it is unclear whether the 15% overhead reduction would persist under realistic concurrent session updates or higher inter-node latencies.

Authors: We acknowledge that the replication protocol for concurrent appends and conflict resolution was described at too high a level. DisCEdge maintains token sequences as a causally ordered log using vector clocks; appends are assigned logical timestamps and merged deterministically by timestamp order to preserve strong consistency. In the revised version we will add a dedicated subsection with pseudocode and a step-by-step description of append handling, ordering, and conflict resolution. Our existing evaluation already includes concurrent-update traces; we will clarify this and note that while absolute synchronization cost scales with latency, the relative 15 % reduction from avoiding re-tokenization remains consistent across the tested range. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical system evaluation

full rationale

The paper proposes DisCEdge as a distributed context management system that stores and replicates user context in tokenized form, then reports direct empirical measurements from an open-source prototype evaluated in a realistic edge environment. Performance numbers (e.g., 14.46% response time improvement, 15% synchronization overhead reduction, 90% client request size reduction) are presented as observed results of the prototype rather than predictions derived from equations, fitted parameters, or first-principles derivations. No self-definitional steps, fitted-input predictions, load-bearing self-citations, uniqueness theorems, or ansatz smuggling appear in the provided abstract or description. The central claims rest on experimental data against external benchmarks (raw-text baseline and client-side management), making the work self-contained without any reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work relies on standard assumptions from distributed systems and LLM inference pipelines but introduces no explicit free parameters, axioms, or invented entities beyond the system name itself. All reported gains are empirical.

pith-pipeline@v0.9.0 · 5468 in / 1181 out tokens · 39713 ms · 2026-05-17T04:13:35.628793+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 5 internal anchors

[1]

David Bermbach, Jörn Kuhlenkamp, Bugra Derre, Markus Klems, and Stefan Tai. 2013. A Middleware Guaranteeing Client-Centric Consistency on Top of Eventually Consistent Datastores. InProceedings of the 1st IEEE International Conference on Cloud Engineering(San Francisco, CA, USA)(IC2E 2013). IEEE, New York, NY, USA, 114–123. doi:10.1109/IC2E.2013.32

work page doi:10.1109/ic2e.2013.32 2013
[2]

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al

work page
[3]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

𝜋0: A Vision-Language-Action Flow Model for General Robot Control. arXiv:2410.24164 [cs.LG] https://arxiv.org/abs/2410.24164

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Eric A Brewer. 2000. Towards robust distributed systems. InPODC, Vol. 7. Portland, OR, 343–477

work page 2000
[5]

Yuji Chai, Mujin Kwen, David Brooks, and Gu-Yeon Wei. 2025. FlexQuant: Elastic Quantization Framework for Locally Hosted LLM on Edge Devices.arXiv preprint arXiv:2501.07139(2025)

work page arXiv 2025
[6]

Guojun Chen, Xiaojing Yu, Neiwen Ling, and Lin Zhong. 2025. ChatFly: Low- Latency Drone Planning with Large Language Models.IEEE Transactions on Mobile Computing(2025)

work page 2025
[7]

Kaiyuan Chen, Nan Tian, Christian Juette, Tianshuang Qiu, Liu Ren, John Kubia- towicz, and Ken Goldberg. 2024. FogROS2-PLR: Probabilistic Latency-Reliability For Cloud Robotics.arXiv preprint arXiv:2410.05562(2024)

work page arXiv 2024
[8]

Cloudflare Workers AI. [n. d.]. Cloudflare Workers AI. https://developers. cloudflare.com/workers-ai. Accessed: 2025-04-23

work page 2025
[9]

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multi- modality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Yiran Ding, Li Lyna Zhang, Chengruidong Zhang, Yuanyuan Xu, Ning Shang, Jiahang Xu, Fan Yang, and Mao Yang. 2024. Longrope: Extending llm context window beyond 2 million tokens.arXiv preprint arXiv:2402.13753(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Weizhi Fei, Xueyan Niu, Pingyi Zhou, Lu Hou, Bo Bai, Lei Deng, and Wei Han

work page
[12]

Extending context window of large language models via semantic com- pression.arXiv preprint arXiv:2312.09571(2023)

work page arXiv 2023
[13]

Wei Gao, Xinyu Zhou, Peng Sun, Tianwei Zhang, and Yonggang Wen. 2025. Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving.arXiv preprint arXiv:2503.24000(2025)

work page arXiv 2025
[14]

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yixin Dai, Jiawei Sun, Haofen Wang, and Haofen Wang. 2023. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.10997 2, 1 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Georgi Gerganov. 2023. GGUF: GGML Universal File Format. https://github. com/ggml-org/ggml/blob/master/docs/gguf.md

work page 2023
[16]

Georgi Gerganov et al . 2023. llama.cpp: LLM inference in C/C++. GitHub repository. https://github.com/ggml-org/llama.cpp Commit a33e6a (Feb 26, 2024); accessed 2025-04-01

work page 2023
[17]

Google Cloud. [n. d.]. Vertex AI. https://cloud.google.com/vertex-ai. Accessed: 2025-04-23

work page 2025
[18]

Harshit Gupta and Umakishore Ramachandran. 2018. Fogstore: A geo-distributed key-value store guaranteeing low latency for strongly consistent access. In Proceedings of the 12th ACM International Conference on Distributed and Event- based Systems. 148–159

work page 2018
[19]

Hugging Face. [n. d.]. Hugging Face Inference Endpoints. https://endpoints. huggingface.co. Accessed: 2025-04-23

work page 2025
[20]

Jeffrey Ichnowski, Kaiyuan Chen, Karthik Dharmarajan, Simeon Adebola, Michael Danielczuk, Víctor Mayoral-Vilches, Nikhil Jha, Hugo Zhan, Edith LLon- top, Derek Xu, et al . 2023. Fogros2: An adaptive platform for cloud and fog robotics using ros 2. In2023 IEEE international conference on robotics and automa- tion (ICRA). IEEE, 5493–5500

work page 2023
[21]

Ishan Kavathekar, Raghav Donakanti, Ponnurangam Kumaraguru, and Karthik Vaidhyanathan. 2025. Small models, big tasks: An exploratory empirical study on small language models for function calling.arXiv preprint arXiv:2504.19277 (2025)

work page arXiv 2025
[22]

Guangyuan Liu, Yinqiu Liu, Jiacheng Wang, Hongyang Du, Dusit Niyato, Jiawen Kang, and Zehui Xiong. 2025. Adaptive Contextual Caching for Mobile Edge Large Language Model Service.arXiv preprint arXiv:2501.09383(2025)

work page arXiv 2025
[23]

Mohammadreza Malekabbasi, Tobias Pfandzelter, Trever Schirmer, and David Bermbach. 2024. GeoFaaS: An Edge-to-Cloud FaaS Platform. InProceedings of the 12th IEEE International Conference on Cloud Engineering(Paphos, Cyprus) (IC2E ’24). IEEE, New York, NY, USA, 66–71. doi:10.1109/IC2E61754.2024.00014

work page doi:10.1109/ic2e61754.2024.00014 2024
[24]

2024.What’s an LLM context window and why is it getting larger? https://research.ibm.com/blog/larger-context-window

Kim Martineau. 2024.What’s an LLM context window and why is it getting larger? https://research.ibm.com/blog/larger-context-window

work page 2024
[25]

Matteo Mendula, Paolo Bellavista, Marco Levorato, and Sharon Ladron de Gue- vara Contreras. 2024. Furcifer: a Context Adaptive Middleware for Real-world Object Detection Exploiting Local, Edge, and Split Computing in the Cloud Continuum. In2024 IEEE International Conference on Pervasive Computing and Communications (PerCom). IEEE, 47–56

work page 2024
[26]

Mirza Alim Mutasodirin and Radityo Eko Prasojo. 2021. Investigating text shortening strategy in bert: Truncation vs summarization. In2021 international conference on advanced computer science and information systems (icacsis). IEEE, 1–5

work page 2021
[27]

Avanika Narayan, Dan Biderman, Sabri Eyuboglu, Avner May, Scott Linderman, James Zou, and Christopher Re. 2025. Minions: Cost-efficient collaboration between on-device and cloud language models.arXiv preprint arXiv:2502.15964 (2025)

work page arXiv 2025
[28]

2024.Fine-tuning LLMs for longer context and better RAG systems

Artur Niederfahrenhorst and Kourosh Hakhamaneshi. 2024.Fine-tuning LLMs for longer context and better RAG systems. https://www.anyscale.com/blog/fine- tuning-llms-for-longer-context-and-better-rag-systems

work page 2024
[29]

Tobias Pfandzelter, Nils Japke, Trever Schirmer, Jonathan Hasenburg, and David Bermbach. 2023. Managing Data Replication and Distribution in the Fog with FReD.Software: Practice and Experience53, 10 (Oct. 2023), 1958–1981. doi:10. 1002/spe.3237

work page 2023
[30]

SearchWing Project. [n. d.]. SearchWing: Drones for Sea Rescue. https://tha.de/ searchwing/. Accessed: 2025-04-18

work page 2025
[31]

Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, and Kaibin Huang. 2025. Mobile edge intelligence for large language models: A contempo- rary survey.IEEE Communications Surveys & Tutorials(2025)

work page 2025
[32]

Replicate. [n. d.]. Replicate. https://replicate.com/home. Accessed: 2025-04-23

work page 2025
[33]

Peter Schafhalter, Sukrit Kalra, Le Xu, Joseph E Gonzalez, and Ion Stoica. 2023. Leveraging cloud computing to make autonomous vehicles safer. In2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 5559– 5566

work page 2023
[34]

Peter Schafhalter, Alexander Krentsel, Joseph E Gonzalez, Sylvia Ratnasamy, Scott Shenker, and Ion Stoica. 2025. Bandwidth Allocation for Cloud-Augmented Autonomous Driving.arXiv preprint arXiv:2503.20127(2025)

work page arXiv 2025
[35]

Guanqun Wang, Jiaming Liu, Chenxuan Li, Yuan Zhang, Junpeng Ma, Xinyu Wei, Kevin Zhang, Maurice Chong, Renrui Zhang, Yijiang Liu, et al. 2024. Cloud-device collaborative learning for multimodal large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12646–12655

work page 2024
[36]

Qingyue Wang, Yanhe Fu, Yanan Cao, Shuai Wang, Zhiliang Tian, and Liang Ding. 2025. Recursively summarizing enables long-term dialogue memory in large language models.Neurocomputing639 (2025), 130193

work page 2025
[37]

Zichen Wen, Yifeng Gao, Weijia Li, Conghui He, and Linfeng Zhang. 2025. To- ken Pruning in Multimodal Large Language Models: Are We Solving the Right Problem?arXiv preprint arXiv:2502.11501(2025)

work page arXiv 2025
[38]

Zijun Wu, Bingyuan Liu, Ran Yan, Lei Chen, and Thomas Delteil. 2024. Reducing Distraction in Long-Context Language Models by Focused Learning.arXiv preprint arXiv:2411.05928(2024)

work page arXiv 2024
[39]

Daliang Xu, Wangsong Yin, Hao Zhang, Xin Jin, Ying Zhang, Shiyun Wei, Meng- wei Xu, and Xuanzhe Liu. 2024. EdgeLLM: Fast On-device LLM Inference with Speculative Decoding.IEEE Transactions on Mobile Computing(2024)

work page 2024
[40]

Shengyuan Ye, Bei Ouyang, Liekang Zeng, Tianyi Qian, Xiaowen Chu, Jian Tang, and Xu Chen. 2025. Jupiter: Fast and resource-efficient collaborative inference of generative llms on edge devices.arXiv preprint arXiv:2504.08242(2025)

work page arXiv 2025
[41]

Wangsong Yin, Mengwei Xu, Yuanchun Li, and Xuanzhe Liu. 2024. Llm as a system service on mobile devices.arXiv preprint arXiv:2403.11805(2024)

work page arXiv 2024
[42]

Mingjin Zhang, Xiaoming Shen, Jiannong Cao, Zeyang Cui, and Shan Jiang

work page
[43]

Edgeshard: Efficient llm inference via collaborative edge computing.IEEE Internet of Things Journal(2024)

work page 2024
[44]

Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and Wei Lu. 2024. Tinyllama: An open-source small language model.arXiv preprint arXiv:2401.02385(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

R o b o t i c s ␣ and ␣ Autonomous ␣ Systems ␣ T e s t

Yue Zheng, Yuhao Chen, Bin Qian, Xiufang Shi, Yuanchao Shu, and Jiming Chen. 2025. A review on edge large language models: Design, execution, and applications.Comput. Surveys57, 8 (2025), 1–35. DisCEdge: Distributed Context Management for Large Language Models at the Edge A Experiment Details This appendix provides the complete configuration details for t...

work page 2025

[1] [1]

David Bermbach, Jörn Kuhlenkamp, Bugra Derre, Markus Klems, and Stefan Tai. 2013. A Middleware Guaranteeing Client-Centric Consistency on Top of Eventually Consistent Datastores. InProceedings of the 1st IEEE International Conference on Cloud Engineering(San Francisco, CA, USA)(IC2E 2013). IEEE, New York, NY, USA, 114–123. doi:10.1109/IC2E.2013.32

work page doi:10.1109/ic2e.2013.32 2013

[2] [2]

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al

work page

[3] [3]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

𝜋0: A Vision-Language-Action Flow Model for General Robot Control. arXiv:2410.24164 [cs.LG] https://arxiv.org/abs/2410.24164

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Eric A Brewer. 2000. Towards robust distributed systems. InPODC, Vol. 7. Portland, OR, 343–477

work page 2000

[5] [5]

Yuji Chai, Mujin Kwen, David Brooks, and Gu-Yeon Wei. 2025. FlexQuant: Elastic Quantization Framework for Locally Hosted LLM on Edge Devices.arXiv preprint arXiv:2501.07139(2025)

work page arXiv 2025

[6] [6]

Guojun Chen, Xiaojing Yu, Neiwen Ling, and Lin Zhong. 2025. ChatFly: Low- Latency Drone Planning with Large Language Models.IEEE Transactions on Mobile Computing(2025)

work page 2025

[7] [7]

Kaiyuan Chen, Nan Tian, Christian Juette, Tianshuang Qiu, Liu Ren, John Kubia- towicz, and Ken Goldberg. 2024. FogROS2-PLR: Probabilistic Latency-Reliability For Cloud Robotics.arXiv preprint arXiv:2410.05562(2024)

work page arXiv 2024

[8] [8]

Cloudflare Workers AI. [n. d.]. Cloudflare Workers AI. https://developers. cloudflare.com/workers-ai. Accessed: 2025-04-23

work page 2025

[9] [9]

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multi- modality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

Yiran Ding, Li Lyna Zhang, Chengruidong Zhang, Yuanyuan Xu, Ning Shang, Jiahang Xu, Fan Yang, and Mao Yang. 2024. Longrope: Extending llm context window beyond 2 million tokens.arXiv preprint arXiv:2402.13753(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

Weizhi Fei, Xueyan Niu, Pingyi Zhou, Lu Hou, Bo Bai, Lei Deng, and Wei Han

work page

[12] [12]

Extending context window of large language models via semantic com- pression.arXiv preprint arXiv:2312.09571(2023)

work page arXiv 2023

[13] [13]

Wei Gao, Xinyu Zhou, Peng Sun, Tianwei Zhang, and Yonggang Wen. 2025. Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving.arXiv preprint arXiv:2503.24000(2025)

work page arXiv 2025

[14] [14]

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yixin Dai, Jiawei Sun, Haofen Wang, and Haofen Wang. 2023. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.10997 2, 1 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[15] [15]

Georgi Gerganov. 2023. GGUF: GGML Universal File Format. https://github. com/ggml-org/ggml/blob/master/docs/gguf.md

work page 2023

[16] [16]

Georgi Gerganov et al . 2023. llama.cpp: LLM inference in C/C++. GitHub repository. https://github.com/ggml-org/llama.cpp Commit a33e6a (Feb 26, 2024); accessed 2025-04-01

work page 2023

[17] [17]

Google Cloud. [n. d.]. Vertex AI. https://cloud.google.com/vertex-ai. Accessed: 2025-04-23

work page 2025

[18] [18]

Harshit Gupta and Umakishore Ramachandran. 2018. Fogstore: A geo-distributed key-value store guaranteeing low latency for strongly consistent access. In Proceedings of the 12th ACM International Conference on Distributed and Event- based Systems. 148–159

work page 2018

[19] [19]

Hugging Face. [n. d.]. Hugging Face Inference Endpoints. https://endpoints. huggingface.co. Accessed: 2025-04-23

work page 2025

[20] [20]

Jeffrey Ichnowski, Kaiyuan Chen, Karthik Dharmarajan, Simeon Adebola, Michael Danielczuk, Víctor Mayoral-Vilches, Nikhil Jha, Hugo Zhan, Edith LLon- top, Derek Xu, et al . 2023. Fogros2: An adaptive platform for cloud and fog robotics using ros 2. In2023 IEEE international conference on robotics and automa- tion (ICRA). IEEE, 5493–5500

work page 2023

[21] [21]

Ishan Kavathekar, Raghav Donakanti, Ponnurangam Kumaraguru, and Karthik Vaidhyanathan. 2025. Small models, big tasks: An exploratory empirical study on small language models for function calling.arXiv preprint arXiv:2504.19277 (2025)

work page arXiv 2025

[22] [22]

Guangyuan Liu, Yinqiu Liu, Jiacheng Wang, Hongyang Du, Dusit Niyato, Jiawen Kang, and Zehui Xiong. 2025. Adaptive Contextual Caching for Mobile Edge Large Language Model Service.arXiv preprint arXiv:2501.09383(2025)

work page arXiv 2025

[23] [23]

Mohammadreza Malekabbasi, Tobias Pfandzelter, Trever Schirmer, and David Bermbach. 2024. GeoFaaS: An Edge-to-Cloud FaaS Platform. InProceedings of the 12th IEEE International Conference on Cloud Engineering(Paphos, Cyprus) (IC2E ’24). IEEE, New York, NY, USA, 66–71. doi:10.1109/IC2E61754.2024.00014

work page doi:10.1109/ic2e61754.2024.00014 2024

[24] [24]

2024.What’s an LLM context window and why is it getting larger? https://research.ibm.com/blog/larger-context-window

Kim Martineau. 2024.What’s an LLM context window and why is it getting larger? https://research.ibm.com/blog/larger-context-window

work page 2024

[25] [25]

Matteo Mendula, Paolo Bellavista, Marco Levorato, and Sharon Ladron de Gue- vara Contreras. 2024. Furcifer: a Context Adaptive Middleware for Real-world Object Detection Exploiting Local, Edge, and Split Computing in the Cloud Continuum. In2024 IEEE International Conference on Pervasive Computing and Communications (PerCom). IEEE, 47–56

work page 2024

[26] [26]

Mirza Alim Mutasodirin and Radityo Eko Prasojo. 2021. Investigating text shortening strategy in bert: Truncation vs summarization. In2021 international conference on advanced computer science and information systems (icacsis). IEEE, 1–5

work page 2021

[27] [27]

Avanika Narayan, Dan Biderman, Sabri Eyuboglu, Avner May, Scott Linderman, James Zou, and Christopher Re. 2025. Minions: Cost-efficient collaboration between on-device and cloud language models.arXiv preprint arXiv:2502.15964 (2025)

work page arXiv 2025

[28] [28]

2024.Fine-tuning LLMs for longer context and better RAG systems

Artur Niederfahrenhorst and Kourosh Hakhamaneshi. 2024.Fine-tuning LLMs for longer context and better RAG systems. https://www.anyscale.com/blog/fine- tuning-llms-for-longer-context-and-better-rag-systems

work page 2024

[29] [29]

Tobias Pfandzelter, Nils Japke, Trever Schirmer, Jonathan Hasenburg, and David Bermbach. 2023. Managing Data Replication and Distribution in the Fog with FReD.Software: Practice and Experience53, 10 (Oct. 2023), 1958–1981. doi:10. 1002/spe.3237

work page 2023

[30] [30]

SearchWing Project. [n. d.]. SearchWing: Drones for Sea Rescue. https://tha.de/ searchwing/. Accessed: 2025-04-18

work page 2025

[31] [31]

Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, and Kaibin Huang. 2025. Mobile edge intelligence for large language models: A contempo- rary survey.IEEE Communications Surveys & Tutorials(2025)

work page 2025

[32] [32]

Replicate. [n. d.]. Replicate. https://replicate.com/home. Accessed: 2025-04-23

work page 2025

[33] [33]

Peter Schafhalter, Sukrit Kalra, Le Xu, Joseph E Gonzalez, and Ion Stoica. 2023. Leveraging cloud computing to make autonomous vehicles safer. In2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 5559– 5566

work page 2023

[34] [34]

Peter Schafhalter, Alexander Krentsel, Joseph E Gonzalez, Sylvia Ratnasamy, Scott Shenker, and Ion Stoica. 2025. Bandwidth Allocation for Cloud-Augmented Autonomous Driving.arXiv preprint arXiv:2503.20127(2025)

work page arXiv 2025

[35] [35]

Guanqun Wang, Jiaming Liu, Chenxuan Li, Yuan Zhang, Junpeng Ma, Xinyu Wei, Kevin Zhang, Maurice Chong, Renrui Zhang, Yijiang Liu, et al. 2024. Cloud-device collaborative learning for multimodal large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12646–12655

work page 2024

[36] [36]

Qingyue Wang, Yanhe Fu, Yanan Cao, Shuai Wang, Zhiliang Tian, and Liang Ding. 2025. Recursively summarizing enables long-term dialogue memory in large language models.Neurocomputing639 (2025), 130193

work page 2025

[37] [37]

Zichen Wen, Yifeng Gao, Weijia Li, Conghui He, and Linfeng Zhang. 2025. To- ken Pruning in Multimodal Large Language Models: Are We Solving the Right Problem?arXiv preprint arXiv:2502.11501(2025)

work page arXiv 2025

[38] [38]

Zijun Wu, Bingyuan Liu, Ran Yan, Lei Chen, and Thomas Delteil. 2024. Reducing Distraction in Long-Context Language Models by Focused Learning.arXiv preprint arXiv:2411.05928(2024)

work page arXiv 2024

[39] [39]

Daliang Xu, Wangsong Yin, Hao Zhang, Xin Jin, Ying Zhang, Shiyun Wei, Meng- wei Xu, and Xuanzhe Liu. 2024. EdgeLLM: Fast On-device LLM Inference with Speculative Decoding.IEEE Transactions on Mobile Computing(2024)

work page 2024

[40] [40]

Shengyuan Ye, Bei Ouyang, Liekang Zeng, Tianyi Qian, Xiaowen Chu, Jian Tang, and Xu Chen. 2025. Jupiter: Fast and resource-efficient collaborative inference of generative llms on edge devices.arXiv preprint arXiv:2504.08242(2025)

work page arXiv 2025

[41] [41]

Wangsong Yin, Mengwei Xu, Yuanchun Li, and Xuanzhe Liu. 2024. Llm as a system service on mobile devices.arXiv preprint arXiv:2403.11805(2024)

work page arXiv 2024

[42] [42]

Mingjin Zhang, Xiaoming Shen, Jiannong Cao, Zeyang Cui, and Shan Jiang

work page

[43] [43]

Edgeshard: Efficient llm inference via collaborative edge computing.IEEE Internet of Things Journal(2024)

work page 2024

[44] [44]

Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and Wei Lu. 2024. Tinyllama: An open-source small language model.arXiv preprint arXiv:2401.02385(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[45] [45]

R o b o t i c s ␣ and ␣ Autonomous ␣ Systems ␣ T e s t

Yue Zheng, Yuhao Chen, Bin Qian, Xiufang Shi, Yuanchao Shu, and Jiming Chen. 2025. A review on edge large language models: Design, execution, and applications.Comput. Surveys57, 8 (2025), 1–35. DisCEdge: Distributed Context Management for Large Language Models at the Edge A Experiment Details This appendix provides the complete configuration details for t...

work page 2025