KV cache eviction is unified under an information capacity maximization principle derived from a linear-Gaussian attention surrogate, with CapKV proposed as a leverage-score based implementation that outperforms prior heuristics in experiments.
hub
On-device language models: A comprehensive review.arXiv preprint arXiv:2409.00088
18 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 3polarities
background 3representative citing papers
WISV uses a channel-aware semantic acceptance policy on hidden representations to boost accepted sequence length by up to 60.8% and cut interaction rounds by 37.3% in distributed speculative decoding, with under 1% accuracy loss.
An LLM-orchestrated physics simulation search identifies polymers with strong insulin interactions, outperforming standard optimization methods by significant margins.
SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.
NVLLM offloads FFN computations to integrated 3D NAND flash with page-level access and keeps attention in DRAM, delivering 16.7x-37.9x speedups over GPU out-of-core baselines for models up to 30B parameters.
A new benchmark shows LLM smartphone agents achieve comparable success with screen text alone as with screenshots, but both fail often due to UI accessibility and reasoning gaps.
FlexServe achieves up to 10x faster time-to-first-token for secure LLM inference on mobile devices by using flexible resource isolation in TrustZone compared to standard approaches.
CoreQ delivers adaptive mismatch correction via closed-form geometric coefficient and successive rounding to improve PTQ accuracy for large language models.
AgentProg reframes interaction history as a program with variables and control flow, plus a belief state for partial observability, achieving SOTA success rates on long-horizon GUI benchmarks while baselines degrade.
Personal agents require edge deployment to preserve high-fidelity local context and zero-latency loops, as claimed through three structural shifts away from cloud-centric designs.
An LLM-based framework recommends drill-down paths in visual analytics by approximating a greedy algorithm, interpreting user intent, and managing exploration branches to reduce cognitive load.
Medium-sized LLMs can supply useful feedback on game concepts in early design stages, as demonstrated by model comparisons and a positive student pilot study.
RecGPT-Mobile runs a compact LLM on phones to understand evolving user intent from behaviors and improve mobile e-commerce recommendations.
A multi-agent multimodal system with fact-grounded adjudication and a dynamic two-tier preference graph cuts false positives in content filtering by 74.3% and nearly doubles F1-score versus text-only baselines while supporting user-driven Delta adjustments.
Corpus scaling in RAG frequently matches the accuracy gains from larger LLMs on open-domain QA tasks, with mid-sized models benefiting most due to better passage coverage.
A survey synthesizing challenges, system architectures, model optimizations, deployment methods, and resource management techniques for large language model inference at the network edge.
This research agenda argues that cloud-native architectures, microservices, autoscaling, and emerging trends like serverless inference and federated learning are required to make large language models efficient and scalable.
citing papers explorer
-
Rethinking KV Cache Eviction via a Unified Information-Theoretic Objective
KV cache eviction is unified under an information capacity maximization principle derived from a linear-Gaussian attention surrogate, with CapKV proposed as a leverage-score based implementation that outperforms prior heuristics in experiments.
-
WISV: Wireless-Informed Semantic Verification for Distributed Speculative Decoding in Device-Edge LLM Inference
WISV uses a channel-aware semantic acceptance policy on hidden representations to boost accepted sequence length by up to 60.8% and cut interaction rounds by 37.3% in distributed speculative decoding, with under 1% accuracy loss.
-
Towards Discovery of Polymers for Insulin Delivery via Physics-Grounded Agentic Workflows
An LLM-orchestrated physics simulation search identifies polymers with strong insulin interactions, outperforming standard optimization methods by significant margins.
-
SOD: Step-wise On-policy Distillation for Small Language Model Agents
SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.
-
NVLLM: A 3D NAND-Centric Architecture Enabling Edge on-Device LLM Inference
NVLLM offloads FFN computations to integrated 3D NAND flash with page-level access and keeps attention in DRAM, delivering 16.7x-37.9x speedups over GPU out-of-core baselines for models up to 30B parameters.
-
Do LLMs Need to See Everything? A Benchmark and Study of Failures in LLM-driven Smartphone Automation using Screentext vs. Screenshots
A new benchmark shows LLM smartphone agents achieve comparable success with screen text alone as with screenshots, but both fail often due to UI accessibility and reasoning gaps.
-
FlexServe: A Fast and Secure LLM Serving System for Mobile Devices with Flexible Resource Isolation
FlexServe achieves up to 10x faster time-to-first-token for secure LLM inference on mobile devices by using flexible resource isolation in TrustZone compared to standard approaches.
-
CoreQ: Learning-Free Mismatch Correction and Successive Rounding for Quantization
CoreQ delivers adaptive mismatch correction via closed-form geometric coefficient and successive rounding to improve PTQ accuracy for large language models.
-
AgentProg: Empowering Long-Horizon GUI Agents with Program-Guided Context Management
AgentProg reframes interaction history as a program with variables and control flow, plus a belief state for partial observability, achieving SOTA success rates on long-horizon GUI benchmarks while baselines degrade.
-
Beyond Scaling: Agents Are Heading to the Edge
Personal agents require edge deployment to preserve high-fidelity local context and zero-latency loops, as claimed through three structural shifts away from cloud-centric designs.
-
Intelligent Drill-Down: Large Language Model-Driven Drill-Down Technique for Human-AI Collaborative Visual Exploration
An LLM-based framework recommends drill-down paths in visual analytics by approximating a greedy algorithm, interpreting user intent, and managing exploration branches to reduce cognitive load.
-
Diamonds in the rough: Transforming SPARCs of imagination into a game concept by leveraging medium sized LLMs
Medium-sized LLMs can supply useful feedback on game concepts in early design stages, as demonstrated by model comparisons and a positive student pilot study.
-
RecGPT-Mobile: On-Device Large Language Models for User Intent Understanding in Taobao Feed Recommendation
RecGPT-Mobile runs a compact LLM on phones to understand evolving user intent from behaviors and improve mobile e-commerce recommendations.
-
Transparent and Controllable Recommendation Filtering via Multimodal Multi-Agent Collaboration
A multi-agent multimodal system with fact-grounded adjudication and a dynamic two-tier preference graph cuts false positives in content filtering by 74.3% and nearly doubles F1-score versus text-only baselines while supporting user-driven Delta adjustments.
-
Less LLM, More Documents: Searching for Improved RAG
Corpus scaling in RAG frequently matches the accuracy gains from larger LLMs on open-domain QA tasks, with mid-sized models benefiting most due to better passage coverage.
-
Network Edge Inference for Large Language Models: Principles, Techniques, and Opportunities
A survey synthesizing challenges, system architectures, model optimizations, deployment methods, and resource management techniques for large language model inference at the network edge.
-
Cloud-native and Distributed Systems for Efficient and Scalable Large Language Models -- A Research Agenda
This research agenda argues that cloud-native architectures, microservices, autoscaling, and emerging trends like serverless inference and federated learning are required to make large language models efficient and scalable.
- ScrapMem: A Bio-inspired Framework for On-device Personalized Agent Memory via Optical Forgetting