archive
Every paper Pith has read. Search by title, abstract, or pith.
1164 papers in cs.DC · page 9
-
SplitFT speeds LLM fine-tuning with adaptive client cut layers
SplitFT: An Adaptive Federated Split Learning System For LLMs Fine-Tuning
-
Pipelined sharding speeds client xLM inference up to 30x with 10x less VRAM
Efficient, VRAM-Constrained xLM Inference on Clients
-
Folding parallelism cuts memory for long-context transformers
Folding Tensor and Sequence Parallelism for Memory-Efficient Transformer Training & Inference
-
Multi-version rollout lifts LLM RL throughput 2-3x while keeping convergence
DORA: A Scalable Asynchronous Reinforcement Learning System for Language Model Training
-
Memory-centric chiplets cut attention latency 15 times
AMMA: A Multi-Chiplet Memory-Centric Architecture for Low-Latency 1M Context Attention Serving
-
Direct remote access beats prefetching for LLM GPU offloading
DAK: Direct-Access-Enabled GPU Memory Offloading with Optimal Efficiency for LLM Inference
-
Wave cost model picks MoE kernels with 0.93% regret
RaMP: Runtime-Aware Megakernel Polymorphism for Mixture-of-Experts
-
Workflow structure lets Pythia speed up multi-agent LLM serving
Pythia: Exploiting Workflow Predictability for Efficient Agent-Native LLM Serving
-
Simple interface lifts multi-agent LLM serving throughput
Pythia: Exploiting Workflow Predictability for Efficient Agent-Native LLM Serving
-
Speculative decoding cuts federated LLM communication
SpecFed: Accelerating Federated LLM Inference with Speculative Decoding and Compressed Transmission
-
Exclusive scans finish in log p rounds with bounded operator uses
Two Efficient Message-passing Exclusive Scan Algorithms
-
Hierarchical FL setups lower energy for plant disease classification
Performance and Energy Trade-Off Analysis of Hierarchical Federated Learning for Plant Disease Classification
-
Volitional states guard atomic machine actions in people-machine systems
Volitional Multiagent Atomic Transactions: Describing People and their Machines
-
Computing clusters cut emissions by timing jobs to renewable surplus
Economical and ecological impact of sector coupling applied to computing clusters
-
Warp-tiled kernels cut depthwise convolution time by 3.26 times
CUDA Kernel Optimization and Counter-Free Performance Analysis for Depthwise Convolution in Cloud Environments
-
Microservice systems often model only partial production dynamics
Adaptive Management of Microservices in Dynamic Computing Environments: A Taxonomy and Future Directions
-
3D parallelism cuts first-token time in LLM serving by 10-62%
CacheFlow: Efficient LLM Serving with 3D-Parallel KV Cache Restoration
-
Fixed-input lock keeps Spark policy outputs identical under repartitioning
Spark Policy Toolkit: Semantic Contracts and Scalable Execution for Policy Learning in Spark
-
IoE unifies people, data, and things for 6G automation
Internet of Everything in the 6G Era: Paradigms, Enablers, Potentials and Future Directions
-
Repository blockchain turns fork chains into trees for single-process access
A Tree-Based Repository Blockchain Framework for Shared Governance in Collaborative Fork Ecosystems
-
One shared KV cache serves 15 agents at 97.7% less memory
PolyKV: A Shared Asymmetrically-Compressed KV Cache Pool for Multi-Agent LLM Inference
-
Merkle trees allow 2-3x larger post-quantum cert chains
Network Impact of Post-Quantum Certificate Chain sizes on Time to First Byte in TLS Deployments
-
SpotVista picks multi-node spots with 81% higher availability
SpotVista: Availability-Aware Recommendation System for Reliable and Cost-Efficient Multi-Node Spot Instances
-
Split learning lets clients fine-tune LLMs without sharing data
A Survey on Split Learning for LLM Fine-Tuning: Models, Systems, and Privacy Optimizations
-
Incisor is a cloud system that pairs program analysis tools with large language models to…
Incisor: Ex Ante Cloud Instance Selection for HPC Jobs
-
Exact scheduler improves IoT latency
Exact, Efficient, and Reliable Multi-Objective and Multi-Constrained IoT Workflow Scheduling in Edge-Hub-Cloud Cyber-Physical Systems
-
Multi-agent LLM tutor runs full semester without boundary failures
ITAS: A Multi-Agent Architecture for LLM-Based Intelligent Tutoring
-
Priority PayGo holds tutoring under 4s at 50 users
Latency and Cost of Multi-Agent Intelligent Tutoring at Scale
-
Atomistic model reaches year-and-meter scales for RPV steel
Unfolding an Atomistic World: Atomistic Simulation of Reactor Pressure Vessel Steel Across Year-and-Meter Scales
-
AtomWorld simulates RPV steel atom by atom at meter and year scales
Unfolding an Atomistic World: Atomistic Simulation of Reactor Pressure Vessel Steel Across Year-and-Meter Scales
-
TACO cuts tensor-parallel communication to raise LLM training speed 1.87x
TACO: Efficient Communication Compression of Intermediate Tensors for Scalable Tensor-Parallel LLM Training
-
FreeScale cuts bubbles by 90 percent in recommendation model training
FreeScale: Distributed Training for Sequence Recommendation Models with Minimal Scaling Cost
-
Kubernetes spot system gains 55% more performance per dollar
KubePACS: Kubernetes Cluster Using Performant, Highly Available, and Cost Efficient Spot Instances
-
CommFuse removes tail latency from LLM training overlaps
CommFuse: Hiding Tail Latency via Communication Decomposition and Fusion for Distributed LLM Training
-
Distributed solver speeds large IPMs up to 97 times over single-node codes
SDSL-Solver: Scalable Distributed Sparse Linear Solvers for Large-Scale Interior Point Methods
-
Invariants proven for local-first access control data type
Towards System-Oriented Formal Verification of Local-First Access Control
-
Full-block fusion raises Pythia decoding speed 1.34x
ClusterFusion++: Expanding Cluster-Level Fusion to Full Transformer-Block Decoding
-
Isolated tracks let federated learning respect client exclusions
A Taxonomy and Resolution Strategy for Client-Level Disagreements in Federated Learning
-
Genetic algorithm lifts blockchain validator profits by 15%
The Blockchain Execution Dilemma: Optimizing Revenue XOR Fair Ordering
-
RL policy adapts caches to save 43% energy in GNN training
GreenDyGNN: Runtime-Adaptive Energy-Efficient Communication for Distributed GNN Training
-
Structured overlays beat gossip for AI agent discovery under node churn
Usable Agent Discovery for Decentralized AI Systems
-
Survey maps path for large language model inference on edge networks
Network Edge Inference for Large Language Models: Principles, Techniques, and Opportunities
-
Peer-to-peer grids obey transport lower bounds and monoid reduction rules
Mathematical Foundations for Peer-to-Peer Lattice Computation
-
Accelerators improve LLM speed on edge single-board computers
Cloud to Edge: Benchmarking LLM Inference On Hardware-Accelerated Single-Board Computers
-
Gradient entropy ranks client contributions without validation data
Data-Free Contribution Estimation in Federated Learning using Gradient von Neumann Entropy
-
Continuous bids cut cloud contention losses by 8-23%
LaissezCloud: Continuous Resource Renegotiation for the Public Cloud
-
MPS gains or loses 30% in GPU sharing depending on memory contention
A comprehensive evaluation of spatial co-execution on GPUs using MPS and MIG technologies
-
Top-K method speeds sparse decode 1.88x on Blackwell
Guess-Verify-Refine: Data-Aware Top-K for Sparse-Attention Decoding on Blackwell via Temporal Correlation
-
Multi-path GPU links with CUDA Graphs boost bandwidth 2.95x
Accelerating Intra-Node GPU-to-GPU Communication Through Multi-Path Transfers with CUDA Graphs
-
Algorithm achieves 8K-approximation for coflow scheduling in K-core OCS networks
O(K)-Approximation Coflow Scheduling in K-Core Optical Circuit Switching Networks