archive
Every paper Pith has read. Search by title, abstract, or pith.
1164 papers in cs.DC · page 19
-
Simpler recovery fixes leaderless Paxos and generalizes it optimally
Making Democracy Work: Fixing and Simplifying Egalitarian Paxos (Extended Version)
-
Neuromorphic chip learns new classes with 113x lower latency
Online Continual Learning on Intel Loihi 2 via a Co-designed Spiking Neural Network
-
Stable errors bound local clock skew by O(Δ + δ log D)
Gradient Clock Synchronization with Practically Constant Local Skew
-
Uniform RDMA WriteImm interface reaches 400 Gbps on NVIDIA and AWS NICs
fabric-lib: RDMA Point-to-Point Communication for LLM Systems
-
kNN predicts good sub-system sizes for GPU tridiagonal partition
ML-Based Optimum Sub-system Size Heuristic for the GPU Implementation of the Tridiagonal Partition Method
-
AI matches human experts in designing LLM cluster algorithms
Glia: A Human-Inspired AI for Automated Systems Design and Optimization
-
Dictator clients erase honest contributions in federated learning
Power to the Clients: Federated Learning in a Dictatorship Setting
-
Quasipolylog rounds for (Δ+1)-coloring when neighborhood independence is bounded
Distributed $(\Delta+1)$-Coloring in Graphs of Bounded Neighborhood Independence
-
SWOT cuts collective communication time up to 89.7% by overlapping reconfiguration
Enabling Reconfiguration-Communication Overlap for Collective Communication in Optical Networks
-
Spot GPUs raise LLM RL throughput 1.5-2x at 28-49% lower cost
RLBoost: Harvesting Preemptible Resources for Cost-Efficient Reinforcement Learning on LLMs
-
Sparse attention trains 512K-context LLMs at 6x speed
MTraining: Distributed Dynamic Sparse Attention for Efficient Ultra-Long Context Training
-
TokenCake trims multi-agent LLM latency over 47% with smart KV cache moves
TokenCake: A KV-Cache-centric Serving Framework for LLM-based Multi-Agent Applications
-
New algorithm sustains node activity to cut broadcast latency
A New Broadcast Model for Several Network Topologies
-
Statistical method quantifies probabilistic training time guarantees at scale
PRISM: Probabilistic Runtime Insights and Scalable Performance Modeling for Large-Scale Distributed Training
-
Decode rescheduling cuts LLM P99 TPOT by 75%
STAR: Decode-Phase Rescheduling for LLM Inference
-
FlexPipe cuts reserved GPUs for LLM serving from 75% to 30% of peak
FlexPipe: Adapting Dynamic LLM Serving Through Inflight Pipeline Refactoring in Fragmented Serverless Clusters
-
Layered prefill slashes MoE TTFT by 70% without stalls
From Tokens to Layers: Redefining Stall-Free Scheduling for MoE Serving with Layered Prefill
-
Sketches cut communication waste in Byzantine DFL
SketchGuard: Scaling Byzantine-Robust Decentralized Federated Learning via Sketch-Based Screening
-
Framework measures real makespans from abstract graphs on CPU-GPU-FPGA hardware
Evaluating Rapid Makespan Predictions for Heterogeneous Systems with Programmable Logic
-
Async algorithm takes consistent snapshots with O(n) messages
Asynchronous Checkpoint for Eventually Consistent Databases
-
Fused models win on long-range atomistic properties
When Does Global Attention Help? A Unified Empirical Study on Atomistic Graph Learning
-
Profiling uncovers patterns that speed up large MoE inference 6.6x
Patterns behind Chaos: Forecasting Data Movement for Efficient Large-Scale MoE LLM Inference
-
Speculative predictions cut agent latency up to 20 percent
Speculative Actions: A Lossless Framework for Faster Agentic Systems
-
Canonical rounds block optimal Byzantine consensus
Why Canonical Rounds Fail for Optimal Byzantine Resilience
-
GPU data-movement cuts lower both time and energy for large sparse solves
On the energy efficiency of sparse matrix computations on multi-GPU clusters
-
GRACE-MoE speeds up distributed MoE inference up to 4.66x
GRACE-MoE: Grouping and Replication with Locality-Aware Routing for Efficient Distributed MoE Inference
-
Harp speeds heterogeneous GPU training by 1.3x-1.6x
HARP: Orchestrating Automated Parallel Training on Heterogeneous GPU Clusters
-
Graph model spots silent smart-grid eavesdroppers at 98% accuracy
Federated Spatiotemporal Graph Learning for Passive Attack Detection in Smart Grids
-
Simulator detects TON contract race conditions missed by static checks
BugMagnifier: TON Transaction Simulator for Revealing Smart Contract Vulnerabilities
-
132k FaaS workflow runs on AWS and Azure show scaling and cost patterns
Characterizing FaaS Workflows on Public Clouds: The Good, the Bad and the Ugly
-
Modular bricks cut multimodal AI energy by 42% on small batteries
Tiny but Mighty: A Software-Hardware Co-Design Approach for Efficient Multimodal Inference on Battery-Powered Small Devices
-
Elastic PP achieves 1.69x speedup for long-context LLM training
InfiniPipe: Elastic Pipeline Parallelism for Efficient Variable-Length Long-Context LLM Training
-
Frontier AI models average 0.34 Wh per query on real hardware
Energy Use of AI Inference, Efficiency Pathways, and Test-Time Scaling
-
Dynamic TP changes boost LLM inference throughput 1.75x-6.57x
Amoeba: Runtime Tensor Parallel Transformation for LLM Inference Services
-
Metric picks key workers to steady swarm learning on uneven data
Multi-Worker Selection based Distributed Swarm Learning for Edge IoT with Non-i.i.d. Data
-
Tuned whole-brain model gains alpha rhythms and complexity
Emergent complexity and rhythms in evoked and spontaneous dynamics of human whole-brain models after tuning through analysis tools
-
TON checklist built from 233 real audit findings
From Paradigm Shift to Audit Rift: Empirical Analysis and Validation of Security Audit Methodologies for Asynchronous Smart Contract Systems
-
Single code base runs radiation hydrodynamics on any hardware scale
HARD: A Performance Portable Radiation Hydrodynamics Code based on FleCSI Framework
-
Dual-phase expert scheduling cuts MoE LLM latency up to 7.55x
DuoServe-MoE: Dual-Phase Expert Prefetch and Caching for LLM Inference QoS Assurance
-
DP training gains 2.21x throughput with dynamic layer quantization
DPQuant: Efficient and Differentially-Private Model Training via Dynamic Quantization Scheduling
-
Chameleon recovers training within 11% of normal speed after faults
Chameleon: Adaptive Fault Tolerance for Distributed Training via Real-time Policy Selection
-
Resource estimates find feasible setups for distributed quantum computers
Architecting Distributed Quantum Computers: Design Insights from Resource Estimation
-
Default collectives up to 5x slower than tuned choices
PICO: Performance Insights for Collective Operations
-
HFX raises LLM SLO attainment 4.44x with joint scheduling and scaling
HFX: Joint Design of Algorithms and Systems for Multi-SLO Serving and Fast Scaling
-
CausalMesh keeps caches consistent during client migrations
CausalMesh: A Formally Verified Causally Consistent Distributed Cache with Support for Client Migration
-
Engine cuts mixed-precision LLM latency by up to 61 percent
LMDeploy Accelerates Mixed-Precision LLM Inference with TurboMind
-
Client KMeans filtering yields near-IID results in federated distillation
Federated Distillation on Edge Devices: Efficient Client-Side Filtering for Non-IID Data
-
Production control alone realizes any polynomial analog dynamics
Analog computation with transcriptional networks
-
Expert placement strategy cuts MoE edge latency up to 30%
Accelerating Edge Inference for Distributed MoE Models with Latency-Optimized Expert Placement
-
OPEN predicts GPU performance at 98% accuracy with minimal profiling
Coordinated Power Management on Heterogeneous Systems