archive
Every paper Pith has read. Search by title, abstract, or pith.
1164 papers in cs.DC · page 17
-
Probe-first scheduler holds control overhead near constant in GPU clusters
Laminar: A Probe-First Scheduling Paradigm with Deterministic Runtime Survival
-
Benchmark scores LLM Azure SDK code without running it
ACE-Bench: A Lightweight Benchmark for Evaluating Azure SDK Usage Correctness
-
Mismatched weights suppress higher ranks in federated LoRA
Preventing Rank Collapse in Federated Low-Rank Adaptation with Client Heterogeneity
-
HE operations run 334 times faster on PIM hardware
DRAMatic Speedup: Accelerating HE Operations on a Processing-in-Memory System
-
Narrower overrides contain exploits as well as broad ones
Legitimate Overrides in Decentralized Protocols
-
Adaptive model deployments speed LLM serving 1.5x on average
OServe: Accelerating LLM Serving via Spatial-Temporal Workload Orchestration
-
StreamServe is a new system for running large language models that splits input…
StreamServe: Adaptive Speculative Flows for Low-Latency Disaggregated LLM Serving
-
626 autonomous AI agents form emergent social networks on their own
Emergent Social Structures in Autonomous AI Agent Networks: A Metadata Analysis of 626 Agents on the Pilot Protocol
-
The paper describes an integrated methodology combining hardware modeling
Interferences within a certifiable design methodology for high-performance multi-core platforms
-
VTC introduces virtual tensors in DNN compilation to track data movement via index…
VTC: DNN Compilation with Virtual Tensors for Data Movement Elimination
-
M3 Ultra hits 22.7 FPS real-time diffusion img2img
Systematic Optimization of Real-Time Diffusion Model Inference on Apple M3 Ultra
-
Benchmark standardizes speculative decoding across realistic loads
SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding
-
This paper describes Para-B&B
Para-B&B: Load-Balanced Deterministic Parallelization of Solving MIP
-
Video codecs cut remote KV cache TTFT by 3.5x for LLMs
Efficient Remote KV Cache Reuse with GPU-native Video Codec
-
Three Rashomon sets formalize model multiplicity in federated learning
Rashomon Sets and Model Multiplicity in Federated Learning
-
WebGPU dispatch overhead is 24-36 μs on Vulkan
Characterizing WebGPU Dispatch Overhead for LLM Inference Across Four GPU Vendors, Three Backends, and Three Browsers
-
Equilibria enforces CXL fairness and raises performance 52 percent
Equilibria: Fair Multi-Tenant CXL Memory Tiering At Scale
-
Original papers outperform tutorials for system design mastery
The Computer System Trail
-
Grassroots logic programs get correct deterministic multiagent form
Implementing Grassroots Logic Programs with Multiagent Transition Systems and AI (Full Version)
-
Wonderboom aggregates million Ethereum signatures in one slot
Wonderboom -- Efficient, and Censorship-Resilient Signature Aggregation for Million Scale Consensus
-
GPU kernels solve stochastic optimization for over a million scenarios
From Sequential to Parallel: Reformulating Dynamic Programming as GPU Kernels for Large-Scale Stochastic Combinatorial Optimization
-
Host RAM enables single-GPU training of 120B LLMs
Horizon-LM: A RAM-Centric Architecture for LLM Training
-
Multi-agent system recovers better from smart contract audit failures
SPEAR: An Engineering Case Study of Multi-Agent Coordination for Smart Contract Auditing
-
XaaS cuts edge AI explanation latency by 38 percent
Scalable Explainability-as-a-Service (XaaS) for Edge AI Systems
-
Epoch events resolve duelling admins in CRDT groups
ERA: Epoch-Resolved Arbitration for Duelling Admins in Group Management CRDTs
-
Data centers offset carbon by supplying grid regulation
Coordinating GPU Data Centers and Power Grid Regulation Service for Exogenous Carbon Benefits
-
Stored updates remove partial-participation bias from federated training
FedAdaVR: Adaptive Variance Reduction for Robust Federated Learning under Limited Client Participation
-
Centralized critic beats decentralized critics in LLM collaboration
Learning Decentralized LLM Collaboration with Multi-Agent Actor Critic
-
Primary access hints speed Ethereum replay 25x
Ira: Efficient Transaction Replay for Distributed Systems
-
ZipMoE cuts MoE latency up to 73% on edge devices
ZipMoE: Efficient On-Device MoE Serving via Lossless Compression and Cache-Affinity Scheduling
-
First NPU designed for diffusion language model inference
NPU Design for Diffusion Language Model Inference
-
Chunk scheduling overlaps compute and comms inside one kernel
Syncopate: Efficient Multi-GPU AI Kernels via Automatic Chunk-Centric Compute-Communication Overlap
-
Rotary scheduler raises LLM TTFT SLO rates by 75% on superchips
SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips
-
eIDAS 2.0 can evolve to support Self-Sovereign Identity
Self-Sovereign Identity and eIDAS 2.0: An Analysis of Control, Privacy, and Legal Implications
-
Edge AI framework surpasses IPW=1.0 on quantized LLM
QEIL v2: Heterogeneous Computing for Edge Intelligence via Roofline-Derived Pareto-Optimal Energy Modeling and Multi-Objective Orchestration
-
Space-filling curves simplify fast matrix multiplication
Space Filling Curves is All You Need: Communication-Avoiding Matrix Multiplication Made Simple
-
Fitness score ranks IoT subnets in 20 seconds
DeepFedNAS: Efficient Hardware-Aware Architecture Adaptation for Heterogeneous IoT Federations via Pareto-Guided Supernet Training
-
PyTorch library unifies differentiable sparse solvers across backends
torch-sla: Differentiable Sparse Linear Algebra with Adjoint Solvers and Sparse Tensor Parallelism for PyTorch
-
Co-design lets agentic LLMs handle 77% more load at same latency
Sutradhara: An Intelligent Orchestrator-Engine Co-design for Tool-based Agentic Inference
-
Beta metric delivers 96.5% optimal edge AI performance
Mitigating GIL Bottlenecks in Edge AI Systems
-
WISP boosts distributed LLM capacity up to 4.1x
WISP: Waste- and Interference-Suppressed Distributed Speculative LLM Serving at the Edge via Dynamic Drafting and SLO-Aware Batching
-
Oblique projection preserves symmetry in pseudo-Hermitian eigensolves
Chebyshev Accelerated Subspace Eigensolver for Pseudo-hermitian Hamiltonians
-
Multi-GPU framework scales PDHG to massive linear programs
D-PDLP: Scaling PDLP to Distributed Multi-GPU Systems
-
Style transfer and prompts boost federated domain generalization
Multi-Modal Style Transfer-based Prompt Tuning for Efficient Federated Domain Generalization
-
Three-layer memory system lifts distributed AI speed and efficiency
Self-Evolving Distributed Memory Architecture for Scalable AI Systems
-
Self-evolving memory architecture reaches 87% utilization in distributed AI
Self-Evolving Distributed Memory Architecture for Scalable AI Systems
-
Blockchains must share data across chains for complex uses
Exploring Blockchain Interoperability: Frameworks, Use Cases, and Future Challenges
-
Oblivious routing cannot beat √(2k)/4 load on sparse tori
Optimal Oblivious Load-Balancing for Sparse Traffic in Large-Scale Satellite Networks
-
GCP 23% faster on retail POS workloads but Azure 72% cheaper
Cost-Performance Analysis of Cloud-Based Retail Point-of-Sale Systems: A Comparative Study of Google Cloud Platform and Microsoft Azure
-
Consensus protocol secures multi-client data until unanimous agreement
Secure, Verifiable, and Scalable Multi-Client Data Sharing via Consensus-Based Privacy-Preserving Data Distribution