archive
Every paper Pith has read. Search by title, abstract, or pith.
1164 papers in cs.DC · page 13
-
Decoupled matrix units deliver up to 2.31x AI speedups on CPUs
CUTEv2: Unified and Configurable Matrix Extension for Diverse CPU Architectures with Minimal Design Overhead
-
Self-calibrating digital twin reaches 4.39% MAPE on datacenter predictions
OpenDT: Exploring Datacenter Performance and Sustainability with a Self-Calibrating Digital Twin
-
HPC fabrics show distinct congestion under AI-like bursts
Characterizing the Impact of Congestion in Modern HPC Interconnects
-
Pipeline compresses federated models over 11 times for 60% faster training
A Full Compression Pipeline for Green Federated Learning in Communication-Constrained Environments
-
Hierarchical search tunes GPU apps better and faster
Record-Remix-Replay: Hierarchical GPU Kernel Optimization using Evolutionary Search
-
Proactive DQN scaling outperforms reactive Kubernetes autoscalers
NimbusGuard: A Novel Framework for Proactive Kubernetes Autoscaling Using Deep Q-Networks
-
Scheduler runs multiple quantum jobs in parallel on linked QPUs
QuMod: Parallel Quantum Job Scheduling on Modular QPUs using Circuit Cutting
-
Different GPU splits across LLMs change quality by 87% at fixed latency
RouterWise: Joint Resource Allocation and Routing for Latency-Aware Multi-Model LLM Serving
-
Hybrid backend speeds cross-silo FL up to 3.8x for large models
Understanding Communication Backends in Cross-Silo Federated Learning
-
AI workload mix smooths power variability but keeps fast ramps
Workload composition smooths aggregate power demand while sustaining short-horizon ramps in AI data centers
-
Thinning to degree two extends data center stability region
Bipartite matching under communication constraints
-
Protocol hides verifier claim choices from holders
COD-ssi: Enforcing Mutual Privacy for Credential Oblivious Disclosure in Self Sovereign Identity
-
Stackelberg game optimizes incentives and privacy noise in federated learning
FEDBUD: Joint Incentive and Privacy Optimization for Resource-Constrained Federated Learning
-
One CIR image deploys on any platform after lazy build
CIR: Lightweight Container Image for Cross-Platform Deployment
-
LLMs derive exact GPU thread maps that cut energy use up to 4833x
Leveraging Mathematical Reasoning of LLMs for Efficient GPU Thread Mapping
-
Icicle indexes billion-file HPC systems in real time
Icicle: Scalable Metadata Indexing and Real-Time Monitoring for HPC File Systems
-
INCGuard verifies in-network computing for packet-loss risks
Verifying In-Network Computing Systems for Design Risks
-
Deep unrolling turns SP routines into reusable RF sensing blocks
RF-LEGO: Modularized Signal Processing-Deep Learning Co-Design for RF Sensing via Deep Unrolling
-
Sparse measurements predict latency at every CPU-GPU frequency
Taming Asynchronous CPU-GPU Coupling for Frequency-aware Latency Estimation on Mobile Edge
-
Kernel disaggregation lifts heterogeneous GPU throughput by 2.3x
Tessera: Unlocking Heterogeneous GPUs through Kernel-Granularity Disaggregation
-
FlexVector speeds GCN inference 3.78x with flexible registers
FlexVector: A SpMM Vector Processor with Flexible VRF for GCNs on Varying-Sparsity Graphs
-
Local adaptive steps multiply comms savings in decentralized training
LoDAdaC: a unified local training-based decentralized framework with adaptive gradients and compressed communication
-
Microkernel validation eliminates harm from agent restarts
Rebooting Microreboot: Architectural Support for Safe, Parallel Recovery in Microservice Systems
-
System choices scale HPL to 1.01 EF/s FP64 with 11.5x mixed precision gain
Sustaining Exascale Performance: Lessons from HPL and HPL-MxP on Aurora
-
Lone attackers poison federated learning models
XFED: Non-Collusive Model Poisoning Attack Against Byzantine-Robust Federated Classifiers
-
NOMAD speeds up massive graph embeddings by 10-100x on CPU clusters
NOMAD: Generating Embeddings for Massive Distributed Graphs
-
Adaptive layer resolves LLM scaling paradox on NPUs
A-IO: Adaptive Inference Orchestration for Memory-Bound NPUs
-
MATCHA cuts DNN inference latency up to 35% on heterogeneous edge SoCs
MATCHA: Efficient Deployment of Deep Neural Networks on Multi-Accelerator Heterogeneous Edge SoCs
-
Reference storage cuts LLM RL rollout stalls up to 19x
TensorHub: Scalable and Elastic Weight Transfer for LLM RL Training
-
Adaptive quantization cuts mobile LLM cold starts by 4x
EdgeFlow: Fast Cold Starts for LLMs on Mobile Devices
-
Right GPU cuts LLM energy use by 70% in servers
Watt Counts: Energy-Aware Benchmark for Sustainable LLM Inference on Heterogeneous GPU Architectures
-
DAG consensus protocol lifts CFT throughput in wide-area nets
Finding Nemo-Nemo: CFT DAG-based Consensus in the WAN
-
Method scales sensor optimization to billion-DOF tsunami models on GPUs
Sensor Placement for Tsunami Early Warning via Large-Scale Bayesian Optimal Experimental Design
-
CPU offload over Nvlink-C2C fixes rigid GPU slice mismatches
Taming GPU Underutilization via Static Partitioning and Fine-grained CPU Offloading
-
Neural bandits learn better Kubernetes control-plane placements
NL-CPS: Reinforcement Learning-Based Kubernetes Control Plane Placement in Multi-Region Clusters
-
GPU HyperBall scales visibility graphs to 236k cells in 137 seconds
City-Scale Visibility Graph Analysis via GPU-Accelerated HyperBall
-
Causality arguments hold for quantum distributed snapshots
Asynchronous Quantum Distributed Computing: Causality, Snapshots, and Global Operations
-
Joint algorithm minimizes weighted coflow time across OCS cores
Scheduling Coflows in Multi-Core OCS Networks with Performance Guarantee
-
Speculative trees grow only when they cut inference time
SMART: When is it Actually Worth Expanding a Speculative Tree?
-
Energy-efficient GPUs deliver better value under budget limits
Wattlytics: A Web Platform for Co-Optimizing Performance, Energy, and TCO in HPC Clusters
-
Decomposed diffusion workflows handle 3x more requests
LegoDiffusion: Micro-Serving Text-to-Image Diffusion Workflows
-
Shared log makes LLM agent actions visible and stoppable
LogAct: Enabling Agentic Reliability via Shared Logs
-
Beam speculation yields 1.4X LLM agent speedup on edge
B-PASTE: Beam-Aware Pattern-Guided Speculative Execution for Resource-Constrained LLM Agents
-
Decentralized edge agents lift mobile task success 21.7%
Administrative Decentralization in Edge-Cloud Multi-Agent for Mobile Automation
-
Integrated panels give orbital AI 100 kW per ton
Reduced-Mass Orbital AI Inference via Integrated Solar, Compute, and Radiator Panels
-
No single config optimizes all goals in edge speculative LLM
ConfigSpec: Profiling-Based Configuration Selection for Distributed Edge--Cloud Speculative LLM Serving
-
CPU-free LLM serving cuts P99 latency up to 8x
Blink: CPU-Free LLM Inference by Delegating the Serving Stack to GPU and SmartNIC
-
Bonded identities and delay randomness fix MEV ordering
MEV-ACE: Identity-Authenticated Fair Ordering for Proposer-Controlled MEV Mitigation
-
Batch algorithm updates maximal independent set in O(b log^3 n) work
Parallel Batch-Dynamic Maximal Independent Set
-
AI workload power data scales to full data center energy profiles
Measurement of Generative AI Workload Power Profiles for Whole-Facility Data Center Infrastructure Planning