archive
Every paper Pith has read. Search by title, abstract, or pith.
1164 papers in cs.DC · page 12
-
Invariants let agents match hand-optimized GPU kernels
ARGUS: Agentic GPU Optimization Guided by Data-Flow Invariants
-
SCENIC hits 200G SmartNIC speed with programmable stream units
SCENIC: Stream Computation-Enhanced SmartNIC
-
Hybrid models let prefill run in a separate datacenter
Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter
-
Closed forms give exact multi-NUMA VM counts per host
Efficient calculation of available space for multi-NUMA virtual machines
-
Block placement and cache rules cut LLM serving latency
Serving Chain-structured Jobs with Large Memory Footprints with Application to Large Foundation Model Serving
-
Game equilibria set synthetic data volumes in coopetitive learning
Cooperate to Compete: Strategic Data Generation and Incentivization Framework for Coopetitive Cross-Silo Federated Learning
-
FL compression gains depend on correlation strength
Exploiting Correlations in Federated Learning: Opportunities and Practical Limitations
-
MoE serving gains 6.6x speedup via elastic self-speculation on 3D stacks
ELMoE-3D: Leveraging Intrinsic Elasticity of MoE for Hybrid-Bonding-Enabled Self-Speculative Decoding in On-Premises Serving
-
Direct propagation matches locality lower bound in distributed DP
Locality, Not Spectral Mixing, Governs Direct Propagation in Distributed Offline Dynamic Programming
-
Forkable shared logs let AI agents branch streaming data
AgileLog: A Forkable Shared Log for Agents on Data Streams
-
CoCoDiff speeds up distributed DiT inference 3.6x on average
CoCoDiff: Optimizing Collective Communications for Distributed Diffusion Transformer Inference Under Ulysses Sequence Parallelism
-
Registers achieve log P latency despite contention
Fast Concurrent Primitives Despite Contention
-
PIM hardware speeds R-tree queries up to 3.66x with less energy
Parallel R-tree-based Spatial Query Processing on a Commercial Processing-in-Memory System
-
VQLS cuts circuit count 256x for 10-qubit systems
Distributed Variational Quantum Linear Solver
-
GPU hypergraph partitioner reaches 940x speedup with improved quality
Incidence Constraints in Hypergraph Partitioning on GPU
-
Five themes together build cyber-physical resilience
Digital Guardians: The Past and The Future of Cyber-Physical Resilience
-
Finite withholding beats infinite withholding by unbounded factor in pools
Temporary Power Adjusting Withholding Attack
-
Temporary withholding boosts pool attack rewards 22x over permanent version
Temporary Power Adjusting Withholding Attack
-
Inference tasks replace mining in AI blockchain consensus
HadAgent: Harness-Aware Decentralized Agentic AI Serving with Proof-of-Inference Blockchain Consensus
-
OffloadFS moves database compaction to storage nodes for 3.36x speedup
OffloadFS: Leveraging Disaggregated Storage for Computation Offloading
-
Encrypted face data counts crowds without naming anyone
Head Count: Privacy-Preserving Face-Based Crowd Monitoring
-
Open Ethernet HPC cluster ranks 49th on TOP500
SAKURAONE: An Open Ethernet-Based AI HPC System and Its Observed Workload Dynamics in a Single-Tenant LLM Development Environment
-
Adaptive edge system raises robotics AI service quality
Self-adaptive Multi-Access Edge Architectures: A Robotics Case
-
Distributed servers with MPC cut costs for private vertical federated learning
Secure and Privacy-Preserving Vertical Federated Learning
-
PackSELL packs deltas and values to speed GPU SpMV 1.63x in FP16
PackSELL: A Sparse Matrix Format for Precision-Agnostic High-Performance SpMV
-
Event Tensor abstraction compiles dynamic megakernels
Event Tensor: A Unified Abstraction for Compiling Dynamic Megakernel
-
DySkew cuts UDF skew delays with runtime data swaps
DySkew: Dynamic Data Redistribution for Skew-Resilient Snowpark UDF Execution
-
Academia trains 70B open LLM on Alps supercomputer
An Engineering Journey Training Large Language Models at Scale on Alps: The Apertus Experience
-
Virtual machine speeds array programs 147x on GPUs
Towards a Linear-Algebraic Hypervisor
-
EPAC RISC-V chip with three tiles taped out in 22nm
EPAC: The Last Dance
-
ML ensemble cuts CI memory waste by 36 GB per build
Intelligent resource prediction for SAP HANA continuous integration build workloads
-
Hybrid platform extends supercomputers to full AI model lifecycle
Beyond Pre-Training: The Full Lifecycle of Foundation Models on HPC Systems
-
The paper proposes pAirZero, a framework combining zeroth-order optimization and…
Three Birds, One Stone: Solving the Communication-Memory-Privacy Trilemma in LLM Fine-tuning Over Wireless Networks with Zeroth-Order Optimization
-
Local routing plus compression cuts cloud LLM tokens 45-79%
Local-Splitter: A Measurement Study of Seven Tactics for Reducing Cloud LLM Token Usage on Coding-Agent Workloads
-
MARS cuts agentic latency by 5.94x via co-scheduling
MARS: Efficient, Adaptive Co-Scheduling for Heterogeneous Agentic Systems
-
Compiler cuts NPU transformer energy use by up to 41%
Forge-UGC: FX optimization and register-graph engine for universal graph compiler
-
Levy jumps fix trapping in decentralized random walks
Decentralized Learning via Random Walk with Jumps
-
Periodic framework organizes distributed computing
A Periodic Space of Distributed Computing: Vision & Framework
-
Physics-informed DLinear forecasts AI data center power more accurately
A Physics-Aware Framework for Short-Term GPU Power Forecasting of AI Data Centers
-
BlazingAML matches AML accuracy at 210x CPU speed
BlazingAML: High-Throughput Anti-Money Laundering (AML) via Multi-Stage Graph Mining
-
Live pipeline changes cut LLM first-token time by 2.5X
PipeLive: Efficient Live In-place Pipeline Parallelism Reconfiguration for Dynamic LLM Serving
-
Reference-based replication creates AI agents in constant time
Aethon: A Reference-Based Replication Primitive for Constant-Time Instantiation of Stateful AI Agents
-
StableHLO unifies ML performance modeling across GPUs and TPUs
Evaluating Cross-Architecture Performance Modeling of Distributed ML Workloads Using StableHLO
-
Pipelined Parareal on GPUs speeds microswimmer simulations
Accelerating Microswimmer Simulations via a Heterogeneous Pipelined Parallel-in-Time Framework
-
Bayesian Noisy-OR model cuts failure detection time by 60%
Predictive Bayesian Arbitration: A Scalable Noisy-OR Model with Service Criticality Awareness
-
Remote Git service delivers monorepo checkouts in under a second
GitFarm: Git as a Service for Large-Scale Monorepos
-
Visual analytics clusters HPC nodes to expose behavioral differences
Understanding Large-Scale HPC System Behavior Through Cluster-Based Visual Analytics
-
Residual bottlenecks deliver 128x activation compression for pipelines
ResBM: Residual Bottleneck Models for Low-Bandwidth Pipeline Parallelism
-
Nanvix cuts serverless server needs by 20-100x
Nanvix: A Multikernel OS Design for High-Density Serverless Deployments
-
Sparse FHE matmul on GPUs runs up to 3x faster than CPU
GPU Acceleration of Sparse Fully Homomorphic Encrypted DNNs