archive
Every paper Pith has read. Search by title, abstract, or pith.
1164 papers in cs.DC · page 5
-
Algorithm learns adversary utilities for sublinear regret in coding game
Learning from Acceptance: Cumulative Regret in the Game of Coding
-
KV-cache movement regularization cuts static-graph LLM latency spikes
KV-RM: Regularizing KV-Cache Movement for Static-Graph LLM Serving
-
Held-out gates catch regressions in LLM Metal kernel search
Metal-Sci: A Scientific Compute Benchmark for Evolutionary LLM Kernel Search on Apple Silicon
-
Small subsets approximate the global ranking median
A Scalable and Unified Framework to Weighted Rank Aggregation
-
Adaptive DNN splits cut energy by 27-36% on real edge-cloud hardware
Adaptive DNN Partitioning and Offloading in Heterogeneous Edge-Cloud Continuum
-
CaMPL type system blocks deadlocks in concurrent code
Categorical Message Passing Language (CaMPL) for programmers
-
Air quality sensors detect cooking at 99.68 percent accuracy on-device
PoHAR: Understanding Hyperlocal Human Activities with Pollution Sensor Networks
-
ATLAS cuts GNN inference time 12-30x for billion-edge graphs
ATLAS: Efficient Out-of-Core Inference for Billion-Scale Graph Neural Networks
-
Multi-metric detection catches all GPU failures in 504-GPU LLM run
From Detection to Recovery: Operational Analysis on LLM Pre-training with 504 GPUs
-
Kernel-level splits let networked MCUs run large CNNs
Split CNN Inference on Networked Microcontrollers
-
DisagMoE overlaps MoE layers for 1.8x training speedup
DisagMoE: Computation-Communication overlapped MoE Training via Disaggregated AF-Pipe Parallelism
-
Split TCB design routes encrypted packets at native speed
Enforcing Attestable Workflows across Untrusted Networks
-
Consistency models collapse into three entangled constraints
Light Cone Consistency: Closure, Ordering, and the Single-Observer Boundary
-
System achieves up to 7.57x faster dynamic multimodal LLM training
MegaScale-Omni: A Hyper-Scale, Workload-Resilient System for MultiModal LLM Training in Production
-
Variance reduction shortens time complexity in parallel optimization
Rennala MVR: Improved Time Complexity for Parallel Stochastic Optimization via Momentum-Based Variance Reduction
-
VAEs recover mixture proportions for personalized federated learning
FedGMI: Generative Model-Driven Federated Learning for Probabilistic Mixture Inference
-
Basic Verkle trees cost more than Merkle trees
TS-Verkle: A TypeScript Native Verkle Library With On-chain Verifier
-
Agent framework cuts data leaks 2-6x while raising accuracy 15-36%
PAAC: Privacy-Aware Agentic Device-Cloud Collaboration
-
Generative model compresses Earth data by up to 10,000 times
Transforming the Use of Earth Observation Data: Exascale Training of a Generative Compression Model with Historical Priors for up to 10,000x Data Reduction
-
LLMs collaborate across devices and cloud to meet resource limits
Large Language Models over Networks: Collaborative Intelligence under Resource Constraints
-
Same code runs in abstract rounds and real sockets for distributed algorithms
QUANTAS 2 An Abstract, Concrete and Byzantine Simulator
-
Concurrent RL fine-tunes match single-task quality at 4.3x efficiency
MARLaaS: Multi-Tenant Asynchronous Reinforcement Learning as a Service
-
Block-level sharding scales context parallelism to 256 GPUs
Unleashing Scalable Context Parallelism for Foundation Models Pre-Training via FCP
-
Asynchronous stages raise agent evolution throughput 3.5x
FlashEvolve: Accelerating Agent Self-Evolution with Asynchronous Stage Orchestration
-
Hybrid head speeds secure time-series inference up to 44x
Private Vertical Federated Inference for Time-Series
-
LLM profiler reuses work across models to cut GPU hours 56%
Dooly: Configuration-Agnostic, Redundancy-Aware Profiling for LLM Inference Simulation
-
Dooly reuses LLM op profiles across configs to cut costs 56%
Dooly: Configuration-Agnostic, Redundancy-Aware Profiling for LLM Inference Simulation
-
FLAM computes exact global performance in federated learning locally
FLAM: Evaluating Model Performance with Aggregatable Measures in Federated Learning
-
Stencil kernels run up to 342x faster on wafer-scale engine
Stencil Computations on Cerebras Wafer-Scale Engine
-
Adaptive tuning keeps decentralized SGD converging under adversary majority
\mathsf{VISTA}: Decentralized Machine Learning in Adversary Dominated Environments
-
Model runs 1024-core chip sims 115x faster at under 7% error
Accelerating Precise End-to-End Simulation: Latency-Sensitive Many-core System Modeling
-
175B models trained at 10% peak FLOPs with standard parallelism
A Scalable Recipe on SuperMUC-NG Phase 2: Efficient Large-Scale Training of Language Models
-
Wormhole stencil kernels match CPU speed but lose to transfers
Stencil Computations on Tenstorrent Wormhole
-
HexiSeq trains long-context LLMs 1.36x faster on mixed GPU clusters
HexiSeq: Accommodating Long Context Training of LLMs over Heterogeneous Hardware
-
Hierarchical agents lift AI-RAN SLO fulfillment to 90%
Deadline-Driven Hierarchical Agentic Resource Sharing for AI Services and RAN Functions in AI-RAN
-
RcLLM cuts TTFT 1.31x-9.51x for generative recommendation
RcLLM: Accelerating Generative Recommendation via Beyond-Prefix KV Caching
-
Shared spectral operator aligns mismatched sensors in federated learning
UMEDA: Unified Multi-modal Efficient Data Fusion for Privacy-Preserving Graph Federated Learning via Spectral-Gated Attention and Diffusion-Based Operator Alignment
-
MERBIT speeds irregular SpMV 27 percent on GPUs
MERBIT: A GPU-Based SpMV Method for Iterative Workloads
-
RL weight sync uses 100 times less data with full fidelity
SparseRL-Sync: Lossless Weight Synchronization with ~100x Less Communication
-
TREA accelerator reduces edge detection latency up to 9x
TREA: Low-precision Time-Multiplexed, Resource-Efficient Edge Accelerator for Object Detection and Classification
-
Energy subtraction on paired elements recovers signed OTA aggregates
Resource-Element Energy Difference for Noncoherent Over-the-Air Federated Learning
-
Energy difference on two resources replaces CSI for wireless federated learning
Resource-Element Energy Difference for Noncoherent Over-the-Air Federated Learning
-
Future-state scheduler cuts LLM workflow makespan by 32 percent
FATE: Future-State-Aware Scheduling for Heterogeneous LLM Workflows
-
AI backends gain one admission seam for governance across requests
Execution Envelopes: A Shared Admission Contract for Backend AI Execution Requests
-
Hardware usage metrics match Kripke kernel to RAJA proxy
On Similarity of Computational Kernels in our Codes and Proxies
-
Per-step slack regulator raises LLM goodput 1.77x
Regulating Branch Parallelism in LLM Serving
-
IoT security model gains 30% detection boost with mostly unlabeled data
CLAD: A Clustered Label-Agnostic Federated Learning Framework for Joint Anomaly Detection and Attack Classification
-
Traces reveal LLM setups 3x slower on identical hardware
CCL-Bench 1.0: A Trace-Based Benchmark for LLM Infrastructure
-
Sharing serving GPUs boosts agentic RL throughput 1.3-3.3x
ROSE: Rollout On Serving GPUs via Cooperative Elasticity for Agentic RL
-
Serving GPUs accelerate agentic RL rollouts up to 3.3x
ROSE: Rollout On Serving GPUs via Cooperative Elasticity for Agentic RL