Tessera performs kernel-granularity disaggregation on heterogeneous GPUs, achieving up to 2.3x throughput and 1.6x cost efficiency gains for large model inference while generalizing beyond prior methods.
hub Canonical reference
Splitwise: Efficient generative llm inference using phase splitting
Canonical reference. 100% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
years
2026 28roles
background 10polarities
background 10representative citing papers
TCM finds provably optimal DNN accelerator mappings by pruning the search space up to 32 orders of magnitude with a new dataplacement concept, delivering 1.2-6.5x better energy-delay-product in 17 seconds instead of hours.
AtomTwin.jl is a physics-native Julia framework for simulating neutral-atom quantum processors, with a demonstration of logical Bell state preparation using four ytterbium-171 atoms in movable tweezers.
Fleet adds a Chiplet-task level to GPU task models, enabling per-chiplet scheduling and cooperative cache reuse in persistent megakernels, yielding 1.3-1.5x lower LLM decode latency and up to 37% less HBM traffic on AMD MI350 hardware.
KOVAL-Q uses SAT solving to optimize and verify surface-code logical operations with general encodings, finding d-cycle CNOTs and 2d-cycle rotations that reduce FTQC application runtime by about 10 percent.
InfiniLoRA decouples LoRA execution from base-model inference and reports 3.05x higher request throughput plus 54% more adapters meeting strict latency SLOs.
NestPipe achieves up to 3.06x speedup and 94.07% scaling efficiency on 1,536 workers via dual-buffer inter-batch and frozen-window intra-batch pipelining that overlaps communication with computation.
Qurator jointly optimizes queue time and fidelity for hybrid quantum-classical workflows across providers using quantum-aware DAG scheduling and a unified logarithmic fidelity score, achieving 30-75% wait reduction at high load with bounded accuracy cost.
Dynamic replication of predicted overloaded experts in MoE models achieves near-100% GPU utilization and up to 3x faster inference while retaining 90-95% of baseline performance.
KV-RM regularizes KV-cache movement in static-graph LLM serving via block paging and merge-staged transport to improve throughput, tail latency, and memory use for variable-length decoding.
SPECTRE achieves up to 2.28x speedup for large-model LLM serving by running speculative draft generation and target verification in parallel using idle tail-model services.
A new taxonomy for dynamics-aware microservice management, synthesized from 84 systems, finds that production dynamics are often only partially modeled and that reported performance gains depend on evaluation realism.
LEO performs cross-vendor backward slicing from stalled GPU instructions to attribute root causes to source code, enabling optimizations that produce geometric-mean speedups of 1.73-1.82x on 21 workloads.
Proxics introduces lightweight virtual processors and low-latency communication channels as portable OS abstractions for programming near-data processing accelerators, demonstrated on real hardware for memory-intensive workloads.
KAIROS reduces power by 27% on average (up to 39.8%) for agentic AI inference by using long-lived context to jointly manage GPU frequency, concurrency, and request routing across instances.
O3LS reduces space overhead by up to 46.7% and time overhead by up to 36% in lattice surgery while suppressing logical error rates by up to an order of magnitude compared with prior layout and scheduling approaches.
ELMoE-3D achieves 6.6x average speedup and 4.4x energy efficiency gain for MoE serving on 3D hardware by scaling expert and bit elasticity for elastic self-speculative decoding.
WaveTune introduces a wave-aware bilinear latency predictor and wave-structured sparse sampling to enable fast runtime auto-tuning of GPU kernels, achieving up to 1.83x kernel speedup and 1.33x TTFT reduction with drastically lower overhead.
Profile-guided opcode labeling removes consistently independent loads from the MDP working set, cutting queries 79%, false dependencies 77%, and raising small-core IPC 1.47% on SPEC2017 intspeed.
Blink enables CPU-free LLM inference via SmartNIC offload and persistent GPU kernel, delivering up to 8.47x lower P99 TTFT, 3.4x lower P99 TPOT, 2.1x higher decode throughput, and 48.6% lower energy per token while remaining stable under CPU interference.
CoGPU resolves the tradeoff in GPU sharing by introducing GPU coroutines for semantic-preserving resource migration, delivering up to 79.2% higher training throughput and zero token mismatch in inference.
DCGen generates customizable datacenter configurations with IT, power, and cooling components optimized for power, compute, and area targets using real equipment catalogs and workload-specific IT mixes.
PipeWeave predicts GPU kernel performance with 6.1% average error and end-to-end inference with 8.5% error by feeding analytical pipeline features into ML, cutting prior method errors by 4-7x across 11 GPUs.
EnergAIzer predicts module-level GPU utilization from structured kernel patterns and feeds it into a power model to estimate dynamic power with 8% error on Ampere GPUs and 7% on H100 forecasts.
citing papers explorer
-
Tessera: Unlocking Heterogeneous GPUs through Kernel-Granularity Disaggregation
Tessera performs kernel-granularity disaggregation on heterogeneous GPUs, achieving up to 2.3x throughput and 1.6x cost efficiency gains for large model inference while generalizing beyond prior methods.
-
The Turbo-Charged Mapper: Fast and Optimal Mapping for Energy-efficient and Low-latency Accelerator Design
TCM finds provably optimal DNN accelerator mappings by pruning the search space up to 32 orders of magnitude with a new dataplacement concept, delivering 1.2-6.5x better energy-delay-product in 17 seconds instead of hours.
-
AtomTwin.jl: a physics-native digital twin framework for neutral-atom quantum processors
AtomTwin.jl is a physics-native Julia framework for simulating neutral-atom quantum processors, with a demonstration of logical Bell state preparation using four ytterbium-171 atoms in movable tweezers.
-
Fleet: Hierarchical Task-based Abstraction for Megakernels on Multi-Die GPUs
Fleet adds a Chiplet-task level to GPU task models, enabling per-chiplet scheduling and cooperative cache reuse in persistent megakernels, yielding 1.3-1.5x lower LLM decode latency and up to 37% less HBM traffic on AMD MI350 hardware.
-
Design automation and space-time reduction for surface-code logical operations using a SAT-based EDA kernel compatible with general encodings
KOVAL-Q uses SAT solving to optimize and verify surface-code logical operations with general encodings, finding d-cycle CNOTs and 2d-cycle rotations that reduce FTQC application runtime by about 10 percent.
-
InfiniLoRA: Disaggregated Multi-LoRA Serving for Large Language Models
InfiniLoRA decouples LoRA execution from base-model inference and reports 3.05x higher request throughput plus 54% more adapters meeting strict latency SLOs.
-
NestPipe: Large-Scale Recommendation Training on 1,500+ Accelerators via Nested Pipelining
NestPipe achieves up to 3.06x speedup and 94.07% scaling efficiency on 1,536 workers via dual-buffer inter-batch and frozen-window intra-batch pipelining that overlaps communication with computation.
-
Qurator: Scheduling Hybrid Quantum-Classical Workflows Across Heterogeneous Cloud Providers
Qurator jointly optimizes queue time and fidelity for hybrid quantum-classical workflows across providers using quantum-aware DAG scheduling and a unified logarithmic fidelity score, achieving 30-75% wait reduction at high load with bounded accuracy cost.
-
Fast MoE Inference via Predictive Prefetching and Expert Replication
Dynamic replication of predicted overloaded experts in MoE models achieves near-100% GPU utilization and up to 3x faster inference while retaining 90-95% of baseline performance.
-
KV-RM: Regularizing KV-Cache Movement for Static-Graph LLM Serving
KV-RM regularizes KV-cache movement in static-graph LLM serving via block paging and merge-staged transport to improve throughput, tail latency, and memory use for variable-length decoding.
-
SPECTRE: Hybrid Ordinary-Parallel Speculative Serving for Resource-Efficient LLM Inference
SPECTRE achieves up to 2.28x speedup for large-model LLM serving by running speculative draft generation and target verification in parallel using idle tail-model services.
-
Adaptive Management of Microservices in Dynamic Computing Environments: A Taxonomy and Future Directions
A new taxonomy for dynamics-aware microservice management, synthesized from 84 systems, finds that production dynamics are often only partially modeled and that reported performance gains depend on evaluation realism.
-
LEO: Tracing GPU Stall Root Causes via Cross-Vendor Backward Slicing
LEO performs cross-vendor backward slicing from stalled GPU instructions to attribute root causes to source code, enabling optimizations that produce geometric-mean speedups of 1.73-1.82x on 21 workloads.
-
Proxics: an efficient programming model for far memory accelerators
Proxics introduces lightweight virtual processors and low-latency communication channels as portable OS abstractions for programming near-data processing accelerators, demonstrated on real hardware for memory-intensive workloads.
-
KAIROS: Stateful, Context-Aware Power-Efficient Agentic Inference Serving
KAIROS reduces power by 27% on average (up to 39.8%) for agentic AI inference by using long-lived context to jointly manage GPU frequency, concurrency, and request routing across instances.
-
O3LS: Optimizing Lattice Surgery via Automatic Layout Searching and Loose Scheduling
O3LS reduces space overhead by up to 46.7% and time overhead by up to 36% in lattice surgery while suppressing logical error rates by up to an order of magnitude compared with prior layout and scheduling approaches.
-
ELMoE-3D: Leveraging Intrinsic Elasticity of MoE for Hybrid-Bonding-Enabled Self-Speculative Decoding in On-Premises Serving
ELMoE-3D achieves 6.6x average speedup and 4.4x energy efficiency gain for MoE serving on 3D hardware by scaling expert and bit elasticity for elastic self-speculative decoding.
-
WaveTune: Wave-aware Bilinear Modeling for Efficient GPU Kernel Auto-tuning
WaveTune introduces a wave-aware bilinear latency predictor and wave-structured sparse sampling to enable fast runtime auto-tuning of GPU kernels, achieving up to 1.83x kernel speedup and 1.33x TTFT reduction with drastically lower overhead.
-
PG-MDP: Profile-Guided Memory Dependence Prediction for Area-Constrained Cores
Profile-guided opcode labeling removes consistently independent loads from the MDP working set, cutting queries 79%, false dependencies 77%, and raising small-core IPC 1.47% on SPEC2017 intspeed.
-
Blink: CPU-Free LLM Inference by Delegating the Serving Stack to GPU and SmartNIC
Blink enables CPU-free LLM inference via SmartNIC offload and persistent GPU kernel, delivering up to 8.47x lower P99 TTFT, 3.4x lower P99 TPOT, 2.1x higher decode throughput, and 48.6% lower energy per token while remaining stable under CPU interference.
-
Performance Isolation and Semantic Determinism in Efficient GPU Spatial Sharing
CoGPU resolves the tradeoff in GPU sharing by introducing GPU coroutines for semantic-preserving resource migration, delivering up to 79.2% higher training throughput and zero token mismatch in inference.
-
DCGen 1.1 Technical Report: Generating Datacenter Configurations (including IT, Power, Cooling)
DCGen generates customizable datacenter configurations with IT, power, and cooling components optimized for power, compute, and area targets using real equipment catalogs and workload-specific IT mixes.
-
PipeWeave: Synergizing Analytical and Learning Models for Unified GPU Performance Prediction
PipeWeave predicts GPU kernel performance with 6.1% average error and end-to-end inference with 8.5% error by feeding analytical pipeline features into ML, cutting prior method errors by 4-7x across 11 GPUs.
-
EnergAIzer: Fast and Accurate GPU Power Estimation Framework for AI Workloads
EnergAIzer predicts module-level GPU utilization from structured kernel patterns and feeds it into a power model to estimate dynamic power with 8% error on Ampere GPUs and 7% on H100 forecasts.
-
Compiler Framework for Directional Transport in Zoned Neutral Atom Systems with AOD Assistance: A Hybrid Remote CZ Approach
A hybrid DT-AOD compiler framework enables faster remote CZ gates in neutral atom systems by transporting Rydberg excitations directionally along resettable ancilla paths.
-
FILCO: Flexible Composing Architecture with Real-Time Reconfigurability for DNN Acceleration
FILCO introduces a real-time reconfigurable composing architecture for DNN acceleration that achieves 1.3x-5x better throughput and hardware efficiency than prior designs on diverse workloads via an analytical model and two-stage design space exploration.
-
Measurement of Generative AI Workload Power Profiles for Whole-Facility Data Center Infrastructure Planning
High-resolution power profiles for AI workloads on H100 GPUs are measured and scaled to whole-facility energy demand using a bottom-up model, with the dataset made public.
-
Computing In Spintronic Memory: A Thermal Perspective
Spintronic CiM shows uniform temperature that increases linearly with participating memory cells and decreases linearly with array size.