Tessera performs kernel-granularity disaggregation on heterogeneous GPUs, achieving up to 2.3x throughput and 1.6x cost efficiency gains for large model inference while generalizing beyond prior methods.
Canonical reference
Tender: Accelerating large language models via tensor decomposition and runtime requantization,
Canonical reference. 87% of citing Pith papers cite this work as background.
citation-role summary
citation-polarity summary
representative citing papers
TCM finds provably optimal DNN accelerator mappings by pruning the search space up to 32 orders of magnitude with a new dataplacement concept, delivering 1.2-6.5x better energy-delay-product in 17 seconds instead of hours.
HMA-Serve enables efficient cross-vendor disaggregated LLM serving on memory-heterogeneous accelerators via phase-wise quantization, compute-transfer pipelining, and deferred dequantization, delivering up to 3.2x goodput and 4.8x goodput-per-dollar.
LENS predicts NPU LLM inference latency with 2.15% mean error by profiling each bucket with two E2E measurements and composing results to capture bucketing non-linearity.
Introduces three linearizable GPU concurrent queues: an adapted wait-free queue using segments, a bounded lock-free queue with wave-batched paths, and a bounded wait-free queue using 64-bit CAS operations.
AVMP separates KV and SSM cache pools behind unified virtual addressing with failure-triggered migration, cutting OOM events 7.6% and raising throughput 1.83-13.3x on synthetic loads and 2.36x on ShareGPT traces.
ITHICA generates functional tests via intra-thread instruction duplication and comparison, detecting 39% more defective servers than baseline methods on over 3000 real CPUs while revealing new defect behaviors.
AtomTwin.jl is a physics-native Julia framework for simulating neutral-atom quantum processors, with a demonstration of logical Bell state preparation using four ytterbium-171 atoms in movable tweezers.
Fleet adds a Chiplet-task level to GPU task models, enabling per-chiplet scheduling and cooperative cache reuse in persistent megakernels, yielding 1.3-1.5x lower LLM decode latency and up to 37% less HBM traffic on AMD MI350 hardware.
KOVAL-Q uses SAT solving to optimize and verify surface-code logical operations with general encodings, finding d-cycle CNOTs and 2d-cycle rotations that reduce FTQC application runtime by about 10 percent.
InfiniLoRA decouples LoRA execution from base-model inference and reports 3.05x higher request throughput plus 54% more adapters meeting strict latency SLOs.
NestPipe achieves up to 3.06x speedup and 94.07% scaling efficiency on 1,536 workers via dual-buffer inter-batch and frozen-window intra-batch pipelining that overlaps communication with computation.
Qurator jointly optimizes queue time and fidelity for hybrid quantum-classical workflows across providers using quantum-aware DAG scheduling and a unified logarithmic fidelity score, achieving 30-75% wait reduction at high load with bounded accuracy cost.
PrefixWall mitigates APC side channels in multi-tenant LLM systems via selective prefix isolation, delivering up to 70% higher cache reuse and 30% lower latency than full-isolation baselines.
GreenCache dynamically manages LLM KV cache resources to reduce carbon emissions by 15.1% on average (up to 25.3%) while meeting latency constraints for over 90% of requests on real traces.
DiLaServe improves SLO attainment for diffusion language models by up to 56.6 percentage points and reduces latency by up to 46% with less than 1% accuracy drop via deadline-aware scheduling and dynamic reconfiguration.
Kamera stores a low-rank patch with each position-free KV chunk to restore cross-chunk conditioning lost in naive reuse, enabling cheap reordering, sliding windows, and recall across attention mechanisms.
A graph-theoretic nonlinear integer program solved via genetic algorithm reduces qubit transfers in neutral atom quantum circuit compilation compared to prior zoned-architecture compilers.
ANNS-AMP adapts distance-computation precision to vector-space regions via a lightweight cluster-level predictor and a bit-serial accelerator, delivering 163.76x/10.57x/2.06x average speedups and 1100x/39.41x/6.66x energy reductions versus CPU/GPU/custom baselines with <2.7% accuracy loss.
CRAM-ER combines spintronic computational RAM with CMOS adder trees and software fine-tuning to deliver near-lossless DNN accuracy at up to 100x lower latency than CPU/GPU baselines.
ThriftAttention recovers 89.1% of the FP16 quality gap versus pure FP4 attention by running only 5% of query-key blocks in FP16 on long-context benchmarks.
NasZip delivers up to 8.4x speedup over CPU baselines and 1.69x over prior NDP accelerators for ANNS by combining near-data processing with statistics-based PCA early exiting, dynamic-float encoding, and data-aware neighbor mapping.
Develops a simulation framework showing multi-resource stranding changes deployable capacity and effective costs in AI datacenters, arguing the key metric is deployable capacity over time rather than installed megawatts.
Dynamic replication of predicted overloaded experts in MoE models achieves near-100% GPU utilization and up to 3x faster inference while retaining 90-95% of baseline performance.
citing papers explorer
-
The Turbo-Charged Mapper: Fast and Optimal Mapping for Energy-efficient and Low-latency Accelerator Design
TCM finds provably optimal DNN accelerator mappings by pruning the search space up to 32 orders of magnitude with a new dataplacement concept, delivering 1.2-6.5x better energy-delay-product in 17 seconds instead of hours.
-
HBM Is Not All You Need: Efficient Disaggregated LLM Serving across Memory-heterogeneous Accelerators
HMA-Serve enables efficient cross-vendor disaggregated LLM serving on memory-heterogeneous accelerators via phase-wise quantization, compute-transfer pipelining, and deferred dequantization, delivering up to 3.2x goodput and 4.8x goodput-per-dollar.
-
ITHICA: Intra-Thread Instruction Checking Approach for Defect-Induced Silent Data Corruptions
ITHICA generates functional tests via intra-thread instruction duplication and comparison, detecting 39% more defective servers than baseline methods on over 3000 real CPUs while revealing new defect behaviors.
-
Fleet: Hierarchical Task-based Abstraction for Megakernels on Multi-Die GPUs
Fleet adds a Chiplet-task level to GPU task models, enabling per-chiplet scheduling and cooperative cache reuse in persistent megakernels, yielding 1.3-1.5x lower LLM decode latency and up to 37% less HBM traffic on AMD MI350 hardware.
-
CRAM-ER: Error-Resilient Spintronic Computational Random Access Memory for Scalable In-Memory Computation
CRAM-ER combines spintronic computational RAM with CMOS adder trees and software fine-tuning to deliver near-lossless DNN accuracy at up to 100x lower latency than CPU/GPU baselines.
-
NasZip: Software and Hardware Co-Design to Accelerate Approximate Nearest Neighbor Search with DIMM-Based Near-Data Processing
NasZip delivers up to 8.4x speedup over CPU baselines and 1.69x over prior NDP accelerators for ANNS by combining near-data processing with statistics-based PCA early exiting, dynamic-float encoding, and data-aware neighbor mapping.
-
Cassandra: Enabling Reasoning LLMs at Edge via Self-Speculative Decoding
Cassandra is a self-speculative decoding system that builds a draft model via fine-grained data selection and optimized pruning/mantissa truncation, achieving up to 2.41x speedup over BF16 and 1.81x more tokens than Eagle-3 on Llama 3 8B without training.
-
Co-Designing Graph-based Approximate Nearest Neighbor Search at Billion Scale for Processing-in-Memory
Co-design of 14.5x compacted index, asynchronous scheduler, and multiplication-free kernel for PIM-based graph ANNS delivers up to 20x CPU and 17.1x GPU throughput on billion-scale benchmarks.
-
EVA: Accelerating LLM Decoding via an Efficient Vector Quantization Architecture
EVA is a vector-quantization hardware architecture that transforms LLM decoding from GEMV to GEMM via direct codebook dot products and conflict-free output buffering, claiming up to 11.17x speedup over prior lookup designs.
-
KV-RM: Regularizing KV-Cache Movement for Static-Graph LLM Serving
KV-RM regularizes KV-cache movement via block paging and coalesced transfers to improve throughput, tail latency, and memory efficiency in static-graph LLM serving without changing the decoder interface.
-
EnergAIzer: Fast and Accurate GPU Power Estimation Framework for AI Workloads
EnergAIzer predicts module-level GPU utilization from structured kernel patterns and feeds it into a power model to estimate dynamic power with 8% error on Ampere GPUs and 7% on H100 forecasts.
-
FILCO: Flexible Composing Architecture with Real-Time Reconfigurability for DNN Acceleration
FILCO introduces a real-time reconfigurable composing architecture for DNN acceleration that achieves 1.3x-5x better throughput and hardware efficiency than prior designs on diverse workloads via an analytical model and two-stage design space exploration.