Canonical reference

Tender: Accelerating large language models via tensor decomposition and runtime requantization,

· 2025 · arXiv 9077.2024

Canonical reference. 87% of citing Pith papers cite this work as background.

57 Pith papers citing it

Background 87% of classified citations

read on arXiv browse 57 citing papers

citation-role summary

background 13 baseline 1 dataset 1

citation-polarity summary

background 13 baseline 1 use dataset 1

representative citing papers

Tessera: Unlocking Heterogeneous GPUs through Kernel-Granularity Disaggregation

cs.DC · 2026-04-11 · unverdicted · novelty 8.0

Tessera performs kernel-granularity disaggregation on heterogeneous GPUs, achieving up to 2.3x throughput and 1.6x cost efficiency gains for large model inference while generalizing beyond prior methods.

The Turbo-Charged Mapper: Fast and Optimal Mapping for Energy-efficient and Low-latency Accelerator Design

cs.AR · 2026-02-16 · unverdicted · novelty 8.0

TCM finds provably optimal DNN accelerator mappings by pruning the search space up to 32 orders of magnitude with a new dataplacement concept, delivering 1.2-6.5x better energy-delay-product in 17 seconds instead of hours.

HBM Is Not All You Need: Efficient Disaggregated LLM Serving across Memory-heterogeneous Accelerators

cs.AR · 2026-06-29 · unverdicted · novelty 7.0

HMA-Serve enables efficient cross-vendor disaggregated LLM serving on memory-heterogeneous accelerators via phase-wise quantization, compute-transfer pipelining, and deferred dequantization, delivering up to 3.2x goodput and 4.8x goodput-per-dollar.

Latency Prediction for LLM Inference on NPU Systems

cs.DC · 2026-06-16 · unverdicted · novelty 7.0

LENS predicts NPU LLM inference latency with 2.15% mean error by profiling each bucket with two E2E measurements and composing results to capture bucketing non-linearity.

Scalable Concurrent Queues for GPU

cs.DC · 2026-06-01 · unverdicted · novelty 7.0

Introduces three linearizable GPU concurrent queues: an adapted wait-free queue using segments, a bounded lock-free queue with wave-batched paths, and a bounded wait-free queue using 64-bit CAS operations.

Asymmetric Virtual Memory Paging for Hybrid Mamba-Transformer Inference

cs.LG · 2026-05-21 · unverdicted · novelty 7.0

AVMP separates KV and SSM cache pools behind unified virtual addressing with failure-triggered migration, cutting OOM events 7.6% and raising throughput 1.83-13.3x on synthetic loads and 2.36x on ShareGPT traces.

ITHICA: Intra-Thread Instruction Checking Approach for Defect-Induced Silent Data Corruptions

cs.AR · 2026-05-15 · unverdicted · novelty 7.0

ITHICA generates functional tests via intra-thread instruction duplication and comparison, detecting 39% more defective servers than baseline methods on over 3000 real CPUs while revealing new defect behaviors.

AtomTwin.jl: a physics-native digital twin framework for neutral-atom quantum processors

quant-ph · 2026-04-20 · unverdicted · novelty 7.0

AtomTwin.jl is a physics-native Julia framework for simulating neutral-atom quantum processors, with a demonstration of logical Bell state preparation using four ytterbium-171 atoms in movable tweezers.

Fleet: Hierarchical Task-based Abstraction for Megakernels on Multi-Die GPUs

cs.AR · 2026-04-15 · unverdicted · novelty 7.0

Fleet adds a Chiplet-task level to GPU task models, enabling per-chiplet scheduling and cooperative cache reuse in persistent megakernels, yielding 1.3-1.5x lower LLM decode latency and up to 37% less HBM traffic on AMD MI350 hardware.

Design automation and space-time reduction for surface-code logical operations using a SAT-based EDA kernel compatible with general encodings

quant-ph · 2026-04-14 · unverdicted · novelty 7.0

KOVAL-Q uses SAT solving to optimize and verify surface-code logical operations with general encodings, finding d-cycle CNOTs and 2d-cycle rotations that reduce FTQC application runtime by about 10 percent.

InfiniLoRA: Disaggregated Multi-LoRA Serving for Large Language Models

cs.DC · 2026-04-08 · unverdicted · novelty 7.0

InfiniLoRA decouples LoRA execution from base-model inference and reports 3.05x higher request throughput plus 54% more adapters meeting strict latency SLOs.

NestPipe: Large-Scale Recommendation Training on 1,500+ Accelerators via Nested Pipelining

cs.DC · 2026-04-08 · unverdicted · novelty 7.0

NestPipe achieves up to 3.06x speedup and 94.07% scaling efficiency on 1,536 workers via dual-buffer inter-batch and frozen-window intra-batch pipelining that overlaps communication with computation.

Qurator: Scheduling Hybrid Quantum-Classical Workflows Across Heterogeneous Cloud Providers

quant-ph · 2026-04-07 · unverdicted · novelty 7.0

Qurator jointly optimizes queue time and fidelity for hybrid quantum-classical workflows across providers using quantum-aware DAG scheduling and a unified logarithmic fidelity score, achieving 30-75% wait reduction at high load with bounded accuracy cost.

PrefixWall: Mitigating Prefix Caching Side Channels in Shared LLM Systems

cs.CR · 2026-03-11 · unverdicted · novelty 7.0

PrefixWall mitigates APC side channels in multi-tenant LLM systems via selective prefix isolation, delivering up to 70% higher cache reuse and 30% lower latency than full-isolation baselines.

Cache Your Prompt When It's Green: Carbon-Aware Caching for Large Language Model Serving

cs.DC · 2025-05-29 · conditional · novelty 7.0

GreenCache dynamically manages LLM KV cache resources to reduce carbon emissions by 15.1% on average (up to 25.3%) while meeting latency constraints for over 90% of requests on real traces.

DiLaServe: High SLO Attainment Serving for Diffusion Language Models

cs.LG · 2026-06-27 · unverdicted · novelty 6.0

DiLaServe improves SLO attainment for diffusion language models by up to 56.6 percentage points and reduces latency by up to 46% with less than 1% accuracy drop via deadline-aware scheduling and dynamic reconfiguration.

KernelSight-LM: A Kernel-Level LLM Inference Simulator

cs.PF · 2026-06-26 · unverdicted · novelty 6.0

KernelSight-LM simulates token-level LLM inference to predict per-kernel latencies and end-to-end metrics (TTFT, TPOT, throughput) with 12.1% and 3.8% kernel errors in cross-generation and target-measured tiers.

Kamera: Unified Position-Invariant Multimodal KV Cache for Training-Free Reuse

cs.DC · 2026-06-22 · unverdicted · novelty 6.0

Kamera stores a low-rank patch with each position-free KV chunk to restore cross-chunk conditioning lost in naive reuse, enabling cheap reordering, sliding windows, and recall across attention mechanisms.

General circuit mapping algorithm for neutral atom quantum computers

quant-ph · 2026-06-18 · unverdicted · novelty 6.0

A graph-theoretic nonlinear integer program solved via genetic algorithm reduces qubit transfers in neutral atom quantum circuit compilation compared to prior zoned-architecture compilers.

ANNS-AMP: Accelerating Approximate Nearest Neighbor Search via Adaptive Mixed-Precision Computing

cs.PF · 2026-06-05 · unverdicted · novelty 6.0

ANNS-AMP adapts distance-computation precision to vector-space regions via a lightweight cluster-level predictor and a bit-serial accelerator, delivering 163.76x/10.57x/2.06x average speedups and 1100x/39.41x/6.66x energy reductions versus CPU/GPU/custom baselines with <2.7% accuracy loss.

CRAM-ER: Error-Resilient Spintronic Computational Random Access Memory for Scalable In-Memory Computation

cs.AR · 2026-06-01 · unverdicted · novelty 6.0

CRAM-ER combines spintronic computational RAM with CMOS adder trees and software fine-tuning to deliver near-lossless DNN accuracy at up to 100x lower latency than CPU/GPU baselines.

ThriftAttention: Selective Mixed Precision for Long-Context FP4 Attention

cs.LG · 2026-05-21 · unverdicted · novelty 6.0

ThriftAttention recovers 89.1% of the FP16 quality gap versus pure FP4 attention by running only 5% of query-key blocks in FP16 on long-context benchmarks.

NasZip: Software and Hardware Co-Design to Accelerate Approximate Nearest Neighbor Search with DIMM-Based Near-Data Processing

cs.AR · 2026-05-21 · conditional · novelty 6.0

NasZip delivers up to 8.4x speedup over CPU baselines and 1.69x over prior NDP accelerators for ANNS by combining near-data processing with statistics-based PCA early exiting, dynamic-float encoding, and data-aware neighbor mapping.

Designing Datacenter Power Delivery Hierarchies for the AI Era

cs.DC · 2026-05-15 · unverdicted · novelty 6.0

Develops a simulation framework showing multi-resource stranding changes deployable capacity and effective costs in AI datacenters, arguing the key metric is deployable capacity over time rather than installed megawatts.

citing papers explorer

Showing 50 of 57 citing papers.

Tessera: Unlocking Heterogeneous GPUs through Kernel-Granularity Disaggregation cs.DC · 2026-04-11 · unverdicted · none · ref 9
Tessera performs kernel-granularity disaggregation on heterogeneous GPUs, achieving up to 2.3x throughput and 1.6x cost efficiency gains for large model inference while generalizing beyond prior methods.
The Turbo-Charged Mapper: Fast and Optimal Mapping for Energy-efficient and Low-latency Accelerator Design cs.AR · 2026-02-16 · unverdicted · none · ref 15
TCM finds provably optimal DNN accelerator mappings by pruning the search space up to 32 orders of magnitude with a new dataplacement concept, delivering 1.2-6.5x better energy-delay-product in 17 seconds instead of hours.
HBM Is Not All You Need: Efficient Disaggregated LLM Serving across Memory-heterogeneous Accelerators cs.AR · 2026-06-29 · unverdicted · none · ref 7
HMA-Serve enables efficient cross-vendor disaggregated LLM serving on memory-heterogeneous accelerators via phase-wise quantization, compute-transfer pipelining, and deferred dequantization, delivering up to 3.2x goodput and 4.8x goodput-per-dollar.
Latency Prediction for LLM Inference on NPU Systems cs.DC · 2026-06-16 · unverdicted · none · ref 31
LENS predicts NPU LLM inference latency with 2.15% mean error by profiling each bucket with two E2E measurements and composing results to capture bucketing non-linearity.
Scalable Concurrent Queues for GPU cs.DC · 2026-06-01 · unverdicted · none · ref 10
Introduces three linearizable GPU concurrent queues: an adapted wait-free queue using segments, a bounded lock-free queue with wave-batched paths, and a bounded wait-free queue using 64-bit CAS operations.
Asymmetric Virtual Memory Paging for Hybrid Mamba-Transformer Inference cs.LG · 2026-05-21 · unverdicted · none · ref 9
AVMP separates KV and SSM cache pools behind unified virtual addressing with failure-triggered migration, cutting OOM events 7.6% and raising throughput 1.83-13.3x on synthetic loads and 2.36x on ShareGPT traces.
ITHICA: Intra-Thread Instruction Checking Approach for Defect-Induced Silent Data Corruptions cs.AR · 2026-05-15 · unverdicted · none · ref 36
ITHICA generates functional tests via intra-thread instruction duplication and comparison, detecting 39% more defective servers than baseline methods on over 3000 real CPUs while revealing new defect behaviors.
AtomTwin.jl: a physics-native digital twin framework for neutral-atom quantum processors quant-ph · 2026-04-20 · unverdicted · none · ref 16
AtomTwin.jl is a physics-native Julia framework for simulating neutral-atom quantum processors, with a demonstration of logical Bell state preparation using four ytterbium-171 atoms in movable tweezers.
Fleet: Hierarchical Task-based Abstraction for Megakernels on Multi-Die GPUs cs.AR · 2026-04-15 · unverdicted · none · ref 17
Fleet adds a Chiplet-task level to GPU task models, enabling per-chiplet scheduling and cooperative cache reuse in persistent megakernels, yielding 1.3-1.5x lower LLM decode latency and up to 37% less HBM traffic on AMD MI350 hardware.
Design automation and space-time reduction for surface-code logical operations using a SAT-based EDA kernel compatible with general encodings quant-ph · 2026-04-14 · unverdicted · none · ref 30
KOVAL-Q uses SAT solving to optimize and verify surface-code logical operations with general encodings, finding d-cycle CNOTs and 2d-cycle rotations that reduce FTQC application runtime by about 10 percent.
InfiniLoRA: Disaggregated Multi-LoRA Serving for Large Language Models cs.DC · 2026-04-08 · unverdicted · none · ref 28
InfiniLoRA decouples LoRA execution from base-model inference and reports 3.05x higher request throughput plus 54% more adapters meeting strict latency SLOs.
NestPipe: Large-Scale Recommendation Training on 1,500+ Accelerators via Nested Pipelining cs.DC · 2026-04-08 · unverdicted · none · ref 41
NestPipe achieves up to 3.06x speedup and 94.07% scaling efficiency on 1,536 workers via dual-buffer inter-batch and frozen-window intra-batch pipelining that overlaps communication with computation.
Qurator: Scheduling Hybrid Quantum-Classical Workflows Across Heterogeneous Cloud Providers quant-ph · 2026-04-07 · unverdicted · none · ref 15
Qurator jointly optimizes queue time and fidelity for hybrid quantum-classical workflows across providers using quantum-aware DAG scheduling and a unified logarithmic fidelity score, achieving 30-75% wait reduction at high load with bounded accuracy cost.
PrefixWall: Mitigating Prefix Caching Side Channels in Shared LLM Systems cs.CR · 2026-03-11 · unverdicted · none · ref 54
PrefixWall mitigates APC side channels in multi-tenant LLM systems via selective prefix isolation, delivering up to 70% higher cache reuse and 30% lower latency than full-isolation baselines.
Cache Your Prompt When It's Green: Carbon-Aware Caching for Large Language Model Serving cs.DC · 2025-05-29 · conditional · none · ref 75
GreenCache dynamically manages LLM KV cache resources to reduce carbon emissions by 15.1% on average (up to 25.3%) while meeting latency constraints for over 90% of requests on real traces.
DiLaServe: High SLO Attainment Serving for Diffusion Language Models cs.LG · 2026-06-27 · unverdicted · none · ref 42
DiLaServe improves SLO attainment for diffusion language models by up to 56.6 percentage points and reduces latency by up to 46% with less than 1% accuracy drop via deadline-aware scheduling and dynamic reconfiguration.
KernelSight-LM: A Kernel-Level LLM Inference Simulator cs.PF · 2026-06-26 · unverdicted · none · ref 34
KernelSight-LM simulates token-level LLM inference to predict per-kernel latencies and end-to-end metrics (TTFT, TPOT, throughput) with 12.1% and 3.8% kernel errors in cross-generation and target-measured tiers.
Kamera: Unified Position-Invariant Multimodal KV Cache for Training-Free Reuse cs.DC · 2026-06-22 · unverdicted · none · ref 5
Kamera stores a low-rank patch with each position-free KV chunk to restore cross-chunk conditioning lost in naive reuse, enabling cheap reordering, sliding windows, and recall across attention mechanisms.
General circuit mapping algorithm for neutral atom quantum computers quant-ph · 2026-06-18 · unverdicted · none · ref 11
A graph-theoretic nonlinear integer program solved via genetic algorithm reduces qubit transfers in neutral atom quantum circuit compilation compared to prior zoned-architecture compilers.
ANNS-AMP: Accelerating Approximate Nearest Neighbor Search via Adaptive Mixed-Precision Computing cs.PF · 2026-06-05 · unverdicted · none · ref 18
ANNS-AMP adapts distance-computation precision to vector-space regions via a lightweight cluster-level predictor and a bit-serial accelerator, delivering 163.76x/10.57x/2.06x average speedups and 1100x/39.41x/6.66x energy reductions versus CPU/GPU/custom baselines with <2.7% accuracy loss.
CRAM-ER: Error-Resilient Spintronic Computational Random Access Memory for Scalable In-Memory Computation cs.AR · 2026-06-01 · unverdicted · none · ref 7
CRAM-ER combines spintronic computational RAM with CMOS adder trees and software fine-tuning to deliver near-lossless DNN accuracy at up to 100x lower latency than CPU/GPU baselines.
ThriftAttention: Selective Mixed Precision for Long-Context FP4 Attention cs.LG · 2026-05-21 · unverdicted · none · ref 13
ThriftAttention recovers 89.1% of the FP16 quality gap versus pure FP4 attention by running only 5% of query-key blocks in FP16 on long-context benchmarks.
NasZip: Software and Hardware Co-Design to Accelerate Approximate Nearest Neighbor Search with DIMM-Based Near-Data Processing cs.AR · 2026-05-21 · conditional · none · ref 40
NasZip delivers up to 8.4x speedup over CPU baselines and 1.69x over prior NDP accelerators for ANNS by combining near-data processing with statistics-based PCA early exiting, dynamic-float encoding, and data-aware neighbor mapping.
Designing Datacenter Power Delivery Hierarchies for the AI Era cs.DC · 2026-05-15 · unverdicted · none · ref 47
Develops a simulation framework showing multi-resource stranding changes deployable capacity and effective costs in AI datacenters, arguing the key metric is deployable capacity over time rather than installed megawatts.
Fast MoE Inference via Predictive Prefetching and Expert Replication cs.LG · 2026-05-12 · conditional · none · ref 9
Dynamic replication of predicted overloaded experts in MoE models achieves near-100% GPU utilization and up to 3x faster inference while retaining 90-95% of baseline performance.
ROSE: Rollout On Serving GPUs via Cooperative Elasticity for Agentic RL cs.DC · 2026-05-07 · unverdicted · none · ref 47
ROSE is a system for cooperative elasticity that co-locates serving and rollout models on shared GPUs, delivering 1.3-3.3x higher end-to-end throughput than fixed-resource baselines while preserving serving SLOs.
SPECTRE: Hybrid Ordinary-Parallel Speculative Serving for Resource-Efficient LLM Inference cs.DC · 2026-05-04 · unverdicted · none · ref 3 · 2 links
SPECTRE achieves up to 2.28x speedup for large-model LLM serving by running speculative draft generation and target verification in parallel using idle tail-model services.
Adaptive Management of Microservices in Dynamic Computing Environments: A Taxonomy and Future Directions cs.DC · 2026-04-28 · unverdicted · none · ref 20
A new taxonomy for dynamics-aware microservice management, synthesized from 84 systems, finds that production dynamics are often only partially modeled and that reported performance gains depend on evaluation realism.
LEO: Tracing GPU Stall Root Causes via Cross-Vendor Backward Slicing cs.DC · 2026-04-21 · unverdicted · none · ref 49
LEO performs cross-vendor backward slicing from stalled GPU instructions to attribute root causes to source code, enabling optimizations that produce geometric-mean speedups of 1.73-1.82x on 21 workloads.
Proxics: an efficient programming model for far memory accelerators cs.OS · 2026-04-20 · conditional · none · ref 36
Proxics introduces lightweight virtual processors and low-latency communication channels as portable OS abstractions for programming near-data processing accelerators, demonstrated on real hardware for memory-intensive workloads.
KAIROS: Stateful, Context-Aware Power-Efficient Agentic Inference Serving cs.DC · 2026-04-17 · unverdicted · none · ref 48
KAIROS reduces power by 27% on average (up to 39.8%) for agentic AI inference by using long-lived context to jointly manage GPU frequency, concurrency, and request routing across instances.
O3LS: Optimizing Lattice Surgery via Automatic Layout Searching and Loose Scheduling quant-ph · 2026-04-16 · unverdicted · none · ref 48
O3LS reduces space overhead by up to 46.7% and time overhead by up to 36% in lattice surgery while suppressing logical error rates by up to an order of magnitude compared with prior layout and scheduling approaches.
ELMoE-3D: Leveraging Intrinsic Elasticity of MoE for Hybrid-Bonding-Enabled Self-Speculative Decoding in On-Premises Serving cs.LG · 2026-04-16 · unverdicted · none · ref 25
ELMoE-3D achieves 6.6x average speedup and 4.4x energy efficiency gain for MoE serving on 3D hardware by scaling expert and bit elasticity for elastic self-speculative decoding.
WaveTune: Wave-aware Bilinear Modeling for Efficient GPU Kernel Auto-tuning cs.PF · 2026-04-11 · unverdicted · none · ref 50
WaveTune introduces a wave-aware bilinear latency predictor and wave-structured sparse sampling to enable fast runtime auto-tuning of GPU kernels, achieving up to 1.83x kernel speedup and 1.33x TTFT reduction with drastically lower overhead.
PG-MDP: Profile-Guided Memory Dependence Prediction for Area-Constrained Cores cs.PL · 2026-04-09 · unverdicted · none · ref 4
Profile-guided opcode labeling removes consistently independent loads from the MDP working set, cutting queries 79%, false dependencies 77%, and raising small-core IPC 1.47% on SPEC2017 intspeed.
Blink: CPU-Free LLM Inference by Delegating the Serving Stack to GPU and SmartNIC cs.DC · 2026-04-08 · unverdicted · none · ref 15
Blink enables CPU-free LLM inference via SmartNIC offload and persistent GPU kernel, delivering up to 8.47x lower P99 TTFT, 3.4x lower P99 TPOT, 2.1x higher decode throughput, and 48.6% lower energy per token while remaining stable under CPU interference.
Performance Isolation and Semantic Determinism in Efficient GPU Spatial Sharing cs.DC · 2026-03-16 · unverdicted · none · ref 51
CoGPU resolves the tradeoff in GPU sharing by introducing GPU coroutines for semantic-preserving resource migration, delivering up to 79.2% higher training throughput and zero token mismatch in inference.
DCGen 1.1 Technical Report: Generating Datacenter Configurations (including IT, Power, Cooling) cs.DC · 2026-03-15 · accept · none · ref 34
DCGen generates customizable datacenter configurations with IT, power, and cooling components optimized for power, compute, and area targets using real equipment catalogs and workload-specific IT mixes.
PipeWeave: Synergizing Analytical and Learning Models for Unified GPU Performance Prediction cs.PF · 2026-01-21 · unverdicted · none · ref 55 · 2 links
PipeWeave predicts GPU kernel performance with 6.1% average error and end-to-end inference with 8.5% error by feeding analytical pipeline features into ML, cutting prior method errors by 4-7x across 11 GPUs.
Amoeba: Runtime Tensor Parallel Transformation for LLM Inference Services cs.DC · 2025-09-24 · unverdicted · none · ref 22
Amoeba adaptively adjusts tensor parallelism at runtime for LLM inference services to handle mixed short and long context requests, delivering 1.75x-6.57x throughput gains over prior solutions in real-world trace evaluations.
KernelFlume: Elastic Core-Attention Scaling for Agentic Long-Context Decoding cs.DC · 2026-06-28 · unverdicted · none · ref 25
KernelFlume presents a disaggregated decode architecture that separates core attention from projection/FFN paths to enable elastic scaling of attention nodes, reporting up to 61% lower cost per million tokens versus full-instance scaling on H100 hardware for Llama-3.1-8B under dynamic long-context w
ShuntServe: Cost-Efficient LLM Serving on Heterogeneous Spot GPU Clusters cs.DC · 2026-06-17 · unverdicted · none · ref 9
ShuntServe reports 1.42x and 1.35x higher throughput than baselines plus 31.9 percent and 31.2 percent cost-efficiency gains over on-demand instances for Llama-3.1-70B and Qwen3-32B on heterogeneous AWS spot clusters.
Cassandra: Enabling Reasoning LLMs at Edge via Self-Speculative Decoding cs.AR · 2026-05-26 · unverdicted · none · ref 26
Cassandra is a self-speculative decoding system that builds a draft model via fine-grained data selection and optimized pruning/mantissa truncation, achieving up to 2.41x speedup over BF16 and 1.81x more tokens than Eagle-3 on Llama 3 8B without training.
Co-Designing Graph-based Approximate Nearest Neighbor Search at Billion Scale for Processing-in-Memory cs.AR · 2026-05-25 · unverdicted · none · ref 59
Co-design of 14.5x compacted index, asynchronous scheduler, and multiplication-free kernel for PIM-based graph ANNS delivers up to 20x CPU and 17.1x GPU throughput on billion-scale benchmarks.
EVA: Accelerating LLM Decoding via an Efficient Vector Quantization Architecture cs.AR · 2026-05-22 · unverdicted · none · ref 50
EVA is a vector-quantization hardware architecture that transforms LLM decoding from GEMV to GEMM via direct codebook dot products and conflict-free output buffering, claiming up to 11.17x speedup over prior lookup designs.
KV-RM: Regularizing KV-Cache Movement for Static-Graph LLM Serving cs.AR · 2026-05-10 · unverdicted · none · ref 34 · 2 links
KV-RM regularizes KV-cache movement via block paging and coalesced transfers to improve throughput, tail latency, and memory efficiency in static-graph LLM serving without changing the decoder interface.
EnergAIzer: Fast and Accurate GPU Power Estimation Framework for AI Workloads cs.AR · 2026-04-22 · unverdicted · none · ref 38
EnergAIzer predicts module-level GPU utilization from structured kernel patterns and feeds it into a power model to estimate dynamic power with 8% error on Ampere GPUs and 7% on H100 forecasts.
Compiler Framework for Directional Transport in Zoned Neutral Atom Systems with AOD Assistance: A Hybrid Remote CZ Approach quant-ph · 2026-04-13 · unverdicted · none · ref 31
A hybrid DT-AOD compiler framework enables faster remote CZ gates in neutral atom systems by transporting Rydberg excitations directionally along resettable ancilla paths.
FILCO: Flexible Composing Architecture with Real-Time Reconfigurability for DNN Acceleration cs.AR · 2026-04-08 · unverdicted · none · ref 25
FILCO introduces a real-time reconfigurable composing architecture for DNN acceleration that achieves 1.3x-5x better throughput and hardware efficiency than prior designs on diverse workloads via an analytical model and two-stage design space exploration.
Measurement of Generative AI Workload Power Profiles for Whole-Facility Data Center Infrastructure Planning eess.SY · 2026-04-08 · unverdicted · none · ref 16
High-resolution power profiles for AI workloads on H100 GPUs are measured and scaled to whole-facility energy demand using a bottom-up model, with the dataset made public.

Tender: Accelerating large language models via tensor decomposition and runtime requantization,

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer