Canonical reference

Tender: Accelerating large language models via tensor decomposition and runtime requantization,

· 2025 · arXiv 9077.2024

Canonical reference. 87% of citing Pith papers cite this work as background.

57 Pith papers citing it

Background 87% of classified citations

read on arXiv browse 57 citing papers

citation-role summary

background 13 baseline 1 dataset 1

citation-polarity summary

background 13 baseline 1 use dataset 1

representative citing papers

Tessera: Unlocking Heterogeneous GPUs through Kernel-Granularity Disaggregation

cs.DC · 2026-04-11 · unverdicted · novelty 8.0

Tessera performs kernel-granularity disaggregation on heterogeneous GPUs, achieving up to 2.3x throughput and 1.6x cost efficiency gains for large model inference while generalizing beyond prior methods.

The Turbo-Charged Mapper: Fast and Optimal Mapping for Energy-efficient and Low-latency Accelerator Design

cs.AR · 2026-02-16 · unverdicted · novelty 8.0

TCM finds provably optimal DNN accelerator mappings by pruning the search space up to 32 orders of magnitude with a new dataplacement concept, delivering 1.2-6.5x better energy-delay-product in 17 seconds instead of hours.

HBM Is Not All You Need: Efficient Disaggregated LLM Serving across Memory-heterogeneous Accelerators

cs.AR · 2026-06-29 · unverdicted · novelty 7.0

HMA-Serve enables efficient cross-vendor disaggregated LLM serving on memory-heterogeneous accelerators via phase-wise quantization, compute-transfer pipelining, and deferred dequantization, delivering up to 3.2x goodput and 4.8x goodput-per-dollar.

Latency Prediction for LLM Inference on NPU Systems

cs.DC · 2026-06-16 · unverdicted · novelty 7.0

LENS predicts NPU LLM inference latency with 2.15% mean error by profiling each bucket with two E2E measurements and composing results to capture bucketing non-linearity.

Scalable Concurrent Queues for GPU

cs.DC · 2026-06-01 · unverdicted · novelty 7.0

Introduces three linearizable GPU concurrent queues: an adapted wait-free queue using segments, a bounded lock-free queue with wave-batched paths, and a bounded wait-free queue using 64-bit CAS operations.

Asymmetric Virtual Memory Paging for Hybrid Mamba-Transformer Inference

cs.LG · 2026-05-21 · unverdicted · novelty 7.0

AVMP separates KV and SSM cache pools behind unified virtual addressing with failure-triggered migration, cutting OOM events 7.6% and raising throughput 1.83-13.3x on synthetic loads and 2.36x on ShareGPT traces.

ITHICA: Intra-Thread Instruction Checking Approach for Defect-Induced Silent Data Corruptions

cs.AR · 2026-05-15 · unverdicted · novelty 7.0

ITHICA generates functional tests via intra-thread instruction duplication and comparison, detecting 39% more defective servers than baseline methods on over 3000 real CPUs while revealing new defect behaviors.

AtomTwin.jl: a physics-native digital twin framework for neutral-atom quantum processors

quant-ph · 2026-04-20 · unverdicted · novelty 7.0

AtomTwin.jl is a physics-native Julia framework for simulating neutral-atom quantum processors, with a demonstration of logical Bell state preparation using four ytterbium-171 atoms in movable tweezers.

Fleet: Hierarchical Task-based Abstraction for Megakernels on Multi-Die GPUs

cs.AR · 2026-04-15 · unverdicted · novelty 7.0

Fleet adds a Chiplet-task level to GPU task models, enabling per-chiplet scheduling and cooperative cache reuse in persistent megakernels, yielding 1.3-1.5x lower LLM decode latency and up to 37% less HBM traffic on AMD MI350 hardware.

Design automation and space-time reduction for surface-code logical operations using a SAT-based EDA kernel compatible with general encodings

quant-ph · 2026-04-14 · unverdicted · novelty 7.0

KOVAL-Q uses SAT solving to optimize and verify surface-code logical operations with general encodings, finding d-cycle CNOTs and 2d-cycle rotations that reduce FTQC application runtime by about 10 percent.

InfiniLoRA: Disaggregated Multi-LoRA Serving for Large Language Models

cs.DC · 2026-04-08 · unverdicted · novelty 7.0

InfiniLoRA decouples LoRA execution from base-model inference and reports 3.05x higher request throughput plus 54% more adapters meeting strict latency SLOs.

NestPipe: Large-Scale Recommendation Training on 1,500+ Accelerators via Nested Pipelining

cs.DC · 2026-04-08 · unverdicted · novelty 7.0

NestPipe achieves up to 3.06x speedup and 94.07% scaling efficiency on 1,536 workers via dual-buffer inter-batch and frozen-window intra-batch pipelining that overlaps communication with computation.

Qurator: Scheduling Hybrid Quantum-Classical Workflows Across Heterogeneous Cloud Providers

quant-ph · 2026-04-07 · unverdicted · novelty 7.0

Qurator jointly optimizes queue time and fidelity for hybrid quantum-classical workflows across providers using quantum-aware DAG scheduling and a unified logarithmic fidelity score, achieving 30-75% wait reduction at high load with bounded accuracy cost.

PrefixWall: Mitigating Prefix Caching Side Channels in Shared LLM Systems

cs.CR · 2026-03-11 · unverdicted · novelty 7.0

PrefixWall mitigates APC side channels in multi-tenant LLM systems via selective prefix isolation, delivering up to 70% higher cache reuse and 30% lower latency than full-isolation baselines.

Cache Your Prompt When It's Green: Carbon-Aware Caching for Large Language Model Serving

cs.DC · 2025-05-29 · conditional · novelty 7.0

GreenCache dynamically manages LLM KV cache resources to reduce carbon emissions by 15.1% on average (up to 25.3%) while meeting latency constraints for over 90% of requests on real traces.

DiLaServe: High SLO Attainment Serving for Diffusion Language Models

cs.LG · 2026-06-27 · unverdicted · novelty 6.0

DiLaServe improves SLO attainment for diffusion language models by up to 56.6 percentage points and reduces latency by up to 46% with less than 1% accuracy drop via deadline-aware scheduling and dynamic reconfiguration.

KernelSight-LM: A Kernel-Level LLM Inference Simulator

cs.PF · 2026-06-26 · unverdicted · novelty 6.0

KernelSight-LM simulates token-level LLM inference to predict per-kernel latencies and end-to-end metrics (TTFT, TPOT, throughput) with 12.1% and 3.8% kernel errors in cross-generation and target-measured tiers.

Kamera: Unified Position-Invariant Multimodal KV Cache for Training-Free Reuse

cs.DC · 2026-06-22 · unverdicted · novelty 6.0

Kamera stores a low-rank patch with each position-free KV chunk to restore cross-chunk conditioning lost in naive reuse, enabling cheap reordering, sliding windows, and recall across attention mechanisms.

General circuit mapping algorithm for neutral atom quantum computers

quant-ph · 2026-06-18 · unverdicted · novelty 6.0

A graph-theoretic nonlinear integer program solved via genetic algorithm reduces qubit transfers in neutral atom quantum circuit compilation compared to prior zoned-architecture compilers.

ANNS-AMP: Accelerating Approximate Nearest Neighbor Search via Adaptive Mixed-Precision Computing

cs.PF · 2026-06-05 · unverdicted · novelty 6.0

ANNS-AMP adapts distance-computation precision to vector-space regions via a lightweight cluster-level predictor and a bit-serial accelerator, delivering 163.76x/10.57x/2.06x average speedups and 1100x/39.41x/6.66x energy reductions versus CPU/GPU/custom baselines with <2.7% accuracy loss.

CRAM-ER: Error-Resilient Spintronic Computational Random Access Memory for Scalable In-Memory Computation

cs.AR · 2026-06-01 · unverdicted · novelty 6.0

CRAM-ER combines spintronic computational RAM with CMOS adder trees and software fine-tuning to deliver near-lossless DNN accuracy at up to 100x lower latency than CPU/GPU baselines.

ThriftAttention: Selective Mixed Precision for Long-Context FP4 Attention

cs.LG · 2026-05-21 · unverdicted · novelty 6.0

ThriftAttention recovers 89.1% of the FP16 quality gap versus pure FP4 attention by running only 5% of query-key blocks in FP16 on long-context benchmarks.

NasZip: Software and Hardware Co-Design to Accelerate Approximate Nearest Neighbor Search with DIMM-Based Near-Data Processing

cs.AR · 2026-05-21 · conditional · novelty 6.0

NasZip delivers up to 8.4x speedup over CPU baselines and 1.69x over prior NDP accelerators for ANNS by combining near-data processing with statistics-based PCA early exiting, dynamic-float encoding, and data-aware neighbor mapping.

Designing Datacenter Power Delivery Hierarchies for the AI Era

cs.DC · 2026-05-15 · unverdicted · novelty 6.0

Develops a simulation framework showing multi-resource stranding changes deployable capacity and effective costs in AI datacenters, arguing the key metric is deployable capacity over time rather than installed megawatts.

citing papers explorer

Showing 7 of 57 citing papers.

FlexPipe: Adapting Dynamic LLM Serving Through Inflight Pipeline Refactoring in Fragmented Serverless Clusters cs.DC · 2025-10-13 · unverdicted · none · ref 33
FlexPipe introduces runtime pipeline refactoring for LLMs to achieve higher resource efficiency and lower latency in serverless GPU clusters with fragmentation.
Combating the Memory Walls: Optimization Pathways for Long-Context Agentic LLM Inference cs.AR · 2025-09-11 · unverdicted · none · ref 38 · 2 links
PLENA introduces a co-designed system with three optimization pathways for long-context agentic LLM inference, claiming up to 2.23x throughput over A100 and 4.04x energy efficiency.
Performance Analysis and Optimization of 3D Generative Diffusion Models across GPU Architectures cs.LG · 2026-06-11 · unverdicted · none · ref 27
Profiling of Med-DDPM shows cuDNN kernels dominate training; TF32 Tensor Core activation and 3D channels-last layout reduce SM cycles up to 100x and raise Tensor Core utilization on A100 without quality loss.
Reference-Augmented Learning for Precise Tracking Policy of Tendon-Driven Continuum Robots cs.RO · 2026-04-28 · unverdicted · none · ref 25
A reference-augmented offline learning framework for 6-DOF tracking control of tendon-driven continuum robots achieves 50.9% lower average position error than non-augmented baselines.
Computing In Spintronic Memory: A Thermal Perspective cs.ET · 2026-04-08 · unverdicted · none · ref 4
Spintronic CiM shows uniform temperature that increases linearly with participating memory cells and decreases linearly with array size.
Energy-Aware Computing in the Year 2026 cs.DC · 2026-05-23 · unverdicted · none · ref 43 · 2 links
The paper reviews energy-aware computing literature and constructs a taxonomy organized by hardware/software aspects, measurement, optimizations, scheduling, scaling, consolidation, federated learning, and cooling.
PureMagic: A Dynamic Scheduler for Lattice Surgery quant-ph · 2025-12-06 · unreviewed · ref 28

Tender: Accelerating large language models via tensor decomposition and runtime requantization,

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer