{"total":40,"items":[{"citing_arxiv_id":"2605.23081","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ThriftAttention: Selective Mixed Precision for Long-Context FP4 Attention","primary_cat":"cs.LG","submitted_at":"2026-05-21T22:28:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ThriftAttention recovers 89.1% of the FP16 quality gap versus pure FP4 attention by running only 5% of query-key blocks in FP16 on long-context benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22416","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Asymmetric Virtual Memory Paging for Hybrid Mamba-Transformer Inference","primary_cat":"cs.LG","submitted_at":"2026-05-21T12:37:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AVMP separates KV and SSM cache pools behind unified virtual addressing with failure-triggered migration, cutting OOM events 7.6% and raising throughput 1.83-13.3x on synthetic loads and 2.36x on ShareGPT traces.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21952","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"NasZip: Software and Hardware Co-Design to Accelerate Approximate Nearest Neighbor Search with DIMM-Based Near-Data Processing","primary_cat":"cs.AR","submitted_at":"2026-05-21T03:36:27+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"NasZip delivers up to 8.4x speedup over CPU baselines and 1.69x over prior NDP accelerators for ANNS by combining near-data processing with statistics-based PCA early exiting, dynamic-float encoding, and data-aware neighbor mapping.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16255","ref_index":47,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Designing Datacenter Power Delivery Hierarchies for the AI Era","primary_cat":"cs.DC","submitted_at":"2026-05-15T17:58:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Develops a simulation framework showing multi-resource stranding changes deployable capacity and effective costs in AI datacenters, arguing the key metric is deployable capacity over time rather than installed megawatts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15638","ref_index":36,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ITHICA: Intra-Thread Instruction Checking Approach for Defect-Induced Silent Data Corruptions","primary_cat":"cs.AR","submitted_at":"2026-05-15T05:43:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ITHICA generates functional tests via intra-thread instruction duplication and comparison, detecting 39% more defective servers than baseline methods on over 3000 real CPUs while revealing new defect behaviors.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11537","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Fast MoE Inference via Predictive Prefetching and Expert Replication","primary_cat":"cs.LG","submitted_at":"2026-05-12T05:03:53+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Dynamic replication of predicted overloaded experts in MoE models achieves near-100% GPU utilization and up to 3x faster inference while retaining 90-95% of baseline performance.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Figure 2: SwitchTransformers MoE Layer with all experts, active and inactive(white) in GPU traditional routing approaches, including softmax-based mecha- nisms, contribute substantially to the observed inefficiencies. The computational burden of softmax routing, combined with all-to-all communications and the uneven load distribution across experts, necessitates novel architectural adaptations [9] [8] [2]. MoE Overhead.The growing number of experts in MoE models introduces significant sparsity, which results in substantial under- utilization of GPU resources. As more experts are added, a larger proportion of GPU memory sits idle during each forward pass, leading to severe inefficiencies during inference. This sparsity not only wastes precious compute resources but also increases the"},{"citing_arxiv_id":"2605.09735","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"KV-RM: Regularizing KV-Cache Movement for Static-Graph LLM Serving","primary_cat":"cs.AR","submitted_at":"2026-05-10T20:10:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"KV-RM regularizes KV-cache movement in static-graph LLM serving via block paging and merge-staged transport to improve throughput, tail latency, and memory use for variable-length decoding.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"machine-learning-for-peak-performance/ [33] Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, and Wei Lin. 2024. Llumnix: Dynamic Scheduling for Large Language Model Serving. 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI).https://www.usenix.org /conference/osdi24/presentation/sun-biaoUSENIX Association. [34] Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. 2024. QUEST: query-aware sparsity for efficient long- context LLM inference. InProceedings of the 41st International Con- ference on Machine Learning(Vienna, Austria)(ICML'24). JMLR.org, Cambridge, MA, USA, Article 1955, 11 pages.https://arxiv.org/abs/24 06.10774 13"},{"citing_arxiv_id":"2605.06534","ref_index":47,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ROSE: Rollout On Serving GPUs via Cooperative Elasticity for Agentic RL","primary_cat":"cs.DC","submitted_at":"2026-05-07T16:33:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ROSE is a system for cooperative elasticity that co-locates serving and rollout models on shared GPUs, delivering 1.3-3.3x higher end-to-end throughput than fixed-resource baselines while preserving serving SLOs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08151","ref_index":3,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SPECTRE: Hybrid Ordinary-Parallel Speculative Serving for Resource-Efficient LLM Inference","primary_cat":"cs.DC","submitted_at":"2026-05-04T01:27:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SPECTRE achieves up to 2.28x speedup for large-model LLM serving by running speculative draft generation and target verification in parallel using idle tail-model services.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"ment over the strongest speculative decoding baselines. Talk is cheap, we show you the code: https://github.com/sgl-project/sglang/pull/22272. Date: May 13, 2026 1 Introduction Large language model (LLM) serving platforms[ 1, 2] are increasingly deployed as multi-model cloud systems, where shared infrastructure supports models with different sizes, capabilities, and service roles [ 3]. In practice, user demand in such systems is often long-tailed: a small number of popular large models receive most requests, while many smaller models in the tail see much lighter traﬀic [ 4]. As these tail models remain online to serve the full model portfolio, their own traﬀic often falls short of fully utilizing their generation capacity. This imbalance motivates the reuse of idle tail-model capacity to assist heavily loaded large-model"},{"citing_arxiv_id":"2604.25222","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Adaptive Management of Microservices in Dynamic Computing Environments: A Taxonomy and Future Directions","primary_cat":"cs.DC","submitted_at":"2026-04-28T04:59:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A new taxonomy for dynamics-aware microservice management, synthesized from 84 systems, finds that production dynamics are often only partially modeled and that reported performance gains depend on evaluation realism.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"co-located with batch jobs, leading to interference on CPU, memory, storage, and accelerators [24, 25, 95, 101, 108, 112, 156, 162]. Both system-level mechanisms (isolation, bandwidth control) and scheduler-level policies are used to mitigate contention [38, 64, 156, 169]. At the microservice layer, specialized resource managers aim to improve utilization while meeting service-level agreement (SLA)/SLO targets [20, 66, 88, 90, 114, 163]. • Additional dynamics: failures and sustainability.Dynamic environments also include failure events (node failures, stragglers, overload, QoS degradation) [ 51, 112, 164], and sustainability-related variability such as time-varying carbon intensity or energy budgets [100, 122, 157]. Recent systems work shows both the promise of"},{"citing_arxiv_id":"2604.20105","ref_index":38,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"EnergAIzer: Fast and Accurate GPU Power Estimation Framework for AI Workloads","primary_cat":"cs.AR","submitted_at":"2026-04-22T02:02:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"EnergAIzer predicts module-level GPU utilization from structured kernel patterns and feeds it into a power model to estimate dynamic power with 8% error on Ampere GPUs and 7% on H100 forecasts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.20032","ref_index":49,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LEO: Tracing GPU Stall Root Causes via Cross-Vendor Backward Slicing","primary_cat":"cs.DC","submitted_at":"2026-04-21T22:23:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LEO performs cross-vendor backward slicing from stalled GPU instructions to attribute root causes to source code, enabling optimizations that produce geometric-mean speedups of 1.73-1.82x on 21 workloads.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[7] evaluate seven programming models across NVIDIA and AMD GPUs, and Kwack et al. [11] benchmark 12 HPC and machine-learning applications across Frontier, Aurora, and Polaris. These studies establish the exis- tence of cross-platform performance gaps. LEO complements such studies by explaining performance gaps at the instruction level. e) Top-Down Analysis:Yasin [49] introduced top-down microarchitecture analysis for Intel CPUs, and Nowak et al. [50] proposed hierarchical cycle accounting. For GPUs, Saiz et al. [51] adapted top-down profiling to NVIDIA GPUs, and DrGPU [16] extended the idea into a portable profiler. LEO is complementary rather than competitive: top-down methods classify where cycles go, whereas LEO traces specific stalls"},{"citing_arxiv_id":"2604.18531","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AtomTwin.jl: a physics-native digital twin framework for neutral-atom quantum processors","primary_cat":"quant-ph","submitted_at":"2026-04-20T17:26:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AtomTwin.jl is a physics-native Julia framework for simulating neutral-atom quantum processors, with a demonstration of logical Bell state preparation using four ytterbium-171 atoms in movable tweezers.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18120","ref_index":36,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Proxics: an efficient programming model for far memory accelerators","primary_cat":"cs.OS","submitted_at":"2026-04-20T11:38:54+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Proxics introduces lightweight virtual processors and low-latency communication channels as portable OS abstractions for programming near-data processing accelerators, demonstrated on real hardware for memory-intensive workloads.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"split-phase copy engine. Such compiler support would also enable more efficient sharing of MCC hardware. At present, each CP has exclusive use of a MCC, but should be feasible to compile multiple CP into a single binary which interleaves accesses by different application code according to static schedule chosen by the compiler, as in some hard real-time systems [36]. References [1] Advanced Micro Devices, Inc. 2024.UltraScale Architecture-Based FPGAs Memory IP v1.4 LogiCORE IP Product Guide. Technical Report PG150. 955 pages.https://docs.amd.com/r/en-US/pg150-ultrascale- memory-ip [2] Advanced Micro Devices, Inc. 2025.MicroBlaze V Processor Reference Guide. Technical Report UG1629. 152 pages.https://docs.amd."},{"citing_arxiv_id":"2604.16682","ref_index":48,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"KAIROS: Stateful, Context-Aware Power-Efficient Agentic Inference Serving","primary_cat":"cs.DC","submitted_at":"2026-04-17T20:39:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"KAIROS reduces power by 27% on average (up to 39.8%) for agentic AI inference by using long-lived context to jointly manage GPU frequency, concurrency, and request routing across instances.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"48550/arXiv.2511.00739 [47] Yeonju Ro, Haoran Qiu, Íñigo Goiri, Rodrigo Fonseca, Ricardo Bian- chini, Aditya Akella, Zhangyang Wang, Mattan Erez, and Esha Choukse. 2025. Sherlock: Reliable and Efficient Agentic Workflow Execution.arXiv preprint arXiv:2511.00330(2025). doi:10.48550/arXiv. 2511.00330 13 Yichao Yuan, Mosharaf Chowdhury, and Nishil Talati [48] Keshav Santhanam, Deepti Raghavan, Muhammad Shahir Rahman, Thejas Venkatesh, Neha Kunjal, Pratiksha Thaker, Philip Levis, and Matei Zaharia. 2024. ALTO: An Efficient Network Orchestrator for Compound AI Systems. InProceedings of the 4th Workshop on Machine Learning and Systems (EuroMLSys '24). 117-125. doi:10.1145/3642970. 3655844 [49] Noah Shinn, Federico Cassano, Bailin Labash, Ashwin Gopinath,"},{"citing_arxiv_id":"2604.15099","ref_index":48,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"O3LS: Optimizing Lattice Surgery via Automatic Layout Searching and Loose Scheduling","primary_cat":"quant-ph","submitted_at":"2026-04-16T14:57:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"O3LS reduces space overhead by up to 46.7% and time overhead by up to 36% in lattice surgery while suppressing logical error rates by up to an order of magnitude compared with prior layout and scheduling approaches.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"recent locality-aware methodLAPBC[25], which enhances circuit parallelism and outperforms prior compilers [4], [39]. We also compare againstSPARO[28], another automated data- layout design method that aims to expand data layouts. c) Benchmarks.We benchmark using a representative set of FT quantum algorithms, following prior FTQC compiler studies [28], [38], [39], [48], [51]. These include circuits for Hamiltonian simulation, Quantum Fourier Transform, key components of Shor's algorithm (e.g., adders and multipliers), and SW AP tests for quantum machine learning,many of which serve as building blocks for larger algorithms.We source the QASM files from MQT Bench [41] and FTCircuitBench [21]. Some FTCircuitBench circuits were originally taken from"},{"citing_arxiv_id":"2604.14626","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ELMoE-3D: Leveraging Intrinsic Elasticity of MoE for Hybrid-Bonding-Enabled Self-Speculative Decoding in On-Premises Serving","primary_cat":"cs.LG","submitted_at":"2026-04-16T05:12:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ELMoE-3D achieves 6.6x average speedup and 4.4x energy efficiency gain for MoE serving on 3D hardware by scaling expert and bit elasticity for elastic self-speculative decoding.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Among HB processes, Die-to-Wafer (D2W) [22] bonds individual Known Good Dies (KGDs) onto a wafer, enabling heterogeneous die sizes and process nodes with high yield, so that system specifi- cation can be tailored to workload requirements. Figure 3 shows the D2W hybrid bonding process and the resulting system architec- ture. DRAM dies are directly stacked on the logic die via HB, while off-package LPDDR5 [25] serves residual capacity, forming a hier- archical memory system. The on-die DRAM provides TB/s-class bandwidth, and the large bandwidth gap between HB and LPDDR5 naturally lends itself to a caching mechanism [64, 69], where fre- quently accessed data reside on-die and the rest are fetched from LPDDR5 on demand. 3 Motivation 3.1 Low Arithmetic Intensity in MoE Serving"},{"citing_arxiv_id":"2604.15379","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Fleet: Hierarchical Task-based Abstraction for Megakernels on Multi-Die GPUs","primary_cat":"cs.AR","submitted_at":"2026-04-15T21:49:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Fleet adds a Chiplet-task level to GPU task models, enabling per-chiplet scheduling and cooperative cache reuse in persistent megakernels, yielding 1.3-1.5x lower LLM decode latency and up to 37% less HBM traffic on AMD MI350 hardware.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.12560","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Design automation and space-time reduction for surface-code logical operations using a SAT-based EDA kernel compatible with general encodings","primary_cat":"quant-ph","submitted_at":"2026-04-14T10:42:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"KOVAL-Q uses SAT solving to optimize and verify surface-code logical operations with general encodings, finding d-cycle CNOTs and 2d-cycle rotations that reduce FTQC application runtime by about 10 percent.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.11000","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Compiler Framework for Directional Transport in Zoned Neutral Atom Systems with AOD Assistance: A Hybrid Remote CZ Approach","primary_cat":"quant-ph","submitted_at":"2026-04-13T04:58:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A hybrid DT-AOD compiler framework enables faster remote CZ gates in neutral atom systems by transporting Rydberg excitations directionally along resettable ancilla paths.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.10187","ref_index":50,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"WaveTune: Wave-aware Bilinear Modeling for Efficient GPU Kernel Auto-tuning","primary_cat":"cs.PF","submitted_at":"2026-04-11T12:41:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"WaveTune introduces a wave-aware bilinear latency predictor and wave-structured sparse sampling to enable fast runtime auto-tuning of GPU kernels, achieving up to 1.83x kernel speedup and 1.33x TTFT reduction with drastically lower overhead.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"been widely adopted to optimize GPU kernel performance by explor- ing large configuration spaces. Early systems such as AutoTVM [6] and Ansor [49] employ learned cost models (e.g., gradient-boosted trees) to guide search and reduce tuning cost. Subsequent works further improve search efficiency through program sampling and transfer learning [32]. Analytical approaches such as Roller [ 50] attempt to model hardware behavior to prune the search space. 11 Zhang et al. 5 10 20 30 W (Maximum Profiled Wave Count) 1.34 1.36 1.38 1.40Geometric Mean Speedup Sequence Length 64 1023 I = 2 I = 3 I = 5 5 10 20 30 1.25 1.50 1.75 2.00 2.25 2.50 Sequence Length 1024 16384 I = 2 I = 3 I = 5 Figure 11: Impact of profiling range (𝑊 , 𝐼) on performance"},{"citing_arxiv_id":"2604.10180","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Tessera: Unlocking Heterogeneous GPUs through Kernel-Granularity Disaggregation","primary_cat":"cs.DC","submitted_at":"2026-04-11T12:19:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Tessera performs kernel-granularity disaggregation on heterogeneous GPUs, achieving up to 2.3x throughput and 1.6x cost efficiency gains for large model inference while generalizing beyond prior methods.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"demand is further amplified by the emergence of agentic AI applications, where a single user request may trigger multiple sequential model invocations, significantly increasing inference workload and cost [6], [7]. In this context, a primary challenge for cloud providers is to maximize the serving performance of heterogeneous GPU clusters, and improve cost efficiency (Perf/$) [8], [9]. However, existing scheduling mechanisms fail to fully exploit the architectural diversity of heterogeneous GPUs, leaving substantial performance and cost-efficiency gains untapped. State-of-the-art.Recent studies [10], [11], [12], [13], [9] have attempted to leverage the GPU heterogeneity by disag- gregation. These methods partition AI workloads according"},{"citing_arxiv_id":"2604.08445","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PG-MDP: Profile-Guided Memory Dependence Prediction for Area-Constrained Cores","primary_cat":"cs.PL","submitted_at":"2026-04-09T16:41:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Profile-guided opcode labeling removes consistently independent loads from the MDP working set, cutting queries 79%, false dependencies 77%, and raising small-core IPC 1.47% on SPEC2017 intspeed.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.07609","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Blink: CPU-Free LLM Inference by Delegating the Serving Stack to GPU and SmartNIC","primary_cat":"cs.DC","submitted_at":"2026-04-08T21:27:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Blink enables CPU-free LLM inference via SmartNIC offload and persistent GPU kernel, delivering up to 8.47x lower P99 TTFT, 3.4x lower P99 TPOT, 2.1x higher decode throughput, and 48.6% lower energy per token while remaining stable under CPU interference.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.07523","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FILCO: Flexible Composing Architecture with Real-Time Reconfigurability for DNN Acceleration","primary_cat":"cs.AR","submitted_at":"2026-04-08T18:54:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"FILCO introduces a real-time reconfigurable composing architecture for DNN acceleration that achieves 1.3x-5x better throughput and hardware efficiency than prior designs on diverse workloads via an analytical model and two-stage design space exploration.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.07345","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Measurement of Generative AI Workload Power Profiles for Whole-Facility Data Center Infrastructure Planning","primary_cat":"eess.SY","submitted_at":"2026-04-08T17:56:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"High-resolution power profiles for AI workloads on H100 GPUs are measured and scaled to whole-facility energy demand using a bottom-up model, with the dataset made public.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"We simulated operation at a 1 MW data center serving LLM inference in production (online) mode. In accordance with the benchmark tests in Section 3.5 and the power sam- ples in Figure 13, we assumed Llama-3 70B model instances to serve user requests answer- ing coding and conversation questions, respectively. To inform the data center utilization, we used distributions published by Microsoft Azure [16, 15]. These distributions include a probability function dictating the fraction of coding and conversation prompts - 38.1 and 61.9%, respectively - as well as functions to shape how the request rate (prompts/second) varies with time - hourly and by day of the week. Similarly to the colocation use case, 29 Table 4: Inference data center aggregate metrics across the whole year for different target utilization levels."},{"citing_arxiv_id":"2604.07173","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"InfiniLoRA: Disaggregated Multi-LoRA Serving for Large Language Models","primary_cat":"cs.DC","submitted_at":"2026-04-08T15:01:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"InfiniLoRA decouples LoRA execution from base-model inference and reports 3.05x higher request throughput plus 54% more adapters meeting strict latency SLOs.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"https://www.usenix.org/conference/fast25/ presentation/qin [27] Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, and Yuxiong He. 2022. DeepSpeed-MoE: Advancing Mixture-of- Experts Inference and Training to Power Next-Generation AI Scale. arXiv:2201.05596 [cs.LG]https://arxiv.org/abs/2201.05596 [28] John Schulman and Thinking Machines Lab. 2025. LoRA Without Regret.Thinking Machines Lab: Connectionism(2025). doi:10.64434/ tml.20250929https://thinkingmachines.ai/blog/lora/. [29] Sheng Shen, Le Hou, Yanqi Zhou, Nan Du, Shayne Longpre, Jason Wei, Hyung Won Chung, Barret Zoph, William Fedus, Xinyun Chen, Tu Vu, Yuexin Wu, Wuyang Chen, Albert Webson, Yunxuan Li, Vin-"},{"citing_arxiv_id":"2604.06956","ref_index":41,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"NestPipe: Large-Scale Recommendation Training on 1,500+ Accelerators via Nested Pipelining","primary_cat":"cs.DC","submitted_at":"2026-04-08T11:19:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"NestPipe achieves up to 3.06x speedup and 94.07% scaling efficiency on 1,536 workers via dual-buffer inter-batch and frozen-window intra-batch pipelining that overlaps communication with computation.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"lookup bottlenecks [12], [13], [14], [15]. Other pipeline paral- lelism schemes [36], [37], [38] decouple data loading from model computation by prefetching embedding vectors for future batches while the worker computes the current batch. However, these solutions may cause parameter staleness and fundamentally lack reproducibility, thus compromising model convergence [41]. Furthermore, the prohibitive communication bottleneck also continues to limit their effectiveness in decen- tralized embedding training. Even if local lookup latency is hidden, scaling to thousands of workers still exacerbates the communication overhead imposed by the All2All primitive. Another category of works reduces communication over- head via embedding compression, such as hashing [6], [16],"},{"citing_arxiv_id":"2604.06667","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Computing In Spintronic Memory: A Thermal Perspective","primary_cat":"cs.ET","submitted_at":"2026-04-08T04:35:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Spintronic CiM shows uniform temperature that increases linearly with participating memory cells and decreases linearly with array size.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.05505","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Qurator: Scheduling Hybrid Quantum-Classical Workflows Across Heterogeneous Cloud Providers","primary_cat":"quant-ph","submitted_at":"2026-04-07T06:58:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Qurator jointly optimizes queue time and fidelity for hybrid quantum-classical workflows across providers using quantum-aware DAG scheduling and a unified logarithmic fidelity score, achieving 30-75% wait reduction at high load with bounded accuracy cost.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"has addressed parts of the problem, but none has tackled the joint optimization of queue time and fidelity across hetero- geneous providers under the full set of quantum constraints. We survey each area in turn and identify the specific gaps that Qurator fills. 7.1 Classical Task Scheduling Classical scheduling has been studied extensively across single processor [15, 48, 66], grid systems [2, 7], data cen- ters [79], and clusters [ 40, 60], with algorithms spanning static list scheduling [5, 56, 61, 75, 76], task duplication [9, 73], genetic algorithms [ 1, 46, 50, 53], and dynamic tech- niques [26, 32, 37, 42, 57, 59]. Qurator draws most directly from two classical threads. First, scheduling on shared cloud resources where the"},{"citing_arxiv_id":"2603.15042","ref_index":51,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Performance Isolation and Semantic Determinism in Efficient GPU Spatial Sharing","primary_cat":"cs.DC","submitted_at":"2026-03-16T09:48:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CoGPU resolves the tradeoff in GPU sharing by introducing GPU coroutines for semantic-preserving resource migration, delivering up to 79.2% higher training throughput and zero token mismatch in inference.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.09616","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DCGen 1.1 Technical Report: Generating Datacenter Configurations (including IT, Power, Cooling)","primary_cat":"cs.DC","submitted_at":"2026-03-15T00:34:51+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DCGen generates customizable datacenter configurations with IT, power, and cooling components optimized for power, compute, and area targets using real equipment catalogs and workload-specific IT mixes.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.10726","ref_index":54,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PrefixWall: Mitigating Prefix Caching Side Channels in Shared LLM Systems","primary_cat":"cs.CR","submitted_at":"2026-03-11T12:59:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PrefixWall mitigates APC side channels in multi-tenant LLM systems via selective prefix isolation, delivering up to 70% higher cache reuse and 30% lower latency than full-isolation baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.15172","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"The Turbo-Charged Mapper: Fast and Optimal Mapping for Energy-efficient and Low-latency Accelerator Design","primary_cat":"cs.AR","submitted_at":"2026-02-16T20:21:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"TCM finds provably optimal DNN accelerator mappings by pruning the search space up to 32 orders of magnitude with a new dataplacement concept, delivering 1.2-6.5x better energy-delay-product in 17 seconds instead of hours.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.14910","ref_index":55,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PipeWeave: Synergizing Analytical and Learning Models for Unified GPU Performance Prediction","primary_cat":"cs.PF","submitted_at":"2026-01-21T11:47:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PipeWeave predicts GPU kernel performance with 6.1% average error and end-to-end inference with 8.5% error by feeding analytical pipeline features into ML, cutting prior method errors by 4-7x across 11 GPUs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.06484","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PureMagic: A Dynamic Scheduler for Lattice Surgery","primary_cat":"quant-ph","submitted_at":"2025-12-06T16:16:20+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.11938","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FlexPipe: Adapting Dynamic LLM Serving Through Inflight Pipeline Refactoring in Fragmented Serverless Clusters","primary_cat":"cs.DC","submitted_at":"2025-10-13T21:01:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"FlexPipe introduces runtime pipeline refactoring for LLMs to achieve higher resource efficiency and lower latency in serverless GPU clusters with fragmentation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.19729","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Amoeba: Runtime Tensor Parallel Transformation for LLM Inference Services","primary_cat":"cs.DC","submitted_at":"2025-09-24T03:15:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Amoeba adaptively adjusts tensor parallelism at runtime for LLM inference services to handle mixed short and long context requests, delivering 1.75x-6.57x throughput gains over prior solutions in real-world trace evaluations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.09505","ref_index":38,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Combating the Memory Walls: Optimization Pathways for Long-Context Agentic LLM Inference","primary_cat":"cs.AR","submitted_at":"2025-09-11T14:49:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PLENA introduces a co-designed system with three optimization pathways for long-context agentic LLM inference, claiming up to 2.23x throughput over A100 and 4.04x energy efficiency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.23970","ref_index":75,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Cache Your Prompt When It's Green: Carbon-Aware Caching for Large Language Model Serving","primary_cat":"cs.DC","submitted_at":"2025-05-29T19:52:44+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"GreenCache dynamically manages LLM KV cache resources to reduce carbon emissions by 15.1% on average (up to 25.3%) while meeting latency constraints for over 90% of requests on real traces.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}