{"total":48,"items":[{"citing_arxiv_id":"2605.22416","ref_index":21,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Asymmetric Virtual Memory Paging for Hybrid Mamba-Transformer Inference","primary_cat":"cs.LG","submitted_at":"2026-05-21T12:37:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AVMP separates KV and SSM cache pools behind unified virtual addressing with failure-triggered migration, cutting OOM events 7.6% and raising throughput 1.83-13.3x on synthetic loads and 2.36x on ShareGPT traces.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21974","ref_index":40,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Format-Constraint Coupling in Knowledge Graph Construction from Statistical Tables","primary_cat":"cs.AI","submitted_at":"2026-05-21T04:08:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Empirical 2x2 factorial study on 6 statistical datasets shows format and schema constraints in LLM-based KG construction from CSV tables produce super-additive fidelity loss up to +1.180, with mismatched pairs falling below baseline, plus release of CSVFidelity-Bench.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17172","ref_index":97,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"OpenJarvis: Personal AI, On Personal Devices","primary_cat":"cs.LG","submitted_at":"2026-05-16T22:00:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OpenJarvis decomposes personal AI into Intelligence, Engine, Agents, Tools & Memory, and Learning primitives and applies LLM-guided spec search to produce on-device configurations that reach within 3.2 pp of cloud baselines on average across eight tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15422","ref_index":25,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DualKV: Shared-Prompt Flash Attention for Efficient RL Training with Large Rollouts and Long Contexts","primary_cat":"cs.LG","submitted_at":"2026-05-14T21:11:32+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13784","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Attention Once Is All You Need: Efficient Streaming Inference with Stateful Transformers","primary_cat":"cs.LG","submitted_at":"2026-05-13T17:06:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Stateful sessions with incremental KV cache and flash queries allow O(|q|) latency in streaming transformer inference, delivering up to 5.9x speedup over conventional engines while preserving full attention.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12396","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"NCCLZ: Compression-Enabled GPU Collectives with Decoupled Quantization and Entropy Coding","primary_cat":"cs.DC","submitted_at":"2026-05-12T16:57:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"NCCLZ decouples quantization and entropy coding across NCCL stack layers to enable overlapped compression, delivering up to 9.65x speedup over plain NCCL on scientific and training workloads.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"compression. We first discuss NCCL, and then we briefly summarize data compression. A. NCCL NCCL (NVIDIA Collective Communications Library) is a widely used GPU communication library for distributed deep learning and other multi-GPU workloads. It is inte- grated into mainstream frameworks and systems to execute high-throughput collectives [1], [13]-[18]. NCCL provides collectives such asAllReduce,Broadcast,AlltoAll, Fig. 1. System architecture of NCCLZ, with the interaction between appli- cation workloads, the NCCL runtime, GPU-resident entropy coding, and the underlying transport and network layers. andAllGather; acollectiveoperates on a group of GPU ranks, andAllReduceaggregates values across ranks and"},{"citing_arxiv_id":"2605.10832","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents","primary_cat":"cs.CL","submitted_at":"2026-05-11T16:49:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A new image-bank harness and closed-loop on-policy data evolution method raises multimodal agent performance on visual search benchmarks from 24.9% to 39.0% for an 8B model and from 30.6% to 41.5% for a 30B model.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10670","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Surviving Partial Rank Failures in Wide Expert-Parallel MoE Inference","primary_cat":"cs.DC","submitted_at":"2026-05-11T14:53:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"EEP makes wide expert-parallel MoE serving survive single-rank failures with an 11s recovery pause, 8s reintegration pause, and throughput restored to 95% of pre-fault level within 52s while staying within 4.4% of a fixed-membership baseline in steady state.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11030","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"An Executable Benchmarking Suite for Tool-Using Agents","primary_cat":"cs.SE","submitted_at":"2026-05-10T21:24:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The paper delivers a unified executable benchmarking suite for tool-using agents that enforces a shared evidence-admission contract across web, code, and micro-task environments.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09735","ref_index":47,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"KV-RM: Regularizing KV-Cache Movement for Static-Graph LLM Serving","primary_cat":"cs.AR","submitted_at":"2026-05-10T20:10:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"KV-RM regularizes KV-cache movement in static-graph LLM serving via block paging and merge-staged transport to improve throughput, tail latency, and memory use for variable-length decoding.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07985","ref_index":41,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Dooly: Configuration-Agnostic, Redundancy-Aware Profiling for LLM Inference Simulation","primary_cat":"cs.DC","submitted_at":"2026-05-08T16:44:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Dooly reduces LLM inference profiling GPU-hours by 56.4% across 12 models while keeping simulation MAPE under 5% for TTFT and 8% for TPOT by making profiling configuration-agnostic and redundancy-aware.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"customizable attention engine for llm inference serving, 2025. URL https://arxiv.org/ abs/2501.01005. [40] Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. Sglang: Efficient execution of structured language model programs, 2024. URL https://arxiv.org/abs/2312.07104. [41] Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. Distserve: Disaggregating prefill and decoding for goodput-optimized large language model serving, 2024. URLhttps://arxiv.org/abs/2401.09670. 12 A Comparison with Existing LLM Simulators Dooly RT [5] VD [3] FT [17] LS[11] AP [24] LS2 [12] Profiler Automatic Model Discovery ✓ ✗ ✗ ✗ ✗ ✗ ✗"},{"citing_arxiv_id":"2605.06850","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"How to Compress KV Cache in RL Post-Training? Shadow Mask Distillation for Memory-Efficient Alignment","primary_cat":"cs.LG","submitted_at":"2026-05-07T18:51:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Shadow Mask Distillation enables KV cache compression in RL post-training of LLMs by mitigating amplified off-policy bias that defeats standard importance reweighting.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06068","ref_index":84,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?","primary_cat":"cs.AI","submitted_at":"2026-05-07T11:54:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual models, workloads, or hardware.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05899","ref_index":59,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"VisMMOE: Exploiting Visual-Expert Affinity for Efficient Visual-Language MoE Offloading","primary_cat":"cs.LG","submitted_at":"2026-05-07T09:11:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VisMMoE exploits visual-expert affinity via token pruning to achieve up to 2.68x faster VL-MoE inference on memory-constrained hardware while keeping accuracy competitive.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05287","ref_index":45,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Securing the Agent: Vendor-Neutral, Multitenant Enterprise Retrieval and Tool Use","primary_cat":"cs.CR","submitted_at":"2026-05-06T17:59:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A server-side architecture with policy-aware ingestion and ABAC-based retrieval gating prevents cross-tenant data leakage in multitenant enterprise RAG and agent systems.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.03351","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"VLMaxxing through FrameMogging Training-Free Anti-Recomputation for Video Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2026-05-05T04:13:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Training-free adaptive reuse of stable visual state in video VLMs reduces follow-up latency by 15-36x on Qwen2.5-VL while preserving correctness on VideoMME, with smaller first-query speedups via pruning.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"operating point, and reports Qwen's keep-rate boundary instead of claiming a broad sparse backend. Long-horizon reuse and systems co-design are directly relevant. QuickVideo [ 26] explicitly co- designs decode and preﬁll. ReKV [ 28], VLCache [ 41], StreamingVLM [ 36], SparseVILA [ 32], and HERMES [ 21] study cached visual or KV state over long contexts, streaming video, or multi-turn multimodal inference. SGLang's RadixAttention [ 30], vLLM/PagedAttention [ 42], and standard preﬁx caching show that shared-preﬁx KV reuse is already a serving systems primitive; C-PERSIST should not be read as inventing that primitive. StreamingLLM [ 35] is also adjacent as a text-only long-context KV-retention result built around attention sinks, though it is not a visual state-reuse method. CacheBlend [ 4] is even closer to the repair pattern: it reuses bulk cached KV for RAG"},{"citing_arxiv_id":"2605.02821","ref_index":14,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"When Is the Same Model Not the Same Service? A Measurement Study of Hosted Open-Weight LLM APIs","primary_cat":"cs.PF","submitted_at":"2026-05-04T16:59:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Hosted open-weight LLM APIs function as time-varying heterogeneous services rather than fixed model artifacts, with demand concentrated, supply-use mismatches, and task-specific routing yielding major cost and throughput gains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01060","ref_index":53,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"SURGE: SuperBatch Unified Resource-efficient GPU Encoding for Heterogeneous Partitioned Data","primary_cat":"cs.DC","submitted_at":"2026-05-01T19:51:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SURGE achieves fixed-batch throughput for GPU embedding generation on 800M texts across 40k partitions using 12.6x less memory, 68x faster time-to-first-output, and fault tolerance via a streaming two-threshold policy with an analytical cost model accurate to 2%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.27476","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"EdgeFM: Efficient Edge Inference for Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2026-04-30T06:18:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"EdgeFM is an agent-driven framework that strips non-essential features from VLMs and packages reusable optimized kernels, achieving up to 1.49x speedup over TensorRT-Edge-LLM on NVIDIA Orin while enabling first end-to-end deployment on Horizon Journey hardware.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.26039","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"RaMP: Runtime-Aware Megakernel Polymorphism for Mixture-of-Experts","primary_cat":"cs.LG","submitted_at":"2026-04-28T18:20:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RaMP uses a hardware-derived performance region analysis and a four-parameter wave cost model to select optimal polymorphic kernel configurations for MoE inference from runtime expert histograms, delivering 1.22x kernel and 1.30x end-to-end speedups with 0.93% mean regret after brief profiling.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.25724","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Scalable Inference Architectures for Compound AI Systems: A Production Deployment Study","primary_cat":"cs.AI","submitted_at":"2026-04-28T14:53:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A deployed modular inference architecture for compound AI systems cut tail latency over 50%, boosted throughput up to 3.9x, and reduced costs 30-40% while handling multi-model agent workloads.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.20560","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"LLM StructCore: Schema-Guided Reasoning Condensation and Deterministic Compilation","primary_cat":"cs.CL","submitted_at":"2026-04-22T13:42:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Two-stage Schema-Guided Reasoning with LLM condensation and deterministic compilation achieves macro-F1 of 0.63 on dyspnea CRF filling task and is language-agnostic.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.16864","ref_index":49,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"HieraSparse: Hierarchical Semi-Structured Sparse KV Attention","primary_cat":"cs.DC","submitted_at":"2026-04-18T06:28:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"HieraSparse delivers a hierarchical semi-structured sparse KV attention system that achieves 1.2x KV compression and 4.57x decode attention speedup versus prior unstructured sparsity methods at equivalent sparsity, plus up to 1.85x prefill speedup and 1.37x/1.77x speedups with magnitude pruning and ","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"achieve the expected throughput. Additionally, we have a split- KV design where each thread block only processes a subset of the key and value blocks to increase parallelism. Each block outputs a partial output together with its own log-sum-exp, which are later combined in a lightweight post-processing kernel. The kernel was also optimized for Grouped-Query Attention (GQA) [49], [50], where multiple queries attending Fig. 4: The performance gain of different optimizations for prefill kernel, measured with32Kcontext and batch size of8 withLlama-3.1-8B-Instructattention setting. to the same KV Cache head are viewed as a short duplicated sequence and reduce the padding overhead. For the prefill phase, three main additional optimizations are implemented to"},{"citing_arxiv_id":"2604.16838","ref_index":17,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"enclawed: A Configurable, Sector-Neutral Hardening Framework for Single-User AI Assistant Gateways","primary_cat":"cs.CR","submitted_at":"2026-04-18T05:10:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"enclawed is a sector-neutral hardening framework for AI gateways providing signed modules, audit trails, peer attestation, and a 356-case test suite for regulated deployments.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"3 Threat model We assume a single-tenant, single-user gateway running inside a high-trust enclave. The user holds the deploying organization's highest applicable trust tier. The system must: 1. never egress to the public Internet or any external channel/provider; 2. use only locally-hosted inference (e.g. Ollama [18], vLLM [16], LM Studio [19], SGLang [17], local NVIDIA Inference Microservice (NIM)); 3. refuse to render or echo data above the user's authorized tier; 4. refuse to write data below its origin classification (no-write-down); 5. produce a tamper-evident audit trail of every model interaction; 6. encrypt every persistent artifact at rest with a key bound to organization-controlled hard- ware;"},{"citing_arxiv_id":"2604.16736","ref_index":36,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"When Agents Go Quiet: Output Generation Capacity and Format-Cost Separation for LLM Document Synthesis","primary_cat":"cs.AI","submitted_at":"2026-04-17T22:48:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLM agents avoid output stalling and reduce generation tokens by 48-72% via deferred template rendering guided by Output Generation Capacity and a Format-Cost Separation Theorem.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05219","ref_index":38,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Sparse Prefix Caching for Hybrid and Recurrent LLM Serving","primary_cat":"cs.LG","submitted_at":"2026-04-17T09:24:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Sparse prefix caching via dynamic programming for optimal checkpoint placement under overlap distributions improves the Pareto frontier for recurrent and hybrid LLM serving on shared-prefix data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.14847","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"TrigReason: Trigger-Based Collaboration between Small and Large Reasoning Models","primary_cat":"cs.AI","submitted_at":"2026-04-16T10:33:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TrigReason matches large reasoning model accuracy on math and science benchmarks by delegating most steps to small models and intervening selectively on three triggers, cutting latency by 43.9% and cost by 73.3%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.15379","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Fleet: Hierarchical Task-based Abstraction for Megakernels on Multi-Die GPUs","primary_cat":"cs.AR","submitted_at":"2026-04-15T21:49:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Fleet adds a Chiplet-task level to GPU task models, enabling per-chiplet scheduling and cooperative cache reuse in persistent megakernels, yielding 1.3-1.5x lower LLM decode latency and up to 37% less HBM traffic on AMD MI350 hardware.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.11943","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ProbeLogits: Kernel-Level LLM Inference Primitives for AI-Native Operating Systems","primary_cat":"cs.OS","submitted_at":"2026-04-13T18:32:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ProbeLogits performs single-pass logit reading inside the kernel to classify LLM agent actions as safe or dangerous, reaching 97-99% block rates on HarmBench and F1 parity or better than Llama Guard 3 at 2.5x lower latency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.11554","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Relax: An Asynchronous Reinforcement Learning Engine for Omni-Modal Post-Training at Scale","primary_cat":"cs.CL","submitted_at":"2026-04-13T14:42:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Relax is a new RL training engine with omni-native design and async execution that delivers up to 2x speedups over baselines like veRL while converging to equivalent reward levels on Qwen3 models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.10235","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"CodeComp: Structural KV Cache Compression for Agentic Coding","primary_cat":"cs.CL","submitted_at":"2026-04-11T14:38:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CodeComp uses Joern-extracted Code Property Graph priors for training-free structural KV cache compression, outperforming attention-only baselines on bug localization and code generation while matching full-context patch quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.09852","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MEMENTO: Teaching LLMs to Manage Their Own Context","primary_cat":"cs.AI","submitted_at":"2026-04-10T19:30:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MEMENTO trains LLMs to segment reasoning into blocks, generate mementos as dense summaries, and reason forward using only mementos and KV states, cutting peak KV cache by ~2.5x while preserving benchmark accuracy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.28342","ref_index":34,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization","primary_cat":"cs.CL","submitted_at":"2026-03-30T12:12:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Kernel-Smith combines evolutionary search with RL post-training to generate optimized GPU kernels, achieving SOTA speedups on KernelBench that beat Gemini-3.0-pro and Claude-4.6-opus on NVIDIA Triton and generalize to MetaX MACA.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.03270","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Knowledge Packs: Zero-Token Knowledge Delivery via KV Cache Injection","primary_cat":"cs.CL","submitted_at":"2026-03-22T11:55:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Knowledge Packs deliver knowledge via pre-computed KV caches with exact equivalence under causal masking, achieving zero divergences on tested questions and enabling value-based steering without training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.09557","ref_index":58,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding","primary_cat":"cs.DC","submitted_at":"2026-02-10T16:19:56+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.12967","ref_index":31,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Sutradhara: An Intelligent Orchestrator-Engine Co-design for Tool-based Agentic Inference","primary_cat":"cs.DC","submitted_at":"2026-01-19T11:26:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Sutradhara co-designs orchestrator and LLM serving to overlap tool execution with prefill, stream tool dispatch during decode, and use semantic hints for cache management, yielding up to 77% higher load at fixed median FTR latency or 15% lower median FTR at fixed load.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.14098","ref_index":51,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Cornfigurator: Automated Planning for Any-to-Any Multimodal Model Serving","primary_cat":"cs.LG","submitted_at":"2025-12-16T05:14:41+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Cornfigurator is the first automated deployment planner for generic any-to-any multimodal models that explores the full range of colocation-to-disaggregation strategies and delivers 1.12x to 6.32x higher goodput than existing systems or expert plans.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.21686","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Matrix: Peer-to-Peer Multi-Agent Synthetic Data Generation Framework","primary_cat":"cs.CL","submitted_at":"2025-11-26T18:59:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Matrix provides a peer-to-peer multi-agent system for synthetic data generation that scales to tens of thousands of workflows and delivers 2-15x higher throughput than centralized designs without quality loss.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.03092","ref_index":30,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SnapStream: Efficient Long Sequence Decoding on Dataflow Accelerators","primary_cat":"cs.AI","submitted_at":"2025-11-05T00:38:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SnapStream deploys sparse KV attention in a production inference system on dataflow accelerators, delivering 4x on-chip memory savings for DeepSeek-671B at 128k context with up to 1832 tokens/sec and minimal accuracy loss on LongBench-v2, AIME24, and LiveCodeBench.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.19669","ref_index":20,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DiffAdapt: Difficulty-Adaptive Reasoning for Token-Efficient LLM Inference","primary_cat":"cs.CL","submitted_at":"2025-10-22T15:16:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DiffAdapt detects problem difficulty via entropy in reasoning traces and applies one of three fixed inference strategies per question, cutting token usage up to 22.4% with comparable or better accuracy across five models and eight benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.10129","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CacheClip: Accelerating RAG with Effective KV Cache Reuse","primary_cat":"cs.LG","submitted_at":"2025-10-11T09:28:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CacheClip accelerates RAG prefill by up to 3.33x via auxiliary-model-guided selective KV recomputation while retaining 85-91% of full-attention quality on NIAH and LongBench.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.09999","ref_index":55,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ServeGen: Workload Characterization and Generation of Large Language Model Serving in Production","primary_cat":"cs.DC","submitted_at":"2025-05-15T06:24:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ServeGen characterizes production LLM inference workloads across model types and generates realistic per-client composed workloads that reduce under-provisioning by 50% in a production validation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2504.09775","ref_index":70,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MIST: A Co-Design Framework for Heterogeneous, Multi-Stage LLM Inference","primary_cat":"cs.AR","submitted_at":"2025-04-14T00:29:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MIST is a new simulator for heterogeneous multi-stage LLM inference that combines hardware traces with analytical models to explore configuration trade-offs in hybrid CPU-accelerator systems.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2501.01005","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving","primary_cat":"cs.DC","submitted_at":"2025-01-02T02:02:20+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FlashInfer delivers a customizable attention kernel that reduces inter-token latency by 29-69% in LLM serving benchmarks via optimized KV-cache storage and load-balanced scheduling compatible with CUDA graphs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2411.15594","ref_index":222,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Survey on LLM-as-a-Judge","primary_cat":"cs.CL","submitted_at":"2024-11-23T16:03:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A survey on LLM-as-a-Judge that reviews reliability strategies, proposes evaluation methods, and introduces a novel benchmark for assessing such systems.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Alternatively, it can be continuous, ranging from 0 to 1 or 0 to 100 [174]. The simplest way to score is through the context, setting the range of scores and the main criteria for scoring. For example, \"Please rate the helpfulness, relevance, accuracy, level of details of their responses. Each assistant receives an overall score on a scale of 1 to 10, where a higher score indicates better overall performance\" [222]. A slightly more complex way is to provide more detailed scoring criteria. More complex scoring situations can be asLanguage-Model-as-an-Examiner[ 8], which use Likert scale scoring functions as an absolute evaluative measure. The evaluator assigns scores to a given response along predefined dimensions, including accuracy, coherence, factuality, and"},{"citing_arxiv_id":"2407.21787","ref_index":69,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Large Language Monkeys: Scaling Inference Compute with Repeated Sampling","primary_cat":"cs.LG","submitted_at":"2024-07-31T17:57:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Repeated sampling scales problem coverage log-linearly with sample count, improving SWE-bench Lite performance from 15.9% to 56% using 250 samples.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2407.11550","ref_index":35,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference","primary_cat":"cs.CL","submitted_at":"2024-07-16T09:53:32+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Ada-KV is the first head-wise adaptive KV cache budget allocator for LLMs, using a theoretical loss upper bound to allocate eviction differently per attention head and yielding higher quality than uniform methods on long-context benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2404.14294","ref_index":52,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Survey on Efficient Inference for Large Language Models","primary_cat":"cs.CL","submitted_at":"2024-04-22T15:53:08+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":3.0,"formal_verification":"none","one_line_summary":"The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"As a result, APAR achieves an average 1.4∼2.0× speed-up on benchmarks and cases a negligible impact on the answer quality. Further- more, APAR combines their decoding approach with the speculative decoding technique (i.e., Medusa [50]) and serv- ing system (i.e. vLLM [51]) to further improve the inference latency and system throughput, respectively. SGLang [52] introduces a domain-specific language (DSL) in Python featuring primitives that flexibly facili- tate LLM programming. The core idea behind SGLang is to analyze dependencies among various generation calls automatically, and perform batch inference and KV cache sharing based on this analysis. With this language, users can implement various prompting strategies easily and benefit"}],"limit":50,"offset":0}