K12-KGraph is a textbook-derived knowledge graph that powers a new benchmark revealing LLMs' poor curriculum cognition and a small training corpus that outperforms general instruction data on educational tasks.
hub
Xing, Joseph E
17 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
SAGE is a new multi-agent benchmark that formalizes service SOPs as dynamic dialogue graphs to measure LLM agents on logical compliance and path coverage, uncovering an execution gap and empathy resilience across 27 models in 6 scenarios.
A renewal-reward analysis yields a closed-form mean-field rule for the optimal Attention/FFN provisioning ratio in disaggregated LLM serving that accounts for stochastic KV-cache growth and matches simulation optima within 10%.
SuperInfer improves TTFT SLO attainment by up to 74.7% on GH200 Superchips via SLO-aware rotary scheduling (RotaSched) and full-duplex KV cache rotation (DuplexKV) over NVLink-C2C while preserving TBT and throughput.
TTS adapts speculator models online via target model verifications to improve acceptance lengths by up to 72% over prior methods, with gains increasing for longer generations.
Adversarial restlessness in LLM activations allows five scalar features to detect multi-turn prompt injections at 93.8% accuracy on synthetic data, with cross-model replication but source-dependent generalization to real-world chats.
TwinGate deploys a stateful dual-encoder system with asymmetric contrastive learning to detect decompositional jailbreaks in untraceable LLM traffic at high recall and low false-positive rate with negligible latency.
A flow-control framework for LLM inference derives necessary and sufficient stability conditions and experimentally improves throughput, latency, and KV cache stability over common baselines.
LLMs diverge from human goal selection in self-directed learning by exploiting single solutions with low variability across instances.
Sandwich delivers 2.01x average end-to-end speedup and up to 3.4x latency reduction for CPU LLM serving via phase-wise hot-switching, TopoTree hardware abstraction, and fast-start dynamic kernel generation.
LLMs drop 39% in performance during multi-turn conversations due to premature assumptions and inability to recover from early errors.
A unified KV cache system with architecture-specific sizing, six-tier memory from GPU to filesystems, and Bayesian prediction delivers 7.4x higher batch sizes, 70-84% hit rates, and projected 1.7-2.9x throughput gains.
NVIDIA releases the Nemotron 3 model family with hybrid Mamba-Transformer architecture, LatentMoE, NVFP4 training, MTP layers, and multi-environment RL post-training for reasoning and agentic tasks.
A survey on LLM-as-a-Judge that reviews reliability strategies, proposes evaluation methods, and introduces a novel benchmark for assessing such systems.
Benchmarks of vLLM, InfiniGen, and H2O identify conditions under which each KV cache strategy delivers the best trade-off between memory consumption and inference performance.
A survey categorizing scaling in LLM reasoning across input size, steps, rounds, training, and future directions, noting that scaling can negatively affect performance.
Gemma 2 models achieve leading performance at their sizes by combining established Transformer modifications with knowledge distillation for the 2B and 9B variants.
citing papers explorer
-
K12-KGraph: A Curriculum-Aligned Knowledge Graph for Benchmarking and Training Educational LLMs
K12-KGraph is a textbook-derived knowledge graph that powers a new benchmark revealing LLMs' poor curriculum cognition and a small training corpus that outperforms general instruction data on educational tasks.
-
SAGE: A Service Agent Graph-guided Evaluation Benchmark
SAGE is a new multi-agent benchmark that formalizes service SOPs as dynamic dialogue graphs to measure LLM agents on logical compliance and path coverage, uncovering an execution gap and empathy resilience across 27 models in 6 scenarios.
-
Analytical Provisioning for Attention-FFN Disaggregated LLM Serving under Stochastic Workloads
A renewal-reward analysis yields a closed-form mean-field rule for the optimal Attention/FFN provisioning ratio in disaggregated LLM serving that accounts for stochastic KV-cache growth and matches simulation optima within 10%.
-
SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips
SuperInfer improves TTFT SLO attainment by up to 74.7% on GH200 Superchips via SLO-aware rotary scheduling (RotaSched) and full-duplex KV cache rotation (DuplexKV) over NVLink-C2C while preserving TBT and throughput.
-
Test-Time Speculation
TTS adapts speculator models online via target model verifications to improve acceptance lengths by up to 72% over prior methods, with gains increasing for longer generations.
-
Latent Adversarial Detection: Adaptive Probing of LLM Activations for Multi-Turn Attack Detection
Adversarial restlessness in LLM activations allows five scalar features to detect multi-turn prompt injections at 93.8% accuracy on synthetic data, with cross-model replication but source-dependent generalization to real-world chats.
-
TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning
TwinGate deploys a stateful dual-encoder system with asymmetric contrastive learning to detect decompositional jailbreaks in untraceable LLM traffic at high recall and low false-positive rate with negligible latency.
-
Flow-Controlled Scheduling for LLM Inference with Provable Stability Guarantees
A flow-control framework for LLM inference derives necessary and sufficient stability conditions and experimentally improves throughput, latency, and KV cache stability over common baselines.
-
Language Model Goal Selection Differs from Humans' in a Self-Directed Learning Task
LLMs diverge from human goal selection in self-directed learning by exploiting single solutions with low variability across instances.
-
Sandwich: Joint Configuration Search and Hot-Switching for Efficient CPU LLM Serving
Sandwich delivers 2.01x average end-to-end speedup and up to 3.4x latency reduction for CPU LLM serving via phase-wise hot-switching, TopoTree hardware abstraction, and fast-start dynamic kernel generation.
-
LLMs Get Lost In Multi-Turn Conversation
LLMs drop 39% in performance during multi-turn conversations due to premature assumptions and inability to recover from early errors.
-
Predictive Multi-Tier Memory Management for KV Cache in Large-Scale GPU Inference
A unified KV cache system with architecture-specific sizing, six-tier memory from GPU to filesystems, and Bayesian prediction delivers 7.4x higher batch sizes, 70-84% hit rates, and projected 1.7-2.9x throughput gains.
-
NVIDIA Nemotron 3: Efficient and Open Intelligence
NVIDIA releases the Nemotron 3 model family with hybrid Mamba-Transformer architecture, LatentMoE, NVFP4 training, MTP layers, and multi-environment RL post-training for reasoning and agentic tasks.
-
A Survey on LLM-as-a-Judge
A survey on LLM-as-a-Judge that reviews reliability strategies, proposes evaluation methods, and introduces a novel benchmark for assessing such systems.
-
Comparative Characterization of KV Cache Management Strategies for LLM Inference
Benchmarks of vLLM, InfiniGen, and H2O identify conditions under which each KV cache strategy delivers the best trade-off between memory consumption and inference performance.
-
A Survey of Scaling in Large Language Model Reasoning
A survey categorizing scaling in LLM reasoning across input size, steps, rounds, training, and future directions, noting that scaling can negatively affect performance.
-
Gemma 2: Improving Open Language Models at a Practical Size
Gemma 2 models achieve leading performance at their sizes by combining established Transformer modifications with knowledge distillation for the 2B and 9B variants.