hub

Proceedings of the 29th symposium on operating systems principles , pages=

Efficient memory management for large language model serving with pagedattention , author=

22 Pith papers cite this work. Polarity classification is still indexing.

22 Pith papers citing it

browse 22 citing papers

hub tools

JSON dossier citing papers JSON

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

FlowCompile: An Optimizing Compiler for Structured LLM Workflows

cs.CL · 2026-05-13 · unverdicted · novelty 8.0

FlowCompile performs compile-time design space exploration on structured LLM workflows to produce reusable high-quality configuration sets that outperform routing baselines with up to 6.4x speedup.

Lost in Translation: Do LVLM Judges Generalize Across Languages?

cs.CL · 2026-04-21 · unverdicted · novelty 8.0

MM-JudgeBench shows substantial cross-lingual performance variance in 22 LVLM judges, with model size and architecture as poor predictors of multilingual robustness.

ArgBench: Benchmarking LLMs on Computational Argumentation Tasks

cs.CL · 2026-04-19 · unverdicted · novelty 8.0

ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.

Base Models Look Human To AI Detectors

cs.CL · 2026-05-19 · unverdicted · novelty 7.0

Base model text evades AI detectors better than instruction-tuned text, and the HIP method strengthens this trade-off across model sizes.

Rule2DRC: Benchmarking LLM Agents for DRC Script Synthesis with Execution-Guided Test Generation

cs.LG · 2026-05-15 · unverdicted · novelty 7.0

Rule2DRC is a benchmark for LLM agents synthesizing DRC scripts from natural language rules, paired with SplitTester that improves Best-of-N selection via execution-guided discriminative test generation.

LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues

cs.CL · 2026-05-12 · unverdicted · novelty 7.0

LongMemEval-V2 is a new benchmark where AgentRunbook-C reaches 72.5% accuracy on long-term agent memory tasks, beating RAG baselines at 48.5% and basic coding agents at 69.3%.

Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on AIME 2025.

Transformers with Selective Access to Early Representations

cs.LG · 2026-05-05 · unverdicted · novelty 7.0 · 2 refs

SATFormer uses a context-dependent gate for selective reuse of early Transformer representations, improving validation loss and zero-shot accuracy especially on retrieval benchmarks.

POSTCONDBENCH: Benchmarking Correctness and Completeness in Formal Postcondition Inference

cs.SE · 2026-05-05 · unverdicted · novelty 7.0

POSTCONDBENCH is a new multilingual benchmark that evaluates LLM postcondition generation on real code using defect discrimination to assess completeness beyond surface matching.

From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents

cs.CL · 2026-05-14 · unverdicted · novelty 6.0

A dataset-agnostic framework converts text tool-calling benchmarks to paired audio evaluations via TTS, speaker variation and noise, then evaluates seven omni-modal models showing model- and task-dependent performance with small text-to-voice gaps.

Search Your Block Floating Point Scales!

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

ScaleSearch optimizes block floating point scales via fine-grained search to cut quantization error by 27% for NVFP4, improving PTQ by up to 15 points on MATH500 for Qwen3-8B and attention PPL by 0.77 on Llama 3.1 70B.

LegalCiteBench: Evaluating Citation Reliability in Legal Language Models

cs.CL · 2026-05-11 · unverdicted · novelty 6.0

LegalCiteBench reveals that current LLMs achieve under 7% accuracy on closed-book legal citation retrieval and completion tasks, with misleading answer rates above 94% for nearly all tested models.

CodeClinic: Evaluating Automation of Coding Skills for Clinical Reasoning Agents

cs.AI · 2026-05-10 · unverdicted · novelty 6.0

CodeClinic benchmark demonstrates that LLM-generated Python skill libraries from clinical guidelines enhance consistency and reduce token consumption by up to 40% compared to zero-shot approaches on MIMIC-IV based tasks.

Structured Recurrent Mixers for Massively Parallelized Sequence Generation

cs.CL · 2026-05-09 · conditional · novelty 6.0 · 2 refs

Structured Recurrent Mixers enable algebraic switching between parallel training and recurrent inference representations, yielding higher throughput, concurrency, and training efficiency than comparable linear-complexity models on language tasks.

ReFlect: An Effective Harness System for Complex Long-Horizon LLM Reasoning

cs.AI · 2026-05-07 · unverdicted · novelty 6.0

ReFlect is a harness that wraps LLMs to detect and recover from reasoning errors, achieving 7-29 pp gains over direct CoT on long-horizon tasks and improving code patch quality to 82-87%.

DASH-KV: Accelerating Long-Context LLM Inference via Asymmetric KV Cache Hashing

cs.CL · 2026-04-21 · unverdicted · novelty 6.0

DASH-KV accelerates long-context LLM inference to linear complexity via asymmetric KV cache hashing and mixed-precision retention, matching full attention performance on LongBench.

Are Large Language Models Economically Viable for Industry Deployment?

cs.CL · 2026-04-21 · unverdicted · novelty 6.0

Small LLMs under 2B parameters achieve better economic break-even, energy efficiency, and hardware density than larger models on legacy GPUs for industrial tasks.

HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment

cs.LG · 2026-04-20 · unverdicted · novelty 6.0

HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.

Position: Zeroth-Order Optimization in Deep Learning Is Underexplored, Not Underpowered

cs.LG · 2026-05-15 · unverdicted · novelty 5.0

Zeroth-order optimization is underexplored rather than underpowered in deep learning, with limitations stemming from full-space designs that can be addressed via subspace, spectral, and systems-aware approaches.

On Training Large Language Models for Long-Horizon Tasks: An Empirical Study of Horizon Length

cs.AI · 2026-05-04 · unverdicted · novelty 5.0

Longer action horizons bottleneck LLM agent training through instability, but training with reduced horizons stabilizes learning and enables better generalization to longer horizons.

PipeMax: Enhancing Offline LLM Inference on Commodity GPU Servers

cs.DC · 2026-05-04 · unverdicted · novelty 5.0

PipeMax integrates pipeline parallelism with offloading to achieve up to 2.51x higher throughput than vLLM for offline LLM inference on commodity 8-GPU servers.

When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models

cs.CL · 2026-05-01

citing papers explorer

Showing 22 of 22 citing papers.

FlowCompile: An Optimizing Compiler for Structured LLM Workflows cs.CL · 2026-05-13 · unverdicted · none · ref 40
FlowCompile performs compile-time design space exploration on structured LLM workflows to produce reusable high-quality configuration sets that outperform routing baselines with up to 6.4x speedup.
Lost in Translation: Do LVLM Judges Generalize Across Languages? cs.CL · 2026-04-21 · unverdicted · none · ref 5
MM-JudgeBench shows substantial cross-lingual performance variance in 22 LVLM judges, with model size and architecture as poor predictors of multilingual robustness.
ArgBench: Benchmarking LLMs on Computational Argumentation Tasks cs.CL · 2026-04-19 · unverdicted · none · ref 93
ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.
Base Models Look Human To AI Detectors cs.CL · 2026-05-19 · unverdicted · none · ref 35
Base model text evades AI detectors better than instruction-tuned text, and the HIP method strengthens this trade-off across model sizes.
Rule2DRC: Benchmarking LLM Agents for DRC Script Synthesis with Execution-Guided Test Generation cs.LG · 2026-05-15 · unverdicted · none · ref 4
Rule2DRC is a benchmark for LLM agents synthesizing DRC scripts from natural language rules, paired with SplitTester that improves Best-of-N selection via execution-guided discriminative test generation.
LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues cs.CL · 2026-05-12 · unverdicted · none · ref 55
LongMemEval-V2 is a new benchmark where AgentRunbook-C reaches 72.5% accuracy on long-term agent memory tasks, beating RAG baselines at 48.5% and basic coding agents at 69.3%.
Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients cs.CL · 2026-05-07 · unverdicted · none · ref 35
POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on AIME 2025.
Transformers with Selective Access to Early Representations cs.LG · 2026-05-05 · unverdicted · none · ref 9 · 2 links
SATFormer uses a context-dependent gate for selective reuse of early Transformer representations, improving validation loss and zero-shot accuracy especially on retrieval benchmarks.
POSTCONDBENCH: Benchmarking Correctness and Completeness in Formal Postcondition Inference cs.SE · 2026-05-05 · unverdicted · none · ref 76
POSTCONDBENCH is a new multilingual benchmark that evaluates LLM postcondition generation on real code using defect discrimination to assess completeness beyond surface matching.
From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents cs.CL · 2026-05-14 · unverdicted · none · ref 119
A dataset-agnostic framework converts text tool-calling benchmarks to paired audio evaluations via TTS, speaker variation and noise, then evaluates seven omni-modal models showing model- and task-dependent performance with small text-to-voice gaps.
Search Your Block Floating Point Scales! cs.LG · 2026-05-12 · unverdicted · none · ref 56
ScaleSearch optimizes block floating point scales via fine-grained search to cut quantization error by 27% for NVFP4, improving PTQ by up to 15 points on MATH500 for Qwen3-8B and attention PPL by 0.77 on Llama 3.1 70B.
LegalCiteBench: Evaluating Citation Reliability in Legal Language Models cs.CL · 2026-05-11 · unverdicted · none · ref 33
LegalCiteBench reveals that current LLMs achieve under 7% accuracy on closed-book legal citation retrieval and completion tasks, with misleading answer rates above 94% for nearly all tested models.
CodeClinic: Evaluating Automation of Coding Skills for Clinical Reasoning Agents cs.AI · 2026-05-10 · unverdicted · none · ref 19
CodeClinic benchmark demonstrates that LLM-generated Python skill libraries from clinical guidelines enhance consistency and reduce token consumption by up to 40% compared to zero-shot approaches on MIMIC-IV based tasks.
Structured Recurrent Mixers for Massively Parallelized Sequence Generation cs.CL · 2026-05-09 · conditional · none · ref 1 · 2 links
Structured Recurrent Mixers enable algebraic switching between parallel training and recurrent inference representations, yielding higher throughput, concurrency, and training efficiency than comparable linear-complexity models on language tasks.
ReFlect: An Effective Harness System for Complex Long-Horizon LLM Reasoning cs.AI · 2026-05-07 · unverdicted · none · ref 18
ReFlect is a harness that wraps LLMs to detect and recover from reasoning errors, achieving 7-29 pp gains over direct CoT on long-horizon tasks and improving code patch quality to 82-87%.
DASH-KV: Accelerating Long-Context LLM Inference via Asymmetric KV Cache Hashing cs.CL · 2026-04-21 · unverdicted · none · ref 16
DASH-KV accelerates long-context LLM inference to linear complexity via asymmetric KV cache hashing and mixed-precision retention, matching full attention performance on LongBench.
Are Large Language Models Economically Viable for Industry Deployment? cs.CL · 2026-04-21 · unverdicted · none · ref 45
Small LLMs under 2B parameters achieve better economic break-even, energy efficiency, and hardware density than larger models on legacy GPUs for industrial tasks.
HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment cs.LG · 2026-04-20 · unverdicted · none · ref 58
HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.
Position: Zeroth-Order Optimization in Deep Learning Is Underexplored, Not Underpowered cs.LG · 2026-05-15 · unverdicted · none · ref 118
Zeroth-order optimization is underexplored rather than underpowered in deep learning, with limitations stemming from full-space designs that can be addressed via subspace, spectral, and systems-aware approaches.
On Training Large Language Models for Long-Horizon Tasks: An Empirical Study of Horizon Length cs.AI · 2026-05-04 · unverdicted · none · ref 30
Longer action horizons bottleneck LLM agent training through instability, but training with reduced horizons stabilizes learning and enables better generalization to longer horizons.
PipeMax: Enhancing Offline LLM Inference on Commodity GPU Servers cs.DC · 2026-05-04 · unverdicted · none · ref 13
PipeMax integrates pipeline parallelism with offloading to achieve up to 2.51x higher throughput than vLLM for offline LLM inference on commodity 8-GPU servers.
When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models cs.CL · 2026-05-01 · unreviewed · ref 36

Proceedings of the 29th symposium on operating systems principles , pages=

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer