TCM-Serve applies modality-aware scheduling to reduce average TTFT by 54% and 78.5% for latency-critical requests in MLLM inference.
Longbench: A bilingual, multitask benchmark for long context understanding, 2024
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2verdicts
UNVERDICTED 2representative citing papers
STS repurposes draft-model attention scores from speculative decoding to build token-and-head-wise sparsity masks, delivering 2.67x speedup at ~90% sparsity on NarrativeQA with negligible accuracy loss.
citing papers explorer
-
TCM-Serve: Modality-aware Scheduling for Multimodal Large Language Model Inference
TCM-Serve applies modality-aware scheduling to reduce average TTFT by 54% and 78.5% for latency-critical requests in MLLM inference.
-
STS: Efficient Sparse Attention with Speculative Token Sparsity
STS repurposes draft-model attention scores from speculative decoding to build token-and-head-wise sparsity masks, delivering 2.67x speedup at ~90% sparsity on NarrativeQA with negligible accuracy loss.