Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E · 2023

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

browse 5 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

DharmaOCR: Specialized Small Language Models for Structured OCR that outperform Open-Source and Commercial Baselines

cs.CV · 2026-04-15 · unverdicted · novelty 7.0

DharmaOCR models reach 0.925 and 0.911 extraction scores with 0.40% and 0.20% degeneration rates on a new benchmark covering printed, handwritten, and legal documents, outperforming open-source and commercial baselines via SFT plus DPO.

DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

cs.CL · 2024-10-14 · conditional · novelty 7.0

DuoAttention identifies retrieval heads requiring full KV cache and streaming heads using constant-length cache to reduce memory and latency in long-context LLM inference.

PathCal: State-Aware Reflection-Marker Calibration for Efficient Reasoning

cs.AI · 2026-05-21 · unverdicted · novelty 6.0

PathCal calibrates reasoning paths by type-aware soft rebalancing of reflection-marker logits at uncertain states, yielding better efficiency-performance trade-offs on six benchmarks.

Uncovering Intra-expert Activation Sparsity for Efficient Mixture-of-Expert Model Execution

cs.LG · 2026-05-09 · conditional · novelty 6.0

Pre-trained MoE models exhibit up to 90% intra-expert activation sparsity that enables up to 2.5x faster MoE layer execution when exploited in the vLLM inference system.

MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction

cs.CL · 2026-04-30 · unverdicted · novelty 6.0

MiniCPM-o 4.5 uses the Omni-Flow streaming framework to deliver real-time full-duplex omni-modal interaction with proactive behavior in a 9B model that approaches Gemini 2.5 Flash performance.

citing papers explorer

Showing 5 of 5 citing papers.

DharmaOCR: Specialized Small Language Models for Structured OCR that outperform Open-Source and Commercial Baselines cs.CV · 2026-04-15 · unverdicted · none · ref 35
DharmaOCR models reach 0.925 and 0.911 extraction scores with 0.40% and 0.20% degeneration rates on a new benchmark covering printed, handwritten, and legal documents, outperforming open-source and commercial baselines via SFT plus DPO.
DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads cs.CL · 2024-10-14 · conditional · none · ref 28
DuoAttention identifies retrieval heads requiring full KV cache and streaming heads using constant-length cache to reduce memory and latency in long-context LLM inference.
PathCal: State-Aware Reflection-Marker Calibration for Efficient Reasoning cs.AI · 2026-05-21 · unverdicted · none · ref 29
PathCal calibrates reasoning paths by type-aware soft rebalancing of reflection-marker logits at uncertain states, yielding better efficiency-performance trade-offs on six benchmarks.
Uncovering Intra-expert Activation Sparsity for Efficient Mixture-of-Expert Model Execution cs.LG · 2026-05-09 · conditional · none · ref 20
Pre-trained MoE models exhibit up to 90% intra-expert activation sparsity that enables up to 2.5x faster MoE layer execution when exploited in the vLLM inference system.
MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction cs.CL · 2026-04-30 · unverdicted · none · ref 88
MiniCPM-o 4.5 uses the Omni-Flow streaming framework to deliver real-time full-duplex omni-modal interaction with proactive behavior in a 9B model that approaches Gemini 2.5 Flash performance.

Gonzalez, Hao Zhang, and Ion Stoica

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer