{"total":19,"items":[{"citing_arxiv_id":"2606.26861","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Cascaded Multi-Granularity Pruning for On-Device LLM Inference in Industrial IoT","primary_cat":"cs.CL","submitted_at":"2026-06-25T10:44:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Cascaded multi-granularity pruning reaches 13.8x compression on MHA+GELU LLMs for bearing fault diagnosis at 83.82% accuracy while causing ~74pp collapse on GQA+SwiGLU models that violate the formalized Structural Independence Assumption.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00523","ref_index":63,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ProactiveLLM: Learning Active Interaction for Streaming Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-30T04:31:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ProactiveLLM enables active interaction in streaming LLMs by learning semantic sufficiency cues from partial inputs through mask-based modeling and synchronized privileged self-distillation without external supervision.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18331","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Prune, Update and Trim: Robust Structured Pruning for Large Language Models","primary_cat":"cs.LG","submitted_at":"2026-05-18T12:48:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Putri is a structured pruning technique for LLMs that compensates for pruning errors via weight updates and sequential processing while pruning at the attention-head level to reach state-of-the-art results at extreme sparsity.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17985","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SAFE-SVD: Sensitivity-Aware Fidelity-Enforcing SVD for Physics Foundation Models","primary_cat":"cs.LG","submitted_at":"2026-05-18T07:40:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SAFE-SVD introduces a sensitivity-aware fidelity-enforcing SVD framework for compressing physics foundation models that maintains higher accuracy than standard methods at greater compression ratios.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14738","ref_index":76,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TAPIOCA: Why Task- Aware Pruning Improves OOD model Capability","primary_cat":"cs.LG","submitted_at":"2026-05-14T12:01:05+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08885","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Compact SO(3) Equivariant Atomistic Foundation Models via Structural Pruning","primary_cat":"cs.LG","submitted_at":"2026-05-09T11:07:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Structural pruning of SO(3) equivariant atomistic models from large checkpoints yields 1.5-4x fewer parameters and 2.5-4x less pre-training compute than small models trained from scratch, while outperforming them on most Matbench Discovery metrics and downstream tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Chapman and Hall/CRC, 2022. [13] Manish Gupta and Puneet Agrawal. Compression of deep learning models for text: A survey.ACM Transactions on Knowledge Discovery from Data (TKDD), 16(4):1-55, 2022. [14] Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, and Danqi Chen. Sheared llama: Accelerating language model pre-training via structured pruning.arXiv preprint arXiv:2310.06694, 2023. [15] Saleh Ashkboos, Maximilian L Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman. Slicegpt: Compress large language models by deleting rows and columns.arXiv preprint arXiv:2401.15024, 2024. [16] Xin Men, Mingyu Xu, Qingyu Zhang, Qianhao Yuan, Bingning Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, and Weipeng Chen. Shortgpt: Layers in large language models are more redundant than you expect."},{"citing_arxiv_id":"2605.08842","ref_index":38,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"XPERT: Expert Knowledge Transfer for Effective Training of Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-09T09:53:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"XPERT extracts and reuses cross-domain expert knowledge from pre-trained MoE LLMs via inference analysis and tensor decomposition to improve performance and convergence in downstream language model training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08738","ref_index":25,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training","primary_cat":"cs.LG","submitted_at":"2026-05-09T06:50:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Pruning pretrained MoE models outperforms training from scratch under fixed budget, different expert compression methods converge after continued training, and progressive pruning plus multi-token KD improves the final 23A2B model.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08568","ref_index":57,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Different Prompts, Different Ranks: Prompt-aware Dynamic Rank Selection for SVD-based LLM Compression","primary_cat":"cs.LG","submitted_at":"2026-05-09T00:02:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PARSE trains a prompt-aware linear router on dense-model outputs to select dynamic SVD ranks, improving accuracy up to 10% at 0.6 compression ratio on LLaMA-7B while delivering 2.5x prefill and 2.4x decode speedups.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Enabling unstructured sparse acceleration on structured sparse accelerators.Proceedings of Machine Learning and Systems, 7, 2025. [56] Shail Dave, Riyadh Baghdadi, Tony Nowatzki, Sasikanth Avancha, Aviral Shrivastava, and Baoxin Li. Hardware acceleration of sparse and irregular tensor computations of ml models: A survey and insights.Proceedings of the IEEE, 109(10):1706-1752, 2021. [57] Saleh Ashkboos, Maximilian L Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman. Slicegpt: Compress large language models by deleting rows and columns. arXiv preprint arXiv:2401.15024, 2024. [58] Jialong Guo, Xinghao Chen, Yehui Tang, and Yunhe Wang. Slimllm: Accurate structured pruning for large language models.arXiv preprint arXiv:2505."},{"citing_arxiv_id":"2605.07271","ref_index":46,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Understanding Performance Collapse in Layer-Pruned Large Language Models via Decision Representation Transitions","primary_cat":"cs.CL","submitted_at":"2026-05-08T05:35:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Performance collapse in layer-pruned LLMs stems from disrupting the Silent Phase of decision-making, which blocks the transition to correct predictions, while the later Decisive Phase is robust to pruning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01732","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"EGAD: Entropy-Guided Adaptive Distillation for Token-Level Knowledge Transfer","primary_cat":"cs.CL","submitted_at":"2026-05-03T06:05:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"EGAD adaptively distills LLM knowledge at the token level by using entropy to create a curriculum from low- to high-entropy tokens, adjust temperature, and switch between logits-only and feature-based branches.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"In recent years, large language models (LLMs) [41] have achieved remarkable progress across diverse domains, largely due to increasing model size and train- ing data. Larger models improve both generation quality and task generaliza- tion, but their high computational and memory demands limit deployment in resource-constrained settings. To mitigate this, model compression techniques such as quantization [5], pruning [3, 38, 43], and knowledge distillation (KD) have been proposed. Quantization and pruning reduce costs by lowering weight precision or removing redundant parameters, while KD transfers knowledge from a large teacher model to a smaller student, effectively balancing performance and arXiv:2605.01732v1 [cs.CL] 3 May 2026 2 Authors Suppressed Due to Excessive Length"},{"citing_arxiv_id":"2604.04493","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SLaB: Sparse-Lowrank-Binary Decomposition for Efficient Large Language Models","primary_cat":"cs.LG","submitted_at":"2026-04-06T07:36:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SLaB compresses LLM weights via sparse-lowrank-binary decomposition guided by activation-aware scores, achieving up to 36% lower perplexity than prior methods at 50% compression on Llama models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.03298","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ENEC: A Lossless AI Model Compression Method Enabling Fast Inference on Ascend NPUs","primary_cat":"cs.AR","submitted_at":"2026-03-28T16:11:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ENEC delivers 3.43X higher throughput than DietGPU and 1.12X better compression ratio than nvCOMP for lossless model weight compression on Ascend NPUs, yielding up to 6.3X end-to-end inference speedup.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.01997","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"On the Limits of Layer Pruning for Generative Reasoning in Large Language Models","primary_cat":"cs.LG","submitted_at":"2026-02-02T11:57:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Layer pruning preserves classification performance in LLMs but fundamentally limits recovery of generative reasoning capabilities even after extensive self-supervised finetuning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.22671","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Fragile Knowledge, Robust Instruction-Following: The Width Pruning Dichotomy in Llama-3.2","primary_cat":"cs.CL","submitted_at":"2025-12-27T18:09:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Width pruning in Llama-3.2 models reduces parametric knowledge while enhancing instruction-following and preserving reasoning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.12876","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MaskPro: Linear-Space Probabilistic Learning for Strict (N:M)-Sparsity on LLMs","primary_cat":"cs.LG","submitted_at":"2025-06-15T15:02:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MaskPro learns categorical distributions over groups of M weights to generate exact (N:M) sparsity via N-way sampling without replacement and stabilizes training with a moving average tracker of loss residuals.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.17138","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"RAP: Runtime Adaptive Pruning for LLM Inference","primary_cat":"cs.LG","submitted_at":"2025-05-22T06:12:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"RAP is a reinforcement learning framework for runtime-adaptive pruning of LLMs that jointly optimizes model weights and KV-cache usage under varying memory budgets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.04416","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Analytical FFN-to-MoE Restructuring via Activation Pattern Analysis","primary_cat":"cs.LG","submitted_at":"2025-02-06T14:05:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"An analytical post-training method restructures FFNs into MoE by partitioning neurons based on activation patterns and building a router from statistics, achieving 1.17x speedup with minimal resources.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2404.14294","ref_index":179,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Survey on Efficient Inference for Large Language Models","primary_cat":"cs.CL","submitted_at":"2024-04-22T15:53:08+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":3.0,"formal_verification":"none","one_line_summary":"The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Attention [158], Reformer [159], Sparse Flash Attention [160], Routing Transformer [161], Sparse Sinkhorn Attention [162], H 2O [163], Diffuser [164] Weight Pruning SparseGPT [165], Wanda [166], ISC [167], Prune and Tune [168], OWL [169], BESA [170], oBERT [171], FastPruning [172], RIA [173], LLM-Pruner [174], Sheared LLaMA [175], ZipLM [176], LoRAPrune [177], LoRAS- hear [178], SliceGPT [179], PLATON [180], CoFi [181], SIMPLE [182], ExpertSpar- sity [183], SEER-MoE [184], Pruner-Zero [185], DSØT [186] Quantization Quantization- aware Training LLM-QAT [187], Norm Tweaking [188], QLoRA [189], QA-LoRA [190], LoftQ [191] Post-Training Quantization GPTQ [192], LUT-GEMM [193], AWQ [194], OWQ [195], SpQR [196], SqueezeLLM [197], QuIP [198], FineQuant [199], QuantEase [200],"}],"limit":50,"offset":0}