{"total":12,"items":[{"citing_arxiv_id":"2606.29223","ref_index":7,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Depth Exploration for LLM Decoding","primary_cat":"cs.LG","submitted_at":"2026-06-28T06:22:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DEX replaces single-depth selection with parallel exploration over multiple candidate depths, committing the final-depth token while collapsing reusable states to reduce per-token computation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.26538","ref_index":5,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"CascadeFormer: Depth-Tapered Transformers Motivated by Gradient Fan-in Asymmetry","primary_cat":"cs.LG","submitted_at":"2026-06-25T02:25:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CascadeFormer tapers Transformer width with depth based on gradient fan-in asymmetry to match uniform baselines in perplexity while cutting latency 8.6%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.20295","ref_index":138,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Token-Operations-Oriented Inference Optimization Techniques for Large Models","primary_cat":"cs.SE","submitted_at":"2026-06-18T14:33:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"The paper introduces a four-layer technical architecture for token-operations-oriented inference optimization in large models and reviews key technologies and industry status at each layer.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.05742","ref_index":22,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"AdaPLD: Adaptive Retrieval and Reuse for Efficient Model-Free Speculative Decoding","primary_cat":"cs.CL","submitted_at":"2026-06-04T06:09:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"AdaPLD adaptively mixes lexical and semantic retrieval with branched reuse to improve model-free speculative decoding and reports up to 3.10x speedup across benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.27965","ref_index":2,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"The Shape of Overthinking: Backtracking Bursts in Long Reasoning Traces","primary_cat":"cs.AI","submitted_at":"2026-05-27T05:01:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"On 6000 Qwen3-8B AIME traces, late-clustered moderate-to-severe backtracks are more common in incorrect outputs, enabling prefix-causal burst-aware filtering that outperforms fixed-length cutoffs at shallow and intermediate depths.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20998","ref_index":24,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Single-Pass, Depth-Selective Reading for Multi-Aspect Sentiment Analysis","primary_cat":"cs.CL","submitted_at":"2026-05-20T10:37:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DABS is a single-pass framework that builds a depth-ordered substrate from one Transformer encoding and performs lightweight aspect-conditioned readout, cutting computation by up to 60% on multi-aspect ATSA benchmarks while matching prior accuracy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09165","ref_index":31,"ref_count":2,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Sparse Layers are Critical to Scaling Looped Language Models","primary_cat":"cs.LG","submitted_at":"2026-05-09T20:58:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Looped-MoE models scale better than dense looped or standard transformers because routing changes across loops, and they enable stronger compute-quality trade-offs via early exits at loop boundaries.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Together, these findings suggest that Looped-MoE models are a practical path toward language models that are cheaper to store, faster to run, and competitive in quality. Limitations. µP transfer does not hold when model depth changes, so we hold depth constant and scale width. Future work could validate with depth-scaling extensions such as CompleteP [31], though this adds another axis of comparison: the number of unique layers repeated by looping would also change with depth. Due to compute constraints, we did not scale our four architectures beyond 305M/711M (active/stored) parameters, but rely on the principle that compute-optimal scaling laws fitted at smaller scales predict larger model performance."},{"citing_arxiv_id":"2604.14612","ref_index":3,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"ConfLayers: Adaptive Confidence-based Layer Skipping for Self-Speculative Decoding","primary_cat":"cs.LG","submitted_at":"2026-04-16T04:39:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ConfLayers dynamically skips LLM layers based on confidence scores to create adaptive draft models for self-speculative decoding, reporting up to 1.4x speedup over standard generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.12946","ref_index":25,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Parcae: Scaling Laws For Stable Looped Language Models","primary_cat":"cs.LG","submitted_at":"2026-04-14T16:43:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Parcae stabilizes looped LLMs via spectral norm constraints on injection parameters, enabling power-law scaling for training FLOPs and saturating exponential scaling at test time that improves quality over fixed-depth baselines under fixed parameter budgets.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"and Carole-Jean Wu. Layerskip: Enabling early exit inference and self-speculative decoding. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), page 12622-12642. Association for Computational Linguistics, 2024. doi: 10.18653/v1/2024.acl-long.681. URL http://dx.doi.org/10.18653/v1/2024.acl-long. 681. [25] Katie Everett, Lechao Xiao, Mitchell Wortsman, Alexander A. Alemi, Roman Novak, Peter J. Liu, Izzeddin Gur, Jascha Sohl-Dickstein, Leslie Pack Kaelbling, Jaehoon Lee, and Jeffrey Pennington. Scaling exponents across parameterizations and optimizers, 2024. URL https: //arxiv.org/abs/2407.05872. [26] Jonas Geiping and Tom Goldstein. Cramming: Training a language model on a single GPU in"},{"citing_arxiv_id":"2604.18592","ref_index":6,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Two-dimensional early exit optimisation of LLM inference","primary_cat":"cs.CL","submitted_at":"2026-03-27T15:27:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Coordinating layer-wise and sentence-wise early exits in LLMs produces multiplicative speedups of 1.4-2.3x over single-dimension early exit on sentiment classification tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.14004","ref_index":68,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-01-20T14:23:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The survey organizes mechanistic interpretability techniques into a Locate-Steer-Improve framework to enable actionable improvements in LLM alignment, capability, and efficiency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.15461","ref_index":9,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"All is Not Lost: LLM Recovery without Checkpoints","primary_cat":"cs.DC","submitted_at":"2025-06-18T13:48:33+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CheckFree recovers intermediate stage failures in pipeline-parallel LLM training via neighbor averaging; CheckFree+ adds out-of-order execution to handle first/last stages by copying neighbors, with small embedding storage, outperforming checkpointing and redundancy at 5-10% failure rates by up to  ","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}