{"total":49,"items":[{"citing_arxiv_id":"2607.01127","ref_index":32,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"$\\text{Log}_\\text{b}$Quant: Quantizing Language Models in Logarithmic Space","primary_cat":"cs.CL","submitted_at":"2026-07-01T16:13:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Log_b Quant is an adjustable-base logarithmic quantization technique that outperforms tensor-wise asymmetric linear quantization at 4-bit precision on language model benchmarks while providing memory savings.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.06448","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Agent Memory: Characterization and System Implications of Stateful Long-Horizon Workloads","primary_cat":"cs.AI","submitted_at":"2026-06-04T17:44:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"The paper delivers the first systems characterization of agent memory, with a four-axis taxonomy, phase-aware profiler, evaluation of ten systems on two benchmarks, and ten design recommendations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.05017","ref_index":19,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"GoldenFloat: A Phi-Derived Static-Split Floating-Point Family from GF4 to GF1024 with a Lucas-Exact Integer Identity","primary_cat":"cs.AR","submitted_at":"2026-06-03T15:41:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"GoldenFloat introduces a phi-derived rule for setting exponent and fraction widths across floating-point formats from 4 to 1024 bits, backed by open RTL generator, Lucas-exact accumulator, and FPGA implementation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.04115","ref_index":20,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"dMX: Differentiable Mixed-Precision Assignment for Low-Precision Floating-Point Formats","primary_cat":"cs.LG","submitted_at":"2026-06-02T18:23:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"dMX is a differentiable mixed-precision framework that learns per-layer MXFP bit-width assignments for LLMs and outperforms KL-based heuristics on perplexity and zero-shot accuracy under bit-width budgets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.06521","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"P-Cast Precision in FP8 Attention: Sink-Induced Collapse and the Optimality of S=2^8","primary_cat":"cs.AR","submitted_at":"2026-06-02T17:29:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Forward KV iteration in FP8 attention produces P-collapse under attention sink; reverse iteration with S=256 removes it and is optimal among bit-exact scales.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.04028","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Novel Aspects of IEEE SA P3109 Arithmetic Formats for Machine Learning","primary_cat":"cs.LG","submitted_at":"2026-06-01T19:27:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"partial","one_line_summary":"IEEE P3109 defines a family of adjustable low-precision floating-point formats for ML with decoding to extended reals, multiple rounding modes, block operations, kappa-approximation for approximations, and mechanical verification.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00539","ref_index":38,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"GNMR: Runtime Stability Control for Low-Precision Large Language Model Training","primary_cat":"cs.LG","submitted_at":"2026-05-30T05:11:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"GNMR is a gradient-norm-based controller that maps local stability signals to budgeted recovery actions to stabilize low-precision LLM training while preserving quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00312","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Stochastic Rounding Increases Small Singular Values","primary_cat":"math.NA","submitted_at":"2026-05-29T19:36:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Stochastic rounding lifts clusters of small singular values even in constant aspect ratio matrices, extending its role as a spectral regularizer.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.31268","ref_index":50,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Mellum2 Technical Report","primary_cat":"cs.CL","submitted_at":"2026-05-29T13:01:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"Mellum 2 is a 12B MoE model with 2.5B active parameters, trained on 10.6T tokens with MoE, GQA, SWA, and MTP, then post-trained into Instruct and Thinking variants, claimed competitive with 4B-14B models at 2.5B compute.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28704","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Expressive Power of Floating-Point Neural Networks with Arbitrary Reduction Orders and Inexact Activation Implementations","primary_cat":"cs.LG","submitted_at":"2026-05-27T16:30:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Floating-point neural networks achieve universal representability for practical activations like ReLU, sigmoid, and tanh under arbitrary reduction orders and bounded ulp errors in activations via a new distinguishability condition.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28691","ref_index":20,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"OSP-Next: Efficient High-Quality Video Generation with Sparse Sequence Parallelism, HiF8 Quantization, and Reinforcement Learning","primary_cat":"cs.CV","submitted_at":"2026-05-27T16:19:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"OSP-Next reports 83.73% VBench score and up to 2.27x speedup via hybrid sparse attention, SSP parallelism, HiF8 quantization, and Mix-GRPO on diffusion transformers.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.07581","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Training-Inference Kernel Contracts: Bounding Divergence in Post-Training and Deployment","primary_cat":"cs.LG","submitted_at":"2026-05-26T21:48:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Introduces kernel contracts framework with derived bounds on divergence from logit drift to reward drift, specialized for RL post-training under support and norm assumptions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.26189","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Max-Window Scale Estimation for Near-Lossless HiF8 W8A8 Quantization-Aware Training","primary_cat":"cs.LG","submitted_at":"2026-05-25T09:19:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Identifies amax saturation and catastrophic forgetting in HiF8 W8A8 QAT for OpenPangu-Embedded-1B and mitigates them with 64-step max-window DTS and 500-step BF16 warmup at lr=1e-5 to achieve under 0.6% benchmark drops.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23656","ref_index":23,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Recursive Block-Diagonal Coupling for Resource-Efficient Training of Vision Models","primary_cat":"cs.CV","submitted_at":"2026-05-22T14:08:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RBDC trains wide vision models by recursive block-diagonal coupling of narrower pre-trained models, reducing training FLOPs by 30% at similar ImageNet accuracy for DeiT and ResNet while outperforming model growth baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20402","ref_index":28,"ref_count":3,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Decomposing MXFP4 quantization error for LLM reinforcement learning: reducible bias, recoverable deadzone, and an irreducible floor","primary_cat":"cs.LG","submitted_at":"2026-05-19T18:59:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MXFP4 quantization error decomposes into scale bias, deadzone truncation, and grid noise; mode-targeted corrections recover BF16 accuracy within 0.7% on Qwen2.5-3B and exceed it by 1.0% on Qwen3-30B-A3B.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18739","ref_index":51,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation","primary_cat":"cs.CV","submitted_at":"2026-05-18T17:57:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LongLive-2.0 delivers an NVFP4 parallel infrastructure that enables direct training of long multi-shot autoregressive diffusion video models and achieves up to 2.15x training and 1.84x inference speedups on Blackwell and other GPUs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18733","ref_index":29,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Advancing Narrative Long Video Generation via Training-Free Identity-Aware Memory","primary_cat":"cs.CV","submitted_at":"2026-05-18T17:54:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"IAMFlow is a training-free identity-aware memory system that tracks entities via LLM global ID assignment and VLM frame verification to reduce identity drift in narrative long video generation from shifting prompts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17745","ref_index":28,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"StatQAT: Statistical Quantizer Optimization for Deep Networks","primary_cat":"stat.ML","submitted_at":"2026-05-18T01:56:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A statistical error analysis framework yields iterative and analytic quantizers that improve accuracy and stability when incorporated into quantization-aware training for integer and floating-point formats.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13907","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"AIS: Adaptive Importance Sampling for Quantized RL","primary_cat":"stat.ML","submitted_at":"2026-05-13T03:36:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AIS adaptively corrects non-stationary policy gradient bias in quantized LLM RL, matching BF16 performance while retaining 1.5-2.76x FP8 rollout speedup.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12464","ref_index":144,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Search Your Block Floating Point Scales!","primary_cat":"cs.LG","submitted_at":"2026-05-12T17:50:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ScaleSearch optimizes block floating point scales via fine-grained search to cut quantization error by 27% for NVFP4, improving PTQ by up to 15 points on MATH500 for Qwen3-8B and attention PPL by 0.77 on Llama 3.1 70B.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11999","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures","primary_cat":"cs.DC","submitted_at":"2026-05-12T11:48:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Power capping is illusory in LLM decode as memory-bound operation leaves power headroom untouched on 700 W GPUs, while SM clock locking saves up to 32% energy and three DVFS classes appear across attention types.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"default governor baselines rather than power capping, leaving open the question of whether the two mechanisms are even comparable for memory-bound decode. No prior work explicitly poses the question:is power capping effective for memory- bound LLM decode?-and none gives the negative answer. Equally absent is a cross-architecture energy comparison. Existing inference studies focus on throughput and latency [ 8,10,13,4,19], not energy. The few studies that do report inference energy or apply DVFS restrict themselves to 4 B. Ma et al. standard GQA/MHA transformers [ 16,11,20]; none examines novel attention replacements-MLA's compressed KV path, linear-attention hybrids such as GDN, or SSM-based models such as Mamba2-where qualitatively different kernel types may alter the DVFS response entirely."},{"citing_arxiv_id":"2605.11255","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"HEBATRON: A Hebrew-Specialized Open-Weight Mixture-of-Experts Language Model","primary_cat":"cs.CL","submitted_at":"2026-05-11T21:27:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Hebatron is the first open-weight Hebrew MoE LLM adapted from Nemotron-3, reaching 73.8% on Hebrew reasoning benchmarks while activating only 3B parameters per pass and supporting 65k-token context.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11111","ref_index":42,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ShardTensor: Domain Parallelism for Scientific Machine Learning","primary_cat":"cs.DC","submitted_at":"2026-05-11T18:20:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ShardTensor is a domain-parallelism system for SciML that enables flexible scaling of extreme-resolution spatial datasets by removing the constraint of batch size one per device.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"However, both methods can reduce GPU memory usage on high resolution data with no detrimental impacts on data resolution or model accuracy. Better still, these optimizations are only needed during training, and inference can proceed fully optimized. •Sparsity or Lower Dimensional Representationscan enable alternative methods such as SparseConvNets [41], Minkowski Networks [42], FigConvNet [43] and other methods. In many cases, especially as spatial dimension- ality rises, taking advantage of inherent structure and sparsity of the data structures of scientific data is crucial to achieving both accurate results and high performance for machine learning. III. RELATEDWORK Other methods and techniques of parallelization for machine"},{"citing_arxiv_id":"2605.07245","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"TransDot: An Area-efficient Reconfigurable Floating-Point Unit for Trans-Precision Dot-Product Accumulation for FPGA AI Engines","primary_cat":"cs.AR","submitted_at":"2026-05-08T05:06:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TransDot unifies SIMD FMA and trans-precision DPA in one reconfigurable FPU, achieving 2x FP16, 4x FP8, and 8x FP4 throughput with FP32 accumulation plus 1.46x to 2.92x area efficiency gains over the FPnew baseline.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"next-generation AMD Versal AI engines. Index Terms-Floating-point unit, dot-product accumulation, mixed-precision arithmetic, FPGA AI engine, reconfigurable datapath I. INTRODUCTION Modern AI accelerators increasingly rely on reduced- precision arithmetic to improve performance and energy ef- ficiency. Low-precision formats such as FP16 [1], bfloat16, FP8 [2], [3], and FP4 [4] have become widely adopted in both training and inference, with computation dominated by repeated multiply-accumulate operations [5], [6]. However, accumulation needs to be performed at higher precision to preserve numerical stability, as each output accumulates many products and low-precision accumulation leads to excessive rounding error and degraded convergence [7], [8]."},{"citing_arxiv_id":"2605.06057","ref_index":29,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"FalconGEMM: Surpassing Hardware Peaks with Lower-Complexity Matrix Multiplication","primary_cat":"cs.DC","submitted_at":"2026-05-07T11:41:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FalconGEMM delivers a framework with deployment, group-parallel execution, and analytical decision modules that makes lower-complexity matrix multiplication practical, beating cuBLAS and similar libraries by 7.59-17.85% on LLM tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Workloads and Metrics.We extract linear layer shapes (N, K) from three open-source LLMs (DeepSeek-R1[26], Qwen3.5-397B[27], andHunyuanVideo[28]). Supported data types across all platforms include FP32, BF16, FP16, and FP8, depending on hardware capabilities. For FP8, we adopt the full BF16-to-quantized-FP8 workflow with1×128block- wise scaling for FP8E4M3 [29], consistent with widely used CUTLASS [7] and DeepGEMM [6]. Baselines.We select the state-of-the-art competitors as baselines. For standard GEMM, we usecuBLAS[4] (em- bedded in CUDA Toolkit) on NVIDIA GPUs,Intel MKL[5] on Intel x86,OpenBLAS[30] on AMD x86, andACL[31] (v52.8) on ARM.AlphaTensor[3] (relying on JAX) serves as the state-of-the-art cross-platform LCMA competitor."},{"citing_arxiv_id":"2605.05683","ref_index":65,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Spectral Lens: Activation and Gradient Spectra as Diagnostics of LLM Optimization","primary_cat":"stat.ML","submitted_at":"2026-05-07T05:19:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Spectral analysis of activations and gradients provides new diagnostics that link batch size to representation geometry, early covariance tails to token efficiency, and spectral shifts to learning dynamics in decoder-only LLMs, backed by a mechanistic model.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"(1−e −xr,B(t))2 =η 2 Bt2r−2qB 1 +o(1) \u0001 ,(64) which proves the tail formula and(61). The crossover statement is exactly the regimexr,B(t) ≍ 1. □ Theorem F.7(Band-recruitment law).Assume (54)-(57), zero initialization, and monotone rates in rank. Fix a cutoffRand a toleranceδ∈(0,1), and define TR,δ(B) := inf n t≥0 :a r(B, t)≥(1−δ)β r for all1≤r≤R o .(65) Then TR,δ(B) = log(1/δ) ηB RqB = log(1/δ) ηB R(αtail(B)−p)/2.(66) Consequently, the exact family-local objective is to maximize the tail-band rateηBR−qB. Proof. Because the rates are monotone decreasing inr, the slowest mode among{1, . . . , R}is mode R. Under zero initialization, aR(B, t) =β R 1−e −ηBtR−qB\u0001 .(67) The conditiona R(B, t)≥(1−δ)β R is equivalent to"},{"citing_arxiv_id":"2605.05331","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ViTok-v2: Scaling Native Resolution Auto-Encoders to 5 Billion Parameters","primary_cat":"cs.CV","submitted_at":"2026-05-06T18:03:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ViTok-v2 is a 5B-parameter native-resolution image autoencoder using NaFlex and DINOv3 loss that matches or exceeds prior tokenizers at 256p and outperforms them at 512p and above while advancing the Pareto frontier in joint scaling with generators.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02568","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k","primary_cat":"cs.LG","submitted_at":"2026-05-04T13:19:29+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Chunked streaming top-k enables CSA indexer execution at 1M sequence length with 6.21 GB peak memory and >=0.998 recall on synthetic V4-shaped inputs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.25759","ref_index":22,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Neural-Network-Based Variational Method in Nuclear Density Functional Theory: Application to the Extended Thomas-Fermi Model","primary_cat":"nucl-th","submitted_at":"2026-04-28T15:23:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Neural networks represent densities in a variational extended Thomas-Fermi model, yielding binding energies within 0.5% of prior ETF results and reproducing nuclear pasta phases.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"with the center-of-mass motion, we subtract it approximately as Ecm ≈ Ekin A ,A= Z Ω dr(n n +n p),(20) and define the corrected kinetic energy as Eeff kin =E kin −E cm.(21) This correction is employed here as a simple approximate treat- ment of the center-of-mass motion. 3 2.2.2. Interaction Energy The interaction energy is given by Eint = Z Ω drE int(r),(22) where the interaction-energy density is written as Eint(r)= X q=n,p Wq ·J q +B 1n2 +B 2 X q=n,p n2 q +B 5n∆n +B 6 X q=n,p nq ∆nq +B 7nα+2 +B 8nα X q=n,p n2 q. (23) The spin-orbit field and the spin-current density are defined by Wq =B 9 \u0010 ∇n+∇n q \u0011 ,(24) Jq =− 2mq ℏ2 ! nq Wq fq .(25) Accordingly, the functional adopted here is expressed solely in"},{"citing_arxiv_id":"2604.24088","ref_index":37,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"TACO: Efficient Communication Compression of Intermediate Tensors for Scalable Tensor-Parallel LLM Training","primary_cat":"cs.DC","submitted_at":"2026-04-27T06:27:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"TACO compresses tensor-parallel intermediate tensors with an adaptive FP8 scheme and fused kernels, yielding up to 1.87X throughput gains on GPT and Qwen models with near-lossless accuracy.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"of (3 × 10−4 → 3 × 10−5), with a global batch size of 64 for 10,000 iterations, to evaluate the generalization ability of the proposed method across different model families and data distributions. In Section 5.5, we conduct large-scale evaluations under a 3D parallel training configuration, where GPT-6.7B is trained from scratch on the Pile dataset using PyTorch [37] v2.5.1 and Megatron-LM [43], with parallelism configured as (TP = 4, PP = 2, DP = 2). Metrics.We report End-to-end throughput (TFLOPS) to measure efficiency, and Model quality (Validation/Test Loss) to evaluate con- vergence. Degradation (Deg.) is reported as the relative percentage increase in loss relative to the BF16 baseline. 5.2 Evaluation of Accuracy with TP"},{"citing_arxiv_id":"2604.15416","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"StoSignSGD: Unbiased Structural Stochasticity Fixes SignSGD for Training Large Language Models","primary_cat":"cs.LG","submitted_at":"2026-04-16T17:55:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"StoSignSGD resolves SignSGD divergence on non-smooth objectives via structural stochasticity, matching optimal convex rates and improving non-convex bounds while delivering 1.44-2.14x speedups in FP8 LLM pretraining.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.10390","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"LLM-PRISM: Characterizing Silent Data Corruption from Permanent GPU Faults in LLM Training","primary_cat":"cs.AR","submitted_at":"2026-04-12T00:35:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLMs resist low-frequency permanent GPU faults but certain datapaths and precision formats trigger catastrophic training divergence even at moderate fault rates.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"a value by orders of magnitude, whereas the same spatial fault in a BF16 tensor, with a wider exponent and narrower mantissa, produces a different corruption profile; FP8 further alters this trade-off by aggressively compressing both range and precision. While increasing model scale is known to affect activation outlier distributions and thus quantization behav- ior [25], isolating the numerical format at a controlled scale allows us to rigorously probe a large space of hardware-level bit-flips and activation rates, yielding more direct, actionable insight into how representation choice shapes training-time fault sensitivity. C. Hardware, Software and Training Recipe Hardware:The fault injection training experiments were"},{"citing_arxiv_id":"2604.08826","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"HiFloat4 Format for Language Model Pre-training on Ascend NPUs","primary_cat":"cs.LG","submitted_at":"2026-04-09T23:50:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"HiFloat4 FP4 with stabilization techniques trains dense and MoE language models on Ascend NPUs at relative error within 1% of full-precision baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.06836","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"STQuant: Spatio-Temporal Adaptive Framework for Optimizer Quantization in Large Multimodal Model Training","primary_cat":"cs.LG","submitted_at":"2026-04-08T08:57:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"STQuant dynamically allocates quantization bits for optimizer states in multimodal model training, reducing memory by 84.4% to an average 5.1 bits while preserving quality on GPT-2 and ViT.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.02525","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"AdaHOP: Fast and Accurate Low-Precision Training via Outlier-Pattern-Aware Rotation","primary_cat":"cs.LG","submitted_at":"2026-04-02T21:24:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AdaHOP applies pattern-aware Hadamard transforms and selective outlier extraction to enable from-scratch MXFP4 training of LLMs at BF16 quality with up to 3.6X memory compression and 1.46X speedup.","context_count":1,"top_context_role":"other","top_context_polarity":"unclear","context_text":"[25] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019. URL https: //arxiv.org/abs/1711.05101. [26] Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. Mixed precision training. InInternational Conference on Learning Representations, 2018. [27] Paulius Micikevicius, Dusan Stosic, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisenthwaite, Sangwon Ha, Alexander Heinecke, Patrick Judd, John Kamalu, et al. Fp8 formats for deep learning.arXiv preprint arXiv:2209.05433, 2022. [28] Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Ngoc-Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández."},{"citing_arxiv_id":"2604.03298","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ENEC: A Lossless AI Model Compression Method Enabling Fast Inference on Ascend NPUs","primary_cat":"cs.AR","submitted_at":"2026-03-28T16:11:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ENEC delivers 3.43X higher throughput than DietGPU and 1.12X better compression ratio than nvCOMP for lossless model weight compression on Ascend NPUs, yielding up to 6.3X end-to-end inference speedup.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.03279","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Rewriting TTS Inference Economics: Lightning V2 on Tenstorrent Achieves 4x Lower Cost Than NVIDIA L40S","primary_cat":"eess.AS","submitted_at":"2026-03-24T13:02:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"Lightning V2 achieves 4x lower on-prem accelerator cost for TTS inference on Tenstorrent hardware than NVIDIA L40S at equivalent throughput and production audio fidelity.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.10718","ref_index":30,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SnapMLA: Efficient Long-Context MLA Decoding via Hardware-Aware FP8 Quantized Pipelining","primary_cat":"cs.LG","submitted_at":"2026-02-11T10:24:42+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SnapMLA achieves up to 1.91x higher throughput in long-output MLA decoding using FP8 quantization and specialized kernels while keeping benchmark quality near the BF16 baseline.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.05743","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Balancing FP8 Computation Accuracy and Efficiency on Digital CIM via Shift-Aware On-the-fly Aligned-Mantissa Bitwidth Prediction","primary_cat":"cs.AR","submitted_at":"2026-02-05T15:10:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A 28nm digital CIM accelerator for FP8 uses on-the-fly shift-aware bitwidth prediction, FIFO alignment, and scalable MACs to reach 20.4 TFLOPS/W and 2.8x better efficiency than prior work while supporting variable mantissa widths.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.20856","ref_index":31,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"NVIDIA Nemotron 3: Efficient and Open Intelligence","primary_cat":"cs.CL","submitted_at":"2025-12-24T00:24:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"NVIDIA releases the Nemotron 3 model family with hybrid Mamba-Transformer architecture, LatentMoE, NVFP4 training, MTP layers, and multi-environment RL post-training for reasoning and agentic tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.12131","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"BOOST: BOttleneck-Optimized Scalable Training Framework for Low-Rank Large Language Models","primary_cat":"cs.LG","submitted_at":"2025-12-13T01:50:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"BOOST delivers 1.46-2.27x end-to-end speedups for low-rank bottleneck LLMs by redesigning tensor parallelism around the bottleneck structure plus supporting optimizations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.10909","ref_index":20,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Bit-Accurate Modeling of GPU Matrix Multiply-Accumulate Units: Demystifying Numerical Discrepancy and Accuracy","primary_cat":"cs.AR","submitted_at":"2025-11-14T02:45:15+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"The authors derive the first bit-accurate arithmetic models for matrix multiply-accumulate operations on ten GPU architectures spanning NVIDIA Volta to Blackwell and AMD CDNA1 to CDNA3.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.06838","ref_index":57,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"P3-LLM: An Integrated NPU-PIM Accelerator for Edge LLM Inference Using Hybrid Numerical Formats","primary_cat":"cs.AR","submitted_at":"2025-11-10T08:29:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"P3-LLM delivers 4.9x average speedup over HBM-PIM for edge LLM inference by pairing hybrid-format quantization with iso-area-optimized low-precision PIM compute units and operator fusion.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.04212","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention","primary_cat":"cs.LG","submitted_at":"2025-10-05T14:01:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Low-precision Flash Attention fails due to similar low-rank attention representations combined with biased rounding errors that accumulate and corrupt weight updates; a minimal fix to reduce rounding bias stabilizes training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.03472","ref_index":32,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DPQuant: Efficient and Differentially-Private Model Training via Dynamic Quantization Scheduling","primary_cat":"cs.LG","submitted_at":"2025-09-03T16:51:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DPQuant uses epoch-wise probabilistic layer rotation and DP loss sensitivity to quantize only a changing subset of layers, reducing accuracy degradation from quantization noise in DP-SGD and delivering up to 2.21x throughput gains with under 2% accuracy drop.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.08822","ref_index":26,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"OISMA: On-the-fly In-memory Stochastic Multiplication Architecture for Matrix-Multiplication Workloads","primary_cat":"cs.AR","submitted_at":"2025-08-12T10:24:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"OISMA is an in-memory computing design using quasi-stochastic bent-pyramid computing to convert memory reads into multiplications, demonstrated in a 4-kB RRAM array with 0.789 TOPS/W at 50 MHz in 180-nm technology and projected gains at 22-nm.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.11277","ref_index":32,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Analysis of Floating-Point Matrix Multiplication Computed via Integer Arithmetic","primary_cat":"math.NA","submitted_at":"2025-06-12T20:33:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Error analysis and cost estimator for recasting floating-point matrix multiplication as accumulated integer products on mixed-precision hardware.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2412.04468","ref_index":87,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"NVILA: Efficient Frontier Visual Language Models","primary_cat":"cs.CV","submitted_at":"2024-12-05T18:59:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"NVILA improves on VILA with a scale-then-compress visual token strategy and full-lifecycle efficiency optimizations, matching or exceeding leading VLMs on image and video benchmarks while reducing training cost 1.9-5.1x and latencies 1.2-2.8x.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"promisingapproacheshavebeenproposedforselecting pretraining data for Large Language Models (LLMs), such as domain-mixing [85], sample-wise data selec- tion [27, 86], and theory-driven optimal selection [28]. In this work, we specifically focus on pruning super- vised fine-tuning (SFT) datasets for VLMs. Regard- ing low-precision training, FP8 training [87, 88] has gained popularity for LLMs, yet no prior work has demonstrated its feasibility for VLMs without sacrific- ing accuracy. Techniques such as pruning, distillation, and quantization are commonly applied to LLMs. [89, 90] apply pruning/distillation to LLM. However, their application to VLMs presents an open question: Should an LLM be pruned or distilled first before"},{"citing_arxiv_id":"2407.08608","ref_index":37,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision","primary_cat":"cs.LG","submitted_at":"2024-07-11T15:44:48+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"FlashAttention-3 achieves 1.5-2x speedup on H100 GPUs for attention, reaching 740 TFLOPs/s (75% utilization) in FP16 and near 1.2 PFLOPs/s in FP8 while cutting numerical error by 2.6x versus baseline FP8 attention.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}