{"total":83,"items":[{"citing_arxiv_id":"2606.00844","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MoEIoU: Rethinking Bounding-Box Regression as a Mixture of Experts","primary_cat":"cs.CV","submitted_at":"2026-05-30T18:34:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MoEIoU is a mixture-of-experts IoU loss using log-sum-exp aggregation and curriculum weighting that reports consistent gains over prior IoU losses on PASCAL VOC, HRIPCB, and MS COCO with YOLO models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.31175","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Towards Efficient LLMs Annealing with Principled Sample Selection","primary_cat":"cs.CL","submitted_at":"2026-05-29T11:42:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DiReCT reformulates LLM annealing sample selection as a constrained optimization problem that enforces per-sample gradient directions aligned with the loss landscape's curvature.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23061","ref_index":68,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Anytime Training with Schedule-Free Spectral Optimization","primary_cat":"cs.LG","submitted_at":"2026-05-21T21:50:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SF-NorMuon is a new schedule-free spectral optimizer that closes the gap with tuned AdamW on 125M-772M parameter models across 1-8x Chinchilla horizons while providing stationarity guarantees.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20866","ref_index":52,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LOSCAR-SGD: Local SGD with Communication-Computation Overlap and Delay-Corrected Sparse Model Averaging","primary_cat":"cs.LG","submitted_at":"2026-05-20T08:01:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LOSCAR-SGD combines local updates, sparse model averaging, and communication-computation overlap with a delay-corrected merge rule, providing convergence rates for smooth non-convex objectives under worker heterogeneity.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"≤cX r +η 2 r G2(BSN +DS Q).(50) This proves (42). Corollary A.1 (Accumulated disagreement bound). Assumex 0 i =x 0 for alli, so thatX 0 = 0, and assumeη r ≡ηfor allr. If c=q(1 +α)(1 +β)<1, then R−1X r=0 E[X r]≤R η2G2 BSN +DS Q \u0001 1−c (51) for everyR≥1. Proof.Letδ r :=E[X r]. Taking full expectation in (42), and usingη r ≡η, gives δr+1 (42) ≤cδ r +C,(52) whereC :=η 2G2(BSN +DS Q). Sincex 0 i =x 0 for alli, we have X0 = 0, δ 0 =E \u0002 X0\u0003 = 0. We now prove by induction that, for everyr≥0, δr ≤C r−1X j=0 cj,(53) where the sum is interpreted as 0 when r= 0 . For r= 0 , (53) gives δ0 ≤0, which holds since δ0 = 0. Suppose now that (53) holds for somer≥0. Then δr+1 (52) ≤cδ r +C (53) ≤cC r−1X j=0 cj +C=C rX"},{"citing_arxiv_id":"2605.20005","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Fine-Tuning Without Forgetting via Loss-Adaptive Learning Rates","primary_cat":"cs.LG","submitted_at":"2026-05-19T15:36:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"FINCH is a loss-adaptive learning-rate schedule that reduces forgetting by 93% on average during LLM fine-tuning while matching standard task performance across several benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19811","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LionMuon: Alternating Spectral and Sign Descent for Efficient Training","primary_cat":"cs.LG","submitted_at":"2026-05-19T13:07:59+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20302","ref_index":76,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Neural Collapse by Design: Learning Class Prototypes on the Hypersphere","primary_cat":"cs.LG","submitted_at":"2026-05-19T12:51:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Supervised classification reaches neural collapse by design via normalized prototype losses on the hypersphere, outperforming CE and SCL on ImageNet-1K and other benchmarks with faster convergence and better transfer.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18174","ref_index":50,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Ringmaster LMO: Asynchronous Linear Minimization Oracle Momentum Method","primary_cat":"cs.LG","submitted_at":"2026-05-18T10:18:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Ringmaster LMO extends delay-thresholding from ASGD to LMO-based momentum updates, providing convergence guarantees under (L0, L1)-smoothness and time-complexity bounds that recover optimal rates in the Euclidean case.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17546","ref_index":64,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Accelerating Redshift-Conditioned Galaxy Image Synthesis with One-step Generative Modeling","primary_cat":"astro-ph.IM","submitted_at":"2026-05-17T17:00:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"One-step pixel-MeanFlow models recover key galaxy morphology statistics at orders-of-magnitude lower computational cost than standard DDPM sampling while remaining weaker on fine-grained structure.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16017","ref_index":34,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Accelerated Gradient Descent for Faster Convergence with Minimal Overhead","primary_cat":"cs.LG","submitted_at":"2026-05-15T14:50:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"CT-AGD accelerates first-order optimization in deep learning by using finite-difference curvature estimates and noise-mitigation heuristics, achieving equivalent accuracy with 33% fewer training epochs and overhead comparable to Adam.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13434","ref_index":169,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Rescaled Asynchronous SGD: Optimal Distributed Optimization under Data and System Heterogeneity","primary_cat":"cs.LG","submitted_at":"2026-05-13T12:27:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Rescaled ASGD recovers convergence to the true global objective by rescaling worker stepsizes proportional to computation times, matching the known time lower bound in the leading term under non-convex smoothness and bounded heterogeneity.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12278","ref_index":19,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Hypernetworks for Dynamic Feature Selection","primary_cat":"cs.LG","submitted_at":"2026-05-12T15:37:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Hyper-DFS uses hypernetworks and Set Transformers to generate on-demand parameters for feature subsets in dynamic selection, outperforming prior methods on tabular data and showing stronger zero-shot generalization.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"This is analogous to training a shared feature extractor behind randomly initialised, rapidly changing projection heads, a scenario prone to representation collapse [Yu et al., 2020, Chen and He, 2021]. A per-batch mask budget.Restricting each mini-batch to a single subset (K= 1 , withS(b) =S for allb) givesθ(b) =θ S for allb, and Eq. (18) simplifies to ∇ηL= 1 B BX b=1 Jη ˜x(b) S \u0001⊤ Jf,θ h(b)\u0001⊤ ∇ˆy(b) θ L(b) θ=θS .(19) Every per-sample contribution is now routed through the same primary network, recovering the standard situation of a feature extractor trained jointly with a single downstream classifier; the same simplification applies to Eq.(17), where the outer sum collapses to the single termk= 1 and the bar fixesθ=θ S. A strict budget ofK= 1 removes the within-batch subset diversity needed by"},{"citing_arxiv_id":"2605.11870","ref_index":156,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Information theoretic underpinning of self-supervised learning by clustering","primary_cat":"cs.LG","submitted_at":"2026-05-12T09:50:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SSL clustering is derived as KL-divergence optimization where a teacher-distribution constraint normalizes via inverse cluster priors and simplifies to batch centering by Jensen's inequality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11530","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Multi-Narrow Transformation as a Single-Model Ensemble: Boundary Conditions, Mechanisms, and Failure Modes","primary_cat":"cs.LG","submitted_at":"2026-05-12T04:54:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Multi-narrow single-model ensembles outperform wide baselines in low-data image classification by learning diverse features but underperform in data-rich settings where training favors few paths.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"128 , and the weight decay was fixed at5 × 10 −2. The batch size was set to 128 by default, but adjusted in practice according to the memory usage of each model. When the batch size was changed, the maximum learning rate was scaled proportionally. This setting was chosen based on the common learning-rate scaling heuristic used in large-batch training [7, 24], together with preliminary experiments on CIFAR-100. The number of training epochs was adjusted to ensure a sufficient number of parameter updates under low-data conditions: epochs = { 200,IPC≥100, 200 × (100∕IPC) ,IPC<100. That is, models were trained for 200 epochs whenIPC≥ 100, and for proportionally longer when the data became scarcer. In all cases, we confirmed that training had con-"},{"citing_arxiv_id":"2605.09850","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Probing Routing-Conditional Calibration in Attention-Residual Transformers","primary_cat":"cs.CV","submitted_at":"2026-05-11T01:06:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Routing summaries and auxiliary features do not provide stable evidence of conditional miscalibration in AR transformers once confidence-matched baselines, capacity controls, and permutation nulls are applied.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08871","ref_index":47,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Rennala MVR: Improved Time Complexity for Parallel Stochastic Optimization via Momentum-Based Variance Reduction","primary_cat":"math.OC","submitted_at":"2026-05-09T10:46:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Rennala MVR improves time complexity over Rennala SGD for smooth nonconvex stochastic optimization in heterogeneous parallel systems under a mean-squared smoothness assumption.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08524","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Unleashing Scalable Context Parallelism for Foundation Models Pre-Training via FCP","primary_cat":"cs.DC","submitted_at":"2026-05-08T22:16:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"FCP shards sequences at block level with flexible P2P communication and bin-packing to achieve near-linear scaling up to 256 GPUs and 1.13x-2.21x higher attention MFU in foundation model pre-training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07815","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"OrScale: Orthogonalised Optimization with Layer-Wise Trust-Ratio Scaling","primary_cat":"cs.LG","submitted_at":"2026-05-08T14:47:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OrScale adds a Frobenius-norm trust-ratio layer-wise scaler to Muon’s orthogonalized updates, with per-layer calibration for language models, yielding higher CIFAR-10 accuracy and better language-model pre-training loss than Muon+Moonlight and AdamW.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"deep bidirectional transformers for language understanding, 2019. URL https://arxiv.org/ abs/1810.04805. [4] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour, 2018. URLhttps://arxiv.org/abs/1706.02677. [5] Keller Jordan, Jeremy Bernstein, Brendan Rappazzo, @fernbear.bsky.social, Boza Vlado, You Jiacheng, Franz Cesista, Braden Koszarsky, and @Grad62304977. modded-nanogpt: Speedrunning the nanogpt baseline, 2024. URL https://github.com/KellerJordan/ modded-nanogpt. [6] Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and"},{"citing_arxiv_id":"2605.07795","ref_index":110,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Scalable Distributed Stochastic Optimization via Bidirectional Compression: Beyond Pessimistic Limits","primary_cat":"math.OC","submitted_at":"2026-05-08T14:32:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Inkheart SGD and M4 use bidirectional compression to achieve time complexities in distributed SGD that improve with worker count n and surpass prior lower bounds under a necessary structural assumption.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07160","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"TENNOR: Trustworthy Execution for Neural Networks through Obliviousness and Retrievals","primary_cat":"cs.CR","submitted_at":"2026-05-08T02:46:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TENNOR enables efficient private training of wide neural networks in TEEs by recasting sparsification as doubly oblivious LSH retrievals and introducing MP-WTA to cut hash table memory by 50x while preserving accuracy.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"2017.Accu- rate, Large Minibatch SGD: Training ImageNet in 1 Hour. Technical Report arXiv:1706.02677. Facebook AI Research. https://arxiv.org/abs/1706.02677 [39] P. Grubbs, M. Lacharité, B. Minaud, and K. G. Paterson. 2019. Learning to Reconstruct: Statistical Learning Theory and Encrypted Database Attacks. In Proc. of the 40th IEEE S&P. 496-512. [40] Tianyao Gu, Yilei Wang, Afonso Tinoco, Bingnan Chen, Ke Yi, and Elaine Shi. 2025. Flexway O-Sort: Enclave-Friendly and Optimal Oblivious Sorting. In34th USENIX Security Symposium (USENIX Security 2025), Seattle, W A, USA, August 13- 15, 2025, Lujo Bauer and Giancarlo Pellegrino (Eds.). USENIX Association, 7563- 7582. https://www.usenix.org/conference/usenixsecurity25/presentation/gu-"},{"citing_arxiv_id":"2605.02853","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Trust, but Verify: Peeling Low-Bit Transformer Networks for Training Monitoring","primary_cat":"cs.LG","submitted_at":"2026-05-04T17:30:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A layer-wise peeling framework creates reference bounds to diagnose under-optimized layers in trained decoder-only transformers, including low-bit and quantized versions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10953","ref_index":49,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Parameter-Efficient Adaptation of Pre-Trained Vision Foundation Models for Active and Passive Seismic Data Denoising","primary_cat":"physics.geo-ph","submitted_at":"2026-04-30T15:28:51+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Adapting vision foundation models with LoRA and kurtosis-guided unsupervised test-time adaptation matches or exceeds domain-specific models for seismic denoising across multiple sites and unseen data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.27932","ref_index":48,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Dynamic Cluster Data Sampling for Efficient and Long-Tail-Aware Vision-Language Pre-training","primary_cat":"cs.CV","submitted_at":"2026-04-30T14:33:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DynamiCS dynamically scales semantic clusters per training epoch to reduce VLM pre-training compute while improving accuracy on long-tail concepts compared to static or flattening baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.27128","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Lightweight Distillation of SAM 3 and DINOv3 for Edge-Deployable Individual-Level Livestock Monitoring and Longitudinal Visual Analytics","primary_cat":"cs.CV","submitted_at":"2026-04-29T19:25:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Distilled SAM 3 and DINOv3 models deliver near-teacher accuracy in pig tracking (92.29% MOTA, 96.15% IDF1) and behavior classification while achieving 7.77x parameter reduction and fitting on Jetson Orin NX with headroom.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.26687","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"COPUS: Co-adaptive Parallelism and Batch Size Selection in Large Language Model Training","primary_cat":"cs.DC","submitted_at":"2026-04-29T13:52:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"COPUS co-adapts batch size and parallelism during LLM training via goodput to deliver 3.9-8% average faster convergence than fixing one while tuning the other.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Adaptive batch sizing.The GNS framework [27] motivated online batch-size adaptation based on the gradient noise- to-signal ratio. Follow-up estimators and heuristics include AdaScale [14], CABS [2], SimiGrad [37], AdaBatch [4], Ad- aBatchGrad [34], and AdAdaGrad [18]; large-batch studies further characterized optimization under changing batch sizes [9, 16, 42]. Branching-based methods estimate CBS more directly [29, 55]. All optimize𝐵𝑔 for statistical efficiency without modeling the 3D-parallelism throughput surface. Goodput-aware scheduling.Pollux [ 36] introduced good- put as the product of statistical efficiency and throughput, co-adapting batch size and resource allocation in a shared DP- only cluster."},{"citing_arxiv_id":"2604.26481","ref_index":61,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Provably Robust Multi-Jet Framework applied to Active Flow Control of an Airfoil in Weakly Compressible Flow","primary_cat":"physics.flu-dyn","submitted_at":"2026-04-29T09:47:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A new injective multi-jet framework for RL flow control provides jet-count-independent running cost upper bounds and enables superior coordinated jet strategies, achieving drag suppression beyond symmetric ideals on cylinders and aerodynamic efficiency gains from 53% to 73% on airfoils.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"A closer look at deep learning heuris- tics: Learning rate restarts, warmup and distillation.arXiv preprint arXiv:1810.13243, 2018. [60] C. Vignon, J. Rabault, and R. Vinuesa. Recent advances in applying deep reinforcement learning for flow control: Perspectives and future directions.Physics of Fluids, 35 (3):031301, mar 2023. ISSN 1070-6631, 1089-7666. doi: 10.1063/5.0143913. [61] Ricardo Vinuesa. Perspectives on predicting and control- ling turbulent flows through deep learning.Physics of Flu- ids, 36(3):031401, March 2024. ISSN 1070-6631, 1089- 7666. doi: 10.1063/5.0190452. 25"},{"citing_arxiv_id":"2604.24013","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"CommFuse: Hiding Tail Latency via Communication Decomposition and Fusion for Distributed LLM Training","primary_cat":"cs.LG","submitted_at":"2026-04-27T03:48:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CommFuse eliminates tail latency in communication-computation overlap for distributed LLM training by decomposing collective operations into P2P communications and fusing them with fine-grained computation scheduling.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23098","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"In-context modeling as a retrain-free paradigm for foundation models in computational science","primary_cat":"cs.CE","submitted_at":"2026-04-25T01:33:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"In-Context Modeling lets one trained model generalize across unseen materials, geometries, and conditions in computational physics by treating measurements as context for inference.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.21691","ref_index":266,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"There Will Be a Scientific Theory of Deep Learning","primary_cat":"stat.ML","submitted_at":"2026-04-23T13:58:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"A mechanics of the learning process is emerging in deep learning theory, characterized by dynamics, coarse statistics, and falsifiable predictions across idealized settings, limits, laws, hyperparameters, and universal behaviors.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19902","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MMCORE: MultiModal COnnection with Representation Aligned Latent Embeddings","primary_cat":"cs.CV","submitted_at":"2026-04-21T18:25:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"MMCORE transfers VLM reasoning into diffusion-based image generation and editing via aligned latent embeddings from learnable queries, outperforming baselines on text-to-image and editing tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"small to large by increasing the number of GPUs, where the batch token length of the MLLM.is expanded. Following the training of a diffusion head on top of the aligned visual tokens, we observe that performance improves consistently with scale, showing no signs of saturation within our maximum compute budget. This aligns with established scaling laws for large-model training [8, 14]. SFT for Improved Alignment.Our decoupled pipeline facilitates staged optimization. Following large-scale pre-training on noisy image-text data, we apply Supervised Fine-Tuning (SFT) using a curated multimodal instruction dataset. A brief SFT phase (e.g., ∼2K steps) significantly enhances text faithfulness and controllability, consistent with prior findings in instruction tuning [21, 41]."},{"citing_arxiv_id":"2604.19351","ref_index":41,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"DASH-KV: Accelerating Long-Context LLM Inference via Asymmetric KV Cache Hashing","primary_cat":"cs.CL","submitted_at":"2026-04-21T11:33:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DASH-KV accelerates long-context LLM inference to linear complexity via asymmetric KV cache hashing and mixed-precision retention, matching full attention performance on LongBench.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.10506","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"A Progressive Training Strategy for Vision-Language Models to Counteract Spatio-Temporal Hallucinations in Embodied Reasoning","primary_cat":"cs.AI","submitted_at":"2026-04-12T07:48:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A progressive training framework using spatiotemporal chain-of-thought data reduces the forward-backward temporal query performance gap in VLMs from over 70% to 6.53%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.08140","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Multimodal Reasoning with LLM for Encrypted Traffic Interpretation: A Benchmark","primary_cat":"cs.CR","submitted_at":"2026-04-09T11:56:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Creates the BGTD benchmark and mmTraffic architecture to enable explainable multimodal interpretation of encrypted network traffic using LLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.04736","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Sampling Parallelism for Fast and Efficient Bayesian Learning","primary_cat":"cs.LG","submitted_at":"2026-04-06T15:03:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Sampling parallelism distributes Bayesian sample evaluations across GPUs for near-perfect scaling, lower memory use, and faster convergence via per-GPU data augmentations, outperforming pure data parallelism in diversity.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2020. Generative adversarial networks.Commun. ACM63, 11 (2020), 139-144. [13] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2018. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. arXiv:1706.02677 [cs.CV] doi:10.48550/arXiv.1706.02677 [14] Alex Graves. 2011. Practical variational inference for neural networks. InPro- ceedings of the 25th International Conference on Neural Information Processing Systems(Granada, Spain)(NIPS'11). Curran Associates Inc., Red Hook, NY, USA, 2348-2356. [15] Hans Hersbach, Bill Bell, Paul Berrisford, Shoji Hirahara, András Horányi, Joaquín Muñoz-Sabater, Julien Nicolas, Carole Peubey, Raluca Radu, Dinand"},{"citing_arxiv_id":"2604.03688","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Fusion and Alignment Enhancement with Large Language Models for Tail-item Sequential Recommendation","primary_cat":"cs.IR","submitted_at":"2026-04-04T11:19:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"FAERec fuses collaborative ID embeddings with LLM semantic embeddings using adaptive gating and dual-level alignment to enhance tail-item sequential recommendations.","context_count":1,"top_context_role":"background","top_context_polarity":"support","context_text":"[52], with item-level alignment and feature-level alignment pre- cisely corresponding to this hierarchy: the former is a simple point- to-point task, while the latter is a complex alignment task over a space structure. If both constraints are imposed simultaneously during early training, premature space structure constraints may introduce noise, leading to suboptimal performance [ 16]. There- fore, inspired by curriculum learning [5, 17] we propose a cosine- scheduled to dynamically balance the two alignment objectives. Table 1: The statistics of three datasets. Dataset # Users # Items Sparsity Avg.length Beauty 52,204 57,289 99.99% 7.56 Grocery 32,126 39,264 99.98% 8.57 Yelp 15,720 11,383 99.89% 12.23 Specifically, the weight of item-level alignment loss is defined as:"},{"citing_arxiv_id":"2604.02473","ref_index":44,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Analyzing Reverse Address Translation Overheads in Multi-GPU Scale-Up Pods","primary_cat":"cs.DC","submitted_at":"2026-04-02T19:08:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Simulation study shows cold TLB misses in reverse address translation dominate latency for small collectives in multi-GPU pods, causing up to 1.4x degradation, while larger ones see diminishing returns.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"tructure: inference spans tens to hundreds of GPUs, while training cutting-edge models can require thousands [10, 29, 59, 67, 87]. This scaling follows a two-tier model [35, 60]: vertical (intra-node) scal- ing via GPU interconnects such as NVLink [ 72], AMD Infinity Fabric Link [89], or Huawei HCCS [63], and horizontal (inter-node) scaling via GPU-NIC interconnects using RDMA [44, 56, 68, 83, 105]. While intra-node links offer terabits-per-second bandwidth and direct load/store semantics [7, 36], inter-node communication re- mains a bottleneck due to slower NIC-mediated bandwidth [35, 84]. To bridge this gap, new scale-up fabrics such as NVIDIA's NVLink network [26] and the recently ratified UALink 200G 1.0 standard [9, 19, 42, 66] introduce high-bandwidth, memory-semantic intercon-"},{"citing_arxiv_id":"2603.05116","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"FedBCD:Communication-Efficient Accelerated Block Coordinate Gradient Descent for Federated Learning","primary_cat":"cs.LG","submitted_at":"2026-03-05T12:37:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"FedBCGD reduces communication in federated learning by a factor of 1/N through block-wise parameter updates with accelerated convergence guarantees.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.16233","ref_index":37,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DistributedEstimator: Distributed Training of Quantum Neural Networks via Circuit Cutting","primary_cat":"cs.DC","submitted_at":"2026-02-18T07:17:29+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.12677","ref_index":21,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Fine-Tuning Causal LLMs for Text Classification: Embedding-Based vs. Instruction-Based Approaches","primary_cat":"cs.CL","submitted_at":"2025-12-14T13:02:06+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.02012","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Improved Mean Flows: On the Challenges of Fastforward Generative Models","primary_cat":"cs.CV","submitted_at":"2025-12-01T18:59:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Improved MeanFlow (iMF) reaches 1.72 FID on ImageNet 256x256 with one function evaluation by reformulating the training objective as a regression on instantaneous velocity and treating guidance as flexible conditioning variables.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.13720","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Back to Basics: Let Denoising Generative Models Denoise","primary_cat":"cs.CV","submitted_at":"2025-11-17T18:59:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Directly predicting clean data with large-patch pixel Transformers enables strong generative performance in diffusion models where noise prediction fails at high dimensions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"JiT-B JiT-L JiT-H JiT-G architecture depth 12 24 32 40 hidden dim 768 1024 1280 1664 heads 12 16 16 16 image size 256 (other settings: 512, or 1024) patch size image size/ 16 bottleneck 128 (B/L), 256 (H/G) dropout 0 (B/L), 0.2 (H/G) in-context class tokens 32 (if used) in-context start block 4 8 10 10 training epochs 200 (ablation), 600 warmup epochs [17] 5 optimizer Adam [31],β 1, β2 = 0.9,0.95 batch size 1024 learning rate 2e-4 learning rate schedule constant weight decay 0 ema decay {0.9996, 0.9998, 0.9999} time sampler logit(t)∼N(µ, σ 2),µ= -0.8,σ= 0.8 noise scale 1.0×image size/ 256 clip of(1−t)in division 0.05 class token drop (for CFG) 0.1 sampling ODE solver Heun [20] ODE steps 50 time steps linear in [0."},{"citing_arxiv_id":"2511.08666","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Privacy Beyond Pixels: Latent Anonymization for Privacy-Preserving Video Understanding","primary_cat":"cs.CV","submitted_at":"2025-11-11T18:56:27+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A plug-and-play Anonymizing Adapter Module removes private information from video latent features using self-supervised privacy objectives and consistency losses while retaining utility on action recognition, temporal detection, and anomaly tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.04988","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Adaptive Memory Momentum via a Model-Based Framework for Deep Learning Optimization","primary_cat":"cs.LG","submitted_at":"2025-10-06T16:24:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Presents a model-based proximal framework for adaptive momentum in first-order optimizers by using a two-plane approximation of the objective to dynamically set the memory coefficient online.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.21613","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Chameleon: Adaptive Fault Tolerance for Distributed Training via Real-time Policy Selection","primary_cat":"cs.DC","submitted_at":"2025-08-29T13:22:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Chameleon provides adaptive fault tolerance for distributed training by real-time selection of optimal recovery policies via a unified performance model, demonstrated with low overhead on a 32-card cluster.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.23315","ref_index":22,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Analysis of Hyperparameter Optimization Effects on Lightweight Deep Models for Real-Time Image Classification","primary_cat":"cs.CV","submitted_at":"2025-07-31T07:47:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"Hyperparameter tuning on seven lightweight models trained on a 90k-image ImageNet subset yields 1.5-3.5% top-1 accuracy gains, with RepVGG-A2 and MobileNetV3-L achieving sub-5ms latency and over 9800 FPS on GPU.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.00528","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Efficient GPU-Accelerated Training of a Neuroevolution Potential with Analytical Gradients","primary_cat":"cond-mat.dis-nn","submitted_at":"2025-07-01T07:44:22+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"GNEP trains neuroevolution potentials with analytical gradients and Adam optimizer, cutting fitting time by orders of magnitude for Sb-Te systems while matching DFT accuracy on equation of state and radial distribution functions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.00432","ref_index":243,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning","primary_cat":"cs.AI","submitted_at":"2025-07-01T05:23:05+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.13447","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Mean Flows for One-step Generative Modeling","primary_cat":"cs.LG","submitted_at":"2025-05-19T17:59:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MeanFlow uses a derived identity between average and instantaneous velocities to train one-step flow models, achieving FID 3.43 on ImageNet 256x256 with 1-NFE from scratch.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"[14] Zhengyang Geng, Ashwini Pokle, and J Zico Kolter. One-step diffusion distillation via deep equilibrium models. Neural Information Processing Systems (NeurIPS), 36, 2024. 2 [15] Zhengyang Geng, Ashwini Pokle, William Luo, Justin Lin, and J Zico Kolter. Consistency models made easy. arXiv preprint arXiv:2406.14548, 2024. 1, 2, 5, 6, 7, 8, 9, 15 [16] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017. 14 [17] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium."},{"citing_arxiv_id":"2503.19444","ref_index":41,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AI Failures in the Eyes of the Downstream Developer: A First Look at Concerns, Practices, and Challenges","primary_cat":"cs.SE","submitted_at":"2025-03-25T08:35:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Mixed-methods study maps downstream developers' concerns, practices, and challenges with AI failures in PTM-based software.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"1 INTRODUCTION From autonomous vehicles to powerful chatbots, software engineers are adopting AI models to build AI-based software systems [75] in critical domains, including education [ 29], healthcare [ 115], and transportation [ 80]. With recent advancements in AI models, the expertise and resources required to train a model from scratch have increased [41, 88, 106]. This has led to a paradigm shift, with developers increasingly adopting pre-trained models (PTMs) - models that have been trained on large datasets for general tasks and can be adapted to specific applications - from platforms such as Hugging Face [56, 102]. Neglecting the safety of AI-based software has harmed individuals and broader societies [10]."},{"citing_arxiv_id":"2502.05171","ref_index":63,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach","primary_cat":"cs.LG","submitted_at":"2025-02-07T18:55:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}