{"total":209,"items":[{"citing_arxiv_id":"2606.26861","ref_index":27,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Cascaded Multi-Granularity Pruning for On-Device LLM Inference in Industrial IoT","primary_cat":"cs.CL","submitted_at":"2026-06-25T10:44:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Cascaded multi-granularity pruning reaches 13.8x compression on MHA+GELU LLMs for bearing fault diagnosis at 83.82% accuracy while causing ~74pp collapse on GQA+SwiGLU models that violate the formalized Structural Independence Assumption.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.17631","ref_index":25,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Better Queries, Cheaper Attention: Adapting Transformers for Efficient Sparse Reconstruction","primary_cat":"hep-ex","submitted_at":"2026-06-16T07:41:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A geometry-aware dynamic-query transformer decoder with Local Strided Cross-Attention raises track reconstruction efficiency from 94.1% to 98.1%, halves latency, and cuts memory use by over 10x versus fixed-query baselines in a simplified HL-LHC simulation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.14122","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Beyond Perplexity: UTF-8 Validity in Byte-aware Language Models","primary_cat":"cs.CL","submitted_at":"2026-06-12T05:03:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A 355M-parameter byte-level LM on 80B multilingual tokens exhibits UTF-8 validity converging after 4.2B tokens versus 2.1B for perplexity, with higher validity on rare characters than common ones.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.12364","ref_index":44,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"On Subquadratic Architectures: From Applications to Principles","primary_cat":"cs.LG","submitted_at":"2026-06-10T17:33:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"xLSTM outperforms Mamba-2 and Gated DeltaNet on tasks with complex dependencies because its gating scheme enables more flexible and stable state tracking and memory accumulation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00892","ref_index":54,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"An Exploratory Study into using Machine-Learning for Fast Step-by-step Emulation of Numerical Mechanical Thrombectomy Simulations for Ischemic Stroke","primary_cat":"cs.LG","submitted_at":"2026-05-30T20:54:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"ML surrogates accurately emulate single steps of simplified thrombectomy simulations with speedups but lack stability over long times with complex geometries.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.31296","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"mRNAutilus: Multi-Objective-Guided Discrete Generation of mRNA with Optimized Therapeutic Properties","primary_cat":"q-bio.BM","submitted_at":"2026-05-29T13:32:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"mRNAutilus generates full-length therapeutic mRNAs via diffusion models and multi-objective guidance, achieving over 400-fold expression gains for luciferase and outperforming baselines for Spike and other targets in zero-shot tests.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.31268","ref_index":64,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Mellum2 Technical Report","primary_cat":"cs.CL","submitted_at":"2026-05-29T13:01:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"Mellum 2 is a 12B MoE model with 2.5B active parameters, trained on 10.6T tokens with MoE, GQA, SWA, and MTP, then post-trained into Instruct and Thinking variants, claimed competitive with 4B-14B models at 2.5B compute.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.07604","ref_index":49,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Contribution Weights: A Geometrical Analysis of Self-Attention Transformers","primary_cat":"cs.LG","submitted_at":"2026-05-29T09:40:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Contribution Weights combine attention, value magnitude, and directional alignment to measure token influence more faithfully than attention alone, and show attention sinks actively suppress information via a convex sink-rate to output-norm relationship.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.31035","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MixFP4: Enhancing NVFP4 with Adaptive FP4/INT4 Block Representations","primary_cat":"cs.AR","submitted_at":"2026-05-29T09:05:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MixFP4 extends NVFP4 by adaptively selecting between two FP4 micro-formats per block using repurposed scale sign bits and a unified E2M2 compute path, claiming better accuracy than standard NVFP4 at 3.1% area and 1.5% power overhead.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30750","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SLAP: The Semantic Least Action Principle for Variational Video-Language Modeling","primary_cat":"cs.CV","submitted_at":"2026-05-29T02:28:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"SLAP reframes video interpolation as a variational mechanics boundary value problem on a semantic manifold to enforce object persistence without pixel rendering.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30100","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Chess-World-Model: A 10M-Game Benchmark for Exact State Tracking from Chess Move Sequences","primary_cat":"cs.LG","submitted_at":"2026-05-28T15:43:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Introduces Chess-World-Model benchmark from 10M chess games showing recurrent models (SLiCE, Mamba-3, Gated DeltaNet) outperform Transformers on exact state tracking, with random-play split remaining hard at larger scales.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30080","ref_index":20,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Adaptive Targeted Dynamic Chunking for Tokenization-Free Hierarchical Model","primary_cat":"cs.CL","submitted_at":"2026-05-28T15:26:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ATDC applies curriculum learning to dynamically control chunk compression in hierarchical byte models, reporting competitive BPB on FineWeb-Edu 100B and more stable training than fixed-ratio baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30022","ref_index":26,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Give it Space! Explicit Disentangling of Positional and Semantic Representations in Encoders","primary_cat":"cs.CL","submitted_at":"2026-05-28T14:42:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Explicitly disentangling semantic and positional streams in a Transformer encoder reveals that absolute positional representations collapse to a 2D document-structure manifold, attention heads specialize by role, and the approach improves linguistic probing performance on 49 of 65 phenomena.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.29863","ref_index":21,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"STAP: A Shuffle-Tokenized App Predictor with Ultra Long Context for Vocabulary-Free Mobile App Prediction","primary_cat":"cs.LG","submitted_at":"2026-05-28T12:44:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A Transformer model with app-identity shuffling and ultra-long context achieves vocabulary-free next-app prediction with cross-dataset zero-shot capability and competitive cold-start performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.29714","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Leveraging Routing Dynamics in Mixture-of-Experts Models for Efficient Language Adaptation","primary_cat":"cs.CL","submitted_at":"2026-05-28T10:12:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Continual multilingual pre-training of an English-centric MoE model produces language-agnostic routing in early layers and specialization in final layers; updating only final-layer experts yields competitive multilingual performance while changing less than 2% of parameters.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23893","ref_index":27,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Complete-muE: Optimal Hyperparameter Transfer and Scaling for MoE Models","primary_cat":"cs.LG","submitted_at":"2026-05-22T17:56:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Complete-muE combines active-width μP and activated-expert scaling to transfer hyperparameters across dense FFN, dense MoE, and sparse MoE while covering changes in experts, capacity, width, depth, batch size, and duration.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23191","ref_index":30,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Expand More, Shrink Less: Shaping Effective-Rank Dynamics for Dense Scaling in Recommendation","primary_cat":"cs.LG","submitted_at":"2026-05-22T03:17:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RankElastor mitigates embedding collapse via spectrum-robust token mixing and GLU-based P-FFNs, yielding better performance and scaling on industrial recommendation datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21981","ref_index":26,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RiT: Vanilla Diffusion Transformers Suffice in Representation Space","primary_cat":"cs.CV","submitted_at":"2026-05-21T04:21:43+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A vanilla Diffusion Transformer trained via x-prediction on frozen DINOv2 features reaches FID 1.14 on ImageNet 256x256 with fewer parameters and faster sampling than prior DiT variants.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21381","ref_index":67,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Disentangling Generation and Regression in Stochastic Interpolants for Controllable Image Restoration","primary_cat":"cs.CV","submitted_at":"2026-05-20T16:41:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DiSI disentangles stochastic interpolants into separate generation and regression paths, allowing controllable transitions between regression and generative image restoration with a unified few-step sampler.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20839","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Activation-Free Backbones for Image Recognition: Polynomial Alternatives within MetaFormer-Style Vision Models","primary_cat":"cs.CV","submitted_at":"2026-05-20T07:29:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Polynomial replacements for activations in MLPs, convolutions, and attention within MetaFormer yield PolyNeXt models that match or exceed standard performance on ImageNet, ADE20K, and robustness benchmarks while beating prior polynomial networks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20798","ref_index":60,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Most Transformer Modifications Still Do Not Transfer at 1-3B: A 2020-2026 Update to Narang et al. (2021) with Downstream Evaluation and a Noise Floor","primary_cat":"cs.LG","submitted_at":"2026-05-20T06:43:34+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Empirical update to prior work shows most of 20 recent Transformer modifications do not transfer at 1-3B scales when measured with downstream CLIMB-12 tasks, multi-seed noise floor, and cross-scale stability.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20613","ref_index":27,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"HRM-Text: Efficient Pretraining Beyond Scaling","primary_cat":"cs.CL","submitted_at":"2026-05-20T01:59:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A 1B-parameter hierarchical recurrent model pretrained on 40B instruction-response tokens achieves 60.7% MMLU and strong results on ARC-C, DROP, GSM8K, and MATH while using 100-900x fewer tokens than standard baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20005","ref_index":57,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Fine-Tuning Without Forgetting via Loss-Adaptive Learning Rates","primary_cat":"cs.LG","submitted_at":"2026-05-19T15:36:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"FINCH is a loss-adaptive learning-rate schedule that reduces forgetting by 93% on average during LLM fine-tuning while matching standard task performance across several benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19568","ref_index":34,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"m3BERT: A Modern, Multi-lingual, Matryoshka Bidirectional Encoder","primary_cat":"cs.CL","submitted_at":"2026-05-19T09:13:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"m3BERT uses a three-stage Matryoshka pretraining approach on a bidirectional encoder to support variable embedding sizes while outperforming prior models on large-scale retrieval tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19376","ref_index":47,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Generative Recursive Reasoning","primary_cat":"cs.AI","submitted_at":"2026-05-19T05:20:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GRAM is a latent-variable generative model that performs recursive reasoning via stochastic trajectories, trained with amortized variational inference to support multi-hypothesis reasoning and unconditional generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18553","ref_index":40,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"StableHand: Quality-Aware Flow Matching for World-Space Dual-Hand Motion Estimation from Egocentric Video","primary_cat":"cs.CV","submitted_at":"2026-05-18T15:33:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"StableHand introduces a quality-aware flow matching framework conditioned on predicted four-channel per-frame hand observation quality to estimate dual-hand world-space motion from egocentric video, achieving SOTA results with 20-25% W-MPJPE reduction on HOT3D and ARCTIC benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18387","ref_index":30,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Graph Hierarchical Recurrence for Long-Range Generalization","primary_cat":"cs.LG","submitted_at":"2026-05-18T13:31:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"GHR uses hierarchical recurrence on pooled graph abstractions to improve long-range dependency capture and out-of-range generalization while using far fewer parameters than existing models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18106","ref_index":135,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers","primary_cat":"math.OC","submitted_at":"2026-05-18T09:17:26+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"13416, 2025. [133] S. Schubert, P. Neubert, J. Pöschmann, and P. Protzel. Circular convolutional neural networks for panoramic images and laser data. InIEEE Intelligent Vehicles Symposium (IV). 2019. [134] A. Semenov, M. Pagliardini, and M. Jaggi. Benchmarking optimizers for large language model pretraining.arXiv preprint arXiv:2509.01440, 2025. 36 [135] N. Shazeer. GLU variants improve transformer.arXiv preprint arXiv:2002.05202, 2020. [136] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. InInternational Conference on Learning Representations (ICLR). 2017. [137] N. Shazeer and M. Stern."},{"citing_arxiv_id":"2605.15871","ref_index":120,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Agentic Discovery of Neural Architectures: AIRA-Compose and AIRA-Design","primary_cat":"cs.AI","submitted_at":"2026-05-15T11:40:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Multi-agent LLM systems discover new Transformer and hybrid architectures that outperform Llama 3.2 at 1B scale and approach human SOTA on long-range benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15403","ref_index":33,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"$\\phi$-Balancing for Mixture-of-Experts Training","primary_cat":"cs.LG","submitted_at":"2026-05-14T20:39:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"φ-balancing is a convex optimization method for population-level expert balance in MoE training that derives an online EMA adjustment and outperforms heuristic baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14597","ref_index":72,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VMU-Diff: A Coarse-to-fine Multi-source Data Fusion Framework for Precipitation Nowcasting","primary_cat":"cs.CV","submitted_at":"2026-05-14T09:05:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VMU-Diff improves precipitation nowcasting via coarse multi-source Vision Mamba fusion followed by residual conditional diffusion refinement.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14037","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Self-Pruned Key-Value Attention: Learning When to Write by Predicting Future Utility","primary_cat":"cs.LG","submitted_at":"2026-05-13T18:58:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SP-KV trains a utility predictor jointly with the LLM to dynamically prune low-utility KV cache entries, achieving 3-10x memory reduction during generation with negligible performance loss.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13989","ref_index":50,"ref_count":3,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VectraYX-Nano: A 42M-Parameter Spanish Cybersecurity Language Model with Curriculum Learning and Native Tool Use","primary_cat":"cs.CL","submitted_at":"2026-05-13T18:03:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Trains a 42M-parameter Spanish cybersecurity LLM from scratch with curriculum phases and achieves 0.23 tool-selection accuracy after SFT mixture rebalancing to 1:21 tool-use ratio.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13807","ref_index":41,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Parallel Scan Recurrent Neural Quantum States for Scalable Variational Monte Carlo","primary_cat":"cond-mat.str-el","submitted_at":"2026-05-13T17:36:32+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PSR-NQS makes recurrent neural quantum states scalable for variational Monte Carlo by using parallel scan recurrence, reaching accurate results on 52x52 two-dimensional lattices.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"of the previous input and previous hidden state, which injects new information. From this perspective, the LRU keeps the recurrence itself linear, with a fixed diagonal propagator and lin- ear input injection. Nonlinearity is added only outside this propagation step, through local nonlinear maps such as multilayer perceptrons (MLPs) or gated linear units (GLUs) [41], which can be viewed as input-dependent gates that modulate the transmitted signal. By contrast, in minGRU, the update gate zt =σ(W zxt−1 +b z) (11) makes the recurrence input-dependent, so the nonlinear- ity appears directly inside the state update, ht =z t ⊙h t−1 + (1−z t)⊙(W hxt−1 +b h).(12) This viewpoint also helps situate other recent architec-"},{"citing_arxiv_id":"2605.13352","ref_index":42,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"GeoFlowVLM: Geometry-Aware Joint Uncertainty for Frozen Vision-Language Embedding","primary_cat":"cs.LG","submitted_at":"2026-05-13T11:12:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"GeoFlowVLM learns joint distributions of l2-normalized VLM embeddings on the product hypersphere via Riemannian flow matching to expose both aleatoric and epistemic uncertainty through derived entropy and typicality scores.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12492","ref_index":69,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation","primary_cat":"cs.LG","submitted_at":"2026-05-12T17:59:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Pion is an optimizer that preserves the singular values of weight matrices in LLM training by applying orthogonal equivalence transformations.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"training practical LLMs demands additional design choices for greater stability. To this end, we explore the following design principles. We note that our exploration is by no means comprehensive, but rather represents an initial yet principled effort toward building a stable spectrum-preserving optimizer. For rapid prototyping, we perform all design explorations using a 60M-parameter LLaMA-based model [69, 76, 91], a common setup for ablation [26, 60, 92]. All the models in this section are trained on C4 [63] with sequence length 256 for 9.6B tokens, ensuring sufficient training. 2.4.1 Consistent Update /uni0394W /uni03B7 F (a) (b) 1 din /uni2225Gin t /uni2225F 1 dout /uni2225Gout t /uni2225F 1e-4 Figure 2: Inconsistent updates in Pion. To train deep neural networks effectively, prior work [4, 24,"},{"citing_arxiv_id":"2605.12011","ref_index":45,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CaloArt: Large-Patch x-Prediction Diffusion Transformers for High-Granularity Calorimeter Shower Generation","primary_cat":"physics.ins-det","submitted_at":"2026-05-12T12:00:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"CaloArt achieves top FPD, high-level, and classifier metrics on CaloChallenge datasets 2 and 3 while keeping single-GPU generation at 9-11 ms per shower by combining large-patch tokenization, x-prediction, and conditional flow matching.","context_count":1,"top_context_role":"method","top_context_polarity":"background","context_text":"Based on the ablation in Section 4.4, we use theape+ropeconfiguration as the default setting in the experiments below. Modern transformer refinements.CaloArt incorporates several backbone refinements commonly used in recent task-agnostic diffusion transformers. Following the LightningDiT modernization recipe [44], we replace the standard FFN with a SwiGLU FFN [45], re- place LayerNorm with RMSNorm [46], and apply query-key normalization in the attention module [47]. Overall, CaloArt keeps the direct full-shower generation setup of CaloDiT-style models, but strengthens the backbone through practical updates to positional encoding, condition- ing modulation, and transformer block components. The resulting model is intended as a"},{"citing_arxiv_id":"2605.11558","ref_index":63,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Composite Activation Function for Learning Stable Binary Representations","primary_cat":"cs.LG","submitted_at":"2026-05-12T05:41:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"HTAF is a sigmoid-tanh composite that approximates the Heaviside function to allow stable gradient training of binary activation networks, yielding ICBMs with stable discretization and competitive performance on image tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[61] Johannes Schmidt-Hieber. Nonparametric regression using deep neural networks with ReLU activation function.The Annals of Statistics, 48(4):1875 - 1897, 2020. [62] Abhronil Sengupta, Yuting Ye, Robert Wang, Chiao Liu, and Kaushik Roy. Going deeper in spiking neural networks: Vgg and residual architectures.Frontiers in neuroscience, 13:95, 2019. [63] Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020. [64] Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025. [65] Vasu Singla, Sahil Singla, Soheil Feizi, and David Jacobs."},{"citing_arxiv_id":"2605.11408","ref_index":42,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MaskTab: Scalable Masked Tabular Pretraining with Scaling Laws and Distillation for Industrial Classification","primary_cat":"cs.LG","submitted_at":"2026-05-12T01:56:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MaskTab is a masked pretraining method for industrial tabular data that delivers measurable gains in classification AUC and KS metrics while enabling effective distillation to smaller models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11327","ref_index":30,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Neural Statistical Functions","primary_cat":"cs.LG","submitted_at":"2026-05-11T23:25:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Neural statistical functions use prefix statistics to unify and directly predict statistical quantities over continuous ranges from pre-trained single-sample models without repeated sampling.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[27] Prajit Ramachandran, Barret Zoph, and Quoc V Le. Searching for activation functions.arXiv preprint arXiv:1710.05941, 2017. [28] Christian P Robert, George Casella, and George Casella.Monte Carlo statistical methods, volume 2. Springer, 2004. [29] Yulia Rubanova, Ricky TQ Chen, and David K Duvenaud. Latent ordinary differential equations for irregularly-sampled time series.NeurIPS, 2019. [30] Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020. [31] Leslie N Smith and Nicholay Topin. Super-convergence: Very fast training of neural networks using large learning rates. InArtificial intelligence and machine learning for multi-domain operations applications. SPIE, 2019. [32] Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever."},{"citing_arxiv_id":"2605.10938","ref_index":61,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ELF: Embedded Language Flows","primary_cat":"cs.CL","submitted_at":"2026-05-11T17:59:29+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Kim, Grigory Bartosh, Dmitry Molchanov, Sergey Markov, and Dmitry Vetrov. TEncDM: Understanding the properties of the diffusion model in the space of language model encodings. InAAAI, 2025. 3, 5, 15 [60] Alexander Shabalin, Simon Elistratov, Viacheslav Meshchaninov, Ildus Sadrtdinov, and Dmitry Vetrov. Why gaussian diffusion models fail on discrete data?arXiv preprint arXiv:2604.02028, 2026. 5 [61] Noam Shazeer. GLU variants improve Transformer.arXiv preprint arXiv:2002.05202, 2020. 24 [62] Junzhe Shen, Jieru Zhao, Ziwei He, and Zhouhan Lin. Codar: Continuous diffusion language models are more powerful than you think.arXiv preprint arXiv:2603.02547, 2026. 2, 3, 15 [63] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli."},{"citing_arxiv_id":"2605.11061","ref_index":44,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer","primary_cat":"cs.CV","submitted_at":"2026-05-11T17:59:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A pixel-space Diffusion Transformer with Unified Transformer architecture unifies image generation, editing, and personalization in an end-to-end model that maps all inputs to a shared token space and scales from 8B to over 200B parameters.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[42] Rombach, R., Blattmann, A., Lorenz, D., Esser, P ., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022) [43] Seedream, T., Chen, Y., Gao, Y., Gong, L., Guo, M., Guo, Q., Guo, Z., Hou, X., Huang, W., Huang, Y., et al.: Seedream 4.0: Toward next-generation multimodal image generation. arXiv preprint arXiv:2509.20427 (2025) [44] Shazeer, N.: Glu variants improve transformer. arXiv preprint arXiv:2002.05202 (2020) [45] Somepalli, G., Singla, V ., Goldblum, M., Geiping, J., Goldstein, T.: Diffusion art or digital forgery? investigating data replication in diffusion models. In: CVPR (2023) [46] Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., Liu, Y.: Roformer: Enhanced transformer with"},{"citing_arxiv_id":"2605.10777","ref_index":40,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Locking Pretrained Weights via Deep Low-Rank Residual Distillation","primary_cat":"cs.LG","submitted_at":"2026-05-11T16:09:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DLR-Lock locks open-weight LLMs against unauthorized fine-tuning by swapping MLPs for deep low-rank residual networks that inflate backprop memory and complicate optimization, yet preserve original capabilities via module-wise distillation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10775","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"On the global convergence of gradient descent for wide shallow models with bounded nonlinearities","primary_cat":"math.OC","submitted_at":"2026-05-11T16:08:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Gradient descent on wide shallow models with bounded nonlinearities converges globally in the mean-field limit as non-global critical points are unstable under the dynamics.","context_count":1,"top_context_role":"extension","top_context_polarity":"extend","context_text":"whenm→+∞, then(µ m,t)t⩾0 converges asm→+∞to(µ t)t⩾0. Moreover, there existsC >0such that W2(µt, µm,t)⩽e CtW2(µ0, µm,0)for everyt⩾0. The proof of Theorem 1 is given in Section B. Our proof strategy and our assumptions on the initialization are discussed below. Assumption on the initialization.We stress that our condition on the initialization is essentially weaker than that of [CB18, Proposition 2.5], which requires thew-marginal of the initial measure to have compact support. In particular, our condition covers the important case of Gaussian initializations. In Section 3 below, we focus on the case where Ω =R dw ×R dθ and Φ : (w, θ)7→ϕ(θ)wwhereϕ(θ) is a linear mapping fromR dw toF. Thewvariable corresponds to the parameters of the output layer (in the case of fully"},{"citing_arxiv_id":"2605.10537","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Mela: Test-Time Memory Consolidation based on Transformation Hypothesis","primary_cat":"cs.CL","submitted_at":"2026-05-11T13:20:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Mela is a Transformer variant with a dual-frequency Hierarchical Memory Module and MemStack that performs test-time memory consolidation, outperforming baselines on long contexts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10504","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Learning Less Is More: Premature Upper-Layer Attention Specialization Hurts Language Model Pretraining","primary_cat":"cs.CL","submitted_at":"2026-05-11T13:01:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Temporarily reducing the learning rate on upper-layer query and key projections during early GPT pretraining prevents premature attention specialization and improves model performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10391","ref_index":20,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Phoenix-VL 1.5 Medium Technical Report","primary_cat":"cs.CL","submitted_at":"2026-05-11T11:36:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"Phoenix-VL 1.5 Medium is a 123B-parameter natively multimodal model that reaches state-of-the-art results on Singapore multimodal, legal, and policy benchmarks after localized training on 1T+ tokens while staying competitive on global benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10288","ref_index":49,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"BROS: Bias-Corrected Randomized Subspaces for Memory-Efficient Single-Loop Bilevel Optimization","primary_cat":"cs.LG","submitted_at":"2026-05-11T09:50:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"BROS achieves memory-efficient single-loop stochastic bilevel optimization with O(ε^{-2}) sample complexity by performing updates in randomized subspaces and using Rademacher bi-probe correction for unbiased estimation.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"component by query scale rather than folding it into a backend-specific closed-form constant. Notably, RSO reports that activations can dominate trainable-side memory in large-batch regimes [8], makingBROS's activation savings important when the lower-level task trains neural networks. We analyze one trainable Transformer decoder block with a SwiGLU feed-forward network [49] and standard multi-head attention [55], i.e., without grouped-query attention [1] (non-GQA). The component-wise derivation is provided in Appendix D.2. Letn denote the hidden size,s the sequence length,b the micro-batch size, and h the number of attention heads. We count scalar memory slots, omit upper-level variables and datatype constants, and use a peak-memory proxy that includes persistent lower/auxiliary states, saved activations,"},{"citing_arxiv_id":"2605.09949","ref_index":47,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"From Syntax to Semantics: Unveiling the Emergence of Chirality in SMILES Translation Models","primary_cat":"cs.LG","submitted_at":"2026-05-11T03:53:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Chirality emerges in SMILES translation models through an abrupt encoder-centered reorganization of representations after a long plateau, identified via checkpoint analysis and ablation.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"Specifically, the encoder receives a randomized SMILES, and the decoder is trained to output the corresponding canonical SMILES. This strategy promotes the learning of structure-related features beyond surface string syntax. Given the variance in sequence lengths, we devised a curriculum learning-style bucket sampling strategy to stabilize and accelerate training [47]. The selection probability of each bucket was dynamically adjusted throughout the training process. In the early stages, shorter sequences (ZINC20 (∼100), PubChem (∼100)) were preferentially sampled. As training progressed, the probability of sampling longer sequences gradually increased, culminating in equal sampling probabilities for all buckets by the final stage."},{"citing_arxiv_id":"2605.09386","ref_index":23,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech","primary_cat":"eess.AS","submitted_at":"2026-05-10T07:24:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"GibbsTTS combines a training-free kinetic-optimal scheduler with finite-step moment correction in MI-DFM to deliver top naturalness and strong speaker similarity in zero-shot TTS.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"5 Model For conciseness, this section summarizes the main model design, while detailed training and inference procedures are provided in Appendix E. Specifically, Algorithms 3 and 4 describe the training and inference processes, respectively. Backbone.As shown in Fig. 1, we adopt a DiT [ 19] backbone and leverage RoPE position em- bedding [26], SwiGLU [ 23], and RMSNorm [ 38]. The timestep and language embeddings are concatenated and used as conditioning for the adaLN-Zero layers in DiT. Input construction.We improve the StableTTS text frontend [1] for text normalization and grapheme- to-phoneme conversion. Codec-token embeddings from all RVQ codebooks are concatenated along the channel dimension and linearly projected to a per-frame embedding, which is then concatenated"}],"limit":50,"offset":0}