CLAD is the first deep learning framework for log anomaly detection that operates directly on compressed byte streams using a dilated convolutional encoder, hybrid Transformer-mLSTM, and two-stage training, achieving 0.9909 average F1-score across five datasets.
super hub Mixed citations
GLU Variants Improve Transformer
Mixed citation behavior. Most common role is background (47%).
abstract
Gated Linear Units (arXiv:1612.08083) consist of the component-wise product of two linear projections, one of which is first passed through a sigmoid function. Variations on GLU are possible, using different nonlinear (or even linear) functions in place of sigmoid. We test these variants in the feed-forward sublayers of the Transformer (arXiv:1706.03762) sequence-to-sequence model, and find that some of them yield quality improvements over the typically-used ReLU or GELU activations.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Gated Linear Units (arXiv:1612.08083) consist of the component-wise product of two linear projections, one of which is first passed through a sigmoid function. Variations on GLU are possible, using different nonlinear (or even linear) functions in place of sigmoid. We test these variants in the feed-forward sublayers of the Transformer (arXiv:1706.03762) sequence-to-sequence model, and find that some of them yield quality improvements over the typically-used ReLU or GELU activations.
authors
co-cited works
representative citing papers
Test-time training with KV binding reduces to learned linear attention.
LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.
DTM-Codec achieves better reconstruction quality and intelligibility than fixed-frame-rate neural speech codecs at matched total bitrate via dynamic token masking and Path Length Equalization for variable frame rates.
PRA approximates sequential rollout training in parallel for pixel-space AR models via intermediate states and a pixel decoder, achieving FID 2.58 (135M params) and 1.94 (511M params) on ImageNet-1K 256x256, new SOTA among pixel-space AR models.
mRNAutilus generates full-length therapeutic mRNAs via diffusion models and multi-objective guidance, achieving over 400-fold expression gains for luciferase and outperforming baselines for Spike and other targets in zero-shot tests.
Introduces Chess-World-Model benchmark from 10M chess games showing recurrent models (SLiCE, Mamba-3, Gated DeltaNet) outperform Transformers on exact state tracking, with random-play split remaining hard at larger scales.
An in-vitro study with synthetic languages finds cross-lingual transfer depends more on tokenization preserving reusable substructure than on lexical similarity or balance, with transfer emerging in stages.
Bilingual fine-tuning on a new parallel Filipino-English dementia dataset yields Macro-F1 scores of 0.969-0.973 and eliminates cross-lingual degradation for all tested transformers.
MuCRASP prunes VLMs in a CoT-aware manner, outperforming baselines by preserving reasoning quality at 30-50% compression rates on models like Qwen2.5-VL-7B.
DiSI disentangles stochastic interpolants into separate generation and regression paths, allowing controllable transitions between regression and generative image restoration with a unified few-step sampler.
StableHand introduces a quality-aware flow matching framework conditioned on predicted four-channel per-frame hand observation quality to estimate dual-hand world-space motion from egocentric video, achieving SOTA results with 20-25% W-MPJPE reduction on HOT3D and ARCTIC benchmarks.
φ-balancing is a convex optimization method for population-level expert balance in MoE training that derives an online EMA adjustment and outperforms heuristic baselines.
VMU-Diff improves precipitation nowcasting via coarse multi-source Vision Mamba fusion followed by residual conditional diffusion refinement.
PSR-NQS makes recurrent neural quantum states scalable for variational Monte Carlo by using parallel scan recurrence, reaching accurate results on 52x52 two-dimensional lattices.
GeoFlowVLM learns joint distributions of l2-normalized VLM embeddings on the product hypersphere via Riemannian flow matching to expose both aleatoric and epistemic uncertainty through derived entropy and typicality scores.
Neural statistical functions use prefix statistics to unify and directly predict statistical quantities over continuous ranges from pre-trained single-sample models without repeated sampling.
DLR-Lock locks open-weight LLMs against unauthorized fine-tuning by swapping MLPs for deep low-rank residual networks that inflate backprop memory and complicate optimization, yet preserve original capabilities via module-wise distillation.
Temporarily reducing the learning rate on upper-layer query and key projections during early GPT pretraining prevents premature attention specialization and improves model performance.
Chirality emerges in SMILES translation models through an abrupt encoder-centered reorganization of representations after a long plateau, identified via checkpoint analysis and ablation.
GibbsTTS combines a training-free kinetic-optimal scheduler with finite-step moment correction in MI-DFM to deliver top naturalness and strong speaker similarity in zero-shot TTS.
BLT-D, BLT-S, and BLT-DV use block-wise diffusion training and speculative verification to enable parallel byte generation in byte-level LMs, cutting memory-bandwidth cost by over 50%.
Every fixed finite feedforward neural network definable in an o-minimal structure has finite sample complexity in the agnostic PAC setting.
citing papers explorer
-
DINOv2: Learning Robust Visual Features without Supervision
Pith review generated a malformed one-line summary.
-
Illuminating Unified Multimodal Model for Free-form Interleaved Text-Image Generation
ILLUME-X is a unified multimodal model that generates free-form interleaved text-image sequences via an expanded data pipeline, progressive self-adaptive training, and ILScore evaluation, claiming outperformance over prior unified models on style transfer, image decomposition, and storytelling.
-
On Subquadratic Architectures: From Applications to Principles
xLSTM outperforms Mamba-2 and Gated DeltaNet on tasks with complex dependencies because its gating scheme enables more flexible and stable state tracking and memory accumulation.
-
An Exploratory Study into using Machine-Learning for Fast Step-by-step Emulation of Numerical Mechanical Thrombectomy Simulations for Ischemic Stroke
ML surrogates accurately emulate single steps of simplified thrombectomy simulations with speedups but lack stability over long times with complex geometries.
-
SLAP: The Semantic Least Action Principle for Variational Video-Language Modeling
SLAP reframes video interpolation as a variational mechanics boundary value problem on a semantic manifold to enforce object persistence without pixel rendering.
-
General Covariant Action Modeling: Constructing Generalized Manifolds via Spatio-Temporal Decoupling
GAM framework uses arc-length parameterization for temporal invariance and schema-affine factorization for geometric invariance to build a covariant action manifold integrated into VLA models for improved generalization from sparse data.
-
PowLU: An Activation Function for Stable Pre-Training of LLMs
PowLU replaces SwiGLU with a rational-power activation to reduce outlier amplification and numerical instability during large-scale LLM pre-training while matching performance.
-
m3BERT: A Modern, Multi-lingual, Matryoshka Bidirectional Encoder
m3BERT uses a three-stage Matryoshka pretraining approach on a bidirectional encoder to support variable embedding sizes while outperforming prior models on large-scale retrieval tasks.
-
VectraYX-Nano: A 42M-Parameter Spanish Cybersecurity Language Model with Curriculum Learning and Native Tool Use
Trains a 42M-parameter Spanish cybersecurity LLM from scratch with curriculum phases and achieves 0.23 tool-selection accuracy after SFT mixture rebalancing to 1:21 tool-use ratio.
-
BARFI-Q: Quantum-Enhanced Block Attention Residual Fusion Framework for Multivariate Time-Series Forecasting in Atom Interferometry
BARFI-Q integrates patch-based embedding, dual-branch temporal modeling, hierarchical fusion, adaptive block-attention residuals, and quantum feature mapping to forecast atom interferometry time-series, outperforming baselines while representing targets in circular sine-cosine space.
-
RLDX-1 Technical Report
RLDX-1 outperforms frontier VLAs such as π0.5 and GR00T N1.6 on dexterous manipulation benchmarks, reaching 86.8% success on ALLEX humanoid tasks versus around 40% for the baselines.
-
GEGLU-Transformer for IMU-to-EMG Estimation with Few-Shot Adaptation
GEGLU-Transformer reconstructs EMG signals from IMU data with r=0.706 cross-subject and improves to r=0.761 after minimal adaptation under leave-one-subject-out testing.
-
Towards Intrinsic Interpretability of Large Language Models:A Survey of Design Principles and Architectures
This survey organizes intrinsic interpretability approaches for LLMs into five categories—functional transparency, concept alignment, representational decomposability, explicit modularization, and latent sparsity induction—while discussing challenges and future directions.
-
Joint Model Parameter Scaling and Universal-Domain Data Integration for E-commerce Search Ranking
UniScale couples entire-space data construction with a hierarchical fusion transformer to improve scaling behavior and deliver 1.70% purchase and 2.04% GMV lifts in large-scale e-commerce search A/B tests.
-
SAGE-32B: Agentic Reasoning via Iterative Distillation
SAGE-32B improves multi-tool agentic success rates over same-size baselines by combining iterative distillation with an inverse-reasoning meta-cognition head.
-
Predicting one-year clinical instability and mortality in heart failure patients using sequence modeling
Sequence models on EHR data from a Swedish heart failure cohort achieve AUPRCs of 0.555 to 0.854 for one-year instability and mortality predictions and support four care pathways.
-
Hybrid Architectures for Language Models: Systematic Analysis and Design Insights
This work systematically compares inter-layer and intra-layer hybridization strategies for combining self-attention and Mamba-style state space models, evaluating them on language modeling, downstream tasks, long-context performance, scaling, and efficiency to derive optimal design recipes.
-
Gemma: Open Models Based on Gemini Research and Technology
Gemma introduces open 2B and 7B LLMs derived from Gemini technology that beat comparable open models on 11 of 18 text tasks and come with safety assessments.
-
DeepSeek-VL: Towards Real-World Vision-Language Understanding
DeepSeek-VL develops open-source 1.3B and 7B vision-language models that achieve competitive or state-of-the-art results on real-world visual-language benchmarks through diverse data curation, a hybrid vision encoder, and pretraining that preserves language capabilities.
-
Yi: Open Foundation Models by 01.AI
Yi models are 6B and 34B open foundation models pretrained on 3.1T curated tokens that achieve strong benchmark results through data quality and targeted extensions like long context and vision alignment.
-
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.
-
TinyLlama: An Open-Source Small Language Model
TinyLlama is a 1.1B-parameter open-source language model pretrained on 1 trillion tokens that outperforms other open-source models of similar size on downstream tasks.
-
Baichuan 2: Open Large-scale Language Models
Baichuan 2 presents 7B and 13B LLMs trained on 2.6T tokens that match or exceed similar open models on MMLU, CMMLU, GSM8K, HumanEval and excel in medicine and law.
-
Mellum2 Technical Report
Mellum 2 is a 12B MoE model with 2.5B active parameters, trained on 10.6T tokens with MoE, GQA, SWA, and MTP, then post-trained into Instruct and Thinking variants, claimed competitive with 4B-14B models at 2.5B compute.
-
CosmicFish-HRM: Adaptive Reasoning via Hierarchical Recurrent Mechanisms in Compact Language Models
Presents CosmicFish-HRM, a compact LM using hierarchical recurrent reasoning to adapt computation depth per input.
-
Phoenix-VL 1.5 Medium Technical Report
Phoenix-VL 1.5 Medium is a 123B-parameter natively multimodal model that reaches state-of-the-art results on Singapore multimodal, legal, and policy benchmarks after localized training on 1T+ tokens while staying competitive on global benchmarks.
-
LLMOrbit: A Circular Taxonomy of Large Language Models -From Scaling Walls to Agentic AI Systems
A survey taxonomy of LLMs identifies three scaling crises and six efficiency paradigms while tracing the shift from generation to tool-using agents.
-
Gemma 2: Improving Open Language Models at a Practical Size
Gemma 2 models achieve leading performance at their sizes by combining established Transformer modifications with knowledge distillation for the 2B and 9B variants.
-
A Survey of Large Language Models
This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.
-
A Comprehensive Overview of Large Language Models
A survey paper providing an overview of Large Language Models, their background, and recent advances in the field.
- Sparse Layers are Critical to Scaling Looped Language Models
- Depth Registers Unlock W4A4 on SwiGLU: A Reader/Generator Decomposition
- Capacity-Controlled Global Attention for Graph Transformers
- HAFM: Hierarchical Autoregressive Foundation Model for Music Accompaniment Generation
- CalM: A Self-Supervised Foundation Model for Population Dynamics in Calcium Imaging Data