super hub Mixed citations

GLU Variants Improve Transformer

Noam Shazeer · 2020 · cs.LG · arXiv 2002.05202

Mixed citation behavior. Most common role is background (47%).

235 Pith papers citing it

Background 47% of classified citations

open full Pith review browse 235 citing papers more from Noam Shazeer arXiv PDF

abstract

Gated Linear Units (arXiv:1612.08083) consist of the component-wise product of two linear projections, one of which is first passed through a sigmoid function. Variations on GLU are possible, using different nonlinear (or even linear) functions in place of sigmoid. We test these variants in the feed-forward sublayers of the Transformer (arXiv:1706.03762) sequence-to-sequence model, and find that some of them yield quality improvements over the typically-used ReLU or GELU activations.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 30 method 24 dataset 2 extension 1

citation-polarity summary

background 27 use method 23 unclear 4 use dataset 2 extend 1

claims ledger

abstract Gated Linear Units (arXiv:1612.08083) consist of the component-wise product of two linear projections, one of which is first passed through a sigmoid function. Variations on GLU are possible, using different nonlinear (or even linear) functions in place of sigmoid. We test these variants in the feed-forward sublayers of the Transformer (arXiv:1706.03762) sequence-to-sequence model, and find that some of them yield quality improvements over the typically-used ReLU or GELU activations.

authors

Noam Shazeer

co-cited works

representative citing papers

CLAD: Efficient Log Anomaly Detection Directly on Compressed Representations

cs.LG · 2026-04-14 · unverdicted · novelty 8.0

CLAD is the first deep learning framework for log anomaly detection that operates directly on compressed byte streams using a dilated convolutional encoder, hybrid Transformer-mLSTM, and two-stage training, achieving 0.9909 average F1-score across five datasets.

Test-Time Training with KV Binding Is Secretly Linear Attention

cs.LG · 2026-02-24 · conditional · novelty 8.0

Test-time training with KV binding reduces to learned linear attention.

Large Language Diffusion Models

cs.CL · 2025-02-14 · unverdicted · novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

cs.LG · 2023-12-01 · unverdicted · novelty 8.0

Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.

DTM-Codec: Dynamic Token Masking for VFR Speech Coding with Efficient Boundary Selection

eess.AS · 2026-06-28 · unverdicted · novelty 7.0

DTM-Codec achieves better reconstruction quality and intelligibility than fixed-frame-rate neural speech codecs at matched total bitrate via dynamic token masking and Path Length Equalization for variable frame rates.

Parallel Rollout Approximation for Pixel-Space Autoregressive Image Generation

cs.CV · 2026-06-26 · unverdicted · novelty 7.0

PRA approximates sequential rollout training in parallel for pixel-space AR models via intermediate states and a pixel decoder, achieving FID 2.58 (135M params) and 1.94 (511M params) on ImageNet-1K 256x256, new SOTA among pixel-space AR models.

mRNAutilus: Multi-Objective-Guided Discrete Generation of mRNA with Optimized Therapeutic Properties

q-bio.BM · 2026-05-29 · unverdicted · novelty 7.0

mRNAutilus generates full-length therapeutic mRNAs via diffusion models and multi-objective guidance, achieving over 400-fold expression gains for luciferase and outperforming baselines for Spike and other targets in zero-shot tests.

Chess-World-Model: A 10M-Game Benchmark for Exact State Tracking from Chess Move Sequences

cs.LG · 2026-05-28 · unverdicted · novelty 7.0

Introduces Chess-World-Model benchmark from 10M chess games showing recurrent models (SLiCE, Mamba-3, Gated DeltaNet) outperform Transformers on exact state tracking, with random-play split remaining hard at larger scales.

An In-Vitro Study on Cross-Lingual Generalization in Language Models

cs.CL · 2026-05-26 · unverdicted · novelty 7.0

An in-vitro study with synthetic languages finds cross-lingual transfer depends more on tokenization preserving reusable substructure than on lexical similarity or balance, with transfer emerging in stages.

Forgotten Words: Benchmarking NeoBERT for Dementia Detection in Low-Resource Conversational Filipino and English Speech

cs.CL · 2026-05-25 · unverdicted · novelty 7.0

Bilingual fine-tuning on a new parallel Filipino-English dementia dataset yields Macro-F1 scores of 0.969-0.973 and eliminates cross-lingual degradation for all tested transformers.

MuCRASP: Multimodal Chain-of-thought Reasoning aware Structured Pruning

cs.AI · 2026-05-25 · unverdicted · novelty 7.0

MuCRASP prunes VLMs in a CoT-aware manner, outperforming baselines by preserving reasoning quality at 30-50% compression rates on models like Qwen2.5-VL-7B.

Disentangling Generation and Regression in Stochastic Interpolants for Controllable Image Restoration

cs.CV · 2026-05-20 · unverdicted · novelty 7.0

DiSI disentangles stochastic interpolants into separate generation and regression paths, allowing controllable transitions between regression and generative image restoration with a unified few-step sampler.

StableHand: Quality-Aware Flow Matching for World-Space Dual-Hand Motion Estimation from Egocentric Video

cs.CV · 2026-05-18 · unverdicted · novelty 7.0

StableHand introduces a quality-aware flow matching framework conditioned on predicted four-channel per-frame hand observation quality to estimate dual-hand world-space motion from egocentric video, achieving SOTA results with 20-25% W-MPJPE reduction on HOT3D and ARCTIC benchmarks.

$\phi$-Balancing for Mixture-of-Experts Training

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

φ-balancing is a convex optimization method for population-level expert balance in MoE training that derives an online EMA adjustment and outperforms heuristic baselines.

VMU-Diff: A Coarse-to-fine Multi-source Data Fusion Framework for Precipitation Nowcasting

cs.CV · 2026-05-14 · unverdicted · novelty 7.0

VMU-Diff improves precipitation nowcasting via coarse multi-source Vision Mamba fusion followed by residual conditional diffusion refinement.

Parallel Scan Recurrent Neural Quantum States for Scalable Variational Monte Carlo

cond-mat.str-el · 2026-05-13 · conditional · novelty 7.0

PSR-NQS makes recurrent neural quantum states scalable for variational Monte Carlo by using parallel scan recurrence, reaching accurate results on 52x52 two-dimensional lattices.

GeoFlowVLM: Geometry-Aware Joint Uncertainty for Frozen Vision-Language Embedding

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

GeoFlowVLM learns joint distributions of l2-normalized VLM embeddings on the product hypersphere via Riemannian flow matching to expose both aleatoric and epistemic uncertainty through derived entropy and typicality scores.

Neural Statistical Functions

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

Neural statistical functions use prefix statistics to unify and directly predict statistical quantities over continuous ranges from pre-trained single-sample models without repeated sampling.

Locking Pretrained Weights via Deep Low-Rank Residual Distillation

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

DLR-Lock locks open-weight LLMs against unauthorized fine-tuning by swapping MLPs for deep low-rank residual networks that inflate backprop memory and complicate optimization, yet preserve original capabilities via module-wise distillation.

Learning Less Is More: Premature Upper-Layer Attention Specialization Hurts Language Model Pretraining

cs.CL · 2026-05-11 · unverdicted · novelty 7.0

Temporarily reducing the learning rate on upper-layer query and key projections during early GPT pretraining prevents premature attention specialization and improves model performance.

From Syntax to Semantics: Unveiling the Emergence of Chirality in SMILES Translation Models

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

Chirality emerges in SMILES translation models through an abrupt encoder-centered reorganization of representations after a long plateau, identified via checkpoint analysis and ablation.

Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech

eess.AS · 2026-05-10 · unverdicted · novelty 7.0

GibbsTTS combines a training-free kinetic-optimal scheduler with finite-step moment correction in MI-DFM to deliver top naturalness and strong speaker similarity in zero-shot TTS.

Fast Byte Latent Transformer

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

BLT-D, BLT-S, and BLT-DV use block-wise diffusion training and speculative verification to enable parallel byte generation in byte-level LMs, cutting memory-bandwidth cost by over 50%.

Every Feedforward Neural Network Definable in an o-Minimal Structure Has Finite Sample Complexity

stat.ML · 2026-05-08 · unverdicted · novelty 7.0

Every fixed finite feedforward neural network definable in an o-minimal structure has finite sample complexity in the agnostic PAC setting.

citing papers explorer

Showing 50 of 235 citing papers.

BloombergGPT: A Large Language Model for Finance cs.LG · 2023-03-30 · conditional · none · ref 101 · internal anchor
BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.
PaLM: Scaling Language Modeling with Pathways cs.CL · 2022-04-05 · accept · none · ref 138 · internal anchor
PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.
LaMDA: Language Models for Dialog Applications cs.CL · 2022-01-20 · unverdicted · none · ref 93 · internal anchor
LaMDA shows that fine-tuning on human-value annotations and consulting external knowledge sources significantly improves safety and factual grounding in large dialog models beyond what scaling alone achieves.
How Much Knowledge Can You Pack Into the Parameters of a Language Model? cs.CL · 2020-02-10 · accept · none · ref 34 · internal anchor
Fine-tuned language models store knowledge in parameters to answer questions competitively with retrieval-based open-domain QA systems.
TuringViT: Making SOTA Vision Transformers Accessible to All cs.CV · 2026-06-23 · unverdicted · none · ref 32 · internal anchor
TuringViT claims a new ViT design with linear attention and curated data that matches SOTA performance using 10% of typical pretraining data while supporting dynamic resolutions and improving VLM integration.
SamatNext v0.2-B: An Exploratory Study of RMS-Normalized Hybrid Decoders for Curriculum Retention in Small Code Models cs.LG · 2026-06-20 · unverdicted · none · ref 4 · internal anchor
SamatNext v0.2-B reaches 100% on Stage 5 and retains 98.8% of Stage 3 behavior versus 97.6% and 6% for the Transformer baseline in a controlled curriculum setting.
Towards Engineering Scaling Laws with Pretraining Data Composition hep-ex · 2026-06-18 · unverdicted · none · ref 36 · internal anchor
Pretraining data composition can be used to engineer neural scaling laws in hadronic jet classification toward data-heavy rather than model-size-heavy regimes.
Beyond Perplexity: UTF-8 Validity in Byte-aware Language Models cs.CL · 2026-06-12 · unverdicted · none · ref 16 · internal anchor
A 355M-parameter byte-level LM on 80B multilingual tokens exhibits UTF-8 validity converging after 4.2B tokens versus 2.1B for perplexity, with higher validity on rare characters than common ones.
MixFP4: Enhancing NVFP4 with Adaptive FP4/INT4 Block Representations cs.AR · 2026-05-29 · unverdicted · none · ref 15 · internal anchor
MixFP4 extends NVFP4 by adaptively selecting between two FP4 micro-formats per block using repurposed scale sign bits and a unified E2M2 compute path, claiming better accuracy than standard NVFP4 at 3.1% area and 1.5% power overhead.
Adaptive Targeted Dynamic Chunking for Tokenization-Free Hierarchical Model cs.CL · 2026-05-28 · unverdicted · none · ref 20 · internal anchor
ATDC applies curriculum learning to dynamically control chunk compression in hierarchical byte models, reporting competitive BPB on FineWeb-Edu 100B and more stable training than fixed-ratio baselines.
PrunePath: Towards Highly Structured Sparse Language Models cs.CL · 2026-05-27 · unverdicted · none · ref 4 · internal anchor
PrunePath introduces budget-adaptive structured sparsification for FFN layers via softmax routing and cumulative-mass thresholds on top of MoEfication, with Triton kernels for inference speedups.
NITP: Next Implicit Token Prediction for LLM Pre-training cs.CL · 2026-05-24 · unverdicted · none · ref 41 · internal anchor
NITP adds dense supervision from shallow model layers to predict implicit next-token semantics, yielding consistent downstream gains on 0.5B-9B models with ~2% extra training FLOPs.
AssetGen: Deployable 3D Asset Generation at Interactive Speed cs.GR · 2026-05-22 · unverdicted · none · ref 19 · internal anchor
AssetGen is a system that produces deployable 3D assets including meshes, baked normals, and textures from a single reference image in under 30 seconds via a coarse-to-refine VecSet pipeline and co-designed optimizations.
Fine-Tuning Without Forgetting via Loss-Adaptive Learning Rates cs.LG · 2026-05-19 · unverdicted · none · ref 57 · internal anchor
FINCH is a loss-adaptive learning-rate schedule that reduces forgetting by 93% on average during LLM fine-tuning while matching standard task performance across several benchmarks.
Graph Hierarchical Recurrence for Long-Range Generalization cs.LG · 2026-05-18 · unverdicted · none · ref 30 · internal anchor
GHR uses hierarchical recurrence on pooled graph abstractions to improve long-range dependency capture and out-of-range generalization while using far fewer parameters than existing models.
Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation cs.LG · 2026-05-12 · unverdicted · none · ref 69 · internal anchor
Pion is an optimizer that preserves the singular values of weight matrices in LLM training by applying orthogonal equivalence transformations.
CaloArt: Large-Patch x-Prediction Diffusion Transformers for High-Granularity Calorimeter Shower Generation physics.ins-det · 2026-05-12 · unverdicted · none · ref 45 · internal anchor
CaloArt achieves top FPD, high-level, and classifier metrics on CaloChallenge datasets 2 and 3 while keeping single-GPU generation at 9-11 ms per shower by combining large-patch tokenization, x-prediction, and conditional flow matching.
A Composite Activation Function for Learning Stable Binary Representations cs.LG · 2026-05-12 · unverdicted · none · ref 63 · internal anchor
HTAF is a sigmoid-tanh composite that approximates the Heaviside function to allow stable gradient training of binary activation networks, yielding ICBMs with stable discretization and competitive performance on image tasks.
MaskTab: Scalable Masked Tabular Pretraining with Scaling Laws and Distillation for Industrial Classification cs.LG · 2026-05-12 · unverdicted · none · ref 42 · internal anchor
MaskTab is a masked pretraining method for industrial tabular data that delivers measurable gains in classification AUC and KS metrics while enabling effective distillation to smaller models.
Mela: Test-Time Memory Consolidation based on Transformation Hypothesis cs.CL · 2026-05-11 · unverdicted · none · ref 16 · internal anchor
Mela is a Transformer variant with a dual-frequency Hierarchical Memory Module and MemStack that performs test-time memory consolidation, outperforming baselines on long contexts.
SimReg: Achieving Higher Performance in the Pretraining via Embedding Similarity Regularization cs.CL · 2026-05-09 · unverdicted · none · ref 10 · internal anchor
SimReg regularization accelerates LLM pretraining convergence by over 30% and raises average zero-shot performance by over 1% across benchmarks.
TIDE: Every Layer Knows the Token Beneath the Context cs.CL · 2026-05-07 · unverdicted · none · ref 47 · internal anchor
TIDE augments standard transformers with per-layer token embedding injection via an ensemble of memory blocks and a depth-conditioned router to mitigate rare-token undertraining and contextual collapse.
MDN: Parallelizing Stepwise Momentum for Delta Linear Attention cs.LG · 2026-05-07 · unverdicted · none · ref 22 · internal anchor
MDN parallelizes stepwise momentum for delta linear attention using geometric reordering and dynamical systems analysis, yielding performance gains over Mamba2 and GDN on 400M and 1.3B models.
ELAS: Efficient Pre-Training of Low-Rank Large Language Models via 2:4 Activation Sparsity cs.LG · 2026-05-05 · unverdicted · none · ref 10 · internal anchor
ELAS pre-trains low-rank LLMs by applying 2:4 activation sparsity after squared ReLU to cut memory and accelerate training with minimal performance loss.
Colinearity Decay: Training Quantization-Friendly ViTs with Outlier Decay cs.CV · 2026-05-02 · unverdicted · none · ref 33 · internal anchor
Colinearity-Decay regularizer trains ViTs that maintain or improve full-precision accuracy while delivering higher accuracy after low-bit quantization on ImageNet and COCO tasks.
Caracal: Causal Architecture via Spectral Mixing cs.LG · 2026-04-30 · unverdicted · none · ref 60 · internal anchor
Caracal is a Fourier-based sequence mixing architecture that achieves causal autoregressive modeling with standard operators and competitive performance on long sequences.
Nautile-370M: Spectral Memory Meets Attention in a Small Reasoning Model cs.LG · 2026-04-27 · unverdicted · none · ref 13 · internal anchor
Nautile-370M is a hybrid small language model using SeqCond Attention layers alternating with transformers, with a claimed proof that the spectral operator matches full self-attention expressiveness in the continuous limit.
When Does Removing LayerNorm Help? Activation Bounding as a Regime-Dependent Implicit Regularizer cs.LG · 2026-04-25 · unverdicted · none · ref 23 · internal anchor
DyT improves validation loss 27% at 64M params/1M tokens but worsens it 19% at 118M tokens, with saturation levels predicting the sign of the effect.
Uni-Encoder Meets Multi-Encoders: Representation Before Fusion for Brain Tumor Segmentation with Missing Modalities cs.CV · 2026-04-24 · unverdicted · none · ref 38 · internal anchor
UniME combines a pretrained unified ViT encoder with modality-specific CNN encoders to improve brain tumor segmentation performance when some MRI modalities are missing.
Sapiens2 cs.CV · 2026-04-23 · unverdicted · none · ref 24 · internal anchor
Sapiens2 improves pretraining, data scale, and architecture over its predecessor to set new state-of-the-art results on human pose estimation, body-part segmentation, normal estimation, and new tasks like pointmap and albedo estimation.
Hyperloop Transformers cs.LG · 2026-04-23 · unverdicted · none · ref 22 · internal anchor
Hyperloop Transformers outperform standard and mHC Transformers with roughly 50% fewer parameters by looping a middle block of layers and applying hyper-connections only after each loop.
Sessa: Selective State Space Attention cs.LG · 2026-04-20 · unverdicted · none · ref 32 · internal anchor
Sessa integrates attention within recurrent paths to achieve power-law memory tails and flexible non-decaying selective retrieval, outperforming baselines on long-context tasks.
LACE: Lattice Attention for Cross-thread Exploration cs.AI · 2026-04-16 · unverdicted · none · ref 30 · 3 links · internal anchor
LACE enables concurrent reasoning paths in LLMs to interact via lattice attention and a synthetic training pipeline, raising accuracy more than 7 points over independent parallel search.
Efficient Learned Data Compression via Dual-Stream Feature Decoupling cs.CL · 2026-04-08 · unverdicted · none · ref 3 · internal anchor
A dual-stream decoupler plus hierarchical refiner and parallel pipeline yields state-of-the-art compression ratio and throughput with lowest reported latency and memory in learned data compression.
GLENN: Neural network-enhanced computation of Ginzburg-Landau energy minimizers math.NA · 2026-03-19 · unverdicted · none · ref 32 · internal anchor
A hybrid neural network and finite element strategy computes Ginzburg-Landau energy minimizers across varying kappa values, using the network output either directly or as a starting guess for reliable classical optimization.
Improved Mean Flows: On the Challenges of Fastforward Generative Models cs.CV · 2025-12-01 · unverdicted · none · ref 41 · internal anchor
Improved MeanFlow (iMF) reaches 1.72 FID on ImageNet 256x256 with one function evaluation by reformulating the training objective as a regression on instantaneous velocity and treating guidance as flexible conditioning variables.
Rethinking Generative Image Pretraining: How Far Are We From Scaling Up Next-Pixel Prediction? cs.CV · 2025-11-11 · unverdicted · none · ref 8 · internal anchor
Scaling experiments on autoregressive next-pixel Transformers at 32x32 show task-dependent optimal data-to-model ratios and project compute as the dominant bottleneck for future high-resolution feasibility.
gpt-oss-120b & gpt-oss-20b Model Card cs.CL · 2025-08-08 · unverdicted · none · ref 9 · internal anchor
OpenAI releases two open-weight reasoning models, gpt-oss-120b and gpt-oss-20b, trained via distillation and RL with claimed strong results on math, coding, and safety benchmarks.
Hierarchical Reasoning Model cs.AI · 2025-06-26 · unverdicted · none · ref 57 · internal anchor
HRM is a recurrent architecture with high-level planning and low-level execution modules that reaches near-perfect accuracy on complex Sudoku, maze navigation, and ARC benchmarks using 27M parameters and 1000 samples without pre-training or CoT supervision.
Lightweight Domain Adaptation of a Large Language Model for Legal Assistance in the Indian Context cs.CL · 2025-05-28 · unverdicted · none · ref 19 · internal anchor
An 8B Llama model with RAG and prompt engineering scores 60.08% on the All-India Bar Examination, slightly above GPT-3.5 Turbo while claiming 22 times greater parameter efficiency via a new PEI metric.
Emerging Properties in Unified Multimodal Pretraining cs.CV · 2025-05-20 · unverdicted · none · ref 65 · internal anchor
BAGEL is a unified decoder-only model that develops emerging complex multimodal reasoning abilities after pretraining on large-scale interleaved data and outperforms prior open-source unified models.
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference cs.CL · 2024-12-18 · unverdicted · none · ref 183 · internal anchor
ModernBERT is a new bidirectional encoder model achieving SOTA performance on diverse classification and retrieval benchmarks while offering superior speed and memory efficiency for long-context inference.
Movie Gen: A Cast of Media Foundation Models cs.CV · 2024-10-17 · unverdicted · none · ref 61 · internal anchor
A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models cs.CV · 2024-08-09 · unverdicted · none · ref 55 · internal anchor
mPLUG-Owl3 introduces hyper attention blocks to integrate vision and language for long image-sequence understanding and reports SOTA results on single-image, multi-image, and video benchmarks.
InternLM2 Technical Report cs.CL · 2024-03-26 · unverdicted · none · ref 217 · internal anchor
InternLM2 is a new open-source LLM that outperforms prior versions on 30 benchmarks and long-context tasks through scaled pre-training to 32k tokens and a conditional online RLHF alignment strategy.
MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices cs.CV · 2023-12-28 · unverdicted · none · ref 105 · internal anchor
MobileVLM achieves on-par performance with much larger vision-language models on standard benchmarks while delivering state-of-the-art inference speeds of 21.5 tokens per second on Snapdragon 888 CPU and 65.3 on Jetson Orin GPU.
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration cs.CL · 2023-11-07 · unverdicted · none · ref 53 · internal anchor
mPLUG-Owl2 presents a modular MLLM architecture that enables modality collaboration via shared functional modules and modality-adaptive components, achieving SOTA on both text and multi-modal tasks with one generic model.
Revisiting Sentiment Analysis for Software Engineering in the Era of Large Language Models cs.SE · 2023-10-17 · unverdicted · none · ref 55 · internal anchor
bLLMs achieve state-of-the-art results on limited and imbalanced SE sentiment datasets even in zero-shot settings, but fine-tuned sLLMs outperform when ample balanced training data is available.
DINOv2: Learning Robust Visual Features without Supervision cs.CV · 2023-04-14 · unverdicted · none · ref 22 · internal anchor
Pith review generated a malformed one-line summary.
Illuminating Unified Multimodal Model for Free-form Interleaved Text-Image Generation cs.CV · 2026-06-29 · unverdicted · none · ref 39 · internal anchor
ILLUME-X is a unified multimodal model that generates free-form interleaved text-image sequences via an expanded data pipeline, progressive self-adaptive training, and ILScore evaluation, claiming outperformance over prior unified models on style transfer, image decomposition, and storytelling.

GLU Variants Improve Transformer

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer