super hub Canonical reference

Long short -term memory

Jürgen Schmidhuber, Sepp Hochreiter · 1997 · Neural Computation · DOI 10.1162/neco.1997.9.8.1735 · arXiv gov/9377276

Canonical reference. 74% of citing Pith papers cite this work as background.

140 Pith papers citing it

80.8k external citations · Crossref

Background 74% of classified citations

open at publisher browse 140 citing papers more from Jürgen Schmidhuber arXiv PDF

hub tools

JSON dossier citing papers JSON publisher DOI arXiv source

citation-role summary

background 15 baseline 2 method 2

citation-polarity summary

background 14 baseline 2 use method 2 support 1

authors

Jürgen Schmidhuber Sepp Hochreiter

co-cited works

representative citing papers

HS-FNO: History-Space Fourier Neural Operator for Non-Markovian Partial Differential Equations

cs.LG · 2026-05-10 · conditional · novelty 8.0 · 2 refs

HS-FNO lifts the state to include history and decomposes updates into a learned future-slice predictor plus an exact shift-append transport, yielding lower rollout errors than standard or lag-stack FNO baselines on five non-Markovian PDE families.

Coupling Precipitation Forecasting and Early Warning with Reverse-Martingale Recurrent Neural Networks

stat.AP · 2026-07-01 · unverdicted · novelty 7.0

A reverse-martingale RNN matches standard precipitation forecast skill on data from four climates while generating drought early warnings ahead of SPI-3 in some regions via backward coherence defects.

RAVEN: A Regime-Aware Variable-context Expert Network for Financial Time Series Forecasting

cs.LG · 2026-06-23 · unverdicted · novelty 7.0

RAVEN proposes a regime-aware MoE architecture with cumulative importance thresholding and correlation-aware weighting to adaptively select temporal context for non-stationary financial forecasting.

ConTex: Reformulating Counterfactual Generation For Time Series Forecasting

cs.LG · 2026-06-16 · unverdicted · novelty 7.0

ConTex learns a global intervention strategy via a decomposed temporal-conditional encoder architecture to generate consistent, sparse counterfactuals for time series models in a single forward pass.

Causally Evaluating the Learnability of Formal Language Tasks

cs.CL · 2026-06-08 · unverdicted · novelty 7.0

Introduces the binning semiring and causal graphical models to show that correlational evaluation of learnability in formal language tasks leads to incorrect conclusions from confounders.

RESCAST-100K: A Comprehensive Dataset for Cross-Domain Residential Load and Indoor Temperature Forecasting

cs.LG · 2026-06-01 · unverdicted · novelty 7.0

RESCAST-100K is a large-scale benchmark dataset of simulated and real residential energy data for cross-domain load and temperature forecasting.

Repetition Mismatch: Why Data Mixture Experiments Don't Scale and How to Fix Them

cs.LG · 2026-05-29 · conditional · novelty 7.0

Repetition rate mismatch between small-scale proxies and target budgets is the main reason data mixture experiments do not scale; a subsampling procedure that equalizes repetition rates recovers optimal mixtures from 1/16-scale experiments.

Faithful Embeddings of Irregular and Asynchronous Data for Online Log-NCDEs

cs.LG · 2026-05-28 · unverdicted · novelty 7.0

Introduces a continuous injective embedding for Log-NCDEs that builds log-signatures from data increments without interpolation or imputation while preserving compact-set universality.

'Si'multaneous 'S'patial-'T'emporal Message Passing for Dynamic Graph Representation Learning

cs.LG · 2026-05-25 · unverdicted · novelty 7.0

SiST-GNN performs simultaneous spatial-temporal message passing on a temporally augmented graph and reports 109-277% gains in fixed-split dynamic link prediction over prior methods.

UWM-JEPA: Predictive World Models That Imagine in Belief Space

cs.LG · 2026-05-25 · unverdicted · novelty 7.0

UWM-JEPA uses a density-matrix latent and unitary predictor in JEPA to preserve joint-state spectrum during blind rollouts, achieving 0.77 accuracy on a five-step hidden-velocity task versus 0.53 for an LSTM baseline.

Streaming Reinforcement Learning under Partial Observability with Real-Time Recurrent Learning

cs.LG · 2026-05-23 · unverdicted · novelty 7.0

Recurrent trace units enable exact RTRL with linear time/memory for streaming RL under partial observability, sustaining performance on long-chain memory tasks where TBPTT baselines collapse.

PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents

cs.AI · 2026-05-11 · unverdicted · novelty 7.0

PaperFit uses rendered page images in a closed loop to diagnose and repair typesetting defects in LaTeX documents, outperforming baselines on a new benchmark of 200 papers.

LLM-guided Semi-Supervised Approaches for Social Media Crisis Data Classification

cs.AI · 2026-05-08 · conditional · novelty 7.0

LG-CoTrain, an LLM-guided co-training method, outperforms classical semi-supervised baselines for crisis tweet classification in low-resource settings with 5-25 labeled examples per class.

SGC-RML: A reliable and interpretable longitudinal assessment for PD in real-world DNS

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

SGC-RML creates an 8D symptom atlas from multimodal PD data and integrates conformal calibration to deliver reliable, rejectable longitudinal assessments.

BadmintonGRF: A Multimodal Dataset and Benchmark for Markerless Ground Reaction Force Estimation in Badminton

cs.CV · 2026-05-03 · unverdicted · novelty 7.0

BadmintonGRF is a new public multimodal dataset and benchmark that pairs multi-view video with instrumented GRF for markerless load estimation in badminton.

Affinity Is Not Enough: Recovering the Free Energy Principle in Mixture-of-Experts

cs.LG · 2026-05-01 · conditional · novelty 7.0

Adding temporal memory via LIF, precision-weighted gating, and anticipatory prediction to MoE routers recovers effective expert selection at distribution transitions, with ablation confirming a super-additive beta-ant interaction.

AsmRAG: LLM-Driven Malware Detection by Retrieving Functionally Similar Assembly Code

cs.CR · 2026-04-25 · unverdicted · novelty 7.0

AsmRAG detects malware at 96% F1 and attributes families at 95% F1 by retrieving functionally similar assembly code via LLM embeddings and density-weighted anchor selection, remaining robust to metamorphic obfuscation.

Preconditioned DeltaNet: Curvature-aware Sequence Modeling for Linear Recurrences

cs.LG · 2026-04-22 · unverdicted · novelty 7.0

Preconditioned delta-rule models with a diagonal curvature approximation improve upon standard DeltaNet, GDN, and KDA by better approximating the test-time regression objective.

BRIDGE and TCH-Net: Heterogeneous Benchmark and Multi-Branch Baseline for Cross-Domain IoT Botnet Detection

cs.CR · 2026-04-13 · unverdicted · novelty 7.0

BRIDGE creates the first formal heterogeneous multi-dataset benchmark for IoT botnet detection with LODO evaluation, and TCH-Net achieves mean LODO F1 of 0.5577 while reaching F1 0.8296 on standard tests, outperforming twelve baselines.

FactorEngine: A Program-level Knowledge-Infused Factor Mining Framework for Quantitative Investment

cs.AI · 2026-03-17 · unverdicted · novelty 7.0

FactorEngine mines alpha factors as Turing-complete code via LLM-guided directional search, parameter separation, and a multi-agent pipeline that converts financial reports into executable programs, delivering higher IC/ICIR and Sharpe ratios than baselines in backtests.

Reduced-Order Surrogates for Forced Flexible Mesh Coastal-Ocean Models

cs.CE · 2026-02-05 · unverdicted · novelty 7.0

Koopman autoencoders with forcings and temporal unrolling deliver accurate year-long predictions for coastal-ocean models at 300-1400x speedup, outperforming POD in two of three cases.

Temporal Graph Networks for Deep Learning on Dynamic Graphs

cs.LG · 2020-06-18 · unverdicted · novelty 7.0

Temporal Graph Networks combine memory modules and graph operators to learn on dynamic graphs as timed event sequences, outperforming prior methods on transductive and inductive tasks while unifying earlier models as special cases.

Language Models as Knowledge Bases?

cs.CL · 2019-09-03 · accept · novelty 7.0

BERT stores relational knowledge extractable via cloze queries without fine-tuning and matches supervised baselines on open-domain QA tasks.

Mixed Precision Training

cs.AI · 2017-10-10 · accept · novelty 7.0

Mixed precision training uses FP16 for most computations, FP32 master weights for accumulation, and loss scaling to enable accurate training of large DNNs with halved memory usage.

citing papers explorer

Showing 50 of 53 citing papers after filters.

HS-FNO: History-Space Fourier Neural Operator for Non-Markovian Partial Differential Equations cs.LG · 2026-05-10 · conditional · none · ref 45 · 2 links
HS-FNO lifts the state to include history and decomposes updates into a learned future-slice predictor plus an exact shift-append transport, yielding lower rollout errors than standard or lag-stack FNO baselines on five non-Markovian PDE families.
RAVEN: A Regime-Aware Variable-context Expert Network for Financial Time Series Forecasting cs.LG · 2026-06-23 · unverdicted · none · ref 16
RAVEN proposes a regime-aware MoE architecture with cumulative importance thresholding and correlation-aware weighting to adaptively select temporal context for non-stationary financial forecasting.
ConTex: Reformulating Counterfactual Generation For Time Series Forecasting cs.LG · 2026-06-16 · unverdicted · none · ref 11
ConTex learns a global intervention strategy via a decomposed temporal-conditional encoder architecture to generate consistent, sparse counterfactuals for time series models in a single forward pass.
RESCAST-100K: A Comprehensive Dataset for Cross-Domain Residential Load and Indoor Temperature Forecasting cs.LG · 2026-06-01 · unverdicted · none · ref 28
RESCAST-100K is a large-scale benchmark dataset of simulated and real residential energy data for cross-domain load and temperature forecasting.
Repetition Mismatch: Why Data Mixture Experiments Don't Scale and How to Fix Them cs.LG · 2026-05-29 · conditional · none · ref 30
Repetition rate mismatch between small-scale proxies and target budgets is the main reason data mixture experiments do not scale; a subsampling procedure that equalizes repetition rates recovers optimal mixtures from 1/16-scale experiments.
Faithful Embeddings of Irregular and Asynchronous Data for Online Log-NCDEs cs.LG · 2026-05-28 · unverdicted · none · ref 1
Introduces a continuous injective embedding for Log-NCDEs that builds log-signatures from data increments without interpolation or imputation while preserving compact-set universality.
'Si'multaneous 'S'patial-'T'emporal Message Passing for Dynamic Graph Representation Learning cs.LG · 2026-05-25 · unverdicted · none · ref 4
SiST-GNN performs simultaneous spatial-temporal message passing on a temporally augmented graph and reports 109-277% gains in fixed-split dynamic link prediction over prior methods.
UWM-JEPA: Predictive World Models That Imagine in Belief Space cs.LG · 2026-05-25 · unverdicted · none · ref 32
UWM-JEPA uses a density-matrix latent and unitary predictor in JEPA to preserve joint-state spectrum during blind rollouts, achieving 0.77 accuracy on a five-step hidden-velocity task versus 0.53 for an LSTM baseline.
Streaming Reinforcement Learning under Partial Observability with Real-Time Recurrent Learning cs.LG · 2026-05-23 · unverdicted · none · ref 4
Recurrent trace units enable exact RTRL with linear time/memory for streaming RL under partial observability, sustaining performance on long-chain memory tasks where TBPTT baselines collapse.
SGC-RML: A reliable and interpretable longitudinal assessment for PD in real-world DNS cs.LG · 2026-05-08 · unverdicted · none · ref 39
SGC-RML creates an 8D symptom atlas from multimodal PD data and integrates conformal calibration to deliver reliable, rejectable longitudinal assessments.
Affinity Is Not Enough: Recovering the Free Energy Principle in Mixture-of-Experts cs.LG · 2026-05-01 · conditional · none · ref 36
Adding temporal memory via LIF, precision-weighted gating, and anticipatory prediction to MoE routers recovers effective expert selection at distribution transitions, with ablation confirming a super-additive beta-ant interaction.
Preconditioned DeltaNet: Curvature-aware Sequence Modeling for Linear Recurrences cs.LG · 2026-04-22 · unverdicted · none · ref 104
Preconditioned delta-rule models with a diagonal curvature approximation improve upon standard DeltaNet, GDN, and KDA by better approximating the test-time regression objective.
Temporal Graph Networks for Deep Learning on Dynamic Graphs cs.LG · 2020-06-18 · unverdicted · none · ref 113
Temporal Graph Networks combine memory modules and graph operators to learn on dynamic graphs as timed event sequences, outperforming prior methods on transductive and inductive tasks while unifying earlier models as special cases.
Estimation--Prediction Tradeoff in Causal Probabilistic Temporal Graphs cs.LG · 2026-06-26 · unverdicted · none · ref 92
Characterizes an estimation-prediction tradeoff in binary logistic models for causal probabilistic temporal graphs and proposes a framework to jointly evaluate temporal link prediction with causal parameter recovery via Cramér-Rao bounds.
Topological Out-of-Domain Generalization in Dynamical Systems Reconstruction cs.LG · 2026-06-22 · unverdicted · none · ref 27
Proposes feature splitting and a closed-form bound on extrapolation range to enable zero-shot topological out-of-domain generalization in dynamical systems reconstruction across tipping points.
A Hybrid LSTM--Vision Transformer Architecture for Predicting HRRR Forecast Errors cs.LG · 2026-06-17 · unverdicted · none · ref 12
Hybrid LSTM-ViT model using mesonet surface data and profiler vertical profiles improves HRRR forecast error prediction for precipitation, wind speed, and temperature, with roughly twofold skill gain for precipitation over baseline LSTM.
GeoGNN: Time Series Geo-Localization using Two-Tower Graph Neural Networks cs.LG · 2026-06-06 · unverdicted · none · ref 25
GeoGNN is a two-tower GNN that learns geographic cell embeddings from adjacency graphs and matches them to temporal representations via dot-product similarity plus classification, improving geolocalization accuracy by ~27% on electricity datasets.
Composing Non-Conjugate Factor Graphs with Closed-Form Variational Inference cs.LG · 2026-05-28 · unverdicted · none · ref 8
Models composed from bilinear factor, exponential link, Gamma prior, Gaussian likelihood, and equality node admit closed-form variational message passing under mean-field factorization.
Sequential Neural Probabilistic Amplitude Shaping: Learning the Channel's Language cs.LG · 2026-05-27 · unverdicted · none · ref 17
Introduces the first neural probabilistic amplitude shaping using a block-less sequential autoregressive encoder compatible with arithmetic distribution matching that claims reduced rate loss and higher information rates.
PIDM-DP: Physics-Informed Diffusion with Dormand-Prince Integration for Chaotic System Identification and State Reconstruction across Multiple Dynamical Regimes cs.LG · 2026-05-26 · unverdicted · none · ref 13
PIDM-DP integrates Dormand-Prince ODE solving into DDPM denoising with scheduled physics guidance to reconstruct chaotic states, reporting up to 15.4x RMSE gains over baselines on five systems including stiff cases.
Fast MoE Inference via Predictive Prefetching and Expert Replication cs.LG · 2026-05-12 · conditional · none · ref 7
Dynamic replication of predicted overloaded experts in MoE models achieves near-100% GPU utilization and up to 3x faster inference while retaining 90-95% of baseline performance.
What If We Let Forecasting Forget? A Sparse Bottleneck for Cross-Variable Dependencies cs.LG · 2026-05-08 · unverdicted · none · ref 96
MS-FLOW uses a capacity-limited sparse routing mechanism to model only critical inter-variable dependencies in time series data, achieving state-of-the-art accuracy on 12 benchmarks with fewer but more reliable connections.
MinMax Recurrent Neural Cascades cs.LG · 2026-05-07 · unverdicted · none · ref 4 · 3 links
MinMax RNCs are recurrent networks over the min-max semiring that achieve regular language expressivity, log-depth parallel scan, uniformly bounded states, and non-vanishing state gradients while showing competitive empirical performance.
Pretraining on Sleep Data Improves non-Sleep Biosignal Tasks cs.LG · 2026-05-04 · unverdicted · none · ref 27
Sleep-only contrastive pretraining improves results on non-sleep EEG and ECG tasks relative to training from scratch and matches or exceeds some specialized models.
ACT: Anti-Crosstalk Learning for Cross-Sectional Stock Ranking via Temporal Disentanglement and Structural Purification cs.LG · 2026-04-22 · unverdicted · none · ref 6
ACT disentangles temporal scales in stock sequences and purifies structural relations in graphs to achieve state-of-the-art cross-sectional stock ranking on CSI300 and CSI500 with up to 74.25% improvement.
Thermodynamic Liquid Manifold Networks: Physics-Bounded Deep Learning for Solar Forecasting in Autonomous Off-Grid Microgrids cs.LG · 2026-04-13 · unverdicted · none · ref 1
A new neural network architecture enforces celestial and thermodynamic constraints to deliver zero nocturnal error and high-accuracy solar forecasts for autonomous microgrids.
Time-Warping Recurrent Neural Networks for Transfer Learning cs.LG · 2026-04-02 · unverdicted · none · ref 15
Time-warping enables RNN transfer learning across time scales in physical systems by rescaling time in pretrained LSTMs, matching accuracy of other methods with minimal parameter changes.
Short window attention enables long-term memorization cs.LG · 2025-09-29 · unverdicted · none · ref 17
Short sliding windows in hybrid attention-xLSTM models boost long-context performance by encouraging long-term memory use, and stochastic window sizing improves both short and long tasks.
Logo-LLM: Local and Global Modeling with Large Language Models for Time Series Forecasting cs.LG · 2025-05-16 · unverdicted · none · ref 4
Logo-LLM improves time series forecasting by pulling local dynamics from shallow LLM layers and global trends from deeper layers, then aligning them via new Local-Mixer and Global-Mixer modules.
Scaling Laws for Transfer cs.LG · 2021-02-02 · unverdicted · none · ref 109
Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.
StateFlow: Dual-State Recurrent Modeling for Long-Horizon Time Series Forecasting cs.LG · 2026-06-30 · unverdicted · none · ref 8
StateFlow extends VARNN with dual hidden and residual-memory states plus a chunk decoder and two-stage training to enable competitive long-horizon time series forecasting while retaining a compact recurrent design.
Interpretable Kolmogorov-Arnold Network with Feature-Isolated Temporal Attention Mechanism for Electricity Load Forecasting cs.LG · 2026-06-22 · unverdicted · none · ref 32
LoadKAN combines feature-isolated temporal attention with KAN to produce competitive load forecasts on three U.S. markets and enables quantitative analysis of non-linear mobility-load relationships via learned activation functions.
Physics-Informed Graph Neural Network Surrogates for Turbulent Nanoparticle Dispersion in Dental Clinical Environments cs.LG · 2026-05-19 · unverdicted · none · ref 39
ELGIN is a graph-based physics-informed surrogate model that predicts carrier flow and polydisperse particle motion in dental aerosol scenarios, achieving lower tracking errors and 37x speedup versus full OpenFOAM CFD in a preliminary single-case test.
From Sparsity to Simplicity: Enabling Simpler Sequential Replacements via Sparse Attention Distillation cs.LG · 2026-05-15 · unverdicted · none · ref 11
Sparsity-guided distillation enables replacing attention layers in ViTs with simpler sequential modules, with sparser layers showing smaller performance drops.
Rethinking Random Transformers as Adaptive Sequence Smoothers for Sleep Staging cs.LG · 2026-05-11 · unverdicted · none · ref 90
Randomly initialized Transformers act as adaptive sequence smoothers for sleep staging via a Random Attention Prior Kernel, with gains mainly from inductive bias rather than training.
SurvBench: A Standardised Preprocessing Pipeline for Multi-Modal Electronic Health Record Survival Analysis cs.LG · 2025-11-14 · accept · none · ref 68
SurvBench supplies a configurable, open-source preprocessing pipeline that standardizes multi-modal EHR data from four critical-care databases for single-risk and competing-risk survival analysis.
AR-KAN: Autoregressive-Weight-Enhanced Kolmogorov-Arnold Network for Time Series Forecasting cs.LG · 2025-09-03 · unverdicted · none · ref 14
AR-KAN combines a pre-trained AR module with KAN to reduce redundancy while preserving temporal features, delivering lower probabilistic approximation error and stronger forecasting results on synthetic almost-periodic signals and real datasets.
WaveletInception Networks for on-board Vibration-Based Infrastructure Health Monitoring cs.LG · 2025-07-17 · unverdicted · none · ref 35
The WaveletInception-BiGRU network uses learnable wavelet packet transforms, 1D Inception-ResNet modules, and BiGRU layers to generate high-resolution, spatially mapped health profiles from variable-speed vibration data, outperforming prior methods on track stiffness and transition zone tasks.
Approximately Equivariant Recurrent Generative Models for Quasi-Periodic Time Series with a Progressive Training Scheme cs.LG · 2025-05-08 · unverdicted · none · ref 8
AEQ-RVAE-ST combines approximate equivariance and progressive sequence lengthening in a recurrent VAE to match or exceed prior generative models on quasi-periodic time series benchmarks.
Beyond Feedforward Networks: Reentry Neural Systems as the Fundamental Basis of Subjecthood and Intrinsic Safety of Next-Generation AGI cs.LG · 2026-06-24 · unverdicted · none · ref 16
A cycle-based reentry architecture is proposed to guarantee self-model emergence, self-preservation, and prompt-injection immunity in AGI via a D-I loop and a new S-measure of integrated information.
Understanding Key Features of Time Series Foundation Models from Epidemic Forecasting cs.LG · 2026-06-17 · unverdicted · none · ref 50
Mixture-of-experts fusing multiple pretrained forecasters achieves strongest performance on influenza time series, with pretraining gains largest at longer horizons when domain-aligned and LLM methods underperforming.
On Subquadratic Architectures: From Applications to Principles cs.LG · 2026-06-10 · unverdicted · none · ref 7
xLSTM outperforms Mamba-2 and Gated DeltaNet on tasks with complex dependencies because its gating scheme enables more flexible and stable state tracking and memory accumulation.
Dynamic Multi-Pair Trading Strategy in Cryptocurrency Markets with Deep Reinforcement Learning cs.LG · 2026-06-03 · unverdicted · none · ref 20
A hybrid DRL system for multi-pair crypto trading with deterministic risk shielding outperforms a heuristic baseline at 10% significance on Binance futures data.
Validation-Gated Multi-Agent Governance for Online Adaptation of Thermal-Hydraulic Surrogate Models under Operating-Regime Shift cs.LG · 2026-06-02 · unverdicted · none · ref 37
A validation-gated multi-agent framework enables online adaptation of thermal-hydraulic surrogates and reduces forecast error by 19% under regime shifts on experimental loop data.
High-fidelity Modeling of Full-scale Pressurized Water Reactor Flow Fields for Machine Learning Applications cs.LG · 2026-05-23 · unverdicted · none · ref 58
The paper generates high-fidelity CFD datasets of PWR lower-plenum and core-inlet flow and evaluates ML models for assembly-level mass-flow reconstruction and short-term autoregressive prediction.
Transformer-Based Wildlife Species Classification from Daily Movement Trajectories cs.LG · 2026-05-07 · unverdicted · none · ref 13
Transformer models classify seven wildlife species from daily GPS trajectories, outperforming LSTM, CNN, and TCN baselines by 8-22 percentage points in balanced accuracy under region-holdout evaluation.
Time Series Forecasting Through the Lens of Dynamics cs.LG · 2025-07-21 · unverdicted · none · ref 16
Proposes dynamics-based analysis of time series models showing partial dynamics learning and end-positioning as key to performance, plus a plug-and-play improvement method.
Tabular GANs for uneven distribution cs.LG · 2020-10-01 · unverdicted · none · ref 3
A modular framework for tabular data generation across GANs, diffusion models, and LLMs is introduced and tested on seven benchmarks, with GAN augmentation shown to boost performance under distribution shift.
Autoencoder Architectures for Athlete Performance Scoring from Wearable Telemetry cs.LG · 2026-06-26 · unverdicted · none · ref 18
Deep autoencoders outperform PCA and VAE variants on a composite of reconstruction MSE and interpretability metrics when reducing runner wearable data to a single latent performance score.
Lifelong In-Context Learning with Transformers Requires Parametric Forms of Attention cs.LG · 2026-06-24 · unverdicted · none · ref 21
Argues that parametric attention forms are necessary for lifelong in-context learning in transformers to maintain constant memory footprint over arbitrary sequence lengths.

Long short -term memory

hub tools

citation-role summary

citation-polarity summary

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer