RoundPipe achieves near-zero-bubble pipeline parallelism for LLM training on consumer GPUs by dynamically dispatching computation stages round-robin, yielding 1.48-2.16x speedups and enabling 235B model fine-tuning on 8x RTX 4090.
super hub Mixed citations
PyTorch: An Imperative Style, High-Performance Deep Learning Library
Mixed citation behavior. Most common role is background (53%).
abstract
Deep learning frameworks have often focused on either usability or speed, but not both. PyTorch is a machine learning library that shows that these two goals are in fact compatible: it provides an imperative and Pythonic programming style that supports code as a model, makes debugging easy and is consistent with other popular scientific computing libraries, while remaining efficient and supporting hardware accelerators such as GPUs. In this paper, we detail the principles that drove the implementation of PyTorch and how they are reflected in its architecture. We emphasize that every aspect of PyTorch is a regular Python program under the full control of its user. We also explain how the careful and pragmatic implementation of the key components of its runtime enables them to work together to achieve compelling performance. We demonstrate the efficiency of individual subsystems, as well as the overall speed of PyTorch on several common benchmarks.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Deep learning frameworks have often focused on either usability or speed, but not both. PyTorch is a machine learning library that shows that these two goals are in fact compatible: it provides an imperative and Pythonic programming style that supports code as a model, makes debugging easy and is consistent with other popular scientific computing libraries, while remaining efficient and supporting hardware accelerators such as GPUs. In this paper, we detail the principles that drove the implementation of PyTorch and how they are reflected in its architecture. We emphasize that every aspect o
co-cited works
representative citing papers
Looped transformers with recall and outer normalization produce reachable, input-dependent fixed points with stable gradients, enabling generalization, while those without recall cannot; a new internal recall variant performs competitively or better.
First in-orbit demonstration of a DRL-trained AI satellite attitude controller that performs robust inertial pointing after sim-to-real transfer.
A two-pass optimization framework with polynomial-based simulation discovers heralded ballistic circuits for 3-5 qubit graph states achieving up to 7.5x higher success probabilities than fusion baselines, including first known circuits for some 5-qubit states.
Task vectors from weight differences allow arithmetic operations to edit pre-trained models, improving multiple tasks simultaneously and enabling analogical inference on unseen tasks.
Quantitative Bayesian inference using a deep-learning emulator detects 0.018-0.020 M_sun of helium in the Type Ic supernova 2014L.
ffortissimo is a JAX-based freeform forward-modeling pipeline that fits complex dust distributions and infers scattering properties in KLIP-reduced images of circumstellar disks such as HR 4796A.
A matrix-free, GPU-compatible PyTorch implementation of phase-field fracture with explicit dynamics, custom differentiable implicit damage solve, benchmarks on dynamic and quasi-static cases, and inverse recovery of fracture energy G_c via L-BFGS.
A Set-Transformer architecture with self-attention encodes Pauli-string correlations, optimizes via commutation objective, and finds symmetries with near-deterministic success on physical models like Ising and Toric code.
A new lattice method recasts SIGW integrals as FFT convolutions to compute fully non-Gaussian spectra in seconds with ~10% error on a radiation-dominated background.
Hybrid TimesFM plus ridge regression on covariates forecasts 1-MeV electron flux with average R² of 0.9 on out-of-sample 2024 data, outperforming linear regression, CNN, LSTM and Transformer models.
A neural network trained on simulations infers stripping times for Sagittarius stream stars from phase-space data, measuring a 0.3 dex/Gyr metallicity gradient and estimating ages for globular clusters such as Pal 12 and NGC 2419.
Events trigger on-the-fly LoRA module generation via hypernetworks over a shared team policy in MARL, paired with a Neural Manifold Diversity metric, enabling sequential role reassignment while preserving reward maximization.
Vibrational mode graphs from molecular dynamics enable sequence-free protein function prediction via graph neural networks, with entrainment improving signals for collective dynamics.
Dingo-Pop uses a transformer to perform amortized, end-to-end population inference from GW strain data in seconds, bypassing per-event Monte Carlo sampling.
Learning in low-rank RNNs reduces to an exact low-dimensional ODE system in overlap space, where loss-invisible overlaps encode training history without affecting function.
Dynamical magnetotropic susceptibility k(ω) acts as a probe of uniform spin and charge fluctuations, with its static scaling in α-RuCl3 arising specifically from dominant Kitaev interactions in the models examined.
Transformer networks sample up to 180x180 2D Ising systems and 64x64 Edwards-Anderson systems by generating spin groups with probability approximations, yielding ~20x higher effective sample size than prior neural samplers at criticality.
A single end-to-end Transformer model unifies stellar labels from heterogeneous spectroscopic surveys into a self-consistent scale without post-hoc recalibration.
Parametric neural networks learn likelihood ratios to infer top-philic scalar resonances from dip patterns caused by signal-background interference in hadron collider data.
A graph-conditioned meta-optimizer learns QAOA parameter trajectories from one problem class and transfers them to others, yielding better initializations than standard methods in an empirical study of 64 settings.
SMC forgets its initial condition geometrically in the jump chain and as 1/ℓ in continuous genetic distance, justifying independent-locus approximations.
Concept Graph Convolutions perform message passing on node concepts to increase interpretability of graph neural networks without losing task performance.
A neural network learns non-stationary anisotropic correlations from gridded CTM outputs and transfers the structure via LatticeKrig basis functions to station data for refined fine-scale NO2 predictions with uncertainty.
citing papers explorer
-
Physics-Informed Graph Neural Network Surrogates for Turbulent Nanoparticle Dispersion in Dental Clinical Environments
ELGIN is a graph-based physics-informed surrogate model that predicts carrier flow and polydisperse particle motion in dental aerosol scenarios, achieving lower tracking errors and 37x speedup versus full OpenFOAM CFD in a preliminary single-case test.
-
Search for pair production of additional neutral scalars within the Inert Doublet Model in a final state with two electrons or two muons in proton-proton collisions at $\sqrt{s}$ = 13 TeV and 13.6 TeV
No significant excess found; new exclusion limits reach m_H = 108 GeV for m_H - m_A = 78 GeV in the Inert Doublet Model.
-
ERPPO: Entropy Regularization-based Proximal Policy Optimization
ERPPO adds a DSA-based ambiguity estimator to MAPPO and switches between L1 and L2 entropy regularization to improve exploration and stability in non-stationary multi-dimensional observations.
-
Unveiling Hidden Lyman Alpha Emitters in the DESI DR1 Data
A CNN detects 19,685 LAEs at z=2-3.5 in DESI DR1 spectra with 95% purity and completeness.
-
Latency Analysis and Optimization of Alpamayo 1 via Efficient Trajectory Generation
Redesigning Alpamayo 1 to single-reasoning and optimizing diffusion action generation cuts inference latency by 69.23% while preserving trajectory diversity and prediction quality.
-
Compositional Quantum Heuristics for Max-Clique Detection
Compositional quantum circuits with symmetry-induced invariant losses produce trainable equivariant quantum GNNs that generalize on max-clique problems and improve hybrid recursive search accuracy and scalability.
-
AdaMeZO: Adam-style Zeroth-Order Optimizer for LLM Fine-tuning Without Maintaining the Moments
AdaMeZO adapts Adam moment estimates to zeroth-order LLM fine-tuning without extra memory storage, outperforming MeZO with up to 70% fewer forward passes.
-
Optimization of Model Splitting, Placement, and Chaining for Multi-hop Split Learning and Inference
An ILP model and BCD heuristic jointly optimize model splitting, node placement, and smashed-data routing in an SFC-based multi-hop split learning/inference architecture to minimize end-to-end latency.
-
Toeplitz MLP Mixers are Low Complexity, Information-Rich Sequence Models
Toeplitz MLP Mixers replace attention with masked Toeplitz multiplications for sub-quadratic complexity while retaining more sequence information and outperforming on copying and in-context tasks.
-
A New Adaptive Deep Learning based Reduced Order Model for Hybrid-Type Parabolic PDEs: Rigorous Error Analysis and Applications
Two new DOD-based reduced-order models (DOD-DL-ROM and DOD+DFNN) are introduced for hybrid-type parabolic PDEs, with rigorous error bounds linking performance to optimal map regularity and conditions for outperforming POD methods.
-
Revisiting Neural Activation Coverage for Uncertainty Estimation
Neural activation coverage can be adapted to provide uncertainty estimates in regression that the authors' experiments show are more meaningful than Monte-Carlo Dropout.
-
The swept-back multipolar magnetic field of neutron stars: Application to NICER MSP J0030+0451
A centered swept-back multipolar magnetic field up to octupole order reproduces the bolometric thermal X-ray light curve of MSP J0030+0451.
-
TabEmb: Joint Semantic-Structure Embedding for Table Annotation
TabEmb decouples LLM-based semantic column embeddings from graph-based structural modeling to produce joint representations that improve table annotation tasks.
-
HieraSparse: Hierarchical Semi-Structured Sparse KV Attention
HieraSparse delivers a hierarchical semi-structured sparse KV attention system that achieves 1.2x KV compression and 4.57x decode attention speedup versus prior unstructured sparsity methods at equivalent sparsity, plus up to 1.85x prefill speedup and 1.37x/1.77x speedups with magnitude pruning and
-
PRiMeFlow: Capturing Complex Expression Heterogeneity in Perturbation Response Modelling
PRiMeFlow applies flow matching in gene expression space with a U-Net velocity field and pretraining-finetuning to model perturbation-induced heterogeneity, showing strong benchmark performance on PerturBench and the ARC Virtual Cell Challenge.
-
Thermodynamic and Transport Properties of Quark-Gluon Plasma at Finite Chemical Potential with a DNN framework
A deep neural network emulates lattice QCD equation of state within a quasi-particle model to compute QGP speed of sound, specific heat, viscosity, and conductivity at finite baryon chemical potential.
-
AnyUser: Translating Sketched User Intent into Domestic Robots
AnyUser translates free-form sketches on images plus optional language into executable robot actions for domestic tasks using multimodal fusion and a hierarchical policy.
-
General Explicit Network (GEN): A novel deep learning architecture for solving partial differential equations
GEN is a neural network that solves PDEs by constructing explicit function approximations from basis functions based on prior PDE knowledge, yielding more robust and extensible solutions than standard PINNs.
-
Machine learning for smell: Ordinal odor strength prediction of molecular perfumery components
The authors compile an ordinal odor strength dataset for over 2,000 molecules from public sources and demonstrate supervised ML prediction of intensity categories, identifying molecular size, polarity, rings, and branching as key drivers via SHAP analysis.
-
Stochastic versus Deterministic in Stochastic Gradient Descent
Treating stochastic and deterministic gradients separately in mini-batch SGD yields faster convergence and smaller error radius than uniform treatment, with further gains under strong convexity.
-
UAV-VL-R1: Generalizing Vision-Language Models via Supervised Fine-Tuning and Multi-Stage GRPO for UAV Visual Reasoning
UAV-VL-R1 combines SFT and multi-stage GRPO reinforcement learning on a new 50,019-sample HRVQA-VL dataset to deliver substantially higher zero-shot accuracy on UAV visual reasoning tasks than both its 2B baseline and a 72B-scale model.
-
LogitSpec: Accelerating Retrieval-based Speculative Decoding via Next Next Token Speculation
LogitSpec accelerates retrieval-based speculative decoding by speculating the next-next token from the last logit and retrieving relevant references for both next and next-next tokens, reporting up to 2.61x speedup and 3.28 mean accepted tokens.
-
Efficient compression of neural networks and datasets
Refined probabilistic and smooth l0 pruning techniques approximate minimum description length for neural networks, achieving high compression with minimal accuracy loss and empirically verifying better sample efficiency and generalization on image and text tasks.
-
A mixed-integer framework for analyzing neural network-based controllers for piecewise affine systems with bounded disturbances
A mixed-integer framework represents neural network-controlled piecewise affine systems with bounded disturbances as MI linear constraints, enabling computation of robustly positively invariant sets via MI linear programs for stability and constraint certification.
-
An Efficient Stochastic Subgradient Method for the Global Placement Problem in Very Large-Scale Integration Circuits
A ReLU-penalty formulation for VLSI global placement is solved via stochastic subgradient descent, with the first claimed convergence proof for ReLU-type nonsmooth nonconvex problems.
-
Molecular Quantum Control Algorithm Design by Reinforcement Learning
Reinforcement learning designs quantum-logic pulse sequences to prepare H3O+ (130 states) and CaH+ molecular ions in pure states from thermal populations.
-
Debiasing the Observed Fast Radio Burst Population with the CHIME/FRB Selection Function
Analysis of CHIME/FRB Catalog 2 with synthetic injections and a multidimensional selection function yields evidence for a slight downturn in the intrinsic scattering timescale distribution, though flat or rising distributions remain possible.
-
Learning to model pediatric asthma exacerbation from multiple risk factors: a case study in coastal Virginia
A case study develops a sparse dictionary learning approach to model pediatric asthma exacerbations from multiple risk factors and reports consensus on relative risks across statistical and machine learning models.
-
Identifying Gems from Roman RAPIDly
Machine learning models RuBR_comb, RuBR_loc, and RuBR_DA for real-bogus classification of transients using combined simulated data and domain adaptation for the Roman RAPID pipeline.
-
Libra: Efficient Resource Management for Agentic RL Post-Training
Libra optimizes GPU allocation across rollout and training in agentic RL via an elastic hybrid pool and C-MLFQ scheduler based on tool-return causal signals, claiming up to 3.0x throughput and 2.5x faster reward convergence on 48 A800 GPUs.
-
Low-Magnification SEM May Suffice: Interpretable Deep Learning for Multi-Scale Fracture-Cause Classification in Zirconia-Toughened Alumina
A fine-tuned ViT on 8493 SEM images classifies fracture causes in zirconia-toughened alumina at 0.907 accuracy and 0.888 macro-F1, with comparable performance at 50x versus higher magnifications.
-
Kolmogorov--Arnold Networks as Implicit Regularizers: Noise Robustness and Interpretability for Stellar Classification
KAN noise robustness in star/galaxy/quasar classification arises from implicit C2-spline regularization rather than architecture, as weight-decay-tuned MLPs match performance on SDSS and DESI data.
-
Scalable Dark Siren Cosmology with gwcosmo: GPU Acceleration, Validation and Systematics
GPU-accelerated gwcosmo enables 1000x faster dark-siren cosmological analyses for large GW catalogs.
-
A Value-added Physical Properties Catalog for Low-redshift Galaxies from DESI Legacy Imaging Surveys DR10
A multimodal neural network trained on MPA-JHU references produces SFR, stellar mass, and metallicity estimates for 547 million low-redshift galaxies in DESI LS DR10.
-
Text-RSIR: A Text-Guided Framework for Efficient Remote Sensing Image Transmission and Reconstruction
A text-guided framework for remote sensing image transmission uses low-res images and compact text to reduce data volume to 2%, with text-conditioned reconstruction achieving PSNRs of 16.36-27.41 dB on tested datasets.
-
QuChaTeR: A Hybrid Quantum-Chaotic Temporal Framework for Earthquake Prediction
QuChaTeR hybridizes chaotic maps and variational quantum circuits with recurrent networks and wavelets to achieve faster convergence and better performance than classical and quantum-inspired baselines on real seismic datasets.
-
VIGILant: an automatic classification pipeline for glitches in the Virgo detector
VIGILant applies tree-based models and a ResNet CNN to classify Virgo O3b glitches with 98% accuracy and has been deployed for daily use with an interactive dashboard.
-
Multimodal Anomaly Detection for Human-Robot Interaction
MADRI detects anomalies in human-robot pick-and-place tasks by reconstructing multimodal feature vectors from video, internal sensors, and scene graphs, with multimodal versions outperforming vision-only on a custom dataset.
-
GPAFormer: Graph-guided Patch Aggregation Transformer for Efficient 3D Medical Image Segmentation
GPAFormer with 1.81M parameters reports top Dice scores on BTCV (75.70%), Synapse (81.20%), ACDC (89.32%), and BraTS (82.74%) while running inference in under one second on consumer GPUs.
-
PR3DICTR: A modular AI framework for medical 3D image-based detection and outcome prediction
PR3DICTR is a new open-access modular framework for 3D medical image classification and outcome prediction that works with as little as two lines of code.
-
AI4EOSC: a Federated Cloud Platform for Artificial Intelligence in Scientific Research
AI4EOSC is a federated cloud platform that integrates modular AI development, serverless AI-as-a-Service, and distributed orchestration with built-in FAIR metadata and provenance tracking for scientific AI workloads in EOSC.
-
Auto-encoder model for faster generation of effective one-body gravitational waveform approximations
Auto-encoder approximates SEOBNRv4 waveforms for four-parameter aligned-spin binaries, delivering 4 orders of magnitude speedup at median mismatch of 10^{-2}.
-
Physics-informed neural network (PINN) modeling of charged particle multiplicity using the two-component framework in heavy-ion collisions: A comparison with data-driven neural networks
A PINN constrained by the two-component multiplicity model learns the hard-scattering fraction from Zr+Zr events and predicts N_ch more accurately than a data-driven NN on unseen Ru+Ru and Au+Au collisions.
-
Identifying lopsidedness in spiral galaxies using a Deep Convolutional Neural Network
Transfer learning with a Zoobot CNN on SDSS DR18 data identifies 3,679 lopsided spiral galaxies at 87% test accuracy, with lopsided systems showing higher star formation, bluer colors, lower mass and concentration.
-
Clinical utility of foundation models in musculoskeletal MRI for biomarker fidelity and predictive outcomes
Fine-tuned foundation models produce reliable MSK MRI biomarkers that support workload-reducing triage and calibrated 48-month prediction of knee replacement and incident OA.
-
Dynamics-Encoded Deep Learning for Robust System Identification and Parameter Estimation
Dynamics-encoded deep learning approaches are developed for system identification and parameter estimation in dynamical systems using numerical discretization schemes.
-
Validation of an AI-based end-to-end model for prostate pathology using long-term archived routine samples
GleasonAI achieves quadratic-weighted kappa of 0.86 on ISUP grading of 10,366 long-term archived prostate biopsy cores, with performance stable over 17 years and a clear prognostic gradient for cancer-specific mortality.
-
Machine Learning Approaches for Improved Scalability of Metallic Magnetic Calorimeters
Machine learning methods are explored for pulse classification, artifact rejection, and shape analysis in metallic magnetic calorimeters to improve scalability over traditional signal processing.
-
Modelling magnetic material properties with uncertainty-aware neural networks
Uncertainty-aware neural networks using Gaussian negative log-likelihood and dropout are applied to predict intrinsic magnetic properties and coercivity via graph neural networks in permanent magnet research.
-
Jarvis-HEP: A lightweight Python framework for workflow composition and parameter scans in high-energy physics
Jarvis-HEP introduces a YAML-based Python framework for composing workflows and performing parameter scans in high-energy physics.