RoundPipe achieves near-zero-bubble pipeline parallelism for LLM training on consumer GPUs by dynamically dispatching computation stages round-robin, yielding 1.48-2.16x speedups and enabling 235B model fine-tuning on 8x RTX 4090.
super hub Mixed citations
PyTorch: An Imperative Style, High-Performance Deep Learning Library
Mixed citation behavior. Most common role is background (53%).
abstract
Deep learning frameworks have often focused on either usability or speed, but not both. PyTorch is a machine learning library that shows that these two goals are in fact compatible: it provides an imperative and Pythonic programming style that supports code as a model, makes debugging easy and is consistent with other popular scientific computing libraries, while remaining efficient and supporting hardware accelerators such as GPUs. In this paper, we detail the principles that drove the implementation of PyTorch and how they are reflected in its architecture. We emphasize that every aspect of PyTorch is a regular Python program under the full control of its user. We also explain how the careful and pragmatic implementation of the key components of its runtime enables them to work together to achieve compelling performance. We demonstrate the efficiency of individual subsystems, as well as the overall speed of PyTorch on several common benchmarks.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Deep learning frameworks have often focused on either usability or speed, but not both. PyTorch is a machine learning library that shows that these two goals are in fact compatible: it provides an imperative and Pythonic programming style that supports code as a model, makes debugging easy and is consistent with other popular scientific computing libraries, while remaining efficient and supporting hardware accelerators such as GPUs. In this paper, we detail the principles that drove the implementation of PyTorch and how they are reflected in its architecture. We emphasize that every aspect o
co-cited works
representative citing papers
Looped transformers with recall and outer normalization produce reachable, input-dependent fixed points with stable gradients, enabling generalization, while those without recall cannot; a new internal recall variant performs competitively or better.
First in-orbit demonstration of a DRL-trained AI satellite attitude controller that performs robust inertial pointing after sim-to-real transfer.
A two-pass optimization framework with polynomial-based simulation discovers heralded ballistic circuits for 3-5 qubit graph states achieving up to 7.5x higher success probabilities than fusion baselines, including first known circuits for some 5-qubit states.
Task vectors from weight differences allow arithmetic operations to edit pre-trained models, improving multiple tasks simultaneously and enabling analogical inference on unseen tasks.
Quantitative Bayesian inference using a deep-learning emulator detects 0.018-0.020 M_sun of helium in the Type Ic supernova 2014L.
ffortissimo is a JAX-based freeform forward-modeling pipeline that fits complex dust distributions and infers scattering properties in KLIP-reduced images of circumstellar disks such as HR 4796A.
A matrix-free, GPU-compatible PyTorch implementation of phase-field fracture with explicit dynamics, custom differentiable implicit damage solve, benchmarks on dynamic and quasi-static cases, and inverse recovery of fracture energy G_c via L-BFGS.
A Set-Transformer architecture with self-attention encodes Pauli-string correlations, optimizes via commutation objective, and finds symmetries with near-deterministic success on physical models like Ising and Toric code.
A new lattice method recasts SIGW integrals as FFT convolutions to compute fully non-Gaussian spectra in seconds with ~10% error on a radiation-dominated background.
Hybrid TimesFM plus ridge regression on covariates forecasts 1-MeV electron flux with average R² of 0.9 on out-of-sample 2024 data, outperforming linear regression, CNN, LSTM and Transformer models.
A neural network trained on simulations infers stripping times for Sagittarius stream stars from phase-space data, measuring a 0.3 dex/Gyr metallicity gradient and estimating ages for globular clusters such as Pal 12 and NGC 2419.
Events trigger on-the-fly LoRA module generation via hypernetworks over a shared team policy in MARL, paired with a Neural Manifold Diversity metric, enabling sequential role reassignment while preserving reward maximization.
Vibrational mode graphs from molecular dynamics enable sequence-free protein function prediction via graph neural networks, with entrainment improving signals for collective dynamics.
Dingo-Pop uses a transformer to perform amortized, end-to-end population inference from GW strain data in seconds, bypassing per-event Monte Carlo sampling.
Learning in low-rank RNNs reduces to an exact low-dimensional ODE system in overlap space, where loss-invisible overlaps encode training history without affecting function.
Dynamical magnetotropic susceptibility k(ω) acts as a probe of uniform spin and charge fluctuations, with its static scaling in α-RuCl3 arising specifically from dominant Kitaev interactions in the models examined.
Transformer networks sample up to 180x180 2D Ising systems and 64x64 Edwards-Anderson systems by generating spin groups with probability approximations, yielding ~20x higher effective sample size than prior neural samplers at criticality.
A single end-to-end Transformer model unifies stellar labels from heterogeneous spectroscopic surveys into a self-consistent scale without post-hoc recalibration.
Parametric neural networks learn likelihood ratios to infer top-philic scalar resonances from dip patterns caused by signal-background interference in hadron collider data.
A graph-conditioned meta-optimizer learns QAOA parameter trajectories from one problem class and transfers them to others, yielding better initializations than standard methods in an empirical study of 64 settings.
SMC forgets its initial condition geometrically in the jump chain and as 1/ℓ in continuous genetic distance, justifying independent-locus approximations.
Concept Graph Convolutions perform message passing on node concepts to increase interpretability of graph neural networks without losing task performance.
A neural network learns non-stationary anisotropic correlations from gridded CTM outputs and transfers the structure via LatticeKrig basis functions to station data for refined fine-scale NO2 predictions with uncertainty.
citing papers explorer
-
What If We Let Forecasting Forget? A Sparse Bottleneck for Cross-Variable Dependencies
MS-FLOW uses a capacity-limited sparse routing mechanism to model only critical inter-variable dependencies in time series data, achieving state-of-the-art accuracy on 12 benchmarks with fewer but more reliable connections.
-
Why Does Agentic Safety Fail to Generalize Across Tasks?
Agentic safety fails to generalize across tasks because the task-to-safe-controller mapping has a higher Lipschitz constant than the task-to-controller mapping alone, as proven in linear-quadratic control and demonstrated in quadcopter and LLM experiments.
-
Decoding Alignment without Encoding Alignment: A critique of similarity analysis in neuroscience
Decoding alignment metrics can remain high and unchanged even when encoding manifold topology is causally altered, so they do not imply similar function or computation across neural populations.
-
Stage Light is Sequence$^2$: Multi-Light Control via Imitation Learning
SeqLight maps music to multi-light HSV control via SkipBART for global color prediction followed by hybrid imitation learning in a goal-conditioned MDP to decompose colors across lights.
-
ClarifySTL: An Interactive LLM Agent Framework for STL Transformation through Requirements Clarification
ClarifySTL uses LLM agents to interactively detect and resolve vagueness and ambiguity in natural language requirements via clarification queries before generating STL formulas, with evaluations on existing and new benchmarks showing effectiveness.
-
Compressibility of micromagnetic solutions in tensor train format
Tensor-train compressed micromagnetic solutions for flux-closure states in soft-magnetic prisms scale as L^{1.8} and (1/a)^{1.2} by exploiting spatial sparsity in domain walls versus uniform domains.
-
Towards Generalizable Mapping of Hedges and Linear Woody Features from Earth Observation Data: a national Product for Germany
A modular deep learning workflow maps linear woody features at national scale in Germany from three different resolution EO sources using a single trained model.
-
Data-Driven Acceleration of Eccentricity Reduction for Binary Black Hole Simulations
A Gaussian Process Regression model trained on an archive of eccentricity-reduced binary black hole simulations predicts initial conditions that achieve low eccentricity with zero or one iteration.
-
JAX-BEM: Gradient-Based Acoustic Shape Optimisation via a Differentiable Boundary Element Method
A JAX-based differentiable BEM solver matches traditional BEM accuracy on benchmarks and supports gradient-driven acoustic geometry optimization.
-
TCL: Enabling Fast and Efficient Cross-Hardware Tensor Program Optimization via Continual Learning
TCL delivers 16.8x faster tuning on CPU and 12.48x on GPU with modestly lower inference latency by combining RDU active sampling, a lightweight Mamba cost model, and cross-platform continual knowledge distillation.
-
Particle transformers for identifying Lorentz-boosted Higgs bosons decaying to a pair of W bosons
PaRT achieves >50% tagging efficiency for boosted H->WW jets at 1% background efficiency, decorrelated from jet mass, with data-to-simulation scale factors of 0.9-1.0 on 138 fb^{-1} of 13 TeV collisions.
-
Minimising Willmore Energy via Neural Flow
Neural networks minimize Willmore energy on embedded surfaces, recovering the round sphere and Clifford torus while supplying a search procedure for genus-2 minimal surfaces.
-
AEGIS: Scaling Long-Sequence Homomorphic Encrypted Transformer Inference via Hybrid Parallelism on Multi-GPU Systems
AEGIS reduces inter-GPU communication by up to 81.3% in self-attention and reaches 96.62% scaling efficiency with 3.86x speedup on four GPUs for 2048-token encrypted Transformer inference.
-
Scalable Variational Bayesian Fine-Tuning of LLMs via Orthogonalized Low-Rank Adapters
PoLAR-VBLL combines orthogonalized low-rank adapters with variational Bayesian last-layer inference to enable scalable, well-calibrated uncertainty quantification in fine-tuned LLMs.
-
Mitigating Data Scarcity in Spaceflight Applications for Offline Reinforcement Learning Using Physics-Informed Deep Generative Models
MI-VAE generates physics-constrained synthetic trajectories from scarce real data to improve offline RL policy performance on planetary lander tasks over standard VAEs.
-
Deeper detection limits in astronomical imaging using self-supervised spatiotemporal denoising
ASTERIS, a self-supervised spatiotemporal denoising algorithm, improves astronomical detection limits by 1 magnitude at 90% completeness while identifying three times more redshift >9 galaxy candidates in JWST images.
-
Physics and causally constrained discrete-time neural models of turbulent dynamical systems
A framework builds stable neural models of turbulent dynamics by enforcing energy-preserving nonlinearities and causal constraints in discrete-time flow maps, demonstrated on Charney-DeVore and Lorenz-96 systems.
-
Optimizing Deep Learning Photometric Redshifts for the Roman Space Telescope with HST/CANDELS
PITA, a new semi-supervised deep learning algorithm, outperforms prior photo-z methods by using a triple-task loss on images, colors, and available redshifts to produce a smooth latent space.
-
The hidden risks of temporal resampling in clinical reinforcement learning
Resampling clinical time series into uniform bins for offline RL reduces performance by up to 60% and causes retrospective evaluations to overestimate returns by 1.5-3x versus unprocessed data.
-
Learning from Historical Activations in Graph Neural Networks
HISTOGRAPH applies unified layer-wise attention followed by node-wise attention over historical GNN activations to improve graph classification, especially in deep models.
-
A window for water-hydrogen demixing on warm metal-rich sub-Neptunes
Water-hydrogen demixing occurs on warm sub-Neptunes with envelope metallicities of 150-700 times solar, including TOI-270 d, implying layered interiors and underestimated bulk metallicities when using fully-miscible models.
-
Understanding the Staged Dynamics of Transformers in Learning Latent Structure
Transformers learn latent structure components in discrete stages during training, composing rules more robustly than decomposing complex examples, with identified layer plasticity windows.
-
SilverTorch: A Unified Model-based System to Democratize Large-Scale Recommendation on GPUs
SilverTorch replaces standalone ANN indexing and filtering with a unified GPU model using a model-based Bloom index and fused Int8 ANN kernel, delivering up to 23.7x throughput and 13.35x cost efficiency gains on industry data.
-
CoGate-LSTM: Prototype-Guided Feature-Space Gating for Mitigating Gradient Dilution in Imbalanced Toxic Comment Classification
CoGate-LSTM adds prototype-guided cosine feature-space gating to a character-level BiLSTM with multi-source embeddings and focal loss, reaching 0.881 macro-F1 on Jigsaw toxic comments while using 7.3M parameters and outperforming fine-tuned BERT by 6.9 points on minority labels.
-
Image reconstruction with the JWST Interferometer
Dorito enables diffraction-limited image reconstruction from JWST AMI observations by deconvolving images or Fourier observables using maximum entropy and total variation regularization.
-
Zero-Effort Image-to-Music Generation: An Interpretable RAG-based VLM Approach
A zero-training VLM framework generates music from images via ABC notation, multi-modal RAG, and self-refinement while providing text and visual explanations for the outputs.
-
Differentiable Acoustic Radiance Transfer
DART adds differentiability to acoustic radiance transfer, enabling material optimization and improved performance on sparse acoustic field prediction tasks compared to signal processing and neural baselines.
-
Optimizing Quantum Photonic Integrated Circuits using Differentiable Tensor Networks
Gradient-based optimization of quantum photonic circuits is achieved via differentiable tensor networks that model nonlinear unitary gates and stochastic losses at low photon numbers.
-
Thermodynamically consistent machine learning model for excess Gibbs energy
HANNA is a thermodynamically consistent ML model for predicting excess Gibbs energy from molecular structures, trained on various binary mixture data and extended to multi-component mixtures using geometric projection.
-
Scalable Equilibrium Propagation via Intermediate Error Signals for Deep Convolutional CRNNs
Introduces layer-wise learning signals combining knowledge distillation and local errors into Equilibrium Propagation, enabling scalable training of deep VGG-style CRNNs with SOTA results on CIFAR-10 and CIFAR-100.
-
Stability-Constrained AC Optimal Power Flow--A Gaussian Process-Based Approach
A Gaussian Process surrogate for the stability exponent of generator dynamics is integrated into AC Optimal Power Flow to produce both cost-optimal and dynamically stable operating points.
-
Neural simulation-based inference of the Higgs trilinear self-coupling via off-shell Higgs production
A hybrid NSBI technique is presented for inferring the Higgs trilinear coupling via off-shell production in SMEFT, achieving near-theoretical-optimum sensitivity with expected HL-LHC constraints.
-
Characterizing control between interacting subsystems with deep Jacobian estimation
JacobianODE learns Jacobians from data to quantify directional control in nonlinear systems and shows sensory-to-cognitive control strengthening in a trained working-memory RNN.
-
Unsupervised risk factor identification across cancer types and data modalities via explainable artificial intelligence
New unsupervised method adapts the multivariate logrank statistic into a differentiable loss for training any neural network on any data modality to discover prognostically distinct patient clusters, demonstrated on myeloma lab data and lung cancer CT images with post-hoc explainability.
-
Learning Encodings by Maximizing State Distinguishability: Variational Quantum Error Correction
VarQEC uses a distinguishability loss as a machine-learning objective to variationally discover resource-efficient encoding circuits optimized for given noise models.
-
Neuralized Fermionic Tensor Networks for Quantum Many-Body Systems
NN-fTNS enhance fermionic tensor networks with neural parametrization to improve expressivity and achieve order-of-magnitude better energies than pure fTNS on Hubbard models while maintaining linear scaling.
-
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
SmolVLA is a small efficient VLA model that achieves performance comparable to 10x larger models while training on one GPU and deploying on consumer hardware via community data and chunked asynchronous action prediction.
-
Modal Decomposition and Identification for a Population of Structures Using Physics-Informed Graph Neural Networks and Transformers
A physics-informed GNN-transformer model performs unsupervised modal decomposition and identification for populations of structures from sparse dynamic measurements.
-
Tensor-Programmable Quantum Circuits for Solving Differential Equations
A quantum solver for PDEs is introduced via flexible matrix product operator representations with mid-circuit measurements and state-dependent norm correction to handle non-unitary dynamics.
-
Variational decision diagrams for quantum-inspired machine learning applications
The paper proposes variational decision diagrams (VDDs) for quantum state representation in QML and reports successful training without barren plateaus on transverse-field Ising and Heisenberg Hamiltonians.
-
Pretrained Event Classification Model for High Energy Physics Analysis
A GNN pretrained on 120M simulated HEP events generalizes to unseen processes and ATLAS data; fine-tuning boosts accuracy especially with small datasets, with CKA showing preserved encoders but altered intermediate layers.
-
Steering Llama 2 via Contrastive Activation Addition
Contrastive Activation Addition steers Llama 2 Chat by adding averaged residual-stream activation differences from contrastive example pairs to control targeted behaviors at inference time.
-
MONAI: An open-source framework for deep learning in healthcare
MONAI is a community-supported PyTorch framework that extends deep learning to medical data with domain-specific architectures, transforms, and deployment tools.
-
Euclid preparation. CosmoPostProcess: A simulation calibrated framework for weak lensing selection bias in richness-selected galaxy clusters
CosmoPostProcess delivers simulation-calibrated radial corrections for projection-induced selection bias (20-40% amplitude near 1 h^{-1} Mpc) and baryonic effects in Euclid richness-selected cluster weak lensing profiles.
-
A Physics Informed Bayesian Neural Network for the Neutron Star Equation of State
A physics-informed Bayesian neural network learns neutron-star equations of state from theoretical priors and constraints, then generates posterior mass-radius and mass-tidal-deformability distributions consistent with NICER radii and 2-solar-mass limits.
-
Continual Learning for Sequential Personalization of Small Language Models: A Stability Monitoring Analysis
Checkpoint monitoring during sequential LoRA adaptation of SLMs reveals instability patterns via reference set diagnostics that standard task metrics can miss.
-
Reducing Experimental Testing in Space Propulsion Film Cooling Analyses by Pixelwise Generative Image Interpolation
A feed-forward neural network with positional encoding generates film cooling images from 30% fewer experimental measurements while achieving RMSE below 8% and SSIM above 93%.
-
ESAM++: Efficient Online 3D Perception on the Edge
ESAM++ introduces a 3D Sparse Feature Pyramid Network for efficient online 3D scene perception on edge devices, claiming competitive accuracy with up to 3x faster inference and 2x smaller model size than ESAM on four benchmarks.
-
Physics-Informed Graph Neural Network Surrogates for Turbulent Nanoparticle Dispersion in Dental Clinical Environments
ELGIN is a graph-based physics-informed surrogate model that predicts carrier flow and polydisperse particle motion in dental aerosol scenarios, achieving lower tracking errors and 37x speedup versus full OpenFOAM CFD in a preliminary single-case test.
-
Search for pair production of additional neutral scalars within the Inert Doublet Model in a final state with two electrons or two muons in proton-proton collisions at $\sqrt{s}$ = 13 TeV and 13.6 TeV
No significant excess found; new exclusion limits reach m_H = 108 GeV for m_H - m_A = 78 GeV in the Inert Doublet Model.