RoundPipe achieves near-zero-bubble pipeline parallelism for LLM training on consumer GPUs by dynamically dispatching computation stages round-robin, yielding 1.48-2.16x speedups and enabling 235B model fine-tuning on 8x RTX 4090.
super hub Mixed citations
PyTorch: An Imperative Style, High-Performance Deep Learning Library
Mixed citation behavior. Most common role is background (53%).
abstract
Deep learning frameworks have often focused on either usability or speed, but not both. PyTorch is a machine learning library that shows that these two goals are in fact compatible: it provides an imperative and Pythonic programming style that supports code as a model, makes debugging easy and is consistent with other popular scientific computing libraries, while remaining efficient and supporting hardware accelerators such as GPUs. In this paper, we detail the principles that drove the implementation of PyTorch and how they are reflected in its architecture. We emphasize that every aspect of PyTorch is a regular Python program under the full control of its user. We also explain how the careful and pragmatic implementation of the key components of its runtime enables them to work together to achieve compelling performance. We demonstrate the efficiency of individual subsystems, as well as the overall speed of PyTorch on several common benchmarks.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Deep learning frameworks have often focused on either usability or speed, but not both. PyTorch is a machine learning library that shows that these two goals are in fact compatible: it provides an imperative and Pythonic programming style that supports code as a model, makes debugging easy and is consistent with other popular scientific computing libraries, while remaining efficient and supporting hardware accelerators such as GPUs. In this paper, we detail the principles that drove the implementation of PyTorch and how they are reflected in its architecture. We emphasize that every aspect o
co-cited works
representative citing papers
Looped transformers with recall and outer normalization produce reachable, input-dependent fixed points with stable gradients, enabling generalization, while those without recall cannot; a new internal recall variant performs competitively or better.
First in-orbit demonstration of a DRL-trained AI satellite attitude controller that performs robust inertial pointing after sim-to-real transfer.
A two-pass optimization framework with polynomial-based simulation discovers heralded ballistic circuits for 3-5 qubit graph states achieving up to 7.5x higher success probabilities than fusion baselines, including first known circuits for some 5-qubit states.
Task vectors from weight differences allow arithmetic operations to edit pre-trained models, improving multiple tasks simultaneously and enabling analogical inference on unseen tasks.
Quantitative Bayesian inference using a deep-learning emulator detects 0.018-0.020 M_sun of helium in the Type Ic supernova 2014L.
ffortissimo is a JAX-based freeform forward-modeling pipeline that fits complex dust distributions and infers scattering properties in KLIP-reduced images of circumstellar disks such as HR 4796A.
A matrix-free, GPU-compatible PyTorch implementation of phase-field fracture with explicit dynamics, custom differentiable implicit damage solve, benchmarks on dynamic and quasi-static cases, and inverse recovery of fracture energy G_c via L-BFGS.
A Set-Transformer architecture with self-attention encodes Pauli-string correlations, optimizes via commutation objective, and finds symmetries with near-deterministic success on physical models like Ising and Toric code.
A new lattice method recasts SIGW integrals as FFT convolutions to compute fully non-Gaussian spectra in seconds with ~10% error on a radiation-dominated background.
Hybrid TimesFM plus ridge regression on covariates forecasts 1-MeV electron flux with average R² of 0.9 on out-of-sample 2024 data, outperforming linear regression, CNN, LSTM and Transformer models.
A neural network trained on simulations infers stripping times for Sagittarius stream stars from phase-space data, measuring a 0.3 dex/Gyr metallicity gradient and estimating ages for globular clusters such as Pal 12 and NGC 2419.
Events trigger on-the-fly LoRA module generation via hypernetworks over a shared team policy in MARL, paired with a Neural Manifold Diversity metric, enabling sequential role reassignment while preserving reward maximization.
Vibrational mode graphs from molecular dynamics enable sequence-free protein function prediction via graph neural networks, with entrainment improving signals for collective dynamics.
Dingo-Pop uses a transformer to perform amortized, end-to-end population inference from GW strain data in seconds, bypassing per-event Monte Carlo sampling.
Learning in low-rank RNNs reduces to an exact low-dimensional ODE system in overlap space, where loss-invisible overlaps encode training history without affecting function.
Dynamical magnetotropic susceptibility k(ω) acts as a probe of uniform spin and charge fluctuations, with its static scaling in α-RuCl3 arising specifically from dominant Kitaev interactions in the models examined.
Transformer networks sample up to 180x180 2D Ising systems and 64x64 Edwards-Anderson systems by generating spin groups with probability approximations, yielding ~20x higher effective sample size than prior neural samplers at criticality.
A single end-to-end Transformer model unifies stellar labels from heterogeneous spectroscopic surveys into a self-consistent scale without post-hoc recalibration.
Parametric neural networks learn likelihood ratios to infer top-philic scalar resonances from dip patterns caused by signal-background interference in hadron collider data.
A graph-conditioned meta-optimizer learns QAOA parameter trajectories from one problem class and transfers them to others, yielding better initializations than standard methods in an empirical study of 64 settings.
SMC forgets its initial condition geometrically in the jump chain and as 1/ℓ in continuous genetic distance, justifying independent-locus approximations.
Concept Graph Convolutions perform message passing on node concepts to increase interpretability of graph neural networks without losing task performance.
A neural network learns non-stationary anisotropic correlations from gridded CTM outputs and transfers the structure via LatticeKrig basis functions to station data for refined fine-scale NO2 predictions with uncertainty.
citing papers explorer
-
End-to-End Population Inference from Gravitational-Wave Strain using Transformers
Dingo-Pop uses a transformer to perform amortized, end-to-end population inference from GW strain data in seconds, bypassing per-event Monte Carlo sampling.
-
Dynamical magnetotropic susceptibility as a new probe of Kitaev materials and beyond
Dynamical magnetotropic susceptibility k(ω) acts as a probe of uniform spin and charge fluctuations, with its static scaling in α-RuCl3 arising specifically from dominant Kitaev interactions in the models examined.
-
Diffusion Models Beat GANs on Image Synthesis
Diffusion models with architecture improvements and classifier guidance achieve superior FID scores to GANs on unconditional and conditional ImageNet image synthesis.
-
Curvature-Aware Captioning:Leveraging Geodesic Attention for 3D Scene Understanding
A new framework combines self-attention on the Oblique manifold with bidirectional geodesic cross-attention on the Lorentz hyperboloid to improve both localization accuracy and descriptive coherence in 3D dense captioning.
-
Particle transformers for identifying Lorentz-boosted Higgs bosons decaying to a pair of W bosons
PaRT achieves >50% tagging efficiency for boosted H->WW jets at 1% background efficiency, decorrelated from jet mass, with data-to-simulation scale factors of 0.9-1.0 on 138 fb^{-1} of 13 TeV collisions.
-
Search for pair production of additional neutral scalars within the Inert Doublet Model in a final state with two electrons or two muons in proton-proton collisions at $\sqrt{s}$ = 13 TeV and 13.6 TeV
No significant excess found; new exclusion limits reach m_H = 108 GeV for m_H - m_A = 78 GeV in the Inert Doublet Model.
-
Latency Analysis and Optimization of Alpamayo 1 via Efficient Trajectory Generation
Redesigning Alpamayo 1 to single-reasoning and optimizing diffusion action generation cuts inference latency by 69.23% while preserving trajectory diversity and prediction quality.
-
A New Adaptive Deep Learning based Reduced Order Model for Hybrid-Type Parabolic PDEs: Rigorous Error Analysis and Applications
Two new DOD-based reduced-order models (DOD-DL-ROM and DOD+DFNN) are introduced for hybrid-type parabolic PDEs, with rigorous error bounds linking performance to optimal map regularity and conditions for outperforming POD methods.
-
AI4EOSC: a Federated Cloud Platform for Artificial Intelligence in Scientific Research
AI4EOSC is a federated cloud platform that integrates modular AI development, serverless AI-as-a-Service, and distributed orchestration with built-in FAIR metadata and provenance tracking for scientific AI workloads in EOSC.
-
Jarvis-HEP: A lightweight Python framework for workflow composition and parameter scans in high-energy physics
Jarvis-HEP introduces a YAML-based Python framework for composing workflows and performing parameter scans in high-energy physics.
-
Software and computing for Run 3 of the ATLAS experiment at the LHC
ATLAS reports on its Run 3 software infrastructure for data management, workflows, databases, validation, and physics analysis tools at the LHC.