hub Mixed citations

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu · 2023 · cs.DC · arXiv 2304.11277

Mixed citation behavior. Most common role is background (50%).

95 Pith papers citing it

Background 50% of classified citations

open full Pith review browse 95 citing papers arXiv PDF

abstract

It is widely acknowledged that large models have the potential to deliver superior performance across a broad range of domains. Despite the remarkable progress made in the field of machine learning systems research, which has enabled the development and exploration of large models, such abilities remain confined to a small group of advanced users and industry leaders, resulting in an implicit technical barrier for the wider community to access and leverage these technologies. In this paper, we introduce PyTorch Fully Sharded Data Parallel (FSDP) as an industry-grade solution for large model training. FSDP has been closely co-designed with several key PyTorch core components including Tensor implementation, dispatcher system, and CUDA memory caching allocator, to provide non-intrusive user experiences and high training efficiency. Additionally, FSDP natively incorporates a range of techniques and settings to optimize resource utilization across a variety of hardware configurations. The experimental results demonstrate that FSDP is capable of achieving comparable performance to Distributed Data Parallel while providing support for significantly larger models with near-linear scalability in terms of TFLOPS.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 12 method 11 dataset 1

citation-polarity summary

background 12 use method 11 use dataset 1

claims ledger

abstract It is widely acknowledged that large models have the potential to deliver superior performance across a broad range of domains. Despite the remarkable progress made in the field of machine learning systems research, which has enabled the development and exploration of large models, such abilities remain confined to a small group of advanced users and industry leaders, resulting in an implicit technical barrier for the wider community to access and leverage these technologies. In this paper, we introduce PyTorch Fully Sharded Data Parallel (FSDP) as an industry-grade solution for large model tr

co-cited works

representative citing papers

OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models

cs.CV · 2026-04-05 · unverdicted · novelty 8.0

OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

cs.CV · 2026-01-15 · unverdicted · novelty 8.0

Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

cs.CV · 2024-09-25 · accept · novelty 8.0

Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.

ORPO: Monolithic Preference Optimization without Reference Model

cs.CL · 2024-03-12 · conditional · novelty 8.0

ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.

Transferability for General Reasoning: An Automated Curriculum for Multi-Domain RLVR

cs.AI · 2026-06-23 · unverdicted · novelty 7.0

TAC is a bandit curriculum for multi-domain RLVR that prioritizes domains whose gradient updates align with and benefit other domains, yielding up to 2.8-point macro accuracy gains over learnability-only baselines on Qwen3-1.7B and Llama3.2-3B.

Synthetic Benchmarks Overstate Forward-Forward Scaling: Real-Data Limits of Layer-Local Training

cs.CV · 2026-06-04 · conditional · novelty 7.0

DTG-FF reaches 91.8% on CIFAR-10 and 49.4% on ImageNet-100 224x224 but BP baselines beat it by 2.4-5.93 pp with gaps widening by class count on real data while reversing the synthetic trend.

Large Byte Model: Teaching Language Models About Compiled Code

cs.CR · 2026-06-01 · unverdicted · novelty 7.0

Presents a byte-native LLM with bespoke tokenizer achieving 69-98% accuracy on malware family and architecture classification from raw bytes.

SiDP: Memory-Efficient Data Parallelism for Offline LLM Inference

cs.DC · 2026-05-27 · unverdicted · novelty 7.0

SiDP distributes model weights across a DP group with WaS and CaS modes to increase KV cache capacity by up to 1.8x and end-to-end throughput by up to 1.5x over vLLM on H20/H200/B200 GPUs for offline LLM inference.

Heterogeneous Parallelism for Multimodal Large Language Model Training

cs.LG · 2026-05-26 · unverdicted · novelty 7.0

Heterogeneous parallelism decouples module layouts in multimodal LLM training via boundary communicators, yielding up to 49.3% TFLOPS/GPU gains in colocated mode and 13% throughput in non-colocated mode with convergence parity.

CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization

cs.LG · 2026-05-19 · conditional · novelty 7.0

CEPO sharpens token credit in RLVR by requiring tokens to be favored by the correct answer and disfavored by wrong answers drawn from rejected rollouts, delivering accuracy gains on five multimodal math benchmarks.

CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

CausalCine enables real-time causal autoregressive multi-shot video generation via multi-shot training, content-aware memory routing for coherence, and distillation to few-step inference.

A satellite foundation model for improved wealth monitoring

cs.CY · 2026-04-25 · unverdicted · novelty 7.0

Tempov is a self-supervised satellite foundation model that predicts wealth levels and decadal changes at high resolution across Africa from Landsat imagery, outperforming baselines even with limited labels and generalizing temporally.

ALTO: Adaptive LoRA Tuning and Orchestration for Heterogeneous LoRA Training Workloads

cs.LG · 2026-04-07 · unverdicted · novelty 7.0

ALTO accelerates LoRA tuning up to 13.8x by monitoring loss trajectories for early stopping, using fused grouped GEMM with rank-local adapter parallelism, and combining intra- and inter-task scheduling for heterogeneous workloads without quality loss.

ChopGrad: Pixel-Wise Losses for Latent Video Diffusion via Truncated Backpropagation

cs.CV · 2026-03-18 · unverdicted · novelty 7.0

ChopGrad truncates backpropagation to local frame windows in video diffusion models, reducing memory from linear in frame count to constant while enabling pixel-wise loss fine-tuning.

Omni-Attribute: Open-vocabulary Attribute Encoder for Visual Concept Personalization

cs.CV · 2025-12-11 · unverdicted · novelty 7.0

Omni-Attribute is a new open-vocabulary image attribute encoder trained on semantically linked pairs with dual objectives to produce disentangled representations for personalization and compositional generation.

Training Agents Inside of Scalable World Models

cs.AI · 2025-09-29 · conditional · novelty 7.0

Dreamer 4 is the first agent to obtain diamonds in Minecraft from only offline data by reinforcement learning inside a scalable world model that accurately predicts game mechanics.

Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

cs.CV · 2024-06-24 · unverdicted · novelty 7.0

Cambrian-1 is a vision-centric multimodal LLM family that evaluates over 20 vision encoders, introduces CV-Bench and the Spatial Vision Aggregator, and releases open models, code, and data achieving strong performance on visual grounding tasks.

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

cs.CV · 2024-06-10 · conditional · novelty 7.0

Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.

Scaling and evaluating sparse autoencoders

cs.LG · 2024-06-06 · unverdicted · novelty 7.0

K-sparse autoencoders with dead-latent fixes produce clean scaling laws and better feature quality metrics that improve with size, shown by training a 16-million-latent model on GPT-4 activations.

One Forward Beats Two: InnerZoom for Accurate and Efficient GUI Grounding

cs.CV · 2026-06-29 · unverdicted · novelty 6.0

InnerZoom bridges cross-layer evidence in one forward pass to achieve SOTA GUI grounding accuracy on six benchmarks while cutting latency up to 31.8% versus two-pass baselines.

InteractiveAvatar: Real-Time Streaming Video Generation for Consistent and Intent-Aware Avatars

cs.CV · 2026-06-22 · unverdicted · novelty 6.0

InteractiveAvatar is a real-time infinite-streaming avatar video generation system using autoregressive distillation, Long-Short Visual Memory for consistency, and a Reasoning-Reaction Module for intent-aware interactions.

Multi-Faceted Interactivity Alignment in Full-Duplex Speech Models

cs.CL · 2026-06-09 · unverdicted · novelty 6.0

A multi-axis RL alignment technique improves pause handling, turn-taking, backchanneling, and interruption response in full-duplex spoken dialogue models by optimizing axis-specific rewards derived from human audio segments.

Breaking the Bubble: Asynchronous Pipeline Parallel Training with Bounded Weight Inconsistency

cs.LG · 2026-06-05 · unverdicted · novelty 6.0

PACI enables bubble-free asynchronous pipeline training by bounding version drift via local gradient accumulation, matching synchronous stability with higher throughput and no extra memory.

SGMD: Score Gradient Matching Distillation for Few-Step Video Diffusion Distillation

cs.CV · 2026-05-28 · unverdicted · novelty 6.0

SGMD uses fake-score optimization toward the teacher with stop-gradient Fisher objective and NR/RC dual potentials to deliver ~3x training speedup and better motion dynamics in 4-step video diffusion models.

citing papers explorer

Showing 50 of 95 citing papers.

OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models cs.CV · 2026-04-05 · unverdicted · none · ref 47 · internal anchor
OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding cs.CV · 2026-01-15 · unverdicted · none · ref 185 · internal anchor
Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models cs.CV · 2024-09-25 · accept · none · ref 135 · internal anchor
Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
ORPO: Monolithic Preference Optimization without Reference Model cs.CL · 2024-03-12 · conditional · none · ref 59 · internal anchor
ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.
Transferability for General Reasoning: An Automated Curriculum for Multi-Domain RLVR cs.AI · 2026-06-23 · unverdicted · none · ref 62 · internal anchor
TAC is a bandit curriculum for multi-domain RLVR that prioritizes domains whose gradient updates align with and benefit other domains, yielding up to 2.8-point macro accuracy gains over learnability-only baselines on Qwen3-1.7B and Llama3.2-3B.
Synthetic Benchmarks Overstate Forward-Forward Scaling: Real-Data Limits of Layer-Local Training cs.CV · 2026-06-04 · conditional · none · ref 8 · internal anchor
DTG-FF reaches 91.8% on CIFAR-10 and 49.4% on ImageNet-100 224x224 but BP baselines beat it by 2.4-5.93 pp with gaps widening by class count on real data while reversing the synthetic trend.
Large Byte Model: Teaching Language Models About Compiled Code cs.CR · 2026-06-01 · unverdicted · none · ref 14 · internal anchor
Presents a byte-native LLM with bespoke tokenizer achieving 69-98% accuracy on malware family and architecture classification from raw bytes.
SiDP: Memory-Efficient Data Parallelism for Offline LLM Inference cs.DC · 2026-05-27 · unverdicted · none · ref 36 · internal anchor
SiDP distributes model weights across a DP group with WaS and CaS modes to increase KV cache capacity by up to 1.8x and end-to-end throughput by up to 1.5x over vLLM on H20/H200/B200 GPUs for offline LLM inference.
Heterogeneous Parallelism for Multimodal Large Language Model Training cs.LG · 2026-05-26 · unverdicted · none · ref 22 · internal anchor
Heterogeneous parallelism decouples module layouts in multimodal LLM training via boundary communicators, yielding up to 49.3% TFLOPS/GPU gains in colocated mode and 13% throughput in non-colocated mode with convergence parity.
CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization cs.LG · 2026-05-19 · conditional · none · ref 27 · internal anchor
CEPO sharpens token credit in RLVR by requiring tokens to be favored by the correct answer and disfavored by wrong answers drawn from rejected rollouts, delivering accuracy gains on five multimodal math benchmarks.
CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives cs.CV · 2026-05-12 · unverdicted · none · ref 59 · internal anchor
CausalCine enables real-time causal autoregressive multi-shot video generation via multi-shot training, content-aware memory routing for coherence, and distillation to few-step inference.
A satellite foundation model for improved wealth monitoring cs.CY · 2026-04-25 · unverdicted · none · ref 46 · internal anchor
Tempov is a self-supervised satellite foundation model that predicts wealth levels and decadal changes at high resolution across Africa from Landsat imagery, outperforming baselines even with limited labels and generalizing temporally.
ALTO: Adaptive LoRA Tuning and Orchestration for Heterogeneous LoRA Training Workloads cs.LG · 2026-04-07 · unverdicted · none · ref 29 · internal anchor
ALTO accelerates LoRA tuning up to 13.8x by monitoring loss trajectories for early stopping, using fused grouped GEMM with rank-local adapter parallelism, and combining intra- and inter-task scheduling for heterogeneous workloads without quality loss.
ChopGrad: Pixel-Wise Losses for Latent Video Diffusion via Truncated Backpropagation cs.CV · 2026-03-18 · unverdicted · none · ref 79 · internal anchor
ChopGrad truncates backpropagation to local frame windows in video diffusion models, reducing memory from linear in frame count to constant while enabling pixel-wise loss fine-tuning.
Omni-Attribute: Open-vocabulary Attribute Encoder for Visual Concept Personalization cs.CV · 2025-12-11 · unverdicted · none · ref 75 · internal anchor
Omni-Attribute is a new open-vocabulary image attribute encoder trained on semantically linked pairs with dual objectives to produce disentangled representations for personalization and compositional generation.
Training Agents Inside of Scalable World Models cs.AI · 2025-09-29 · conditional · none · ref 44 · internal anchor
Dreamer 4 is the first agent to obtain diamonds in Minecraft from only offline data by reinforcement learning inside a scalable world model that accurately predicts game mechanics.
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs cs.CV · 2024-06-24 · unverdicted · none · ref 150 · internal anchor
Cambrian-1 is a vision-centric multimodal LLM family that evaluates over 20 vision encoders, introduces CV-Bench and the Spatial Vision Aggregator, and releases open models, code, and data achieving strong performance on visual grounding tasks.
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation cs.CV · 2024-06-10 · conditional · none · ref 44 · internal anchor
Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.
Scaling and evaluating sparse autoencoders cs.LG · 2024-06-06 · unverdicted · none · ref 71 · internal anchor
K-sparse autoencoders with dead-latent fixes produce clean scaling laws and better feature quality metrics that improve with size, shown by training a 16-million-latent model on GPT-4 activations.
One Forward Beats Two: InnerZoom for Accurate and Efficient GUI Grounding cs.CV · 2026-06-29 · unverdicted · none · ref 146 · internal anchor
InnerZoom bridges cross-layer evidence in one forward pass to achieve SOTA GUI grounding accuracy on six benchmarks while cutting latency up to 31.8% versus two-pass baselines.
InteractiveAvatar: Real-Time Streaming Video Generation for Consistent and Intent-Aware Avatars cs.CV · 2026-06-22 · unverdicted · none · ref 48 · internal anchor
InteractiveAvatar is a real-time infinite-streaming avatar video generation system using autoregressive distillation, Long-Short Visual Memory for consistency, and a Reasoning-Reaction Module for intent-aware interactions.
Multi-Faceted Interactivity Alignment in Full-Duplex Speech Models cs.CL · 2026-06-09 · unverdicted · none · ref 57 · internal anchor
A multi-axis RL alignment technique improves pause handling, turn-taking, backchanneling, and interruption response in full-duplex spoken dialogue models by optimizing axis-specific rewards derived from human audio segments.
Breaking the Bubble: Asynchronous Pipeline Parallel Training with Bounded Weight Inconsistency cs.LG · 2026-06-05 · unverdicted · none · ref 28 · internal anchor
PACI enables bubble-free asynchronous pipeline training by bounding version drift via local gradient accumulation, matching synchronous stability with higher throughput and no extra memory.
SGMD: Score Gradient Matching Distillation for Few-Step Video Diffusion Distillation cs.CV · 2026-05-28 · unverdicted · none · ref 11 · internal anchor
SGMD uses fake-score optimization toward the teacher with stop-gradient Fisher objective and NR/RC dual potentials to deliver ~3x training speedup and better motion dynamics in 4-step video diffusion models.
Convex Optimization for Alignment and Preference Learning on a Single GPU cs.LG · 2026-05-22 · unverdicted · none · ref 40 · internal anchor
COALA applies convex optimization reformulations of neural networks to direct preference optimization, claiming single-GPU training with ~18% of DPO's TFLOPs and competitive performance on multiple datasets and models up to 8B parameters.
STELLAR: Scaling 3D Perception Large Models for Autonomous Driving cs.CV · 2026-05-19 · unverdicted · none · ref 40 · internal anchor
STELLAR trains up to 500M-parameter multi-modal models on 50M driving scenes and reports empirical scaling trends plus new state-of-the-art results on the Waymo Open Dataset.
EmbGen: Teaching with Reassembled Corpora cs.CL · 2026-05-19 · unverdicted · none · ref 39 · internal anchor
EmbGen creates synthetic QA data by entity decomposition, embedding-based reassembly into clusters, and multi-level sampling with cluster-specific prompts, yielding up to 88.9% higher Binary Accuracy than baselines on heterogeneous datasets under fixed token budgets.
EPIC: Abstraction and Polymorphism of In-Network Collectives on Ethernet cs.DC · 2026-05-18 · unverdicted · none · ref 93 · internal anchor
EPIC defines a unified abstraction for in-network collectives on Ethernet with polymorphic implementations and modular design to support incremental hardware evolution.
How Off-Policy Can GRPO Be? Mu-GRPO for Efficient LLM Reinforcement Learning cs.LG · 2026-05-17 · conditional · none · ref 33 · internal anchor
Mu-GRPO enables substantially more off-policy GRPO training for LLMs via relaxed clipping and negative-advantage veto in large staged batches, matching standard GRPO performance at ~2x training speed.
Runtime-Orchestrated Second-Order Optimization for Scalable LLM Training cs.DC · 2026-05-15 · unverdicted · none · ref 10 · internal anchor
Asteria is a runtime system that enables second-order optimization for LLMs by dynamically distributing optimizer state across GPU, CPU, and NVMe while using asynchronous inverse-root computations and bounded-staleness synchronization.
A Few GPUs, A Whole Lotta Scale: Faithful LLM Training Emulation with PrismLLM cs.DC · 2026-05-15 · conditional · none · ref 36 · internal anchor
PrismLLM constructs a sliced execution graph and uses hybrid emulation to faithfully reproduce performance and memory behavior of up to 8192-GPU LLM training runs on fewer than 1% of the original GPUs.
Performance-Driven Policy Optimization for Speculative Decoding with Adaptive Windowing cs.CL · 2026-05-14 · unverdicted · none · ref 33 · internal anchor
PPOW uses window-level RL with cost-aware speedup and proximity rewards plus adaptive divergence-aware windowing to reach 6.29-6.52 acceptance lengths and 3.39-4.36x speedups in speculative decoding.
DynaTrain: Fast Online Parallelism Switching for Elastic LLM Training cs.LG · 2026-05-12 · unverdicted · none · ref 17 · internal anchor
DynaTrain introduces a Virtual Parameter Space abstraction to enable sub-second online parallelism reconfiguration for elastic LLM training on models up to 235B parameters.
ReCoVer: Resilient LLM Pre-Training System via Fault-Tolerant Collective and Versatile Workload cs.DC · 2026-05-11 · unverdicted · none · ref 41 · 2 links · internal anchor
ReCoVer maintains constant microbatch counts per iteration via fault-tolerant collectives, in-step recovery, and versatile workload redistribution to preserve training trajectory on up to 512 GPUs despite losing 256, yielding 2.23× higher effective throughput than checkpoint-restart.
ShardTensor: Domain Parallelism for Scientific Machine Learning cs.DC · 2026-05-11 · unverdicted · none · ref 29 · internal anchor
ShardTensor is a domain-parallelism system for SciML that enables flexible scaling of extreme-resolution spatial datasets by removing the constraint of batch size one per device.
LoKA: Low-precision Kernel Applications for Recommendation Models At Scale cs.LG · 2026-05-11 · unverdicted · none · ref 86 · 2 links · internal anchor
LoKA enables practical FP8 use in numerically sensitive large recommendation models via online profiling of activations, reusable model modifications for stability, and dynamic kernel dispatching.
HELLoRA: Hot Experts Layer-Level Low-Rank Adaptation for Mixture-of-Experts Models cs.LG · 2026-05-11 · unverdicted · none · ref 10 · internal anchor
HELLoRA selectively applies LoRA adapters to hot experts in MoE layers, using as little as 15.7% of standard LoRA parameters while improving accuracy by 9.2% on OlMoE across math, code, and alignment tasks.
DisagMoE: Computation-Communication overlapped MoE Training via Disaggregated AF-Pipe Parallelism cs.LG · 2026-05-10 · unverdicted · none · ref 34 · internal anchor
DisagMoE achieves up to 1.8x faster MoE training by disaggregating attention and FFN layers into disjoint GPU groups with a multi-stage uni-directional pipeline and roofline-based bandwidth balancing.
MegaScale-Omni: A Hyper-Scale, Workload-Resilient System for MultiModal LLM Training in Production cs.DC · 2026-05-09 · unverdicted · none · ref 66 · internal anchor
MegaScale-Omni delivers 1.27x-7.57x higher throughput for dynamic multimodal LLM training by decoupling encoder and LLM parallelism, using unified colocation, and applying adaptive workload balancing.
FlashEvolve: Accelerating Agent Self-Evolution with Asynchronous Stage Orchestration cs.LG · 2026-05-08 · unverdicted · none · ref 36 · internal anchor
FlashEvolve accelerates LLM agent self-evolution via asynchronous stage orchestration and inspectable language-space staleness handling, reporting 3.5-4.9x proposal throughput gains over synchronous baselines on GEPA workloads.
AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation cs.LG · 2026-05-01 · unverdicted · none · ref 21 · 2 links · internal anchor
AsymTalker uses temporal reference encoding and asymmetric knowledge distillation to produce identity-consistent talking head videos up to 600 seconds long at 66 FPS.
LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning cs.RO · 2026-04-30 · unverdicted · none · ref 88 · 2 links · internal anchor
LaST-R1 introduces a RL post-training method called LAPO that optimizes latent Chain-of-Thought reasoning in vision-language-action models, yielding 99.9% success on LIBERO and up to 22.5% real-world gains.
JigsawRL: Assembling RL Pipelines for Efficient LLM Post-Training cs.LG · 2026-04-26 · unverdicted · none · ref 65 · internal anchor
JigsawRL achieves up to 1.85x higher throughput in LLM RL pipelines via pipeline multiplexing, sub-stage graphs, and look-ahead scheduling compared to prior systems.
MARS$^2$: Scaling Multi-Agent Tree Search via Reinforcement Learning for Code Generation cs.AI · 2026-04-16 · unverdicted · none · ref 33 · internal anchor
MARS² integrates multi-agent collaboration with tree-structured search in RL to boost code generation by increasing exploratory diversity and using path-level group advantages for credit assignment.
Nucleus-Image: Sparse MoE for Image Generation cs.CV · 2026-04-14 · unverdicted · none · ref 38 · internal anchor
A 17B-parameter sparse MoE diffusion transformer activates 2B parameters per pass and reaches competitive quality on image generation benchmarks without post-training.
OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation cs.CV · 2026-04-13 · unverdicted · none · ref 75 · internal anchor
OmniShow unifies text, image, audio, and pose conditions into an end-to-end model for high-quality human-object interaction video generation and introduces the HOIVG-Bench benchmark, claiming state-of-the-art results.
Relax: An Asynchronous Reinforcement Learning Engine for Omni-Modal Post-Training at Scale cs.CL · 2026-04-13 · unverdicted · none · ref 19 · internal anchor
Relax is a new RL training engine with omni-native design and async execution that delivers up to 2x speedups over baselines like veRL while converging to equivalent reward levels on Qwen3 models.
Continuous Adversarial Flow Models cs.LG · 2026-04-13 · unverdicted · none · ref 82 · internal anchor
Continuous adversarial flow models replace MSE in flow matching with adversarial training via a discriminator, improving guidance-free FID on ImageNet from 8.26 to 3.63 for SiT and similar gains for JiT and text-to-image benchmarks.
DeepStack: Scalable and Accurate Design Space Exploration for Distributed 3D-Stacked AI Accelerators cs.AR · 2026-04-06 · conditional · none · ref 122 · internal anchor
DeepStack introduces a fast performance model and hierarchical search method for co-optimizing 3D DRAM stacking, interconnects, and distributed scheduling in AI accelerators, delivering up to 9.5x throughput gains over baselines.
veScale-FSDP: Flexible and High-Performance FSDP at Scale cs.DC · 2026-02-25 · unverdicted · none · ref 35 · internal anchor
veScale-FSDP uses RaggedShard and structure-aware planning to support block-wise quantization and non-element-wise optimizers while delivering 5-66% higher throughput and 16-30% lower memory than prior FSDP systems at massive scale.

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer