hub Mixed citations

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu · 2023 · cs.DC · arXiv 2304.11277

Mixed citation behavior. Most common role is background (50%).

79 Pith papers citing it

Background 50% of classified citations

open full Pith review browse 79 citing papers arXiv PDF

abstract

It is widely acknowledged that large models have the potential to deliver superior performance across a broad range of domains. Despite the remarkable progress made in the field of machine learning systems research, which has enabled the development and exploration of large models, such abilities remain confined to a small group of advanced users and industry leaders, resulting in an implicit technical barrier for the wider community to access and leverage these technologies. In this paper, we introduce PyTorch Fully Sharded Data Parallel (FSDP) as an industry-grade solution for large model training. FSDP has been closely co-designed with several key PyTorch core components including Tensor implementation, dispatcher system, and CUDA memory caching allocator, to provide non-intrusive user experiences and high training efficiency. Additionally, FSDP natively incorporates a range of techniques and settings to optimize resource utilization across a variety of hardware configurations. The experimental results demonstrate that FSDP is capable of achieving comparable performance to Distributed Data Parallel while providing support for significantly larger models with near-linear scalability in terms of TFLOPS.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 12 method 11 dataset 1

citation-polarity summary

background 12 use method 11 use dataset 1

claims ledger

abstract It is widely acknowledged that large models have the potential to deliver superior performance across a broad range of domains. Despite the remarkable progress made in the field of machine learning systems research, which has enabled the development and exploration of large models, such abilities remain confined to a small group of advanced users and industry leaders, resulting in an implicit technical barrier for the wider community to access and leverage these technologies. In this paper, we introduce PyTorch Fully Sharded Data Parallel (FSDP) as an industry-grade solution for large model tr

co-cited works

representative citing papers

OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models

cs.CV · 2026-04-05 · unverdicted · novelty 8.0

OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

cs.CV · 2026-01-15 · unverdicted · novelty 8.0

Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

cs.CV · 2024-09-25 · accept · novelty 8.0

Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.

ORPO: Monolithic Preference Optimization without Reference Model

cs.CL · 2024-03-12 · conditional · novelty 8.0

ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.

CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization

cs.LG · 2026-05-19 · conditional · novelty 7.0

CEPO sharpens token credit in RLVR by requiring tokens to be favored by the correct answer and disfavored by wrong answers drawn from rejected rollouts, delivering accuracy gains on five multimodal math benchmarks.

CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

CausalCine enables real-time causal autoregressive multi-shot video generation via multi-shot training, content-aware memory routing for coherence, and distillation to few-step inference.

A satellite foundation model for improved wealth monitoring

cs.CY · 2026-04-25 · unverdicted · novelty 7.0

Tempov is a self-supervised satellite foundation model that predicts wealth levels and decadal changes at high resolution across Africa from Landsat imagery, outperforming baselines even with limited labels and generalizing temporally.

ALTO: Adaptive LoRA Tuning and Orchestration for Heterogeneous LoRA Training Workloads

cs.LG · 2026-04-07 · unverdicted · novelty 7.0

ALTO accelerates LoRA tuning up to 13.8x by monitoring loss trajectories for early stopping, using fused grouped GEMM with rank-local adapter parallelism, and combining intra- and inter-task scheduling for heterogeneous workloads without quality loss.

ChopGrad: Pixel-Wise Losses for Latent Video Diffusion via Truncated Backpropagation

cs.CV · 2026-03-18 · unverdicted · novelty 7.0

ChopGrad truncates backpropagation to local frame windows in video diffusion models, reducing memory from linear in frame count to constant while enabling pixel-wise loss fine-tuning.

Omni-Attribute: Open-vocabulary Attribute Encoder for Visual Concept Personalization

cs.CV · 2025-12-11 · unverdicted · novelty 7.0

Omni-Attribute is a new open-vocabulary image attribute encoder trained on semantically linked pairs with dual objectives to produce disentangled representations for personalization and compositional generation.

Training Agents Inside of Scalable World Models

cs.AI · 2025-09-29 · conditional · novelty 7.0

Dreamer 4 is the first agent to obtain diamonds in Minecraft from only offline data by reinforcement learning inside a scalable world model that accurately predicts game mechanics.

Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

cs.CV · 2024-06-24 · unverdicted · novelty 7.0

Cambrian-1 is a vision-centric multimodal LLM family that evaluates over 20 vision encoders, introduces CV-Bench and the Spatial Vision Aggregator, and releases open models, code, and data achieving strong performance on visual grounding tasks.

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

cs.CV · 2024-06-10 · conditional · novelty 7.0

Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.

Scaling and evaluating sparse autoencoders

cs.LG · 2024-06-06 · unverdicted · novelty 7.0

K-sparse autoencoders with dead-latent fixes produce clean scaling laws and better feature quality metrics that improve with size, shown by training a 16-million-latent model on GPT-4 activations.

Convex Optimization for Alignment and Preference Learning on a Single GPU

cs.LG · 2026-05-22 · unverdicted · novelty 6.0

COALA applies convex optimization reformulations of neural networks to direct preference optimization, claiming single-GPU training with ~18% of DPO's TFLOPs and competitive performance on multiple datasets and models up to 8B parameters.

STELLAR: Scaling 3D Perception Large Models for Autonomous Driving

cs.CV · 2026-05-19 · unverdicted · novelty 6.0

STELLAR trains up to 500M-parameter multi-modal models on 50M driving scenes and reports empirical scaling trends plus new state-of-the-art results on the Waymo Open Dataset.

EmbGen: Teaching with Reassembled Corpora

cs.CL · 2026-05-19 · unverdicted · novelty 6.0

EmbGen creates synthetic QA data by entity decomposition, embedding-based reassembly into clusters, and multi-level sampling with cluster-specific prompts, yielding up to 88.9% higher Binary Accuracy than baselines on heterogeneous datasets under fixed token budgets.

EPIC: Abstraction and Polymorphism of In-Network Collectives on Ethernet

cs.DC · 2026-05-18 · unverdicted · novelty 6.0

EPIC defines a unified abstraction for in-network collectives on Ethernet with polymorphic implementations and modular design to support incremental hardware evolution.

How Off-Policy Can GRPO Be? Mu-GRPO for Efficient LLM Reinforcement Learning

cs.LG · 2026-05-17 · conditional · novelty 6.0

Mu-GRPO enables substantially more off-policy GRPO training for LLMs via relaxed clipping and negative-advantage veto in large staged batches, matching standard GRPO performance at ~2x training speed.

Runtime-Orchestrated Second-Order Optimization for Scalable LLM Training

cs.DC · 2026-05-15 · unverdicted · novelty 6.0

Asteria is a runtime system that enables second-order optimization for LLMs by dynamically distributing optimizer state across GPU, CPU, and NVMe while using asynchronous inverse-root computations and bounded-staleness synchronization.

A Few GPUs, A Whole Lotta Scale: Faithful LLM Training Emulation with PrismLLM

cs.DC · 2026-05-15 · conditional · novelty 6.0

PrismLLM constructs a sliced execution graph and uses hybrid emulation to faithfully reproduce performance and memory behavior of up to 8192-GPU LLM training runs on fewer than 1% of the original GPUs.

Performance-Driven Policy Optimization for Speculative Decoding with Adaptive Windowing

cs.CL · 2026-05-14 · unverdicted · novelty 6.0

PPOW uses window-level RL with cost-aware speedup and proximity rewards plus adaptive divergence-aware windowing to reach 6.29-6.52 acceptance lengths and 3.39-4.36x speedups in speculative decoding.

DynaTrain: Fast Online Parallelism Switching for Elastic LLM Training

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

DynaTrain introduces a Virtual Parameter Space abstraction to enable sub-second online parallelism reconfiguration for elastic LLM training on models up to 235B parameters.

ReCoVer: Resilient LLM Pre-Training System via Fault-Tolerant Collective and Versatile Workload

cs.DC · 2026-05-11 · unverdicted · novelty 6.0 · 2 refs

ReCoVer maintains constant microbatch counts per iteration via fault-tolerant collectives, in-step recovery, and versatile workload redistribution to preserve training trajectory on up to 512 GPUs despite losing 256, yielding 2.23× higher effective throughput than checkpoint-restart.

citing papers explorer

Showing 50 of 79 citing papers.

OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models cs.CV · 2026-04-05 · unverdicted · none · ref 47 · internal anchor
OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding cs.CV · 2026-01-15 · unverdicted · none · ref 185 · internal anchor
Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models cs.CV · 2024-09-25 · accept · none · ref 135 · internal anchor
Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
ORPO: Monolithic Preference Optimization without Reference Model cs.CL · 2024-03-12 · conditional · none · ref 59 · internal anchor
ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.
CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization cs.LG · 2026-05-19 · conditional · none · ref 27 · internal anchor
CEPO sharpens token credit in RLVR by requiring tokens to be favored by the correct answer and disfavored by wrong answers drawn from rejected rollouts, delivering accuracy gains on five multimodal math benchmarks.
CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives cs.CV · 2026-05-12 · unverdicted · none · ref 59 · internal anchor
CausalCine enables real-time causal autoregressive multi-shot video generation via multi-shot training, content-aware memory routing for coherence, and distillation to few-step inference.
A satellite foundation model for improved wealth monitoring cs.CY · 2026-04-25 · unverdicted · none · ref 46 · internal anchor
Tempov is a self-supervised satellite foundation model that predicts wealth levels and decadal changes at high resolution across Africa from Landsat imagery, outperforming baselines even with limited labels and generalizing temporally.
ALTO: Adaptive LoRA Tuning and Orchestration for Heterogeneous LoRA Training Workloads cs.LG · 2026-04-07 · unverdicted · none · ref 29 · internal anchor
ALTO accelerates LoRA tuning up to 13.8x by monitoring loss trajectories for early stopping, using fused grouped GEMM with rank-local adapter parallelism, and combining intra- and inter-task scheduling for heterogeneous workloads without quality loss.
ChopGrad: Pixel-Wise Losses for Latent Video Diffusion via Truncated Backpropagation cs.CV · 2026-03-18 · unverdicted · none · ref 79 · internal anchor
ChopGrad truncates backpropagation to local frame windows in video diffusion models, reducing memory from linear in frame count to constant while enabling pixel-wise loss fine-tuning.
Omni-Attribute: Open-vocabulary Attribute Encoder for Visual Concept Personalization cs.CV · 2025-12-11 · unverdicted · none · ref 75 · internal anchor
Omni-Attribute is a new open-vocabulary image attribute encoder trained on semantically linked pairs with dual objectives to produce disentangled representations for personalization and compositional generation.
Training Agents Inside of Scalable World Models cs.AI · 2025-09-29 · conditional · none · ref 44 · internal anchor
Dreamer 4 is the first agent to obtain diamonds in Minecraft from only offline data by reinforcement learning inside a scalable world model that accurately predicts game mechanics.
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs cs.CV · 2024-06-24 · unverdicted · none · ref 150 · internal anchor
Cambrian-1 is a vision-centric multimodal LLM family that evaluates over 20 vision encoders, introduces CV-Bench and the Spatial Vision Aggregator, and releases open models, code, and data achieving strong performance on visual grounding tasks.
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation cs.CV · 2024-06-10 · conditional · none · ref 44 · internal anchor
Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.
Scaling and evaluating sparse autoencoders cs.LG · 2024-06-06 · unverdicted · none · ref 71 · internal anchor
K-sparse autoencoders with dead-latent fixes produce clean scaling laws and better feature quality metrics that improve with size, shown by training a 16-million-latent model on GPT-4 activations.
Convex Optimization for Alignment and Preference Learning on a Single GPU cs.LG · 2026-05-22 · unverdicted · none · ref 40 · internal anchor
COALA applies convex optimization reformulations of neural networks to direct preference optimization, claiming single-GPU training with ~18% of DPO's TFLOPs and competitive performance on multiple datasets and models up to 8B parameters.
STELLAR: Scaling 3D Perception Large Models for Autonomous Driving cs.CV · 2026-05-19 · unverdicted · none · ref 40 · internal anchor
STELLAR trains up to 500M-parameter multi-modal models on 50M driving scenes and reports empirical scaling trends plus new state-of-the-art results on the Waymo Open Dataset.
EmbGen: Teaching with Reassembled Corpora cs.CL · 2026-05-19 · unverdicted · none · ref 39 · internal anchor
EmbGen creates synthetic QA data by entity decomposition, embedding-based reassembly into clusters, and multi-level sampling with cluster-specific prompts, yielding up to 88.9% higher Binary Accuracy than baselines on heterogeneous datasets under fixed token budgets.
EPIC: Abstraction and Polymorphism of In-Network Collectives on Ethernet cs.DC · 2026-05-18 · unverdicted · none · ref 93 · internal anchor
EPIC defines a unified abstraction for in-network collectives on Ethernet with polymorphic implementations and modular design to support incremental hardware evolution.
How Off-Policy Can GRPO Be? Mu-GRPO for Efficient LLM Reinforcement Learning cs.LG · 2026-05-17 · conditional · none · ref 33 · internal anchor
Mu-GRPO enables substantially more off-policy GRPO training for LLMs via relaxed clipping and negative-advantage veto in large staged batches, matching standard GRPO performance at ~2x training speed.
Runtime-Orchestrated Second-Order Optimization for Scalable LLM Training cs.DC · 2026-05-15 · unverdicted · none · ref 10 · internal anchor
Asteria is a runtime system that enables second-order optimization for LLMs by dynamically distributing optimizer state across GPU, CPU, and NVMe while using asynchronous inverse-root computations and bounded-staleness synchronization.
A Few GPUs, A Whole Lotta Scale: Faithful LLM Training Emulation with PrismLLM cs.DC · 2026-05-15 · conditional · none · ref 36 · internal anchor
PrismLLM constructs a sliced execution graph and uses hybrid emulation to faithfully reproduce performance and memory behavior of up to 8192-GPU LLM training runs on fewer than 1% of the original GPUs.
Performance-Driven Policy Optimization for Speculative Decoding with Adaptive Windowing cs.CL · 2026-05-14 · unverdicted · none · ref 33 · internal anchor
PPOW uses window-level RL with cost-aware speedup and proximity rewards plus adaptive divergence-aware windowing to reach 6.29-6.52 acceptance lengths and 3.39-4.36x speedups in speculative decoding.
DynaTrain: Fast Online Parallelism Switching for Elastic LLM Training cs.LG · 2026-05-12 · unverdicted · none · ref 17 · internal anchor
DynaTrain introduces a Virtual Parameter Space abstraction to enable sub-second online parallelism reconfiguration for elastic LLM training on models up to 235B parameters.
ReCoVer: Resilient LLM Pre-Training System via Fault-Tolerant Collective and Versatile Workload cs.DC · 2026-05-11 · unverdicted · none · ref 41 · 2 links · internal anchor
ReCoVer maintains constant microbatch counts per iteration via fault-tolerant collectives, in-step recovery, and versatile workload redistribution to preserve training trajectory on up to 512 GPUs despite losing 256, yielding 2.23× higher effective throughput than checkpoint-restart.
ShardTensor: Domain Parallelism for Scientific Machine Learning cs.DC · 2026-05-11 · unverdicted · none · ref 29 · internal anchor
ShardTensor is a domain-parallelism system for SciML that enables flexible scaling of extreme-resolution spatial datasets by removing the constraint of batch size one per device.
LoKA: Low-precision Kernel Applications for Recommendation Models At Scale cs.LG · 2026-05-11 · unverdicted · none · ref 86 · 2 links · internal anchor
LoKA enables practical FP8 use in numerically sensitive large recommendation models via online profiling of activations, reusable model modifications for stability, and dynamic kernel dispatching.
HELLoRA: Hot Experts Layer-Level Low-Rank Adaptation for Mixture-of-Experts Models cs.LG · 2026-05-11 · unverdicted · none · ref 10 · internal anchor
HELLoRA selectively applies LoRA adapters to hot experts in MoE layers, using as little as 15.7% of standard LoRA parameters while improving accuracy by 9.2% on OlMoE across math, code, and alignment tasks.
DisagMoE: Computation-Communication overlapped MoE Training via Disaggregated AF-Pipe Parallelism cs.LG · 2026-05-10 · unverdicted · none · ref 34 · internal anchor
DisagMoE achieves up to 1.8x faster MoE training by disaggregating attention and FFN layers into disjoint GPU groups with a multi-stage uni-directional pipeline and roofline-based bandwidth balancing.
MegaScale-Omni: A Hyper-Scale, Workload-Resilient System for MultiModal LLM Training in Production cs.DC · 2026-05-09 · unverdicted · none · ref 66 · internal anchor
MegaScale-Omni delivers 1.27x-7.57x higher throughput for dynamic multimodal LLM training by decoupling encoder and LLM parallelism, using unified colocation, and applying adaptive workload balancing.
FlashEvolve: Accelerating Agent Self-Evolution with Asynchronous Stage Orchestration cs.LG · 2026-05-08 · unverdicted · none · ref 36 · internal anchor
FlashEvolve accelerates LLM agent self-evolution via asynchronous stage orchestration and inspectable language-space staleness handling, reporting 3.5-4.9x proposal throughput gains over synchronous baselines on GEPA workloads.
AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation cs.LG · 2026-05-01 · unverdicted · none · ref 21 · 2 links · internal anchor
AsymTalker uses temporal reference encoding and asymmetric knowledge distillation to produce identity-consistent talking head videos up to 600 seconds long at 66 FPS.
LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning cs.RO · 2026-04-30 · unverdicted · none · ref 88 · 2 links · internal anchor
LaST-R1 introduces a RL post-training method called LAPO that optimizes latent Chain-of-Thought reasoning in vision-language-action models, yielding 99.9% success on LIBERO and up to 22.5% real-world gains.
JigsawRL: Assembling RL Pipelines for Efficient LLM Post-Training cs.LG · 2026-04-26 · unverdicted · none · ref 65 · internal anchor
JigsawRL achieves up to 1.85x higher throughput in LLM RL pipelines via pipeline multiplexing, sub-stage graphs, and look-ahead scheduling compared to prior systems.
MARS$^2$: Scaling Multi-Agent Tree Search via Reinforcement Learning for Code Generation cs.AI · 2026-04-16 · unverdicted · none · ref 33 · internal anchor
MARS² integrates multi-agent collaboration with tree-structured search in RL to boost code generation by increasing exploratory diversity and using path-level group advantages for credit assignment.
Nucleus-Image: Sparse MoE for Image Generation cs.CV · 2026-04-14 · unverdicted · none · ref 38 · internal anchor
A 17B-parameter sparse MoE diffusion transformer activates 2B parameters per pass and reaches competitive quality on image generation benchmarks without post-training.
OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation cs.CV · 2026-04-13 · unverdicted · none · ref 75 · internal anchor
OmniShow unifies text, image, audio, and pose conditions into an end-to-end model for high-quality human-object interaction video generation and introduces the HOIVG-Bench benchmark, claiming state-of-the-art results.
Relax: An Asynchronous Reinforcement Learning Engine for Omni-Modal Post-Training at Scale cs.CL · 2026-04-13 · unverdicted · none · ref 19 · internal anchor
Relax is a new RL training engine with omni-native design and async execution that delivers up to 2x speedups over baselines like veRL while converging to equivalent reward levels on Qwen3 models.
Continuous Adversarial Flow Models cs.LG · 2026-04-13 · unverdicted · none · ref 82 · internal anchor
Continuous adversarial flow models replace MSE in flow matching with adversarial training via a discriminator, improving guidance-free FID on ImageNet from 8.26 to 3.63 for SiT and similar gains for JiT and text-to-image benchmarks.
DeepStack: Scalable and Accurate Design Space Exploration for Distributed 3D-Stacked AI Accelerators cs.AR · 2026-04-06 · conditional · none · ref 122 · internal anchor
DeepStack introduces a fast performance model and hierarchical search method for co-optimizing 3D DRAM stacking, interconnects, and distributed scheduling in AI accelerators, delivering up to 9.5x throughput gains over baselines.
veScale-FSDP: Flexible and High-Performance FSDP at Scale cs.DC · 2026-02-25 · unverdicted · none · ref 35 · internal anchor
veScale-FSDP uses RaggedShard and structure-aware planning to support block-wise quantization and non-element-wise optimizers while delivering 5-66% higher throughput and 16-30% lower memory than prior FSDP systems at massive scale.
BOOST: BOttleneck-Optimized Scalable Training Framework for Low-Rank Large Language Models cs.LG · 2025-12-13 · unverdicted · none · ref 29 · internal anchor
BOOST delivers 1.46-2.27x end-to-end speedups for low-rank bottleneck LLMs by redesigning tensor parallelism around the bottleneck structure plus supporting optimizations.
DMA-Latte: Expanding the Reach of DMA Offloads to Latency-bound ML Communication cs.DC · 2025-11-10 · unverdicted · none · ref 35 · internal anchor
DMA offloads on AMD MI300X GPUs are extended to latency-bound ML communication using untapped hardware features, closing up to 4.5x performance gap versus RCCL in collectives and delivering up to 1.5x lower latency and 1.9x higher throughput in LLM inference over vLLM.
Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency cs.CV · 2025-10-09 · conditional · none · ref 28 · internal anchor
The work introduces rCM, a score-regularized continuous-time consistency model that matches DMD2 quality on large models up to 14B parameters while improving diversity and enabling 1-4 step sampling.
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency cs.CV · 2025-08-25 · unverdicted · none · ref 183 · internal anchor
InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and agentic tasks.
Lizard: An Efficient Linearization Framework for Large Language Models cs.CL · 2025-07-11 · unverdicted · none · ref 24 · internal anchor
Lizard linearizes Transformer LLMs via subquadratic attention and adaptive learnable modules, recovering near-original performance while outperforming prior linearization methods on MMLU and associative recall.
VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning cs.RO · 2025-05-24 · conditional · none · ref 90 · internal anchor
VLA-RL applies online RL to pretrained VLAs, yielding a 4.5% gain over strong baselines on 40 LIBERO manipulation tasks and matching commercial models like π₀-FAST.
MAGI-1: Autoregressive Video Generation at Scale cs.CV · 2025-05-19 · unverdicted · none · ref 49 · internal anchor
MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.
Seedream 2.0: A Native Chinese-English Bilingual Image Generation Foundation Model cs.CV · 2025-03-10 · unverdicted · none · ref 44 · internal anchor
Seedream 2.0 is a native Chinese-English bilingual diffusion model that integrates a self-developed LLM text encoder, Glyph-Aligned ByT5, and Scaled ROPE to reach claimed state-of-the-art results in prompt following, aesthetics, text rendering, and human preference alignment via RLHF.
DOLLAR: Few-Step Video Generation via Distillation and Latent Reward Optimization cs.CV · 2024-12-20 · unverdicted · none · ref 72 · internal anchor
DOLLAR combines variational score and consistency distillation for few-step video generation plus latent reward optimization, reporting 82.57 VBench score and up to 278x speedup over the teacher diffusion model for 128-frame 10-second videos.
MetaMorph: Multimodal Understanding and Generation via Instruction Tuning cs.CV · 2024-12-18 · unverdicted · none · ref 196 · internal anchor
VPiT enables pretrained LLMs to perform both visual understanding and generation by predicting discrete text tokens and continuous visual tokens, with understanding data proving more effective than generation-specific data.

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer