arXiv preprint arXiv:2405.15613 (2024)

Huy V V o, Vasil Khalidov, Timoth ´ee Darcet, Th ´eo Moutakanni, Nikita Smetanin, Marc Szafraniec, Hugo Touvron, Camille Couprie, Maxime Oquab, Armand Joulin, et al · 2024 · arXiv 2405.15613

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

read on arXiv browse 5 citing papers

citation-role summary

background 1 baseline 1

citation-polarity summary

background 1 baseline 1

representative citing papers

OmniShotCut: Holistic Relational Shot Boundary Detection with Shot-Query Transformer

cs.CV · 2026-04-27 · unverdicted · novelty 7.0

OmniShotCut treats shot boundary detection as structured relational prediction via a shot-query Transformer, uses fully synthetic transitions for training data, and releases OmniShotCutBench for evaluation.

SigLino: Efficient Multi-Teacher Distillation for Agglomerative Vision Foundation Models

cs.CV · 2025-12-23 · conditional · novelty 6.0

SigLino distills SigLIP2 and DINOv3 into efficient vision models via asymmetric relation-knowledge distillation, token-balanced batching, and hierarchical data sampling on a new 200M-image corpus, yielding better transfer to grounding VLMs than training from scratch.

LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics

cs.LG · 2025-11-11 · conditional · novelty 6.0

LeJEPA derives an optimal isotropic Gaussian target for embeddings and enforces it via sketched regularization to deliver scalable, heuristics-free self-supervised pretraining with 79% ImageNet linear accuracy on ViT-H/14.

MegaStyle: Constructing Diverse and Scalable Style Dataset via Consistent Text-to-Image Style Mapping

cs.CV · 2026-04-09 · unverdicted · novelty 5.0

A scalable pipeline generates an intra-consistent, inter-diverse 1.4M style image dataset from text-to-image models and uses it to train a style encoder and generalizable style transfer model.

Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

cs.CV · 2025-11-27 · unverdicted · novelty 5.0

Z-Image is an efficient 6B-parameter foundation model for image generation that rivals larger commercial systems in photorealism and bilingual text rendering through a new single-stream diffusion transformer and streamlined training.

citing papers explorer

Showing 5 of 5 citing papers.

OmniShotCut: Holistic Relational Shot Boundary Detection with Shot-Query Transformer cs.CV · 2026-04-27 · unverdicted · none · ref 34
OmniShotCut treats shot boundary detection as structured relational prediction via a shot-query Transformer, uses fully synthetic transitions for training data, and releases OmniShotCutBench for evaluation.
SigLino: Efficient Multi-Teacher Distillation for Agglomerative Vision Foundation Models cs.CV · 2025-12-23 · conditional · none · ref 32
SigLino distills SigLIP2 and DINOv3 into efficient vision models via asymmetric relation-knowledge distillation, token-balanced batching, and hierarchical data sampling on a new 200M-image corpus, yielding better transfer to grounding VLMs than training from scratch.
LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics cs.LG · 2025-11-11 · conditional · none · ref 56
LeJEPA derives an optimal isotropic Gaussian target for embeddings and enforces it via sketched regularization to deliver scalable, heuristics-free self-supervised pretraining with 79% ImageNet linear accuracy on ViT-H/14.
MegaStyle: Constructing Diverse and Scalable Style Dataset via Consistent Text-to-Image Style Mapping cs.CV · 2026-04-09 · unverdicted · none · ref 47
A scalable pipeline generates an intra-consistent, inter-diverse 1.4M style image dataset from text-to-image models and uses it to train a style encoder and generalizable style transfer model.
Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer cs.CV · 2025-11-27 · unverdicted · none · ref 71
Z-Image is an efficient 6B-parameter foundation model for image generation that rivals larger commercial systems in photorealism and bilingual text rendering through a new single-stream diffusion transformer and streamlined training.

arXiv preprint arXiv:2405.15613 (2024)

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer