DRoRAE adaptively fuses multi-layer features from vision encoders via energy-constrained routing to enrich visual tokens, cutting rFID from 0.57 to 0.29 and generation FID from 1.74 to 1.65 on ImageNet-256 while revealing a log-linear scaling law with fusion capacity.
Conceptual 12M: Pushing web-scale image-text pre-training to recognize long-tail visual concepts
2 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.CV 2years
2026 2verdicts
UNVERDICTED 2roles
dataset 2polarities
use dataset 2representative citing papers
MONET is an open 104.9M image-text pair dataset created via safety filtering, deduplication, and multi-VLM recaptioning from 2.9B raw pairs, validated by training a competitive 4B-parameter latent diffusion model.
citing papers explorer
-
Beyond the Last Layer: Multi-Layer Representation Fusion for Visual Tokenization
DRoRAE adaptively fuses multi-layer features from vision encoders via energy-constrained routing to enrich visual tokens, cutting rFID from 0.57 to 0.29 and generation FID from 1.74 to 1.65 on ImageNet-256 while revealing a log-linear scaling law with fusion capacity.
-
MONET: A Massive, Open, Non-redundant and Enriched Text-to-image dataset
MONET is an open 104.9M image-text pair dataset created via safety filtering, deduplication, and multi-VLM recaptioning from 2.9B raw pairs, validated by training a competitive 4B-parameter latent diffusion model.