← back to paper
arxiv: 2605.03245 · 2 revisions
Text-Conditional JEPA for Learning Semantically Rich Visual Representations