OcclusionFormer adds explicit Z-order modeling via a new SA-Z dataset and volume-rendering compositing in a diffusion transformer to resolve occlusion ambiguities in layout-grounded image synthesis.
Title resolution pending
7 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 7representative citing papers
A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.
PTA adapts VLMs at test time by maintaining and updating class-specific knowledge prototypes from test samples, achieving higher accuracy than cache-based methods with far less speed loss.
MARCO achieves new state-of-the-art semantic correspondence on SPair-71k, AP-10K and PF-PASCAL by combining coarse-to-fine refinement with self-distillation on DINOv2, delivering larger gains at fine thresholds and on unseen keypoints and categories while using 3x fewer parameters and running 10x更快.
Adding register tokens to Vision Transformers eliminates high-norm background artifacts and raises state-of-the-art performance on dense visual prediction tasks.
CARE is a parameter-efficient framework that aggregates predictions from noisy labels, VLM text embeddings, and visual features with class-frequency-based agreement thresholds to rectify labels in long-tailed noisy datasets.
Intermediate layers in LLMs consistently provide stronger features than final layers across tasks and architectures, as quantified by a new framework of information-theoretic, geometric, and invariance metrics.
citing papers explorer
-
OcclusionFormer: Arranging Z-Order for Layout-Grounded Image Generation
OcclusionFormer adds explicit Z-order modeling via a new SA-Z dataset and volume-rendering compositing in a diffusion transformer to resolve occlusion ambiguities in layout-grounded image synthesis.
-
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.
-
Prototype-Based Test-Time Adaptation of Vision-Language Models
PTA adapts VLMs at test time by maintaining and updating class-specific knowledge prototypes from test samples, achieving higher accuracy than cache-based methods with far less speed loss.
-
MARCO: Navigating the Unseen Space of Semantic Correspondence
MARCO achieves new state-of-the-art semantic correspondence on SPair-71k, AP-10K and PF-PASCAL by combining coarse-to-fine refinement with self-distillation on DINOv2, delivering larger gains at fine thresholds and on unseen keypoints and categories while using 3x fewer parameters and running 10x更快.
-
Vision Transformers Need Registers
Adding register tokens to Vision Transformers eliminates high-norm background artifacts and raises state-of-the-art performance on dense visual prediction tasks.
-
CARE: Class-Adaptive Expert Consensus for Reliable Learning with Long-Tailed Noisy Labels
CARE is a parameter-efficient framework that aggregates predictions from noisy labels, VLM text embeddings, and visual features with class-frequency-based agreement thresholds to rectify labels in long-tailed noisy datasets.
-
Layer by Layer: Uncovering Hidden Representations in Language Models
Intermediate layers in LLMs consistently provide stronger features than final layers across tasks and architectures, as quantified by a new framework of information-theoretic, geometric, and invariance metrics.