Soft Silhouette Loss offers a batch-global differentiable metric to promote intra-class compactness and inter-class separation in learned representations, boosting performance when hybridized with cross-entropy and contrastive losses.
Big Transfer (BiT): General Visual Repre- sentation Learning
5 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
Swin Transformer reaches 87.3% ImageNet accuracy and sets new records on COCO detection and ADE20K segmentation by replacing global self-attention with shifted-window local attention inside a hierarchical pyramid.
DexHoldem is a new benchmark providing 1,470 teleoperated demonstrations across 14 manipulation primitives, plus standardized tests for dexterous policy execution and agentic perception in a physical Texas Hold'em setting.
Florence is a new vision foundation model that learns universal visual-language representations from web-scale data and reports state-of-the-art results on 44 benchmarks including 83.74% zero-shot ImageNet top-1 accuracy.
Large-scale experiments demonstrate that data-aware augmentations applied only during training allow fine-grained image models to reach high accuracy without using discriminative crops at inference, lowering costs.
citing papers explorer
-
Silhouette Loss: Differentiable Global Structure Learning for Deep Representations
Soft Silhouette Loss offers a batch-global differentiable metric to promote intra-class compactness and inter-class separation in learned representations, boosting performance when hybridized with cross-entropy and contrastive losses.
-
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
Swin Transformer reaches 87.3% ImageNet accuracy and sets new records on COCO detection and ADE20K segmentation by replacing global self-attention with shifted-window local attention inside a hierarchical pyramid.
-
DexHoldem: Playing Texas Hold'em with Dexterous Embodied System
DexHoldem is a new benchmark providing 1,470 teleoperated demonstrations across 14 manipulation primitives, plus standardized tests for dexterous policy execution and agentic perception in a physical Texas Hold'em setting.
-
Florence: A New Foundation Model for Computer Vision
Florence is a new vision foundation model that learns universal visual-language representations from web-scale data and reports state-of-the-art results on 44 benchmarks including 83.74% zero-shot ImageNet top-1 accuracy.
-
A Large-Scale Study on the Accuracy vs Cost Trade-offs of Training and Evaluation Settings in Fine-Grained Image Recognition
Large-scale experiments demonstrate that data-aware augmentations applied only during training allow fine-grained image models to reach high accuracy without using discriminative crops at inference, lowering costs.