{"paper":{"title":"Sigmoid Loss for Language Image Pre-Training","license":"http://creativecommons.org/licenses/by/4.0/","headline":"A pairwise sigmoid loss for image-text pre-training achieves 84.5% zero-shot ImageNet accuracy using only four TPU chips in two days.","cross_cats":["cs.AI"],"primary_cat":"cs.CV","authors_text":"Alexander Kolesnikov, Basil Mustafa, Lucas Beyer, Xiaohua Zhai","submitted_at":"2023-03-27T15:53:01Z","abstract_excerpt":"We propose a simple pairwise Sigmoid loss for Language-Image Pre-training (SigLIP). Unlike standard contrastive learning with softmax normalization, the sigmoid loss operates solely on image-text pairs and does not require a global view of the pairwise similarities for normalization. The sigmoid loss simultaneously allows further scaling up the batch size, while also performing better at smaller batch sizes. Combined with Locked-image Tuning, with only four TPUv4 chips, we train a SigLiT model that achieves 84.5% ImageNet zero-shot accuracy in two days. The disentanglement of the batch size fr"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Combined with Locked-image Tuning, with only four TPUv4 chips, we train a SigLiT model that achieves 84.5% ImageNet zero-shot accuracy in two days.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the sigmoid loss, which forgoes global batch normalization, will continue to produce high-quality representations when scaled to new datasets or model sizes without additional hyper-parameter tuning.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"SigLIP replaces softmax-based contrastive loss with a simple pairwise sigmoid loss for vision-language pre-training, decoupling batch size from normalization and reaching strong zero-shot performance with limited compute.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"A pairwise sigmoid loss for image-text pre-training achieves 84.5% zero-shot ImageNet accuracy using only four TPU chips in two days.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"d80ef83f6df86463442678c0aff55eadcfaffab928244d2d897958b5ce0d08b4"},"source":{"id":"2303.15343","kind":"arxiv","version":4},"verdict":{"id":"7220d1d9-574d-4d8a-a61f-c6040846dd57","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T13:00:19.151560Z","strongest_claim":"Combined with Locked-image Tuning, with only four TPUv4 chips, we train a SigLiT model that achieves 84.5% ImageNet zero-shot accuracy in two days.","one_line_summary":"SigLIP replaces softmax-based contrastive loss with a simple pairwise sigmoid loss for vision-language pre-training, decoupling batch size from normalization and reaching strong zero-shot performance with limited compute.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the sigmoid loss, which forgoes global batch normalization, will continue to produce high-quality representations when scaled to new datasets or model sizes without additional hyper-parameter tuning.","pith_extraction_headline":"A pairwise sigmoid loss for image-text pre-training achieves 84.5% zero-shot ImageNet accuracy using only four TPU chips in two days."},"references":{"count":60,"sample":[{"doi":"","year":2023,"title":"Getting vit in shape: Scaling laws for compute-optimal model design","work_id":"3d85a12f-9454-4f5c-bca9-b96d474ddde2","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2019,"title":"ObjectNet: A large-scale bias-controlled dataset for pushing the limits of object recognition models","work_id":"2f994755-ea11-439a-a510-79be7aa13443","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2006,"title":"Are we done with imagenet?","work_id":"9efae043-283b-44ae-8324-207d3747f93f","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"Bet- ter plain vit baselines for imagenet-1k, 2022","work_id":"8b408975-f8cf-4010-ace0-d6cd6ac702ec","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"Lucas Beyer, Xiaohua Zhai, and Alexander Kolesnikov. Big vision. https://github.com/google-research/ big_vision, 2022. 10, 17","work_id":"96277c24-f45f-4e02-b1c2-2e713c7788c7","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":60,"snapshot_sha256":"beed77e7aaad0bc528e634f9078674503ab1cdd1b8cf0caad84708ba4148c8e0","internal_anchors":9},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}