{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2026:GCGY5GYL2WNFEH7SUV3CPVEPVC","short_pith_number":"pith:GCGY5GYL","schema_version":"1.0","canonical_sha256":"308d8e9b0bd59a521ff2a57627d48fa89f245c840d23165effe84d74f3930f9e","source":{"kind":"arxiv","id":"2602.07026","version":3},"attestation_state":"computed","paper":{"title":"Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models","license":"http://creativecommons.org/licenses/by-nc-nd/4.0/","headline":"ReAlign aligns text embeddings to image distributions via a training-free three-step process using unpaired data, letting MLLMs pretrain without paired image-text examples.","cross_cats":["cs.AI","cs.MM"],"primary_cat":"cs.CV","authors_text":"Chengwei Qin, Chen Liu, Chonghan Liu, Hanzhen Zhao, Hao Tang, Hui Xiong, Shuicheng Yan, Wenjie Zhang, Xiaobin Hu, Xiaomin Yu, Xiaoxing Hu, Yi Xin, Yuhui Zhang, Yu Qiao, Ziyue Qiao","submitted_at":"2026-02-02T13:59:39Z","abstract_excerpt":"Despite the success of multimodal contrastive learning in aligning visual and linguistic representations, a persistent geometric anomaly, the Modality Gap, remains: embeddings of distinct modalities expressing identical semantics occupy systematically offset regions. Prior approaches to bridge this gap are largely limited by oversimplified isotropic assumptions, hindering their application in large-scale scenarios. In this paper, we address these limitations by precisely characterizing the geometric shape of the modality gap and leveraging it for efficient model scaling. First, we propose the "},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":false,"formal_links_present":true},"canonical_record":{"source":{"id":"2602.07026","kind":"arxiv","version":3},"metadata":{"license":"http://creativecommons.org/licenses/by-nc-nd/4.0/","primary_cat":"cs.CV","submitted_at":"2026-02-02T13:59:39Z","cross_cats_sorted":["cs.AI","cs.MM"],"title_canon_sha256":"a3a004a856edd2a7c835034529bccf638be46badc1f9285774d0bc71a3b3d631","abstract_canon_sha256":"396849b50901b5c2d18b42f454bf27b4f63b4e4caa7b19452aa641d8a0379828"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-06-08T01:03:58.233186Z","signature_b64":"YNShjXCeCwHdKyQFKlGddnG7wj3J6lyaYnAk1qlCQKzfKmIZ1/7wxnR1DbNdtpIHb75E6QZDvDqAyA+qtpj2Bg==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"308d8e9b0bd59a521ff2a57627d48fa89f245c840d23165effe84d74f3930f9e","last_reissued_at":"2026-06-08T01:03:58.232218Z","signature_status":"signed_v1","first_computed_at":"2026-06-08T01:03:58.232218Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models","license":"http://creativecommons.org/licenses/by-nc-nd/4.0/","headline":"ReAlign aligns text embeddings to image distributions via a training-free three-step process using unpaired data, letting MLLMs pretrain without paired image-text examples.","cross_cats":["cs.AI","cs.MM"],"primary_cat":"cs.CV","authors_text":"Chengwei Qin, Chen Liu, Chonghan Liu, Hanzhen Zhao, Hao Tang, Hui Xiong, Shuicheng Yan, Wenjie Zhang, Xiaobin Hu, Xiaomin Yu, Xiaoxing Hu, Yi Xin, Yuhui Zhang, Yu Qiao, Ziyue Qiao","submitted_at":"2026-02-02T13:59:39Z","abstract_excerpt":"Despite the success of multimodal contrastive learning in aligning visual and linguistic representations, a persistent geometric anomaly, the Modality Gap, remains: embeddings of distinct modalities expressing identical semantics occupy systematically offset regions. Prior approaches to bridge this gap are largely limited by oversimplified isotropic assumptions, hindering their application in large-scale scenarios. In this paper, we address these limitations by precisely characterizing the geometric shape of the modality gap and leveraging it for efficient model scaling. First, we propose the "},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"ReAlign, a training-free three-step procedure (Anchor, Trace, Centroid Alignment) that uses statistics from massive unpaired data, explicitly rectifies geometric misalignment so that unpaired text can substitute for paired image-text data in MLLM pretraining.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The Fixed-frame Modality Gap Theory assumes that the decomposition into stable biases and anisotropic residuals remains valid when the reference frame is frozen and that the statistics computed from unpaired data accurately capture the target image distribution without introducing new distortions.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"ReAlign corrects the modality gap in unpaired data to let MLLMs learn visual distributions from text alone before instruction tuning, reducing dependence on expensive paired corpora.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"ReAlign aligns text embeddings to image distributions via a training-free three-step process using unpaired data, letting MLLMs pretrain without paired image-text examples.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"cad15353c14c8479783c40301ad97efd885cb51fa026216fe79707e44a091861"},"source":{"id":"2602.07026","kind":"arxiv","version":3},"verdict":{"id":"9468bec5-563c-4687-9028-109af0a86830","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T08:15:12.284635Z","strongest_claim":"ReAlign, a training-free three-step procedure (Anchor, Trace, Centroid Alignment) that uses statistics from massive unpaired data, explicitly rectifies geometric misalignment so that unpaired text can substitute for paired image-text data in MLLM pretraining.","one_line_summary":"ReAlign corrects the modality gap in unpaired data to let MLLMs learn visual distributions from text alone before instruction tuning, reducing dependence on expensive paired corpora.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The Fixed-frame Modality Gap Theory assumes that the decomposition into stable biases and anisotropic residuals remains valid when the reference frame is frozen and that the statistics computed from unpaired data accurately capture the target image distribution without introducing new distortions.","pith_extraction_headline":"ReAlign aligns text embeddings to image distributions via a training-free three-step process using unpaired data, letting MLLMs pretrain without paired image-text examples."},"integrity":{"clean":true,"summary":{"advisory":0,"critical":0,"by_detector":{},"informational":0},"endpoint":"/pith/2602.07026/integrity.json","findings":[],"available":true,"detectors_run":[],"snapshot_sha256":"c28c3603d3b5d939e8dc4c7e95fa8dfce3d595e45f758748cecf8e644a296938"},"references":{"count":0,"sample":[],"resolved_work":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57","internal_anchors":0},"formal_canon":{"evidence_count":1,"snapshot_sha256":"2972064ae0599586b3ab7db6127b910b08909e25048ddafe049462bbdefebb14"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2602.07026","created_at":"2026-06-08T01:03:58.232326+00:00"},{"alias_kind":"arxiv_version","alias_value":"2602.07026v3","created_at":"2026-06-08T01:03:58.232326+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2602.07026","created_at":"2026-06-08T01:03:58.232326+00:00"},{"alias_kind":"pith_short_12","alias_value":"GCGY5GYL2WNF","created_at":"2026-06-08T01:03:58.232326+00:00"},{"alias_kind":"pith_short_16","alias_value":"GCGY5GYL2WNFEH7S","created_at":"2026-06-08T01:03:58.232326+00:00"},{"alias_kind":"pith_short_8","alias_value":"GCGY5GYL","created_at":"2026-06-08T01:03:58.232326+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":6,"internal_anchor_count":6,"sample":[{"citing_arxiv_id":"2605.16889","citing_title":"Controlling Decision Drift in Multimodal Sentiment Analysis with Missing Modalities","ref_index":31,"is_internal_anchor":true},{"citing_arxiv_id":"2605.08245","citing_title":"When Language Overwrites Vision: Over-Alignment and Geometric Debiasing in Vision-Language Models","ref_index":21,"is_internal_anchor":true},{"citing_arxiv_id":"2605.08245","citing_title":"When Language Overwrites Vision: Over-Alignment and Geometric Debiasing in Vision-Language Models","ref_index":21,"is_internal_anchor":true},{"citing_arxiv_id":"2605.08245","citing_title":"When Language Overwrites Vision: Over-Alignment and Geometric Debiasing in Vision-Language Models","ref_index":21,"is_internal_anchor":true},{"citing_arxiv_id":"2604.20318","citing_title":"UniCVR: From Alignment to Reranking for Unified Zero-Shot Composed Visual Retrieval","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2605.07825","citing_title":"Anisotropic Modality Align","ref_index":17,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":1,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/GCGY5GYL2WNFEH7SUV3CPVEPVC","json":"https://pith.science/pith/GCGY5GYL2WNFEH7SUV3CPVEPVC.json","graph_json":"https://pith.science/api/pith-number/GCGY5GYL2WNFEH7SUV3CPVEPVC/graph.json","events_json":"https://pith.science/api/pith-number/GCGY5GYL2WNFEH7SUV3CPVEPVC/events.json","paper":"https://pith.science/paper/GCGY5GYL"},"agent_actions":{"view_html":"https://pith.science/pith/GCGY5GYL2WNFEH7SUV3CPVEPVC","download_json":"https://pith.science/pith/GCGY5GYL2WNFEH7SUV3CPVEPVC.json","view_paper":"https://pith.science/paper/GCGY5GYL","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2602.07026&json=true","fetch_graph":"https://pith.science/api/pith-number/GCGY5GYL2WNFEH7SUV3CPVEPVC/graph.json","fetch_events":"https://pith.science/api/pith-number/GCGY5GYL2WNFEH7SUV3CPVEPVC/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/GCGY5GYL2WNFEH7SUV3CPVEPVC/action/timestamp_anchor","attest_storage":"https://pith.science/pith/GCGY5GYL2WNFEH7SUV3CPVEPVC/action/storage_attestation","attest_author":"https://pith.science/pith/GCGY5GYL2WNFEH7SUV3CPVEPVC/action/author_attestation","sign_citation":"https://pith.science/pith/GCGY5GYL2WNFEH7SUV3CPVEPVC/action/citation_signature","submit_replication":"https://pith.science/pith/GCGY5GYL2WNFEH7SUV3CPVEPVC/action/replication_record"}},"created_at":"2026-06-08T01:03:58.232326+00:00","updated_at":"2026-06-08T01:03:58.232326+00:00"}