{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2023:MUDDTA3EGAO3LWODE3IBSEHJDO","short_pith_number":"pith:MUDDTA3E","schema_version":"1.0","canonical_sha256":"6506398364301db5d9c326d01910e91b8e6f1cf78cfa923d524c2e204befdf59","source":{"kind":"arxiv","id":"2312.17090","version":1},"attestation_state":"computed","paper":{"title":"Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels","license":"http://creativecommons.org/licenses/by-nc-sa/4.0/","headline":"LMMs achieve better visual scoring by predicting discrete text-defined rating levels instead of numerical scores.","cross_cats":["cs.CL","cs.LG"],"primary_cat":"cs.CV","authors_text":"Annan Wang, Chaofeng Chen, Chunyi Li, Erli Zhang, Guangtao Zhai, Haoning Wu, Liang Liao, Qiong Yan, Weisi Lin, Weixia Zhang, Wenxiu Sun, Xiongkuo Min, Yixuan Gao, Zicheng Zhang","submitted_at":"2023-12-28T16:10:25Z","abstract_excerpt":"The explosion of visual content available online underscores the requirement for an accurate machine assessor to robustly evaluate scores across diverse types of visual contents. While recent studies have demonstrated the exceptional potentials of large multi-modality models (LMMs) on a wide range of related fields, in this work, we explore how to teach them for visual rating aligned with human opinions. Observing that human raters only learn and judge discrete text-defined levels in subjective studies, we propose to emulate this subjective process and teach LMMs with text-defined rating level"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2312.17090","kind":"arxiv","version":1},"metadata":{"license":"http://creativecommons.org/licenses/by-nc-sa/4.0/","primary_cat":"cs.CV","submitted_at":"2023-12-28T16:10:25Z","cross_cats_sorted":["cs.CL","cs.LG"],"title_canon_sha256":"c390943399b25c80248b0569a0af41814140229c9ae0d4d47a6ae22dfe712147","abstract_canon_sha256":"2d275bc65e5e2f8fe7a4b0e5f050b1ca381a4fb351da30dfaac05792e27d2383"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:50.870465Z","signature_b64":"d/rxXih6L5zLNlFt2MSHFqiMK0h8AGz3j/9JK5d9AUFH2Tp4dGvMCdD2V4Xj873CvhvZrh0Iugzocu+ke9+KBw==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"6506398364301db5d9c326d01910e91b8e6f1cf78cfa923d524c2e204befdf59","last_reissued_at":"2026-05-17T23:38:50.870033Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:50.870033Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels","license":"http://creativecommons.org/licenses/by-nc-sa/4.0/","headline":"LMMs achieve better visual scoring by predicting discrete text-defined rating levels instead of numerical scores.","cross_cats":["cs.CL","cs.LG"],"primary_cat":"cs.CV","authors_text":"Annan Wang, Chaofeng Chen, Chunyi Li, Erli Zhang, Guangtao Zhai, Haoning Wu, Liang Liao, Qiong Yan, Weisi Lin, Weixia Zhang, Wenxiu Sun, Xiongkuo Min, Yixuan Gao, Zicheng Zhang","submitted_at":"2023-12-28T16:10:25Z","abstract_excerpt":"The explosion of visual content available online underscores the requirement for an accurate machine assessor to robustly evaluate scores across diverse types of visual contents. While recent studies have demonstrated the exceptional potentials of large multi-modality models (LMMs) on a wide range of related fields, in this work, we explore how to teach them for visual rating aligned with human opinions. Observing that human raters only learn and judge discrete text-defined levels in subjective studies, we propose to emulate this subjective process and teach LMMs with text-defined rating level"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"The proposed Q-Align achieves state-of-the-art performance on image quality assessment (IQA), image aesthetic assessment (IAA), as well as video quality assessment (VQA) tasks under the original LMM structure. With the syllabus, we further unify the three tasks into one model, termed the OneAlign.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That training LMMs with discrete text-defined levels emulates human subjective judgment processes more effectively than direct numerical score regression, leading to better performance without architectural changes or extra data.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Q-Align trains LMMs on discrete text-defined levels for visual scoring, achieving SOTA on IQA, IAA, and VQA while unifying the tasks in OneAlign.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"LMMs achieve better visual scoring by predicting discrete text-defined rating levels instead of numerical scores.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"1a1861089861b0b862db8716832fe34666ffb4075a526d038849951b686c6111"},"source":{"id":"2312.17090","kind":"arxiv","version":1},"verdict":{"id":"ef03bab8-e2e6-48e8-be61-5f777a6671b1","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T16:31:49.640259Z","strongest_claim":"The proposed Q-Align achieves state-of-the-art performance on image quality assessment (IQA), image aesthetic assessment (IAA), as well as video quality assessment (VQA) tasks under the original LMM structure. With the syllabus, we further unify the three tasks into one model, termed the OneAlign.","one_line_summary":"Q-Align trains LMMs on discrete text-defined levels for visual scoring, achieving SOTA on IQA, IAA, and VQA while unifying the tasks in OneAlign.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That training LMMs with discrete text-defined levels emulates human subjective judgment processes more effectively than direct numerical score regression, leading to better performance without architectural changes or extra data.","pith_extraction_headline":"LMMs achieve better visual scoring by predicting discrete text-defined rating levels instead of numerical scores."},"references":{"count":293,"sample":[{"doi":"","year":null,"title":"FirstName LastName , title =","work_id":"d9cab501-317f-4237-9e32-b5ead5964402","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"FirstName Alpher , title =","work_id":"42297990-8783-41a1-b0fa-8ccdbf630852","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Journal of Foo , volume = 13, number = 1, pages =","work_id":"65a8b3d0-af84-4f68-87eb-101c85ab18b2","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Journal of Foo , volume = 14, number = 1, pages =","work_id":"b3089947-bd36-4a24-9199-cc535e299537","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"FirstName Alpher and FirstName Gamow , title =","work_id":"caed320b-7cdc-41ca-bb08-00fb14feec62","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":293,"snapshot_sha256":"681418912eaa42e4611d22f6581da11de675b162729c34822c93850dc341d530","internal_anchors":9},"formal_canon":{"evidence_count":3,"snapshot_sha256":"7b9b449cf34fb4622e264d8a1a0b046365f58616194452f8a731628681551281"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2312.17090","created_at":"2026-05-17T23:38:50.870096+00:00"},{"alias_kind":"arxiv_version","alias_value":"2312.17090v1","created_at":"2026-05-17T23:38:50.870096+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2312.17090","created_at":"2026-05-17T23:38:50.870096+00:00"},{"alias_kind":"pith_short_12","alias_value":"MUDDTA3EGAO3","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"MUDDTA3EGAO3LWOD","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"MUDDTA3E","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":25,"internal_anchor_count":25,"sample":[{"citing_arxiv_id":"2509.22414","citing_title":"LucidFlux: Caption-Free Photo-Realistic Image Restoration via a Large-Scale Diffusion Transformer","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2511.00503","citing_title":"Diff4Splat: Controllable 4D Scene Generation with Latent Dynamic Reconstruction Models","ref_index":87,"is_internal_anchor":true},{"citing_arxiv_id":"2512.04677","citing_title":"Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length","ref_index":44,"is_internal_anchor":true},{"citing_arxiv_id":"2512.07584","citing_title":"LongCat-Image Technical Report","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2603.02210","citing_title":"HiFi-Inpaint: Towards High-Fidelity Reference-Based Inpainting for Generating Detail-Preserving Human-Product Images","ref_index":50,"is_internal_anchor":true},{"citing_arxiv_id":"2603.05947","citing_title":"LucidNFT: LR-Anchored Multi-Reward Preference Optimization for Flow-Based Real-World Super-Resolution","ref_index":38,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12957","citing_title":"GTA: Advancing Image-to-3D World Generation via Geometry Then Appearance Video Diffusion","ref_index":83,"is_internal_anchor":true},{"citing_arxiv_id":"2604.02409","citing_title":"LumiVideo: An Intelligent Agentic System for Video Color Grading","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2605.11541","citing_title":"GeoR-Bench: Evaluating Geoscience Visual Reasoning","ref_index":29,"is_internal_anchor":true},{"citing_arxiv_id":"2605.10576","citing_title":"SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models","ref_index":34,"is_internal_anchor":true},{"citing_arxiv_id":"2605.06969","citing_title":"Bringing Multimodal Large Language Models to Infrared-Visible Image Fusion Quality Assessment","ref_index":25,"is_internal_anchor":true},{"citing_arxiv_id":"2604.24123","citing_title":"FDIM: A Feature-distance-based Generic Video Quality Metric for Versatile Codecs","ref_index":38,"is_internal_anchor":true},{"citing_arxiv_id":"2605.01272","citing_title":"GameScope: A Multi-Attribute, Multi-Codec Benchmark Dataset for Gaming Video Quality Assessment","ref_index":24,"is_internal_anchor":true},{"citing_arxiv_id":"2605.00719","citing_title":"Unpaired Image Deraining Using Reward-Guided Self-Reinforcement Strategy","ref_index":73,"is_internal_anchor":true},{"citing_arxiv_id":"2604.12175","citing_title":"Redefining Quality Criteria and Distance-Aware Score Modeling for Image Editing Assessment","ref_index":27,"is_internal_anchor":true},{"citing_arxiv_id":"2604.10578","citing_title":"Rein3D: Reinforced 3D Indoor Scene Generation with Panoramic Video Diffusion Models","ref_index":50,"is_internal_anchor":true},{"citing_arxiv_id":"2604.08172","citing_title":"On the Global Photometric Alignment for Low-Level Vision","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2604.07427","citing_title":"Personalizing Text-to-Image Generation to Individual Taste","ref_index":55,"is_internal_anchor":true},{"citing_arxiv_id":"2605.07477","citing_title":"ReasonEdit: Towards Interpretable Image Editing Evaluation via Reinforcement Learning","ref_index":68,"is_internal_anchor":true},{"citing_arxiv_id":"2605.07457","citing_title":"EditRefiner: A Human-Aligned Agentic Framework for Image Editing Refinement","ref_index":54,"is_internal_anchor":true},{"citing_arxiv_id":"2605.06969","citing_title":"Bringing Multimodal Large Language Models to Infrared-Visible Image Fusion Quality Assessment","ref_index":25,"is_internal_anchor":true},{"citing_arxiv_id":"2604.14268","citing_title":"HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds","ref_index":73,"is_internal_anchor":true},{"citing_arxiv_id":"2604.16858","citing_title":"Q-DeepSight: Incentivizing Thinking with Images for Image Quality Assessment and Refinement","ref_index":47,"is_internal_anchor":true},{"citing_arxiv_id":"2604.21400","citing_title":"You Only Gaussian Once: Controllable 3D Gaussian Splatting for Ultra-Densely Sampled Scenes","ref_index":31,"is_internal_anchor":true},{"citing_arxiv_id":"2605.01799","citing_title":"Embody4D: A Generalist 4D World Model for Embodied AI","ref_index":54,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":3,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/MUDDTA3EGAO3LWODE3IBSEHJDO","json":"https://pith.science/pith/MUDDTA3EGAO3LWODE3IBSEHJDO.json","graph_json":"https://pith.science/api/pith-number/MUDDTA3EGAO3LWODE3IBSEHJDO/graph.json","events_json":"https://pith.science/api/pith-number/MUDDTA3EGAO3LWODE3IBSEHJDO/events.json","paper":"https://pith.science/paper/MUDDTA3E"},"agent_actions":{"view_html":"https://pith.science/pith/MUDDTA3EGAO3LWODE3IBSEHJDO","download_json":"https://pith.science/pith/MUDDTA3EGAO3LWODE3IBSEHJDO.json","view_paper":"https://pith.science/paper/MUDDTA3E","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2312.17090&json=true","fetch_graph":"https://pith.science/api/pith-number/MUDDTA3EGAO3LWODE3IBSEHJDO/graph.json","fetch_events":"https://pith.science/api/pith-number/MUDDTA3EGAO3LWODE3IBSEHJDO/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/MUDDTA3EGAO3LWODE3IBSEHJDO/action/timestamp_anchor","attest_storage":"https://pith.science/pith/MUDDTA3EGAO3LWODE3IBSEHJDO/action/storage_attestation","attest_author":"https://pith.science/pith/MUDDTA3EGAO3LWODE3IBSEHJDO/action/author_attestation","sign_citation":"https://pith.science/pith/MUDDTA3EGAO3LWODE3IBSEHJDO/action/citation_signature","submit_replication":"https://pith.science/pith/MUDDTA3EGAO3LWODE3IBSEHJDO/action/replication_record"}},"created_at":"2026-05-17T23:38:50.870096+00:00","updated_at":"2026-05-17T23:38:50.870096+00:00"}