{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2016:TJ32TYBWQHZYVVBAQTORLM22VX","short_pith_number":"pith:TJ32TYBW","schema_version":"1.0","canonical_sha256":"9a77a9e03681f38ad42084dd15b35aadc68dfc7c488e353b96b3ebfb5424ef41","source":{"kind":"arxiv","id":"1612.00837","version":3},"attestation_state":"computed","paper":{"title":"Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"","cross_cats":["cs.AI","cs.CL","cs.LG"],"primary_cat":"cs.CV","authors_text":"Devi Parikh, Dhruv Batra, Douglas Summers-Stay, Tejas Khot, Yash Goyal","submitted_at":"2016-12-02T20:57:07Z","abstract_excerpt":"Problems at the intersection of vision and language are of significant importance both as challenging research questions and for the rich set of applications they enable. However, inherent structure in our world and bias in our language tend to be a simpler signal for learning than visual modalities, resulting in models that ignore visual information, leading to an inflated sense of their capability.\n  We propose to counter these language priors for the task of Visual Question Answering (VQA) and make vision (the V in VQA) matter! Specifically, we balance the popular VQA dataset by collecting "},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":false,"formal_links_present":false},"canonical_record":{"source":{"id":"1612.00837","kind":"arxiv","version":3},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.CV","submitted_at":"2016-12-02T20:57:07Z","cross_cats_sorted":["cs.AI","cs.CL","cs.LG"],"title_canon_sha256":"14595f4ed3c0cd39ffaafc83305a207ae39cb353b949359abf408b313d7a2cb3","abstract_canon_sha256":"f670ee70b32267205c9baf3d34c9227c01b4ef0dc3ab7c807e683b230cb88a9d"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-18T00:44:33.590684Z","signature_b64":"9v4F5ytLvs8If2N4qa/OT98yU6QawjtdoHdYZTHsEHUnShI4Uo0io+1hlME4oaK5O8F9tsTT1k3HW2l3F+dfAg==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"9a77a9e03681f38ad42084dd15b35aadc68dfc7c488e353b96b3ebfb5424ef41","last_reissued_at":"2026-05-18T00:44:33.590233Z","signature_status":"signed_v1","first_computed_at":"2026-05-18T00:44:33.590233Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"","cross_cats":["cs.AI","cs.CL","cs.LG"],"primary_cat":"cs.CV","authors_text":"Devi Parikh, Dhruv Batra, Douglas Summers-Stay, Tejas Khot, Yash Goyal","submitted_at":"2016-12-02T20:57:07Z","abstract_excerpt":"Problems at the intersection of vision and language are of significant importance both as challenging research questions and for the rich set of applications they enable. However, inherent structure in our world and bias in our language tend to be a simpler signal for learning than visual modalities, resulting in models that ignore visual information, leading to an inflated sense of their capability.\n  We propose to counter these language priors for the task of Visual Question Answering (VQA) and make vision (the V in VQA) matter! Specifically, we balance the popular VQA dataset by collecting "},"claims":{"count":0,"items":[],"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"source":{"id":"1612.00837","kind":"arxiv","version":3},"verdict":{"id":null,"model_set":{},"created_at":null,"strongest_claim":"","one_line_summary":"","pipeline_version":null,"weakest_assumption":"","pith_extraction_headline":""},"references":{"count":0,"sample":[],"resolved_work":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57","internal_anchors":0},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"1612.00837","created_at":"2026-05-18T00:44:33.590296+00:00"},{"alias_kind":"arxiv_version","alias_value":"1612.00837v3","created_at":"2026-05-18T00:44:33.590296+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.1612.00837","created_at":"2026-05-18T00:44:33.590296+00:00"},{"alias_kind":"pith_short_12","alias_value":"TJ32TYBWQHZY","created_at":"2026-05-18T12:30:44.179134+00:00"},{"alias_kind":"pith_short_16","alias_value":"TJ32TYBWQHZYVVBA","created_at":"2026-05-18T12:30:44.179134+00:00"},{"alias_kind":"pith_short_8","alias_value":"TJ32TYBW","created_at":"2026-05-18T12:30:44.179134+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":4,"internal_anchor_count":1,"sample":[{"citing_arxiv_id":"2406.11354","citing_title":"Preserving Knowledge in Large Language Model with Model-Agnostic Self-Decompression","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2604.16930","citing_title":"CoGR-MoE: Concept-Guided Expert Routing with Consistent Selection and Flexible Reasoning for Visual Question Answering","ref_index":8,"is_internal_anchor":false},{"citing_arxiv_id":"2604.18803","citing_title":"LLM-as-Judge Framework for Evaluating Tone-Induced Hallucination in Vision-Language Models","ref_index":11,"is_internal_anchor":false},{"citing_arxiv_id":"2604.22851","citing_title":"EgoDyn-Bench: Evaluating Ego-Motion Understanding in Vision-Centric Foundation Models for Autonomous Driving","ref_index":9,"is_internal_anchor":false}]},"formal_canon":{"evidence_count":0,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/TJ32TYBWQHZYVVBAQTORLM22VX","json":"https://pith.science/pith/TJ32TYBWQHZYVVBAQTORLM22VX.json","graph_json":"https://pith.science/api/pith-number/TJ32TYBWQHZYVVBAQTORLM22VX/graph.json","events_json":"https://pith.science/api/pith-number/TJ32TYBWQHZYVVBAQTORLM22VX/events.json","paper":"https://pith.science/paper/TJ32TYBW"},"agent_actions":{"view_html":"https://pith.science/pith/TJ32TYBWQHZYVVBAQTORLM22VX","download_json":"https://pith.science/pith/TJ32TYBWQHZYVVBAQTORLM22VX.json","view_paper":"https://pith.science/paper/TJ32TYBW","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=1612.00837&json=true","fetch_graph":"https://pith.science/api/pith-number/TJ32TYBWQHZYVVBAQTORLM22VX/graph.json","fetch_events":"https://pith.science/api/pith-number/TJ32TYBWQHZYVVBAQTORLM22VX/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/TJ32TYBWQHZYVVBAQTORLM22VX/action/timestamp_anchor","attest_storage":"https://pith.science/pith/TJ32TYBWQHZYVVBAQTORLM22VX/action/storage_attestation","attest_author":"https://pith.science/pith/TJ32TYBWQHZYVVBAQTORLM22VX/action/author_attestation","sign_citation":"https://pith.science/pith/TJ32TYBWQHZYVVBAQTORLM22VX/action/citation_signature","submit_replication":"https://pith.science/pith/TJ32TYBWQHZYVVBAQTORLM22VX/action/replication_record"}},"created_at":"2026-05-18T00:44:33.590296+00:00","updated_at":"2026-05-18T00:44:33.590296+00:00"}