{"paper":{"title":"Heuristic Pathologies and Further Variance Reduction via Uncertainty Propagation in the AIVAT Family of Techniques","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Fix the heuristic value function before seeing evaluation data to avoid setting AIVAT sample variance pathologically low or enabling p-hacking via gradient descent on the test statistic.","cross_cats":["cs.GT"],"primary_cat":"cs.AI","authors_text":"Juho Kim, Tuomas Sandholm","submitted_at":"2026-05-14T02:04:26Z","abstract_excerpt":"How should an agent's performance in a multiagent environment be evaluated when there is a limited sample size or a high cost of running a trial? The AIVAT family of variance reduction techniques was proposed to address this challenge by introducing unbiased low-variance estimators of agents' expected payoffs. An important component of AIVAT is a heuristic value function that discriminates between potentially low- and high-value counterfactual histories. A notable gap in the literature is that there is little to no constraint or guideline on how the heuristic value function should be chosen or"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"The heuristic value function should be fixed prior to observing the evaluation data to prevent setting sample variance pathologically low or p-hacking via gradient descent; uncertainty propagation then enables further variance reduction via inverse-variance weighted averaging, yielding a 43.0% reduction in samples needed on 10,000 poker hands.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the heuristic uncertainty can be quantified and propagated in a way that produces meaningful further variance reduction without introducing biases or errors that invalidate the overall estimator, and that the poker dataset and parameterization choices generalize beyond the specific experiments.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"AIVAT heuristics can be gamed for pathological low variance or p-hacking unless fixed before data observation, and uncertainty propagation yields additional variance reduction at possible cost to unbiasedness.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Fix the heuristic value function before seeing evaluation data to avoid setting AIVAT sample variance pathologically low or enabling p-hacking via gradient descent on the test statistic.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"0eb2c595a219774f41dbaf587879ac75238a7c8c4b24e2cd014d281c676440aa"},"source":{"id":"2605.14261","kind":"arxiv","version":1},"verdict":{"id":"31915378-a19d-4330-bcb5-42870937a410","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T02:25:12.937142Z","strongest_claim":"The heuristic value function should be fixed prior to observing the evaluation data to prevent setting sample variance pathologically low or p-hacking via gradient descent; uncertainty propagation then enables further variance reduction via inverse-variance weighted averaging, yielding a 43.0% reduction in samples needed on 10,000 poker hands.","one_line_summary":"AIVAT heuristics can be gamed for pathological low variance or p-hacking unless fixed before data observation, and uncertainty propagation yields additional variance reduction at possible cost to unbiasedness.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the heuristic uncertainty can be quantified and propagated in a way that produces meaningful further variance reduction without introducing biases or errors that invalidate the overall estimator, and that the poker dataset and parameterization choices generalize beyond the specific experiments.","pith_extraction_headline":"Fix the heuristic value function before seeing evaluation data to avoid setting AIVAT sample variance pathologically low or enabling p-hacking via gradient descent on the test statistic."},"references":{"count":15,"sample":[{"doi":"","year":2013,"title":"N. Bard, J. Hawkin, J. Rubin, and M. Zinkevich. The annual computer poker competition.AI Magazine, 34(2):112–114, 2013","work_id":"5aaa6f67-ee08-45eb-b004-b264fe5e3c40","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2006,"title":"D. Billings and M. Kan. A tool for the direct assessment of poker decisions.ICGA Journal, 29 (3):119–142, 2006","work_id":"a228e639-4ad0-487f-be26-1b5fa900237b","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2008,"title":"M. Bowling, M. Johanson, N. Burch, and D. Szafron. Strategy evaluation in extensive games with importance sampling. InProceedings of the International Conference on Machine Learning (ICML), 2008","work_id":"47b9e524-79ae-4bdf-8263-e69204421d12","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2018,"title":"N. Brown and T. Sandholm. Superhuman AI for heads-up no-limit poker: Libratus beats top professionals.Science, 359(6374):418–424, 2018","work_id":"ef7b56a3-56e6-44b7-91f5-2599930721da","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2019,"title":"N. Brown and T. Sandholm. Superhuman AI for multiplayer poker.Science, 365(6456):885–890, 2019","work_id":"deee810e-0e04-44e2-a5ee-100c7117dc2d","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":15,"snapshot_sha256":"0983a5c845ad995a549d0df9964cc8955e22861b0059fc30ff04cfb60e23a397","internal_anchors":0},"formal_canon":{"evidence_count":1,"snapshot_sha256":"061b76ccae3d6150967142fc882c726f96a2c584a9135cfb9ff84827a2f5100b"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}