ECPO: Evidence-Coupled Policy Optimization for Evidence-Certified Candidate Ranking
Pith reviewed 2026-05-22 06:27 UTC · model grok-4.3
The pith
Coupling ranking policy optimization with evidence certificate validity enables verifiable decision support in candidate ranking.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By treating the action as the joint object of ranking and evidence certificate and optimizing under coupled rewards that include a label-free verifier for evidence-cycle reconstruction, ECPO produces Top-K lists whose cited spans allow independent recovery of the decision, shifting the objective from standard ranking metrics to CertNDCG and decision-evidence coupling.
What carries the argument
Evidence-Coupled Policy Optimization (ECPO), a listwise policy-optimization objective that couples ranking utility, span-level certificate validity, and an evidence-cycle reward computed by a deterministic verifier on claim-stripped spans.
If this is right
- Rankings produced under ECPO come with evidence certificates sufficient to reconstruct the decision from cited spans alone.
- The evaluation framework compares ECPO to zero-shot, SFT, GRPO, RM-only scoring, and post-hoc rationalization across roster settings.
- Optimization under the coupled rewards yields higher CertNDCG than maximizing ordinary NDCG alone.
- The approach uses skeleton-aligned trajectory supervision and hard negatives with fixed upstream extraction.
Where Pith is reading between the lines
- Similar coupling of utility and verifiability could extend to other high-stakes ranking domains where audit trails matter.
- If the verifier generalizes, it suggests evidence attachment need not rely on post-hoc rationalization or full context access.
- Testing whether CertNDCG gains persist when the verifier is replaced by a learned one would probe the necessity of the deterministic component.
Load-bearing premise
The label-free deterministic verifier can reliably reconstruct candidate support from claim-stripped cited spans without access to the original labels or full context.
What would settle it
If the verifier applied to ECPO-generated certificates fails to recover the correct candidate support on a held-out test set at a rate significantly higher than baselines, the central claim would be falsified.
Figures
read the original abstract
Ranking systems used in decision-support settings should not only order candidates but also expose evidence that can be independently checked. We study evidence-certified candidate ranking: given an intent_id, a predefined plan skeleton, a window-local candidate roster, and text-derived candidate trajectories with span provenance, a system must output a Top-K list together with doc_id:span evidence certificates whose cited spans are sufficient to recover the decision. We instantiate this task on MAVEN-ERE and RAMS with fixed upstream extraction, window-local randomized candidate identifiers, skeleton-aligned trajectory supervision, hard negatives, and audit references. We introduce Evidence-Coupled Policy Optimization (ECPO), a listwise policy-optimization objective whose action is the joint object of ranking and evidence certificate. ECPO first learns an interpretable trajectory reward from skeleton alignment, argument consistency, and optional graph features; it then optimizes a constrained policy with three coupled rewards: listwise ranking utility, span-level certificate validity, and an evidence-cycle reward computed by a label-free deterministic verifier that reconstructs candidate support from claim-stripped cited spans. This reframes the goal from maximizing ordinary NDCG alone to maximizing CertNDCG and decision-evidence coupling. The evaluation compares ECPO against zero-shot, SFT, and GRPO policies, RM-only scoring with deterministic evidence attachment, grammar/JSON-constrained decoding, validator retry, best-of-N RM selection, and post-hoc evidence rationalization under closed-roster, predicted-roster, and hybrid-roster settings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper defines evidence-certified candidate ranking on MAVEN-ERE and RAMS, where a system must return a Top-K list together with doc_id:span certificates whose cited spans suffice to recover the decision. It introduces ECPO, a listwise policy-optimization objective whose action is the joint ranking-plus-certificate object; the method first learns a trajectory reward from skeleton alignment and argument consistency, then optimizes a constrained policy under three coupled rewards (listwise ranking utility, span-level certificate validity, and an evidence-cycle reward produced by a label-free deterministic verifier that reconstructs candidate support from claim-stripped spans). The evaluation compares ECPO to zero-shot, SFT, GRPO, RM-only, grammar-constrained, best-of-N, and post-hoc rationalization baselines under closed-, predicted-, and hybrid-roster conditions, with the central claim being that the approach improves CertNDCG and decision-evidence coupling over ordinary NDCG maximization.
Significance. If the reported CertNDCG gains are robust and the verifier reconstruction proves reliable, the work would supply a concrete mechanism for coupling ranking utility with independently verifiable evidence certificates, a direction of clear practical value for decision-support systems that must support audit and downstream verification.
major comments (2)
- [ECPO objective and evidence-cycle reward] The evidence-cycle reward (described in the ECPO objective) relies on a label-free deterministic verifier that reconstructs candidate support solely from claim-stripped cited spans. No explicit mechanism is given for disambiguating incomplete spans or trajectories shared by multiple candidates; if reconstruction accuracy is low, the reward becomes noisy and the joint optimization no longer enforces verifiable certificates, directly undermining the central CertNDCG claim.
- [Evaluation and results] The evaluation section reports comparisons under closed-roster, predicted-roster, and hybrid-roster settings but supplies no quantitative CertNDCG deltas, confidence intervals, or ablation isolating the contribution of the evidence-cycle reward versus the other two coupled terms; without these numbers it is impossible to assess whether the reframing to CertNDCG actually produces measurable gains.
minor comments (2)
- [Method] The abstract states that the verifier is 'deterministic' yet the full description of its reconstruction rules is deferred; a short pseudocode or formal definition in the method section would improve reproducibility.
- [Task definition] Notation for CertNDCG is introduced without an explicit equation; adding a definition parallel to standard NDCG would clarify how the certificate validity term modifies the metric.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our paper. We address each of the major comments below and describe the revisions planned for the manuscript.
read point-by-point responses
-
Referee: [ECPO objective and evidence-cycle reward] The evidence-cycle reward (described in the ECPO objective) relies on a label-free deterministic verifier that reconstructs candidate support solely from claim-stripped cited spans. No explicit mechanism is given for disambiguating incomplete spans or trajectories shared by multiple candidates; if reconstruction accuracy is low, the reward becomes noisy and the joint optimization no longer enforces verifiable certificates, directly undermining the central CertNDCG claim.
Authors: We thank the referee for this observation. The evidence-cycle reward is computed by a deterministic verifier that takes claim-stripped cited spans and reconstructs the supporting evidence for each candidate using the available trajectory information and span provenance from the datasets. While the current description focuses on the overall objective, we recognize that details on handling incomplete or shared spans are not fully elaborated. To address this, we will expand the method section with a step-by-step description of the verifier, including how it resolves ambiguities through consistency checks with the plan skeleton. Additionally, we will include empirical results on the verifier's reconstruction accuracy to show that the reward remains reliable and supports the CertNDCG improvements. revision: yes
-
Referee: [Evaluation and results] The evaluation section reports comparisons under closed-roster, predicted-roster, and hybrid-roster settings but supplies no quantitative CertNDCG deltas, confidence intervals, or ablation isolating the contribution of the evidence-cycle reward versus the other two coupled terms; without these numbers it is impossible to assess whether the reframing to CertNDCG actually produces measurable gains.
Authors: We agree with the referee that quantitative details are necessary to substantiate the claims. The manuscript presents comparative results but omits specific numerical values for CertNDCG deltas and does not report confidence intervals or dedicated ablations for the evidence-cycle reward. In the revised version, we will update the evaluation section with tables containing the exact CertNDCG scores for all methods and settings, along with deltas, 95% confidence intervals based on multiple experimental runs, and an ablation analysis that compares the full ECPO objective against variants without the evidence-cycle reward. This will allow a clear assessment of the contribution of each component to the observed gains in evidence-certified ranking. revision: yes
Circularity Check
No significant circularity; derivation remains self-contained
full rationale
The paper defines ECPO via an interpretable trajectory reward learned from skeleton alignment, argument consistency, and graph features, followed by constrained optimization over three explicitly coupled but independently specified rewards (listwise ranking utility, span-level certificate validity, and evidence-cycle reward from a label-free deterministic verifier). CertNDCG is presented as a reframing of NDCG that incorporates evidence certificates, with the verifier operating on claim-stripped spans as an external reconstruction step rather than a quantity fitted to the policy outputs. No equations or steps reduce the target objective to its own fitted parameters or self-referential supervision by construction; the central claims rest on the stated independence of the verifier and alignment signals, which are described as external to the optimization loop.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Upstream extraction is fixed and reliable
invented entities (1)
-
CertNDCG
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Tobias Brockhoff, Malte Heithoff, Istvan Koren, Judith Michael, Jerome Pfeiffer, Bernhard Rumpe, Merih Seran Uysal, Wil M. P. Van Der Aalst, and Andreas Wortmann. Process Prediction with Digital Twins. In2021 ACM/IEEE International Conference on Model Driven Engineering Languages and Systems Companion (MODELS-C), pages 182–187, Fukuoka, Japan, October
-
[2]
IEEE. ISBN 978-1-6654-2484-4. doi: 10.1109/MODELS-C53483.2021.00032. URL https://ieeexplore.ieee.org/document/9643680/
-
[3]
C. Burges. From RankNet to LambdaRank to LambdaMART: An Overview. Tech- nical Report MSR-TR-2010-82, Microsoft Research, June 2010. URL https://www. semanticscholar.org/paper/From-RankNet-to-LambdaRank-to-LambdaMART% 3A-An-Burges/0df9c70875783a73ce1e933079f328e8cf5e9ea2
work page 2010
-
[4]
In: Zong, C., Xia, F., Li, W., Navigli, R
Seth Ebner, Patrick Xia, Ryan Culkin, Kyle Rawlins, and Benjamin Van Durme. Multi-Sentence Argument Linking. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8057–8077, Online, July 2020. Association for Computational Linguistics. doi:...
-
[5]
Benjamin W. K. Hung and Anura P. Jayasumana. Investigative Simulation: Towards Utilizing Graph Pattern Matching for Investigative Search, August 2016. URL http://arxiv.org/ abs/1608.01760. arXiv:1608.01760 [cs]. 9
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[6]
Alon Jacovi and Yoav Goldberg. Towards Faithfully Interpretable NLP Systems: How Should We Define and Evaluate Faithfulness? In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors,Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4198–4205, Online, July 2020. Association for Computational Li...
-
[7]
Discounted Cumulated Gain Based Evaluation of Multiple-Query IR Sessions
Kalervo Järvelin and Jaana Kekäläinen. Cumulated gain-based evaluation of IR techniques.ACM Trans. Inf. Syst., 20(4):422–446, October 2002. ISSN 1046-8188. doi: 10.1145/582415.582418. URLhttps://doi.org/10.1145/582415.582418
-
[8]
On Faithfulness and Factuality in Abstractive Summarization
Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. On Faithfulness and Factuality in Abstractive Summarization. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors,Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906–1919, Online, July 2020. Association for Computational Li...
-
[9]
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback,...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[10]
Gyunam Park and Wil M.P. Van Der Aalst. Realizing A Digital Twin of An Organization Using Action-oriented Process Mining. In2021 3rd International Conference on Process Mining (ICPM), pages 104–111, Eindhoven, Netherlands, October 2021. IEEE. ISBN 978-1-6654- 3514-7. doi: 10.1109/ICPM53251.2021.9576846. URL https://ieeexplore.ieee.org/ document/9576846/
-
[11]
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct Preference Optimization: Your Language Model is Secretly a Reward Model, July 2024. URLhttp://arxiv.org/abs/2305.18290. arXiv:2305.18290 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
Miquel Ramírez and Hector Geffner. Plan recognition as planning. InProceedings of the 21st International Joint Conference on Artificial Intelligence, IJCAI’09, pages 1778–1783, San Francisco, CA, USA, July 2009. Morgan Kaufmann Publishers Inc
work page 2009
-
[13]
Probabilistic plan recognition using off-the-shelf classical planners
Miquel Ramírez and Hector Geffner. Probabilistic plan recognition using off-the-shelf classical planners. InProceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, AAAI’10, pages 1121–1126, Atlanta, Georgia, July 2010. AAAI Press
work page 2010
-
[14]
"Why Should I Trust You?": Explaining the Predictions of Any Classifier
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. "Why Should I Trust You?": Ex- plaining the Predictions of Any Classifier, August 2016. URLhttp://arxiv.org/abs/1602. 04938. arXiv:1602.04938 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[15]
doi:10.18653/v1/2021.emnlp-main.779
Torsten Scholak, Nathan Schucher, and Dzmitry Bahdanau. PICARD: Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors,Proceedings of the 2021 Con- ference on Empirical Methods in Natural Language Processing, pages 9895–9901, Online and Punt...
-
[16]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal Policy Optimization Algorithms, August 2017. URL http://arxiv.org/abs/1707.06347. arXiv:1707.06347 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[17]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models, April 2024. URL http://arxiv.org/ abs/2402.03300. arXiv:2402.03300 [cs]. 10
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
August Daniel Sutmuller, Marielle den Hengst, Ana Isabel Barros, and Pieter van Gelder. Getting the Perpetrator Incorporated and Prioritized in Homicide Investigations: The Development and Evaluation of a Case-Specific Element Library (C-SEL).International Journal of Environmental Research and Public Health, 17(17), September 2020. ISSN 1660-4601. doi: 10...
work page 2020
-
[19]
MA VEN: A Massive General Domain Event Detection Dataset
Xiaozhi Wang, Ziqi Wang, Xu Han, Wangyi Jiang, Rong Han, Zhiyuan Liu, Juanzi Li, Peng Li, Yankai Lin, and Jie Zhou. MA VEN: A Massive General Domain Event Detection Dataset. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Pro- cessing (EMNLP), pages 1652–1671, Online, 2020. Association for Computational Linguis- tics. doi: 10...
-
[20]
Xiaozhi Wang, Yulin Chen, Ning Ding, Hao Peng, Zimu Wang, Yankai Lin, Xu Han, Lei Hou, Juanzi Li, Zhiyuan Liu, Peng Li, and Jie Zhou. MA VEN-ERE: A Unified Large-scale Dataset for Event Coreference, Temporal, Causal, and Subevent Relation Extraction. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 926–941, A...
work page 2022
-
[21]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
Brian D. Ziebart, Andrew Maas, J. Andrew Bagnell, and Anind K. Dey. Maximum entropy inverse reinforcement learning. InProceedings of the 23rd national conference on Artificial intelligence - Volume 3, AAAI’08, pages 1433–1438, Chicago, Illinois, July 2008. AAAI Press. ISBN 978-1-57735-368-3. 11 A Benchmark and Reproducibility Details Reader roadmap.For re...
work page 2008
-
[23]
Locate the sentence containing the trigger via resolve_sent_id: use the annotated sen- tence ID when available; otherwise fall back to trigger-text matching
-
[24]
Take sentences in the index range [sent_id−context_radius,sent_id+ context_radius]withcontext_radius=1, yielding at most three sentences
-
[25]
If the range is empty, fall back to the trigger sentence; if still empty, fall back to the concatenation of all non-empty sentences in the document; if still empty, fall back to the trigger text. The final context is the space-joined concatenation of the selected sentences (a coarse token count is computed by whitespace splitting). This procedure is imple...
work page 2021
-
[26]
Split documents into train/dev/test by doc_id
-
[27]
Train extractor only on train documents
-
[28]
Run the fixed extractor on train/dev/test documents to produce predicted event records
-
[29]
Aggregate predicted event records into window-scoped candidate trajectories
-
[30]
Use train annotations to construct reward-learning positives, hard (near-neighbor) negatives, and preference pairs
-
[31]
Use dev annotations only for model selection and diagnostic evaluation
-
[32]
Use test annotations only for final labels and audit references
-
[33]
Train RM and policy only with train-split supervision
-
[34]
miss”), and (ii) spurious observed events (“skip
Evaluate all ranking methods on predicted dev/test trajectory inputs. MA VEN-ERE SFT data scale.The full processed MA VEN-ERE source yields 91,222 possible instruction-tuning instances for trigger/argument extraction before splitting. After document-level splitting, extractor optimization uses only the training-split instances. Dev/test instances are not ...
-
[35]
Prefermatch(k−1, t−1) over miss/skip when its score is tied
-
[36]
Preferskip( k, t−1) over miss (k−1, t) when tied, because skipping a noisy observed event preserves the possibility of matching the current skeleton step later
-
[37]
We record the chosen transition to enable backtracking
As a final tie-breaker, prefer smaller temporal gaps (if timestamps are available) to produce more temporally coherent alignments. We record the chosen transition to enable backtracking. B.3.3 Backtracking and step-to-evidence mapping Backtracking yields an alignment path and a mapping πalign :{1, . . . , M} → {1, . . . , T} ∪ {⊥}, where πalign(k) =t indi...
-
[38]
Traceability: fraction of cited spans that pass document, span-bound, evidence-kind/role, and trajectory-overlap checks against τw(c), without consulting serialized event identifiers
-
[39]
Step coverage: fraction of skeleton steps with at least one valid traceable evidence item for candidatec
-
[40]
Role satisfaction: fraction of required roles supported by traceable argument spans with normalized role agreement
-
[41]
Precedence consistency: fraction of supported step pairs that respect available order or timestamp constraints
-
[42]
Bad-span penalty: duplicate, off-window, non-overlapping, wrong-role, type-incompatible, or multi-candidate-ambiguous evidence citations. 37 The default score is the fixed weighted sum defined in Section 3.3. A simple unweighted version can be reported as a sensitivity check. Candidate assignments must pass both a support threshold and a margin over the s...
work page 2020
-
[43]
Schema validity: strict JSON parsing; required keys (window_id, topk, certificates); correct types
-
[44]
Candidate id validity: each id in topk belongs to the window’s candidate_ids; no duplicates; length exactlyK w = min(K, Nw)
-
[45]
Certificate coverage: certificates must have the same length as topk. The j-th certifi- cate is the claimed support for the j-th topk candidate during validation, but the certificate object itself contains no candidate_id. Each certificate must contain one step object for every step_id in the window skeleton Sw; missing or unsupported steps are serialized...
-
[46]
Skeleton-step consistency: for each returned candidate position, the set and order of certificates[*].steps[*].step_id must match the skeleton steps in Sw, and each serialized etype must equal the corresponding skeleton-step stage label. A step marked matched:true must provide a non-null event_id and cite at least one valid evidence span. A step markedmat...
-
[47]
Document validity: each cited doc_id exists in the window doc_ids. Window exposure for compact snippets or document-local context units is enforced indirectly through the trajectory-traceability rule rather than by a separate snippet-bound validator
-
[48]
Span bounds: each span satisfies 0≤l < r≤L doc_id, where Ldoc_id is stored in doc_meta.jsonl. If span is serialized as a string "l-r", the validator parses it into integers(l, r)and interprets it as the same half-open interval[l, r)
-
[49]
Event-id and trajectory traceability: for feasibility validation, if event_id is non-null, it must identify an event in the trajectory τw,ak of the candidate occupying the same rank position in topk. Each cited trigger span must overlap the trigger span of that event. Each cited argument span must overlap one of that event’s argument spans. Ifkind="arg", ...
work page 2000
-
[50]
Traceability: each cited span overlaps with at least one event mention (trigger or argument span) in the candidate trajectory
-
[51]
Step consistency: for step sk, the traced event must be stage-compatible, i.e., etypek ∈ skeleton_hits(e), and must satisfy required roles under the DP alignment for that candi- date. 46 Table 21: Aligner-independent ReExtractFaithfulness@10. This is an auxiliary audit metric rather than a main optimization target. The verifier is trained only on training...
work page 2013
-
[52]
Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.