ECPO: Evidence-Coupled Policy Optimization for Evidence-Certified Candidate Ranking

Bokun Wang; Daren Zha; Jun Xiao; Miaobo Hu; Shuhao Hu; Xiaobo Guo; Xin Wang; Yina Sa

arxiv: 2605.21993 · v1 · pith:WEICU6JZnew · submitted 2026-05-21 · 💻 cs.AI · cs.LG

ECPO: Evidence-Coupled Policy Optimization for Evidence-Certified Candidate Ranking

Miaobo Hu , Shuhao Hu , BoKun Wang , Yina Sa , Xin Wang , Xiaobo Guo , Daren Zha , Jun Xiao This is my paper

Pith reviewed 2026-05-22 06:27 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords evidence-certified rankingpolicy optimizationCertNDCGdecision-evidence couplingcandidate rankingverifiable evidencelistwise optimization

0 comments

The pith

Coupling ranking policy optimization with evidence certificate validity enables verifiable decision support in candidate ranking.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines evidence-certified candidate ranking as producing a Top-K list together with doc_id:span evidence certificates whose cited spans suffice to recover the decision. It introduces Evidence-Coupled Policy Optimization (ECPO) as a listwise policy-optimization objective whose action is the joint ranking and evidence certificate. ECPO first learns a trajectory reward from skeleton alignment and argument consistency, then optimizes a constrained policy using three coupled rewards: listwise ranking utility, span-level certificate validity, and an evidence-cycle reward from a label-free deterministic verifier. This reframes the goal from ordinary NDCG to CertNDCG and decision-evidence coupling, and the method is tested on MAVEN-ERE and RAMS with fixed upstream extraction under closed-roster, predicted-roster, and hybrid-roster settings.

Core claim

By treating the action as the joint object of ranking and evidence certificate and optimizing under coupled rewards that include a label-free verifier for evidence-cycle reconstruction, ECPO produces Top-K lists whose cited spans allow independent recovery of the decision, shifting the objective from standard ranking metrics to CertNDCG and decision-evidence coupling.

What carries the argument

Evidence-Coupled Policy Optimization (ECPO), a listwise policy-optimization objective that couples ranking utility, span-level certificate validity, and an evidence-cycle reward computed by a deterministic verifier on claim-stripped spans.

If this is right

Rankings produced under ECPO come with evidence certificates sufficient to reconstruct the decision from cited spans alone.
The evaluation framework compares ECPO to zero-shot, SFT, GRPO, RM-only scoring, and post-hoc rationalization across roster settings.
Optimization under the coupled rewards yields higher CertNDCG than maximizing ordinary NDCG alone.
The approach uses skeleton-aligned trajectory supervision and hard negatives with fixed upstream extraction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar coupling of utility and verifiability could extend to other high-stakes ranking domains where audit trails matter.
If the verifier generalizes, it suggests evidence attachment need not rely on post-hoc rationalization or full context access.
Testing whether CertNDCG gains persist when the verifier is replaced by a learned one would probe the necessity of the deterministic component.

Load-bearing premise

The label-free deterministic verifier can reliably reconstruct candidate support from claim-stripped cited spans without access to the original labels or full context.

What would settle it

If the verifier applied to ECPO-generated certificates fails to recover the correct candidate support on a held-out test set at a rate significantly higher than baselines, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.21993 by Bokun Wang, Daren Zha, Jun Xiao, Miaobo Hu, Shuhao Hu, Xiaobo Guo, Xin Wang, Yina Sa.

**Figure 1.** Figure 1: ECPO training loop. The policy samples a joint ranking/certificate object. The validator enforces schema and span traceability, while the deterministic evidence-only verifier reconstructs candidates from claim- and event-id-stripped cited spans and supplies the evidence-cycle reward. with one or more doc_id:span references, or an explicit unmatched step with empty evidence. A valid output must have Ly = Kw… view at source ↗

read the original abstract

Ranking systems used in decision-support settings should not only order candidates but also expose evidence that can be independently checked. We study evidence-certified candidate ranking: given an intent_id, a predefined plan skeleton, a window-local candidate roster, and text-derived candidate trajectories with span provenance, a system must output a Top-K list together with doc_id:span evidence certificates whose cited spans are sufficient to recover the decision. We instantiate this task on MAVEN-ERE and RAMS with fixed upstream extraction, window-local randomized candidate identifiers, skeleton-aligned trajectory supervision, hard negatives, and audit references. We introduce Evidence-Coupled Policy Optimization (ECPO), a listwise policy-optimization objective whose action is the joint object of ranking and evidence certificate. ECPO first learns an interpretable trajectory reward from skeleton alignment, argument consistency, and optional graph features; it then optimizes a constrained policy with three coupled rewards: listwise ranking utility, span-level certificate validity, and an evidence-cycle reward computed by a label-free deterministic verifier that reconstructs candidate support from claim-stripped cited spans. This reframes the goal from maximizing ordinary NDCG alone to maximizing CertNDCG and decision-evidence coupling. The evaluation compares ECPO against zero-shot, SFT, and GRPO policies, RM-only scoring with deterministic evidence attachment, grammar/JSON-constrained decoding, validator retry, best-of-N RM selection, and post-hoc evidence rationalization under closed-roster, predicted-roster, and hybrid-roster settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper defines evidence-certified candidate ranking on MAVEN-ERE and RAMS, where a system must return a Top-K list together with doc_id:span certificates whose cited spans suffice to recover the decision. It introduces ECPO, a listwise policy-optimization objective whose action is the joint ranking-plus-certificate object; the method first learns a trajectory reward from skeleton alignment and argument consistency, then optimizes a constrained policy under three coupled rewards (listwise ranking utility, span-level certificate validity, and an evidence-cycle reward produced by a label-free deterministic verifier that reconstructs candidate support from claim-stripped spans). The evaluation compares ECPO to zero-shot, SFT, GRPO, RM-only, grammar-constrained, best-of-N, and post-hoc rationalization baselines under closed-, predicted-, and hybrid-roster conditions, with the central claim being that the approach improves CertNDCG and decision-evidence coupling over ordinary NDCG maximization.

Significance. If the reported CertNDCG gains are robust and the verifier reconstruction proves reliable, the work would supply a concrete mechanism for coupling ranking utility with independently verifiable evidence certificates, a direction of clear practical value for decision-support systems that must support audit and downstream verification.

major comments (2)

[ECPO objective and evidence-cycle reward] The evidence-cycle reward (described in the ECPO objective) relies on a label-free deterministic verifier that reconstructs candidate support solely from claim-stripped cited spans. No explicit mechanism is given for disambiguating incomplete spans or trajectories shared by multiple candidates; if reconstruction accuracy is low, the reward becomes noisy and the joint optimization no longer enforces verifiable certificates, directly undermining the central CertNDCG claim.
[Evaluation and results] The evaluation section reports comparisons under closed-roster, predicted-roster, and hybrid-roster settings but supplies no quantitative CertNDCG deltas, confidence intervals, or ablation isolating the contribution of the evidence-cycle reward versus the other two coupled terms; without these numbers it is impossible to assess whether the reframing to CertNDCG actually produces measurable gains.

minor comments (2)

[Method] The abstract states that the verifier is 'deterministic' yet the full description of its reconstruction rules is deferred; a short pseudocode or formal definition in the method section would improve reproducibility.
[Task definition] Notation for CertNDCG is introduced without an explicit equation; adding a definition parallel to standard NDCG would clarify how the certificate validity term modifies the metric.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our paper. We address each of the major comments below and describe the revisions planned for the manuscript.

read point-by-point responses

Referee: [ECPO objective and evidence-cycle reward] The evidence-cycle reward (described in the ECPO objective) relies on a label-free deterministic verifier that reconstructs candidate support solely from claim-stripped cited spans. No explicit mechanism is given for disambiguating incomplete spans or trajectories shared by multiple candidates; if reconstruction accuracy is low, the reward becomes noisy and the joint optimization no longer enforces verifiable certificates, directly undermining the central CertNDCG claim.

Authors: We thank the referee for this observation. The evidence-cycle reward is computed by a deterministic verifier that takes claim-stripped cited spans and reconstructs the supporting evidence for each candidate using the available trajectory information and span provenance from the datasets. While the current description focuses on the overall objective, we recognize that details on handling incomplete or shared spans are not fully elaborated. To address this, we will expand the method section with a step-by-step description of the verifier, including how it resolves ambiguities through consistency checks with the plan skeleton. Additionally, we will include empirical results on the verifier's reconstruction accuracy to show that the reward remains reliable and supports the CertNDCG improvements. revision: yes
Referee: [Evaluation and results] The evaluation section reports comparisons under closed-roster, predicted-roster, and hybrid-roster settings but supplies no quantitative CertNDCG deltas, confidence intervals, or ablation isolating the contribution of the evidence-cycle reward versus the other two coupled terms; without these numbers it is impossible to assess whether the reframing to CertNDCG actually produces measurable gains.

Authors: We agree with the referee that quantitative details are necessary to substantiate the claims. The manuscript presents comparative results but omits specific numerical values for CertNDCG deltas and does not report confidence intervals or dedicated ablations for the evidence-cycle reward. In the revised version, we will update the evaluation section with tables containing the exact CertNDCG scores for all methods and settings, along with deltas, 95% confidence intervals based on multiple experimental runs, and an ablation analysis that compares the full ECPO objective against variants without the evidence-cycle reward. This will allow a clear assessment of the contribution of each component to the observed gains in evidence-certified ranking. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation remains self-contained

full rationale

The paper defines ECPO via an interpretable trajectory reward learned from skeleton alignment, argument consistency, and graph features, followed by constrained optimization over three explicitly coupled but independently specified rewards (listwise ranking utility, span-level certificate validity, and evidence-cycle reward from a label-free deterministic verifier). CertNDCG is presented as a reframing of NDCG that incorporates evidence certificates, with the verifier operating on claim-stripped spans as an external reconstruction step rather than a quantity fitted to the policy outputs. No equations or steps reduce the target objective to its own fitted parameters or self-referential supervision by construction; the central claims rest on the stated independence of the verifier and alignment signals, which are described as external to the optimization loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Review is limited to the abstract; no explicit free parameters, axioms, or invented entities are detailed beyond high-level mentions of rewards and verifier.

axioms (1)

domain assumption Upstream extraction is fixed and reliable
Abstract states 'with fixed upstream extraction'

invented entities (1)

CertNDCG no independent evidence
purpose: Metric combining ranking utility and evidence certificate validity
Introduced as the target metric in place of ordinary NDCG

pith-pipeline@v0.9.0 · 5822 in / 1327 out tokens · 51427 ms · 2026-05-22T06:27:48.721670+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 7 internal anchors

[1]

Tobias Brockhoff, Malte Heithoff, Istvan Koren, Judith Michael, Jerome Pfeiffer, Bernhard Rumpe, Merih Seran Uysal, Wil M. P. Van Der Aalst, and Andreas Wortmann. Process Prediction with Digital Twins. In2021 ACM/IEEE International Conference on Model Driven Engineering Languages and Systems Companion (MODELS-C), pages 182–187, Fukuoka, Japan, October

work page
[2]

ISBN 978-1-6654-2484-4

IEEE. ISBN 978-1-6654-2484-4. doi: 10.1109/MODELS-C53483.2021.00032. URL https://ieeexplore.ieee.org/document/9643680/

work page doi:10.1109/models-c53483.2021.00032 2021
[3]

C. Burges. From RankNet to LambdaRank to LambdaMART: An Overview. Tech- nical Report MSR-TR-2010-82, Microsoft Research, June 2010. URL https://www. semanticscholar.org/paper/From-RankNet-to-LambdaRank-to-LambdaMART% 3A-An-Burges/0df9c70875783a73ce1e933079f328e8cf5e9ea2

work page 2010
[4]

In: Zong, C., Xia, F., Li, W., Navigli, R

Seth Ebner, Patrick Xia, Ryan Culkin, Kyle Rawlins, and Benjamin Van Durme. Multi-Sentence Argument Linking. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8057–8077, Online, July 2020. Association for Computational Linguistics. doi:...

work page doi:10.18653/v1/ 2020
[5]

Benjamin W. K. Hung and Anura P. Jayasumana. Investigative Simulation: Towards Utilizing Graph Pattern Matching for Investigative Search, August 2016. URL http://arxiv.org/ abs/1608.01760. arXiv:1608.01760 [cs]. 9

work page internal anchor Pith review Pith/arXiv arXiv 2016
[6]

Alon Jacovi and Yoav Goldberg. Towards Faithfully Interpretable NLP Systems: How Should We Define and Evaluate Faithfulness? In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors,Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4198–4205, Online, July 2020. Association for Computational Li...

work page doi:10.18653/v1/2020.acl-main.386 2020
[7]

Discounted Cumulated Gain Based Evaluation of Multiple-Query IR Sessions

Kalervo Järvelin and Jaana Kekäläinen. Cumulated gain-based evaluation of IR techniques.ACM Trans. Inf. Syst., 20(4):422–446, October 2002. ISSN 1046-8188. doi: 10.1145/582415.582418. URLhttps://doi.org/10.1145/582415.582418

work page doi:10.1145/582415.582418 2002
[8]

On Faithfulness and Factuality in Abstractive Summarization

Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. On Faithfulness and Factuality in Abstractive Summarization. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors,Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906–1919, Online, July 2020. Association for Computational Li...

work page doi:10.18653/v1/2020.acl-main.173 1906
[9]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback,...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[10]

Van Der Aalst

Gyunam Park and Wil M.P. Van Der Aalst. Realizing A Digital Twin of An Organization Using Action-oriented Process Mining. In2021 3rd International Conference on Process Mining (ICPM), pages 104–111, Eindhoven, Netherlands, October 2021. IEEE. ISBN 978-1-6654- 3514-7. doi: 10.1109/ICPM53251.2021.9576846. URL https://ieeexplore.ieee.org/ document/9576846/

work page doi:10.1109/icpm53251.2021.9576846 2021
[11]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct Preference Optimization: Your Language Model is Secretly a Reward Model, July 2024. URLhttp://arxiv.org/abs/2305.18290. arXiv:2305.18290 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Plan recognition as planning

Miquel Ramírez and Hector Geffner. Plan recognition as planning. InProceedings of the 21st International Joint Conference on Artificial Intelligence, IJCAI’09, pages 1778–1783, San Francisco, CA, USA, July 2009. Morgan Kaufmann Publishers Inc

work page 2009
[13]

Probabilistic plan recognition using off-the-shelf classical planners

Miquel Ramírez and Hector Geffner. Probabilistic plan recognition using off-the-shelf classical planners. InProceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, AAAI’10, pages 1121–1126, Atlanta, Georgia, July 2010. AAAI Press

work page 2010
[14]

"Why Should I Trust You?": Explaining the Predictions of Any Classifier

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. "Why Should I Trust You?": Ex- plaining the Predictions of Any Classifier, August 2016. URLhttp://arxiv.org/abs/1602. 04938. arXiv:1602.04938 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2016
[15]

doi:10.18653/v1/2021.emnlp-main.779

Torsten Scholak, Nathan Schucher, and Dzmitry Bahdanau. PICARD: Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors,Proceedings of the 2021 Con- ference on Empirical Methods in Natural Language Processing, pages 9895–9901, Online and Punt...

work page doi:10.18653/v1/2021.emnlp-main.779 2021
[16]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal Policy Optimization Algorithms, August 2017. URL http://arxiv.org/abs/1707.06347. arXiv:1707.06347 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2017
[17]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models, April 2024. URL http://arxiv.org/ abs/2402.03300. arXiv:2402.03300 [cs]. 10

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

August Daniel Sutmuller, Marielle den Hengst, Ana Isabel Barros, and Pieter van Gelder. Getting the Perpetrator Incorporated and Prioritized in Homicide Investigations: The Development and Evaluation of a Case-Specific Element Library (C-SEL).International Journal of Environmental Research and Public Health, 17(17), September 2020. ISSN 1660-4601. doi: 10...

work page 2020
[19]

MA VEN: A Massive General Domain Event Detection Dataset

Xiaozhi Wang, Ziqi Wang, Xu Han, Wangyi Jiang, Rong Han, Zhiyuan Liu, Juanzi Li, Peng Li, Yankai Lin, and Jie Zhou. MA VEN: A Massive General Domain Event Detection Dataset. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Pro- cessing (EMNLP), pages 1652–1671, Online, 2020. Association for Computational Linguis- tics. doi: 10...

work page doi:10.18653/v1/2020.emnlp-main.129 2020
[20]

MA VEN-ERE: A Unified Large-scale Dataset for Event Coreference, Temporal, Causal, and Subevent Relation Extraction

Xiaozhi Wang, Yulin Chen, Ning Ding, Hao Peng, Zimu Wang, Yankai Lin, Xu Han, Lei Hou, Juanzi Li, Zhiyuan Liu, Peng Li, and Jie Zhou. MA VEN-ERE: A Unified Large-scale Dataset for Event Coreference, Temporal, Causal, and Subevent Relation Extraction. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 926–941, A...

work page 2022
[21]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

window_id

Brian D. Ziebart, Andrew Maas, J. Andrew Bagnell, and Anind K. Dey. Maximum entropy inverse reinforcement learning. InProceedings of the 23rd national conference on Artificial intelligence - Volume 3, AAAI’08, pages 1433–1438, Chicago, Illinois, July 2008. AAAI Press. ISBN 978-1-57735-368-3. 11 A Benchmark and Reproducibility Details Reader roadmap.For re...

work page 2008
[23]

Locate the sentence containing the trigger via resolve_sent_id: use the annotated sen- tence ID when available; otherwise fall back to trigger-text matching

work page
[24]

Take sentences in the index range [sent_id−context_radius,sent_id+ context_radius]withcontext_radius=1, yielding at most three sentences

work page
[25]

doc_text

If the range is empty, fall back to the trigger sentence; if still empty, fall back to the concatenation of all non-empty sentences in the document; if still empty, fall back to the trigger text. The final context is the space-joined concatenation of the selected sentences (a coarse token count is computed by whitespace splitting). This procedure is imple...

work page 2021
[26]

Split documents into train/dev/test by doc_id

work page
[27]

Train extractor only on train documents

work page
[28]

Run the fixed extractor on train/dev/test documents to produce predicted event records

work page
[29]

Aggregate predicted event records into window-scoped candidate trajectories

work page
[30]

Use train annotations to construct reward-learning positives, hard (near-neighbor) negatives, and preference pairs

work page
[31]

Use dev annotations only for model selection and diagnostic evaluation

work page
[32]

Use test annotations only for final labels and audit references

work page
[33]

Train RM and policy only with train-split supervision

work page
[34]

miss”), and (ii) spurious observed events (“skip

Evaluate all ranking methods on predicted dev/test trajectory inputs. MA VEN-ERE SFT data scale.The full processed MA VEN-ERE source yields 91,222 possible instruction-tuning instances for trigger/argument extraction before splitting. After document-level splitting, extractor optimization uses only the training-split instances. Dev/test instances are not ...

work page
[35]

Prefermatch(k−1, t−1) over miss/skip when its score is tied

work page
[36]

Preferskip( k, t−1) over miss (k−1, t) when tied, because skipping a noisy observed event preserves the possibility of matching the current skeleton step later

work page
[37]

We record the chosen transition to enable backtracking

As a final tie-breaker, prefer smaller temporal gaps (if timestamps are available) to produce more temporally coherent alignments. We record the chosen transition to enable backtracking. B.3.3 Backtracking and step-to-evidence mapping Backtracking yields an alignment path and a mapping πalign :{1, . . . , M} → {1, . . . , T} ∪ {⊥}, where πalign(k) =t indi...

work page
[38]

Traceability: fraction of cited spans that pass document, span-bound, evidence-kind/role, and trajectory-overlap checks against τw(c), without consulting serialized event identifiers

work page
[39]

Step coverage: fraction of skeleton steps with at least one valid traceable evidence item for candidatec

work page
[40]

Role satisfaction: fraction of required roles supported by traceable argument spans with normalized role agreement

work page
[41]

Precedence consistency: fraction of supported step pairs that respect available order or timestamp constraints

work page
[42]

$schema":

Bad-span penalty: duplicate, off-window, non-overlapping, wrong-role, type-incompatible, or multi-candidate-ambiguous evidence citations. 37 The default score is the fixed weighted sum defined in Section 3.3. A simple unweighted version can be reported as a sensitivity check. Candidate assignments must pass both a support threshold and a margin over the s...

work page 2020
[43]

Schema validity: strict JSON parsing; required keys (window_id, topk, certificates); correct types

work page
[44]

Candidate id validity: each id in topk belongs to the window’s candidate_ids; no duplicates; length exactlyK w = min(K, Nw)

work page
[45]

The j-th certifi- cate is the claimed support for the j-th topk candidate during validation, but the certificate object itself contains no candidate_id

Certificate coverage: certificates must have the same length as topk. The j-th certifi- cate is the claimed support for the j-th topk candidate during validation, but the certificate object itself contains no candidate_id. Each certificate must contain one step object for every step_id in the window skeleton Sw; missing or unsupported steps are serialized...

work page
[46]

A step marked matched:true must provide a non-null event_id and cite at least one valid evidence span

Skeleton-step consistency: for each returned candidate position, the set and order of certificates[*].steps[*].step_id must match the skeleton steps in Sw, and each serialized etype must equal the corresponding skeleton-step stage label. A step marked matched:true must provide a non-null event_id and cite at least one valid evidence span. A step markedmat...

work page
[47]

Document validity: each cited doc_id exists in the window doc_ids. Window exposure for compact snippets or document-local context units is enforced indirectly through the trajectory-traceability rule rather than by a separate snippet-bound validator

work page
[48]

If span is serialized as a string "l-r", the validator parses it into integers(l, r)and interprets it as the same half-open interval[l, r)

Span bounds: each span satisfies 0≤l < r≤L doc_id, where Ldoc_id is stored in doc_meta.jsonl. If span is serialized as a string "l-r", the validator parses it into integers(l, r)and interprets it as the same half-open interval[l, r)

work page
[49]

arg", the evidence object must include a normalizedrole, and the overlapped argument mention must have the same normalized role. If kind=

Event-id and trajectory traceability: for feasibility validation, if event_id is non-null, it must identify an event in the trajectory τw,ak of the candidate occupying the same rank position in topk. Each cited trigger span must overlap the trigger span of that event. Each cited argument span must overlap one of that event’s argument spans. Ifkind="arg", ...

work page 2000
[50]

Traceability: each cited span overlaps with at least one event mention (trigger or argument span) in the candidate trajectory

work page
[51]

Limitations

Step consistency: for step sk, the traced event must be stage-compatible, i.e., etypek ∈ skeleton_hits(e), and must satisfy required roles under the DP alignment for that candi- date. 46 Table 21: Aligner-independent ReExtractFaithfulness@10. This is an auxiliary audit metric rather than a main optimization target. The verifier is trained only on training...

work page 2013
[52]

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page

[1] [1]

Tobias Brockhoff, Malte Heithoff, Istvan Koren, Judith Michael, Jerome Pfeiffer, Bernhard Rumpe, Merih Seran Uysal, Wil M. P. Van Der Aalst, and Andreas Wortmann. Process Prediction with Digital Twins. In2021 ACM/IEEE International Conference on Model Driven Engineering Languages and Systems Companion (MODELS-C), pages 182–187, Fukuoka, Japan, October

work page

[2] [2]

ISBN 978-1-6654-2484-4

IEEE. ISBN 978-1-6654-2484-4. doi: 10.1109/MODELS-C53483.2021.00032. URL https://ieeexplore.ieee.org/document/9643680/

work page doi:10.1109/models-c53483.2021.00032 2021

[3] [3]

C. Burges. From RankNet to LambdaRank to LambdaMART: An Overview. Tech- nical Report MSR-TR-2010-82, Microsoft Research, June 2010. URL https://www. semanticscholar.org/paper/From-RankNet-to-LambdaRank-to-LambdaMART% 3A-An-Burges/0df9c70875783a73ce1e933079f328e8cf5e9ea2

work page 2010

[4] [4]

In: Zong, C., Xia, F., Li, W., Navigli, R

Seth Ebner, Patrick Xia, Ryan Culkin, Kyle Rawlins, and Benjamin Van Durme. Multi-Sentence Argument Linking. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8057–8077, Online, July 2020. Association for Computational Linguistics. doi:...

work page doi:10.18653/v1/ 2020

[5] [5]

Benjamin W. K. Hung and Anura P. Jayasumana. Investigative Simulation: Towards Utilizing Graph Pattern Matching for Investigative Search, August 2016. URL http://arxiv.org/ abs/1608.01760. arXiv:1608.01760 [cs]. 9

work page internal anchor Pith review Pith/arXiv arXiv 2016

[6] [6]

Alon Jacovi and Yoav Goldberg. Towards Faithfully Interpretable NLP Systems: How Should We Define and Evaluate Faithfulness? In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors,Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4198–4205, Online, July 2020. Association for Computational Li...

work page doi:10.18653/v1/2020.acl-main.386 2020

[7] [7]

Discounted Cumulated Gain Based Evaluation of Multiple-Query IR Sessions

Kalervo Järvelin and Jaana Kekäläinen. Cumulated gain-based evaluation of IR techniques.ACM Trans. Inf. Syst., 20(4):422–446, October 2002. ISSN 1046-8188. doi: 10.1145/582415.582418. URLhttps://doi.org/10.1145/582415.582418

work page doi:10.1145/582415.582418 2002

[8] [8]

On Faithfulness and Factuality in Abstractive Summarization

Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. On Faithfulness and Factuality in Abstractive Summarization. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors,Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906–1919, Online, July 2020. Association for Computational Li...

work page doi:10.18653/v1/2020.acl-main.173 1906

[9] [9]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback,...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[10] [10]

Van Der Aalst

Gyunam Park and Wil M.P. Van Der Aalst. Realizing A Digital Twin of An Organization Using Action-oriented Process Mining. In2021 3rd International Conference on Process Mining (ICPM), pages 104–111, Eindhoven, Netherlands, October 2021. IEEE. ISBN 978-1-6654- 3514-7. doi: 10.1109/ICPM53251.2021.9576846. URL https://ieeexplore.ieee.org/ document/9576846/

work page doi:10.1109/icpm53251.2021.9576846 2021

[11] [11]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct Preference Optimization: Your Language Model is Secretly a Reward Model, July 2024. URLhttp://arxiv.org/abs/2305.18290. arXiv:2305.18290 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

Plan recognition as planning

Miquel Ramírez and Hector Geffner. Plan recognition as planning. InProceedings of the 21st International Joint Conference on Artificial Intelligence, IJCAI’09, pages 1778–1783, San Francisco, CA, USA, July 2009. Morgan Kaufmann Publishers Inc

work page 2009

[13] [13]

Probabilistic plan recognition using off-the-shelf classical planners

Miquel Ramírez and Hector Geffner. Probabilistic plan recognition using off-the-shelf classical planners. InProceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, AAAI’10, pages 1121–1126, Atlanta, Georgia, July 2010. AAAI Press

work page 2010

[14] [14]

"Why Should I Trust You?": Explaining the Predictions of Any Classifier

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. "Why Should I Trust You?": Ex- plaining the Predictions of Any Classifier, August 2016. URLhttp://arxiv.org/abs/1602. 04938. arXiv:1602.04938 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2016

[15] [15]

doi:10.18653/v1/2021.emnlp-main.779

Torsten Scholak, Nathan Schucher, and Dzmitry Bahdanau. PICARD: Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors,Proceedings of the 2021 Con- ference on Empirical Methods in Natural Language Processing, pages 9895–9901, Online and Punt...

work page doi:10.18653/v1/2021.emnlp-main.779 2021

[16] [16]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal Policy Optimization Algorithms, August 2017. URL http://arxiv.org/abs/1707.06347. arXiv:1707.06347 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2017

[17] [17]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models, April 2024. URL http://arxiv.org/ abs/2402.03300. arXiv:2402.03300 [cs]. 10

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [18]

August Daniel Sutmuller, Marielle den Hengst, Ana Isabel Barros, and Pieter van Gelder. Getting the Perpetrator Incorporated and Prioritized in Homicide Investigations: The Development and Evaluation of a Case-Specific Element Library (C-SEL).International Journal of Environmental Research and Public Health, 17(17), September 2020. ISSN 1660-4601. doi: 10...

work page 2020

[19] [19]

MA VEN: A Massive General Domain Event Detection Dataset

Xiaozhi Wang, Ziqi Wang, Xu Han, Wangyi Jiang, Rong Han, Zhiyuan Liu, Juanzi Li, Peng Li, Yankai Lin, and Jie Zhou. MA VEN: A Massive General Domain Event Detection Dataset. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Pro- cessing (EMNLP), pages 1652–1671, Online, 2020. Association for Computational Linguis- tics. doi: 10...

work page doi:10.18653/v1/2020.emnlp-main.129 2020

[20] [20]

MA VEN-ERE: A Unified Large-scale Dataset for Event Coreference, Temporal, Causal, and Subevent Relation Extraction

Xiaozhi Wang, Yulin Chen, Ning Ding, Hao Peng, Zimu Wang, Yankai Lin, Xu Han, Lei Hou, Juanzi Li, Zhiyuan Liu, Peng Li, and Jie Zhou. MA VEN-ERE: A Unified Large-scale Dataset for Event Coreference, Temporal, Causal, and Subevent Relation Extraction. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 926–941, A...

work page 2022

[21] [21]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

window_id

Brian D. Ziebart, Andrew Maas, J. Andrew Bagnell, and Anind K. Dey. Maximum entropy inverse reinforcement learning. InProceedings of the 23rd national conference on Artificial intelligence - Volume 3, AAAI’08, pages 1433–1438, Chicago, Illinois, July 2008. AAAI Press. ISBN 978-1-57735-368-3. 11 A Benchmark and Reproducibility Details Reader roadmap.For re...

work page 2008

[23] [23]

Locate the sentence containing the trigger via resolve_sent_id: use the annotated sen- tence ID when available; otherwise fall back to trigger-text matching

work page

[24] [24]

Take sentences in the index range [sent_id−context_radius,sent_id+ context_radius]withcontext_radius=1, yielding at most three sentences

work page

[25] [25]

doc_text

If the range is empty, fall back to the trigger sentence; if still empty, fall back to the concatenation of all non-empty sentences in the document; if still empty, fall back to the trigger text. The final context is the space-joined concatenation of the selected sentences (a coarse token count is computed by whitespace splitting). This procedure is imple...

work page 2021

[26] [26]

Split documents into train/dev/test by doc_id

work page

[27] [27]

Train extractor only on train documents

work page

[28] [28]

Run the fixed extractor on train/dev/test documents to produce predicted event records

work page

[29] [29]

Aggregate predicted event records into window-scoped candidate trajectories

work page

[30] [30]

Use train annotations to construct reward-learning positives, hard (near-neighbor) negatives, and preference pairs

work page

[31] [31]

Use dev annotations only for model selection and diagnostic evaluation

work page

[32] [32]

Use test annotations only for final labels and audit references

work page

[33] [33]

Train RM and policy only with train-split supervision

work page

[34] [34]

miss”), and (ii) spurious observed events (“skip

Evaluate all ranking methods on predicted dev/test trajectory inputs. MA VEN-ERE SFT data scale.The full processed MA VEN-ERE source yields 91,222 possible instruction-tuning instances for trigger/argument extraction before splitting. After document-level splitting, extractor optimization uses only the training-split instances. Dev/test instances are not ...

work page

[35] [35]

Prefermatch(k−1, t−1) over miss/skip when its score is tied

work page

[36] [36]

Preferskip( k, t−1) over miss (k−1, t) when tied, because skipping a noisy observed event preserves the possibility of matching the current skeleton step later

work page

[37] [37]

We record the chosen transition to enable backtracking

As a final tie-breaker, prefer smaller temporal gaps (if timestamps are available) to produce more temporally coherent alignments. We record the chosen transition to enable backtracking. B.3.3 Backtracking and step-to-evidence mapping Backtracking yields an alignment path and a mapping πalign :{1, . . . , M} → {1, . . . , T} ∪ {⊥}, where πalign(k) =t indi...

work page

[38] [38]

Traceability: fraction of cited spans that pass document, span-bound, evidence-kind/role, and trajectory-overlap checks against τw(c), without consulting serialized event identifiers

work page

[39] [39]

Step coverage: fraction of skeleton steps with at least one valid traceable evidence item for candidatec

work page

[40] [40]

Role satisfaction: fraction of required roles supported by traceable argument spans with normalized role agreement

work page

[41] [41]

Precedence consistency: fraction of supported step pairs that respect available order or timestamp constraints

work page

[42] [42]

$schema":

Bad-span penalty: duplicate, off-window, non-overlapping, wrong-role, type-incompatible, or multi-candidate-ambiguous evidence citations. 37 The default score is the fixed weighted sum defined in Section 3.3. A simple unweighted version can be reported as a sensitivity check. Candidate assignments must pass both a support threshold and a margin over the s...

work page 2020

[43] [43]

Schema validity: strict JSON parsing; required keys (window_id, topk, certificates); correct types

work page

[44] [44]

Candidate id validity: each id in topk belongs to the window’s candidate_ids; no duplicates; length exactlyK w = min(K, Nw)

work page

[45] [45]

The j-th certifi- cate is the claimed support for the j-th topk candidate during validation, but the certificate object itself contains no candidate_id

Certificate coverage: certificates must have the same length as topk. The j-th certifi- cate is the claimed support for the j-th topk candidate during validation, but the certificate object itself contains no candidate_id. Each certificate must contain one step object for every step_id in the window skeleton Sw; missing or unsupported steps are serialized...

work page

[46] [46]

A step marked matched:true must provide a non-null event_id and cite at least one valid evidence span

Skeleton-step consistency: for each returned candidate position, the set and order of certificates[*].steps[*].step_id must match the skeleton steps in Sw, and each serialized etype must equal the corresponding skeleton-step stage label. A step marked matched:true must provide a non-null event_id and cite at least one valid evidence span. A step markedmat...

work page

[47] [47]

Document validity: each cited doc_id exists in the window doc_ids. Window exposure for compact snippets or document-local context units is enforced indirectly through the trajectory-traceability rule rather than by a separate snippet-bound validator

work page

[48] [48]

If span is serialized as a string "l-r", the validator parses it into integers(l, r)and interprets it as the same half-open interval[l, r)

Span bounds: each span satisfies 0≤l < r≤L doc_id, where Ldoc_id is stored in doc_meta.jsonl. If span is serialized as a string "l-r", the validator parses it into integers(l, r)and interprets it as the same half-open interval[l, r)

work page

[49] [49]

arg", the evidence object must include a normalizedrole, and the overlapped argument mention must have the same normalized role. If kind=

Event-id and trajectory traceability: for feasibility validation, if event_id is non-null, it must identify an event in the trajectory τw,ak of the candidate occupying the same rank position in topk. Each cited trigger span must overlap the trigger span of that event. Each cited argument span must overlap one of that event’s argument spans. Ifkind="arg", ...

work page 2000

[50] [50]

Traceability: each cited span overlaps with at least one event mention (trigger or argument span) in the candidate trajectory

work page

[51] [51]

Limitations

Step consistency: for step sk, the traced event must be stage-compatible, i.e., etypek ∈ skeleton_hits(e), and must satisfy required roles under the DP alignment for that candi- date. 46 Table 21: Aligner-independent ReExtractFaithfulness@10. This is an auxiliary audit metric rather than a main optimization target. The verifier is trained only on training...

work page 2013

[52] [52]

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page