DTO: a Differentiable Training Objective for Effective Counterfactual Story Rewriting

Amelia Girard; Massimo Piccardi

arxiv: 2605.24885 · v1 · pith:3AQO5H76new · submitted 2026-05-24 · 💻 cs.CL

DTO: a Differentiable Training Objective for Effective Counterfactual Story Rewriting

Amelia Girard , Massimo Piccardi This is my paper

Pith reviewed 2026-06-30 12:18 UTC · model grok-4.3

classification 💻 cs.CL

keywords counterfactual story rewritingdifferentiable training objectivetransformer fine-tuningnarrative consistencylocalized editsTimeTravel datasetART dataset

0 comments

The pith

A joint differentiable loss on rewrite fidelity and narrative consistency improves localized edits in counterfactual story rewriting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a differentiable training objective to update stories for a new event while leaving unaffected parts intact. Standard maximum-likelihood training tends to miss the small required changes, and reinforcement learning approaches are slow to set up. The method fine-tunes a transformer end-to-end by backpropagating a loss that simultaneously rewards matching a reference rewrite and preserving semantic consistency with the original story. Evaluation on the TimeTravel and ART datasets shows the approach beats a maximum-likelihood baseline and a preference-based method while matching two large language models across metrics. A sympathetic reader would care because it offers a simpler, fully differentiable route to controlled text changes without complex reinforcement learning.

Core claim

The central claim is that a transformer model fine-tuned via end-to-end backpropagation against a fully differentiable loss function that jointly rewards fidelity to the reference rewrite and semantic consistency with the source narrative produces better counterfactual story rewrites than maximum-likelihood or preference-based training and remains competitive with contemporary large language models on the TimeTravel and ART datasets.

What carries the argument

The DTO loss, a fully differentiable objective combining a fidelity term to the reference rewrite with a semantic consistency term to the source narrative, optimized through backpropagation during fine-tuning.

Load-bearing premise

The joint loss will steer the model toward precisely localized edits without it discovering unintended shortcuts that achieve high scores while failing to produce the intended story changes.

What would settle it

A test set where models trained with the DTO loss achieve high fidelity and consistency scores yet still alter story elements that the reference rewrites leave unchanged.

Figures

Figures reproduced from arXiv: 2605.24885 by Amelia Girard, Massimo Piccardi.

read the original abstract

Counterfactual story rewriting is a natural language processing task that requires updating an existing story to reflect a chosen alternative event, yet preserving all the unaffected storyline elements and overall coherence. While large language models have recently made remarkable progress on this task, it still remains challenging since the required modifications are typically very small in size and highly localized. As a consequence, models trained in a conventional manner with the maximum-likelihood training objective tend to overlook these nuances. At the same time, more sophisticated training approaches based on reinforcement learning are notoriously slow and difficult to set up. For these reasons, our paper proposes a novel, differentiable training objective (DTO) that directly optimizes for the requisite counterfactual improvements. In our approach, a transformer model is fine-tuned via end-to-end backpropagation against a fully differentiable loss function that jointly rewards (i) fidelity to the reference rewrite and (ii) semantic consistency with the source narrative. The empirical evaluation on the TimeTravel and ART datasets shows that the proposed DTO approach has been able to surpass a maximum-likelihood baseline and a preference-based approach, and perform competitively against two contemporary large language models in all evaluation metrics. These findings substantiate the effectiveness of task-specific differentiable objectives for nuanced, controlled text-generation tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DTO gives a straightforward differentiable loss for localized story edits that beats MLE on two datasets but leaves the exact formulation and stats unclear from the abstract.

read the letter

The key thing here is that the paper introduces DTO, a fully differentiable objective for fine-tuning transformers on counterfactual story rewriting. It backpropagates through a joint loss on fidelity to reference rewrites plus semantic consistency with the source story, and reports better results than maximum-likelihood training or a preference baseline while matching two LLMs on TimeTravel and ART.

What is actually new is the targeted use of this joint differentiable setup for a task where edits must stay small and localized. Standard MLE tends to ignore those nuances, and RL alternatives are cumbersome, so DTO positions itself as a simpler middle path that still optimizes the right signals end-to-end.

The paper does a clean job framing the problem and showing the method on two established datasets with multiple metrics. That gives a practical signal that the approach can work without heavy RL machinery.

The soft spots are real but not fatal. The abstract supplies no explicit loss formula, no description of how the fidelity and consistency terms are scaled or balanced, and no error bars or statistical testing details. Without those, it is hard to judge whether the formulation is mathematically distinct from other differentiable consistency losses or whether the gains are robust. The assumption that the joint objective avoids shortcuts also needs the full experiments to confirm.

This is for people working on controlled story generation or similar editing tasks who want training options short of RL. A reader already running transformer fine-tunes on narrative data could pick up the idea and test it quickly if the math is filled in.

The central empirical claim is testable and the method is grounded enough to warrant referee time, even if revisions will be needed for the missing specifics.

Referee Report

1 major / 0 minor

Summary. The paper proposes a differentiable training objective (DTO) for counterfactual story rewriting. A transformer is fine-tuned end-to-end via backpropagation on a loss that jointly rewards fidelity to reference rewrites and semantic consistency with the source narrative. On the TimeTravel and ART datasets the method is reported to surpass MLE and preference-based baselines while remaining competitive with two contemporary LLMs across all evaluation metrics.

Significance. If the empirical results prove robust under proper statistical controls, the work would illustrate that task-specific differentiable objectives can serve as a practical, faster alternative to reinforcement learning for controlled text-generation problems that demand highly localized edits. This could encourage further exploration of custom losses in nuanced NLP generation tasks.

major comments (1)

[Abstract] Abstract: the loss is described only qualitatively as jointly rewarding "(i) fidelity to the reference rewrite and (ii) semantic consistency with the source narrative" with no explicit functional form, definition of either term, balancing coefficient, or normalization procedure supplied. This formulation is load-bearing for the central claim that the objective is fully differentiable and produces the stated performance gains.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment. We address the single major point below and will update the abstract in the revised manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the loss is described only qualitatively as jointly rewarding "(i) fidelity to the reference rewrite and (ii) semantic consistency with the source narrative" with no explicit functional form, definition of either term, balancing coefficient, or normalization procedure supplied. This formulation is load-bearing for the central claim that the objective is fully differentiable and produces the stated performance gains.

Authors: We agree that the abstract presents the DTO loss at a high level. The explicit functional form (including the fidelity term based on token-level cross-entropy to the reference rewrite, the semantic consistency term implemented via a differentiable sentence-embedding cosine similarity, the balancing coefficient λ, and the normalization) is fully specified in Section 3.2 together with the end-to-end differentiability proof. To strengthen the abstract, we will insert a single sentence that names the two loss components, states that they are combined with a scalar λ, and notes that both are differentiable, thereby making the central claim self-contained while preserving the abstract’s length constraints. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper proposes DTO as an explicitly constructed differentiable loss that jointly optimizes fidelity to external reference rewrites and semantic consistency with the source narrative. The central claim is an empirical result (outperformance on TimeTravel and ART) obtained by training a transformer with this loss; no derivation, uniqueness theorem, or prediction is asserted that reduces by construction to fitted parameters, self-citations, or renamed inputs. The objective is defined against independent external measures, making the reported results testable and non-circular.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The approach rests on the unstated premise that a linear or additive combination of two external reward signals is both sufficient and stable for gradient-based optimization of localized edits.

free parameters (1)

balancing coefficient between fidelity and consistency terms
Any joint reward of the form alpha * fidelity + (1-alpha) * consistency requires choosing or fitting alpha; the abstract gives no value or selection procedure.

pith-pipeline@v0.9.1-grok · 5740 in / 1111 out tokens · 27646 ms · 2026-06-30T12:18:09.466030+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 6 canonical work pages · 3 internal anchors

[1]

Angeles,Dictionary of Philosophy

A. Angeles,Dictionary of Philosophy. Barnes & Noble Books, 1981

1981
[2]

Informal logic and the theory of reasoning,

M. A. Finocchiaro, “Informal logic and the theory of reasoning,”Informal Logic, vol. 6, no. 2, 1984

1984
[3]

Critical thinking as argument analysis,

T. Govier, “Critical thinking as argument analysis,”Argumentation, vol. 3, no. 2, pp. 115–126, 1989

1989
[4]

Kahneman,Thinking, fast and slow

D. Kahneman,Thinking, fast and slow. Macmillan, 2011

2011
[5]

P. C. Wason and P. N. Johnson-Laird,Psychology of Reasoning: Structure and Content, vol. 86. Harvard University Press, 1972

1972
[6]

Philosophical reasoning,

J. A. Passmore, “Philosophical reasoning,” 1961

1961
[7]

Pearl and D

J. Pearl and D. Mackenzie,The Book of Why: The New Science of Cause and Effect. Basic Books, 2018

2018
[8]

Can large language models infer causation from correlation?,

Z. Jin, J. Liu, Z. Lyu, S. Poff, M. Sachan, R. Mihalcea, M. Diab, and B. Schölkopf, “Can large language models infer causation from correlation?,” 2024

2024
[9]

Investigating causal under- standing in llms,

M. Hobbhahn, T. Lieberum, and D. Seiler, “Investigating causal under- standing in llms,”NeurIPS ML Safety Workshop, 2022

2022
[10]

Towards understanding how machines can learn causal overhypotheses,

E. Kosoy, D. M. Chan, A. Liu, J. Collins, B. Kaufmann,et al., “Towards understanding how machines can learn causal overhypotheses,”arXiv preprint arXiv:2206.10591, 2022

work page arXiv 2022
[11]

CommonGen: A constrained text generation challenge for generative commonsense reasoning,

B. Y . Lin, W. Zhou, M. Shen, P. Zhou, C. Bhagavatula, Y . Choi, and X. Ren, “CommonGen: A constrained text generation challenge for generative commonsense reasoning,” inFindings of the Association for Computational Linguistics: EMNLP 2020(T. Cohn, Y . He, and Y . Liu, eds.), (Online), pp. 1823–1840, Association for Computational Linguistics, Nov. 2020

2020
[12]

Counterfactual story reasoning and generation,

L. Qin, A. Bosselut, A. Holtzman, C. Bhagavatula, E. Clark, and Y . Choi, “Counterfactual story reasoning and generation,” inProceedings of EMNLP-IJCNLP, 2019

2019
[13]

Unsupervised editing for counterfactual stories,

J. Chen, C. Gan, S. Cheng, H. Zhou, Y . Xiao, and L. Li, “Unsupervised editing for counterfactual stories,” inProceedings of the AAAI Conference on Artificial Intelligence, 2022

2022
[14]

Sketch and customize: A counterfactual story generator,

Y . Hao, X. Lin, X. Zhou, and M. Huang, “Sketch and customize: A counterfactual story generator,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 11684–11692, 2021

2021
[15]

A causal approach for counterfactual reasoning in narratives,

X. Mu and Q. Li, “A causal approach for counterfactual reasoning in narratives,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), 2024

2024
[16]

Beyond what if: Advancing counterfactual text generation with structural causal modeling,

X. Wang, Y . Zhang, J. Wu, and W. X. Zhao, “Beyond what if: Advancing counterfactual text generation with structural causal modeling,” inProceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 2024

2024
[17]

Bartscore: Evaluating generated text as text generation,

W. Yuan, G. Neubig, and P. Liu, “Bartscore: Evaluating generated text as text generation,”NeurIPS, 2021

2021
[18]

Differentiable expected bleu for text generation,

W. Wang, Z. Hu, Z. Yang, H. Shi, and E. P. Xing, “Differentiable expected bleu for text generation,” inInternational Conference on Learning Representations, 2019

2019
[19]

BERTTune: Fine-tuning neural machine translation with BERTScore,

I. Jauregi Unanue, J. Parnell, and M. Piccardi, “BERTTune: Fine-tuning neural machine translation with BERTScore,” inProceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 915–924, 2021

2021
[20]

Don’t take it literally: An edit-invariant sequence loss for text generation,

S. Liu, Y . Wang, N. Goyal, J. Wei, and G. Neubig, “Don’t take it literally: An edit-invariant sequence loss for text generation,”arXiv preprint arXiv:2205.12684, 2022

work page arXiv 2022
[21]

Simple statistical gradient-following algorithms for connectionist reinforcement learning,

R. J. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning,”Machine Learning, vol. 8, pp. 229– 256, 1992

1992
[22]

Harnessing the power of llms in practice: A survey on chatgpt and beyond,

J. Yang, H. Jin, R. Tang, X. Han, Q. Feng, H. Jiang, S. Zhong, B. Yin, and X. Hu, “Harnessing the power of llms in practice: A survey on chatgpt and beyond,”ACM Transactions on Knowledge Discovery from Data, vol. 18, no. 6, pp. 1–32, 2024

2024
[23]

Bleu: a method for automatic evaluation of machine translation,

K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” inProceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), (Philadelphia, PA, USA), pp. 311–318, Association for Computational Linguistics, 2002

2002
[24]

ROUGE: A package for automatic evaluation of summaries,

C.-Y . Lin, “ROUGE: A package for automatic evaluation of summaries,” inText Summarization Branches Out: Proceedings of the ACL-04 Workshop, (Barcelona, Spain), pp. 74–81, 2004

2004
[25]

Re-evaluating the role of BLEU in machine translation research,

C. Callison-Burch, M. Osborne, and P. Koehn, “Re-evaluating the role of BLEU in machine translation research,” inProceedings of the 11th 9 conference of the European chapter of the Association for Computational Linguistics, pp. 249–256, 2006

2006
[26]

Why we need new evaluation metrics for NLG,

J. Novikova, O. Dušek, and V . Rieser, “Why we need new evaluation metrics for NLG,” inProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 2241–2252, 2017

2017
[27]

Meteor: An automatic metric for mt evaluation with improved correlation with human judgments,

S. Banerjee and A. Lavie, “Meteor: An automatic metric for mt evaluation with improved correlation with human judgments,” inProceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72, 2005

2005
[28]

Comet: A neural framework for mt evaluation,

R. Rei, A. Farinha, A. Lavie, and L. Specia, “Comet: A neural framework for mt evaluation,” inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 2685–2702, 2020

2020
[29]

BERTScore: Evaluating text generation with BERT,

T. Zhang, V . Kishore, F. Wu, K. Weinberger, and Y . Artzi, “BERTScore: Evaluating text generation with BERT,” inInternational Conference on Learning Representations (ICLR), 2020

2020
[30]

Bleurt: Learning robust metrics for text generation,

T. Sellam, D. Das, and A. Parikh, “Bleurt: Learning robust metrics for text generation,” inProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7881–7892, 2020

2020
[31]

Neural machine translation by jointly learning to align and translate,

D. Bahdanau, K. Cho, and Y . Bengio, “Neural machine translation by jointly learning to align and translate,” in3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings(Y . Bengio and Y . LeCun, eds.), 2015

2015
[32]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,”CoRR, vol. abs/2305.18290, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation,

H. Xu, A. Sharaf, Y . Chen, W. Tan, L. Shen, B. Van Durme, K. Murray, and Y . J. Kim, “Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation,”arXiv preprint arXiv:2401.08417, 2024

work page arXiv 2024
[34]

Training objectives and evaluation metrics for counterfactual story rewritingl,

A. Girard, I. J. Unanue, and M. Piccardi, “Training objectives and evaluation metrics for counterfactual story rewritingl,”ACM Transactions on Asian and Low-Resource Language Information Processing, pp. 1–11, 2026

2026
[35]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” inProceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event(M. Meila and T. Zhang, eds.), ...

2021
[36]

Categorical reparameterization with Gumbel-softmax,

E. Jang, S. Gu, and B. Poole, “Categorical reparameterization with Gumbel-softmax,” in5th International Conference on Learning Repre- sentations, ICLR 2017, 2017

2017
[37]

XVD: cross- vocabulary differentiable training for generative adversarial attacks,

T. Roth, I. J. Unanue, A. Abuadbba, and M. Piccardi, “XVD: cross- vocabulary differentiable training for generative adversarial attacks,” in Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC/COLING 2024, pp. 17753–17763, 2024

2024
[38]

Back- propagation through the void: Optimizing control variates for black- box gradient estimation,

W. Grathwohl, D. Choi, Y . Wu, G. Roeder, and D. Duvenaud, “Back- propagation through the void: Optimizing control variates for black- box gradient estimation,” in6th International Conference on Learning Representations, ICLR 2018, 2018

2018
[39]

R. S. Sutton and A. G. Barto,Reinforcement learning - an introduction. MIT Press, 1998

1998
[40]

Abductive commonsense reasoning,

C. Bhagavatula, R. L. Bras, C. Malaviya, K. Sakaguchi, A. Holtz- man, H. Rashkin, D. Downey, W. tau Yih, and Y . Choi, “Abductive commonsense reasoning,” inProceedings of the Eighth International Conference on Learning Representations (ICLR), 2020. OpenReview ID: Byg1v1HKDB

2020
[41]

A corpus and cloze evaluation for deeper understanding of commonsense stories,

N. Mostafazadeh, N. Chambers, X. He, D. Parikh, D. Batra, L. Van- derwende, P. Kohli, and J. Allen, “A corpus and cloze evaluation for deeper understanding of commonsense stories,” inProceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 839–849, 2016

2016
[42]

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

M. Lewis, Y . Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V . Stoyanov, and L. Zettlemoyer, “Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehen- sion,”arXiv preprint arXiv:1910.13461, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910
[43]

A call for clarity in reporting BLEU scores,

M. Post, “A call for clarity in reporting BLEU scores,” inProceedings of the Third Conference on Machine Translation: Research Papers(O. Bojar, R. Chatterjee, C. Federmann, M. Fishel, Y . Graham, B. Haddow, M. Huck, A. J. Yepes, P. Koehn, C. Monz, M. Negri, A. Névéol, M. Neves, M. Post, L. Specia, M. Turchi, and K. Verspoor, eds.), (Brussels, Belgium), pp...

2018
[44]

Back to the future: Unsupervised backprop- based decoding for counterfactual and abductive commonsense reasoning,

L. Qin, V . Shwartz, P. West, C. Bhagavatula, J. D. Hwang, R. Le Bras, A. Bosselut, and Y . Choi, “Back to the future: Unsupervised backprop- based decoding for counterfactual and abductive commonsense reasoning,” inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)(B. Webber, T. Cohn, Y . He, and Y . Liu, eds.)...

2020
[45]

The hitchhiker‘s guide to testing statistical significance in natural language processing,

R. Dror, G. Baumer, S. Shlomov, and R. Reichart, “The hitchhiker‘s guide to testing statistical significance in natural language processing,” inProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)(I. Gurevych and Y . Miyao, eds.), (Melbourne, Australia), pp. 1383–1392, Association for Computationa...

2018
[46]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,”CoRR, vol. abs/1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[47]

Free vs paid ChatGPT token limits: A comprehensive guide,

Tactiq, “Free vs paid ChatGPT token limits: A comprehensive guide,”
[48]

winner” and “loser

Accessed: 2024-07-18. APPENDIXA DIRECT ANDCONTRASTIVEPREFERENCEOPTIMIZATION Rafailovet al. in [ 32] reformulated reinforcement learning from human feedback (RLHF) as a pairwise differentiable training objective. Given an input, x, and two candidate outputs, apreferredoutput yw and adispreferredoutput yl (informally referred to as “winner” and “loser”, hen...

2024
[49]

APPENDIXB MODELHYPERPARAMETERS All our experiments have used the pretrained facebook/bart-large-cnn model as the base model

The key difference between our proposed objectives and CPO’s is that we directly optimize a desirable evaluation metric rather than a generic likelihood, increasing the chances that our trained model will perform along the expectations of the targeted task. APPENDIXB MODELHYPERPARAMETERS All our experiments have used the pretrained facebook/bart-large-cnn...
[50]

The edited ending should remain as close as possible to the original ending

Minimal Intervention: Adjust the story’s original ending with minimal changes needed to align it with the counterfactual event. The edited ending should remain as close as possible to the original ending
[51]

Narrative Insight: Understand the story structure and make changes essential for maintaining the story’s coherence and thematic consistency, avoiding unnecessary alterations
[52]

Counterfactual Adaptability: Adapt the story’s course in response to the counterfactual event that diverges from the initial event. Premise: {test_data[’premise’]} Initial event: {test_data[’initial’]} Original ending: {test_data[’original_ending’]} Counterfactual event: {test_data[’counterfactual’]} Now, generate the adapted ending: This prompt is design...

[1] [1]

Angeles,Dictionary of Philosophy

A. Angeles,Dictionary of Philosophy. Barnes & Noble Books, 1981

1981

[2] [2]

Informal logic and the theory of reasoning,

M. A. Finocchiaro, “Informal logic and the theory of reasoning,”Informal Logic, vol. 6, no. 2, 1984

1984

[3] [3]

Critical thinking as argument analysis,

T. Govier, “Critical thinking as argument analysis,”Argumentation, vol. 3, no. 2, pp. 115–126, 1989

1989

[4] [4]

Kahneman,Thinking, fast and slow

D. Kahneman,Thinking, fast and slow. Macmillan, 2011

2011

[5] [5]

P. C. Wason and P. N. Johnson-Laird,Psychology of Reasoning: Structure and Content, vol. 86. Harvard University Press, 1972

1972

[6] [6]

Philosophical reasoning,

J. A. Passmore, “Philosophical reasoning,” 1961

1961

[7] [7]

Pearl and D

J. Pearl and D. Mackenzie,The Book of Why: The New Science of Cause and Effect. Basic Books, 2018

2018

[8] [8]

Can large language models infer causation from correlation?,

Z. Jin, J. Liu, Z. Lyu, S. Poff, M. Sachan, R. Mihalcea, M. Diab, and B. Schölkopf, “Can large language models infer causation from correlation?,” 2024

2024

[9] [9]

Investigating causal under- standing in llms,

M. Hobbhahn, T. Lieberum, and D. Seiler, “Investigating causal under- standing in llms,”NeurIPS ML Safety Workshop, 2022

2022

[10] [10]

Towards understanding how machines can learn causal overhypotheses,

E. Kosoy, D. M. Chan, A. Liu, J. Collins, B. Kaufmann,et al., “Towards understanding how machines can learn causal overhypotheses,”arXiv preprint arXiv:2206.10591, 2022

work page arXiv 2022

[11] [11]

CommonGen: A constrained text generation challenge for generative commonsense reasoning,

B. Y . Lin, W. Zhou, M. Shen, P. Zhou, C. Bhagavatula, Y . Choi, and X. Ren, “CommonGen: A constrained text generation challenge for generative commonsense reasoning,” inFindings of the Association for Computational Linguistics: EMNLP 2020(T. Cohn, Y . He, and Y . Liu, eds.), (Online), pp. 1823–1840, Association for Computational Linguistics, Nov. 2020

2020

[12] [12]

Counterfactual story reasoning and generation,

L. Qin, A. Bosselut, A. Holtzman, C. Bhagavatula, E. Clark, and Y . Choi, “Counterfactual story reasoning and generation,” inProceedings of EMNLP-IJCNLP, 2019

2019

[13] [13]

Unsupervised editing for counterfactual stories,

J. Chen, C. Gan, S. Cheng, H. Zhou, Y . Xiao, and L. Li, “Unsupervised editing for counterfactual stories,” inProceedings of the AAAI Conference on Artificial Intelligence, 2022

2022

[14] [14]

Sketch and customize: A counterfactual story generator,

Y . Hao, X. Lin, X. Zhou, and M. Huang, “Sketch and customize: A counterfactual story generator,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 11684–11692, 2021

2021

[15] [15]

A causal approach for counterfactual reasoning in narratives,

X. Mu and Q. Li, “A causal approach for counterfactual reasoning in narratives,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), 2024

2024

[16] [16]

Beyond what if: Advancing counterfactual text generation with structural causal modeling,

X. Wang, Y . Zhang, J. Wu, and W. X. Zhao, “Beyond what if: Advancing counterfactual text generation with structural causal modeling,” inProceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 2024

2024

[17] [17]

Bartscore: Evaluating generated text as text generation,

W. Yuan, G. Neubig, and P. Liu, “Bartscore: Evaluating generated text as text generation,”NeurIPS, 2021

2021

[18] [18]

Differentiable expected bleu for text generation,

W. Wang, Z. Hu, Z. Yang, H. Shi, and E. P. Xing, “Differentiable expected bleu for text generation,” inInternational Conference on Learning Representations, 2019

2019

[19] [19]

BERTTune: Fine-tuning neural machine translation with BERTScore,

I. Jauregi Unanue, J. Parnell, and M. Piccardi, “BERTTune: Fine-tuning neural machine translation with BERTScore,” inProceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 915–924, 2021

2021

[20] [20]

Don’t take it literally: An edit-invariant sequence loss for text generation,

S. Liu, Y . Wang, N. Goyal, J. Wei, and G. Neubig, “Don’t take it literally: An edit-invariant sequence loss for text generation,”arXiv preprint arXiv:2205.12684, 2022

work page arXiv 2022

[21] [21]

Simple statistical gradient-following algorithms for connectionist reinforcement learning,

R. J. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning,”Machine Learning, vol. 8, pp. 229– 256, 1992

1992

[22] [22]

Harnessing the power of llms in practice: A survey on chatgpt and beyond,

J. Yang, H. Jin, R. Tang, X. Han, Q. Feng, H. Jiang, S. Zhong, B. Yin, and X. Hu, “Harnessing the power of llms in practice: A survey on chatgpt and beyond,”ACM Transactions on Knowledge Discovery from Data, vol. 18, no. 6, pp. 1–32, 2024

2024

[23] [23]

Bleu: a method for automatic evaluation of machine translation,

K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” inProceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), (Philadelphia, PA, USA), pp. 311–318, Association for Computational Linguistics, 2002

2002

[24] [24]

ROUGE: A package for automatic evaluation of summaries,

C.-Y . Lin, “ROUGE: A package for automatic evaluation of summaries,” inText Summarization Branches Out: Proceedings of the ACL-04 Workshop, (Barcelona, Spain), pp. 74–81, 2004

2004

[25] [25]

Re-evaluating the role of BLEU in machine translation research,

C. Callison-Burch, M. Osborne, and P. Koehn, “Re-evaluating the role of BLEU in machine translation research,” inProceedings of the 11th 9 conference of the European chapter of the Association for Computational Linguistics, pp. 249–256, 2006

2006

[26] [26]

Why we need new evaluation metrics for NLG,

J. Novikova, O. Dušek, and V . Rieser, “Why we need new evaluation metrics for NLG,” inProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 2241–2252, 2017

2017

[27] [27]

Meteor: An automatic metric for mt evaluation with improved correlation with human judgments,

S. Banerjee and A. Lavie, “Meteor: An automatic metric for mt evaluation with improved correlation with human judgments,” inProceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72, 2005

2005

[28] [28]

Comet: A neural framework for mt evaluation,

R. Rei, A. Farinha, A. Lavie, and L. Specia, “Comet: A neural framework for mt evaluation,” inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 2685–2702, 2020

2020

[29] [29]

BERTScore: Evaluating text generation with BERT,

T. Zhang, V . Kishore, F. Wu, K. Weinberger, and Y . Artzi, “BERTScore: Evaluating text generation with BERT,” inInternational Conference on Learning Representations (ICLR), 2020

2020

[30] [30]

Bleurt: Learning robust metrics for text generation,

T. Sellam, D. Das, and A. Parikh, “Bleurt: Learning robust metrics for text generation,” inProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7881–7892, 2020

2020

[31] [31]

Neural machine translation by jointly learning to align and translate,

D. Bahdanau, K. Cho, and Y . Bengio, “Neural machine translation by jointly learning to align and translate,” in3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings(Y . Bengio and Y . LeCun, eds.), 2015

2015

[32] [32]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,”CoRR, vol. abs/2305.18290, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[33] [33]

Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation,

H. Xu, A. Sharaf, Y . Chen, W. Tan, L. Shen, B. Van Durme, K. Murray, and Y . J. Kim, “Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation,”arXiv preprint arXiv:2401.08417, 2024

work page arXiv 2024

[34] [34]

Training objectives and evaluation metrics for counterfactual story rewritingl,

A. Girard, I. J. Unanue, and M. Piccardi, “Training objectives and evaluation metrics for counterfactual story rewritingl,”ACM Transactions on Asian and Low-Resource Language Information Processing, pp. 1–11, 2026

2026

[35] [35]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” inProceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event(M. Meila and T. Zhang, eds.), ...

2021

[36] [36]

Categorical reparameterization with Gumbel-softmax,

E. Jang, S. Gu, and B. Poole, “Categorical reparameterization with Gumbel-softmax,” in5th International Conference on Learning Repre- sentations, ICLR 2017, 2017

2017

[37] [37]

XVD: cross- vocabulary differentiable training for generative adversarial attacks,

T. Roth, I. J. Unanue, A. Abuadbba, and M. Piccardi, “XVD: cross- vocabulary differentiable training for generative adversarial attacks,” in Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC/COLING 2024, pp. 17753–17763, 2024

2024

[38] [38]

Back- propagation through the void: Optimizing control variates for black- box gradient estimation,

W. Grathwohl, D. Choi, Y . Wu, G. Roeder, and D. Duvenaud, “Back- propagation through the void: Optimizing control variates for black- box gradient estimation,” in6th International Conference on Learning Representations, ICLR 2018, 2018

2018

[39] [39]

R. S. Sutton and A. G. Barto,Reinforcement learning - an introduction. MIT Press, 1998

1998

[40] [40]

Abductive commonsense reasoning,

C. Bhagavatula, R. L. Bras, C. Malaviya, K. Sakaguchi, A. Holtz- man, H. Rashkin, D. Downey, W. tau Yih, and Y . Choi, “Abductive commonsense reasoning,” inProceedings of the Eighth International Conference on Learning Representations (ICLR), 2020. OpenReview ID: Byg1v1HKDB

2020

[41] [41]

A corpus and cloze evaluation for deeper understanding of commonsense stories,

N. Mostafazadeh, N. Chambers, X. He, D. Parikh, D. Batra, L. Van- derwende, P. Kohli, and J. Allen, “A corpus and cloze evaluation for deeper understanding of commonsense stories,” inProceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 839–849, 2016

2016

[42] [42]

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

M. Lewis, Y . Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V . Stoyanov, and L. Zettlemoyer, “Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehen- sion,”arXiv preprint arXiv:1910.13461, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910

[43] [43]

A call for clarity in reporting BLEU scores,

M. Post, “A call for clarity in reporting BLEU scores,” inProceedings of the Third Conference on Machine Translation: Research Papers(O. Bojar, R. Chatterjee, C. Federmann, M. Fishel, Y . Graham, B. Haddow, M. Huck, A. J. Yepes, P. Koehn, C. Monz, M. Negri, A. Névéol, M. Neves, M. Post, L. Specia, M. Turchi, and K. Verspoor, eds.), (Brussels, Belgium), pp...

2018

[44] [44]

Back to the future: Unsupervised backprop- based decoding for counterfactual and abductive commonsense reasoning,

L. Qin, V . Shwartz, P. West, C. Bhagavatula, J. D. Hwang, R. Le Bras, A. Bosselut, and Y . Choi, “Back to the future: Unsupervised backprop- based decoding for counterfactual and abductive commonsense reasoning,” inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)(B. Webber, T. Cohn, Y . He, and Y . Liu, eds.)...

2020

[45] [45]

The hitchhiker‘s guide to testing statistical significance in natural language processing,

R. Dror, G. Baumer, S. Shlomov, and R. Reichart, “The hitchhiker‘s guide to testing statistical significance in natural language processing,” inProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)(I. Gurevych and Y . Miyao, eds.), (Melbourne, Australia), pp. 1383–1392, Association for Computationa...

2018

[46] [46]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,”CoRR, vol. abs/1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[47] [47]

Free vs paid ChatGPT token limits: A comprehensive guide,

Tactiq, “Free vs paid ChatGPT token limits: A comprehensive guide,”

[48] [48]

winner” and “loser

Accessed: 2024-07-18. APPENDIXA DIRECT ANDCONTRASTIVEPREFERENCEOPTIMIZATION Rafailovet al. in [ 32] reformulated reinforcement learning from human feedback (RLHF) as a pairwise differentiable training objective. Given an input, x, and two candidate outputs, apreferredoutput yw and adispreferredoutput yl (informally referred to as “winner” and “loser”, hen...

2024

[49] [49]

APPENDIXB MODELHYPERPARAMETERS All our experiments have used the pretrained facebook/bart-large-cnn model as the base model

The key difference between our proposed objectives and CPO’s is that we directly optimize a desirable evaluation metric rather than a generic likelihood, increasing the chances that our trained model will perform along the expectations of the targeted task. APPENDIXB MODELHYPERPARAMETERS All our experiments have used the pretrained facebook/bart-large-cnn...

[50] [50]

The edited ending should remain as close as possible to the original ending

Minimal Intervention: Adjust the story’s original ending with minimal changes needed to align it with the counterfactual event. The edited ending should remain as close as possible to the original ending

[51] [51]

Narrative Insight: Understand the story structure and make changes essential for maintaining the story’s coherence and thematic consistency, avoiding unnecessary alterations

[52] [52]

Counterfactual Adaptability: Adapt the story’s course in response to the counterfactual event that diverges from the initial event. Premise: {test_data[’premise’]} Initial event: {test_data[’initial’]} Original ending: {test_data[’original_ending’]} Counterfactual event: {test_data[’counterfactual’]} Now, generate the adapted ending: This prompt is design...