pith. sign in

arxiv: 2605.24885 · v1 · pith:3AQO5H76new · submitted 2026-05-24 · 💻 cs.CL

DTO: a Differentiable Training Objective for Effective Counterfactual Story Rewriting

Pith reviewed 2026-06-30 12:18 UTC · model grok-4.3

classification 💻 cs.CL
keywords counterfactual story rewritingdifferentiable training objectivetransformer fine-tuningnarrative consistencylocalized editsTimeTravel datasetART dataset
0
0 comments X

The pith

A joint differentiable loss on rewrite fidelity and narrative consistency improves localized edits in counterfactual story rewriting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a differentiable training objective to update stories for a new event while leaving unaffected parts intact. Standard maximum-likelihood training tends to miss the small required changes, and reinforcement learning approaches are slow to set up. The method fine-tunes a transformer end-to-end by backpropagating a loss that simultaneously rewards matching a reference rewrite and preserving semantic consistency with the original story. Evaluation on the TimeTravel and ART datasets shows the approach beats a maximum-likelihood baseline and a preference-based method while matching two large language models across metrics. A sympathetic reader would care because it offers a simpler, fully differentiable route to controlled text changes without complex reinforcement learning.

Core claim

The central claim is that a transformer model fine-tuned via end-to-end backpropagation against a fully differentiable loss function that jointly rewards fidelity to the reference rewrite and semantic consistency with the source narrative produces better counterfactual story rewrites than maximum-likelihood or preference-based training and remains competitive with contemporary large language models on the TimeTravel and ART datasets.

What carries the argument

The DTO loss, a fully differentiable objective combining a fidelity term to the reference rewrite with a semantic consistency term to the source narrative, optimized through backpropagation during fine-tuning.

Load-bearing premise

The joint loss will steer the model toward precisely localized edits without it discovering unintended shortcuts that achieve high scores while failing to produce the intended story changes.

What would settle it

A test set where models trained with the DTO loss achieve high fidelity and consistency scores yet still alter story elements that the reference rewrites leave unchanged.

Figures

Figures reproduced from arXiv: 2605.24885 by Amelia Girard, Massimo Piccardi.

Figure 1
Figure 1. Figure 1: An example of the training loss for the score in Equation [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
read the original abstract

Counterfactual story rewriting is a natural language processing task that requires updating an existing story to reflect a chosen alternative event, yet preserving all the unaffected storyline elements and overall coherence. While large language models have recently made remarkable progress on this task, it still remains challenging since the required modifications are typically very small in size and highly localized. As a consequence, models trained in a conventional manner with the maximum-likelihood training objective tend to overlook these nuances. At the same time, more sophisticated training approaches based on reinforcement learning are notoriously slow and difficult to set up. For these reasons, our paper proposes a novel, differentiable training objective (DTO) that directly optimizes for the requisite counterfactual improvements. In our approach, a transformer model is fine-tuned via end-to-end backpropagation against a fully differentiable loss function that jointly rewards (i) fidelity to the reference rewrite and (ii) semantic consistency with the source narrative. The empirical evaluation on the TimeTravel and ART datasets shows that the proposed DTO approach has been able to surpass a maximum-likelihood baseline and a preference-based approach, and perform competitively against two contemporary large language models in all evaluation metrics. These findings substantiate the effectiveness of task-specific differentiable objectives for nuanced, controlled text-generation tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper proposes a differentiable training objective (DTO) for counterfactual story rewriting. A transformer is fine-tuned end-to-end via backpropagation on a loss that jointly rewards fidelity to reference rewrites and semantic consistency with the source narrative. On the TimeTravel and ART datasets the method is reported to surpass MLE and preference-based baselines while remaining competitive with two contemporary LLMs across all evaluation metrics.

Significance. If the empirical results prove robust under proper statistical controls, the work would illustrate that task-specific differentiable objectives can serve as a practical, faster alternative to reinforcement learning for controlled text-generation problems that demand highly localized edits. This could encourage further exploration of custom losses in nuanced NLP generation tasks.

major comments (1)
  1. [Abstract] Abstract: the loss is described only qualitatively as jointly rewarding "(i) fidelity to the reference rewrite and (ii) semantic consistency with the source narrative" with no explicit functional form, definition of either term, balancing coefficient, or normalization procedure supplied. This formulation is load-bearing for the central claim that the objective is fully differentiable and produces the stated performance gains.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment. We address the single major point below and will update the abstract in the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the loss is described only qualitatively as jointly rewarding "(i) fidelity to the reference rewrite and (ii) semantic consistency with the source narrative" with no explicit functional form, definition of either term, balancing coefficient, or normalization procedure supplied. This formulation is load-bearing for the central claim that the objective is fully differentiable and produces the stated performance gains.

    Authors: We agree that the abstract presents the DTO loss at a high level. The explicit functional form (including the fidelity term based on token-level cross-entropy to the reference rewrite, the semantic consistency term implemented via a differentiable sentence-embedding cosine similarity, the balancing coefficient λ, and the normalization) is fully specified in Section 3.2 together with the end-to-end differentiability proof. To strengthen the abstract, we will insert a single sentence that names the two loss components, states that they are combined with a scalar λ, and notes that both are differentiable, thereby making the central claim self-contained while preserving the abstract’s length constraints. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper proposes DTO as an explicitly constructed differentiable loss that jointly optimizes fidelity to external reference rewrites and semantic consistency with the source narrative. The central claim is an empirical result (outperformance on TimeTravel and ART) obtained by training a transformer with this loss; no derivation, uniqueness theorem, or prediction is asserted that reduces by construction to fitted parameters, self-citations, or renamed inputs. The objective is defined against independent external measures, making the reported results testable and non-circular.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The approach rests on the unstated premise that a linear or additive combination of two external reward signals is both sufficient and stable for gradient-based optimization of localized edits.

free parameters (1)
  • balancing coefficient between fidelity and consistency terms
    Any joint reward of the form alpha * fidelity + (1-alpha) * consistency requires choosing or fitting alpha; the abstract gives no value or selection procedure.

pith-pipeline@v0.9.1-grok · 5740 in / 1111 out tokens · 27646 ms · 2026-06-30T12:18:09.466030+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 6 canonical work pages · 3 internal anchors

  1. [1]

    Angeles,Dictionary of Philosophy

    A. Angeles,Dictionary of Philosophy. Barnes & Noble Books, 1981

  2. [2]

    Informal logic and the theory of reasoning,

    M. A. Finocchiaro, “Informal logic and the theory of reasoning,”Informal Logic, vol. 6, no. 2, 1984

  3. [3]

    Critical thinking as argument analysis,

    T. Govier, “Critical thinking as argument analysis,”Argumentation, vol. 3, no. 2, pp. 115–126, 1989

  4. [4]

    Kahneman,Thinking, fast and slow

    D. Kahneman,Thinking, fast and slow. Macmillan, 2011

  5. [5]

    P. C. Wason and P. N. Johnson-Laird,Psychology of Reasoning: Structure and Content, vol. 86. Harvard University Press, 1972

  6. [6]

    Philosophical reasoning,

    J. A. Passmore, “Philosophical reasoning,” 1961

  7. [7]

    Pearl and D

    J. Pearl and D. Mackenzie,The Book of Why: The New Science of Cause and Effect. Basic Books, 2018

  8. [8]

    Can large language models infer causation from correlation?,

    Z. Jin, J. Liu, Z. Lyu, S. Poff, M. Sachan, R. Mihalcea, M. Diab, and B. Schölkopf, “Can large language models infer causation from correlation?,” 2024

  9. [9]

    Investigating causal under- standing in llms,

    M. Hobbhahn, T. Lieberum, and D. Seiler, “Investigating causal under- standing in llms,”NeurIPS ML Safety Workshop, 2022

  10. [10]

    Towards understanding how machines can learn causal overhypotheses,

    E. Kosoy, D. M. Chan, A. Liu, J. Collins, B. Kaufmann,et al., “Towards understanding how machines can learn causal overhypotheses,”arXiv preprint arXiv:2206.10591, 2022

  11. [11]

    CommonGen: A constrained text generation challenge for generative commonsense reasoning,

    B. Y . Lin, W. Zhou, M. Shen, P. Zhou, C. Bhagavatula, Y . Choi, and X. Ren, “CommonGen: A constrained text generation challenge for generative commonsense reasoning,” inFindings of the Association for Computational Linguistics: EMNLP 2020(T. Cohn, Y . He, and Y . Liu, eds.), (Online), pp. 1823–1840, Association for Computational Linguistics, Nov. 2020

  12. [12]

    Counterfactual story reasoning and generation,

    L. Qin, A. Bosselut, A. Holtzman, C. Bhagavatula, E. Clark, and Y . Choi, “Counterfactual story reasoning and generation,” inProceedings of EMNLP-IJCNLP, 2019

  13. [13]

    Unsupervised editing for counterfactual stories,

    J. Chen, C. Gan, S. Cheng, H. Zhou, Y . Xiao, and L. Li, “Unsupervised editing for counterfactual stories,” inProceedings of the AAAI Conference on Artificial Intelligence, 2022

  14. [14]

    Sketch and customize: A counterfactual story generator,

    Y . Hao, X. Lin, X. Zhou, and M. Huang, “Sketch and customize: A counterfactual story generator,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 11684–11692, 2021

  15. [15]

    A causal approach for counterfactual reasoning in narratives,

    X. Mu and Q. Li, “A causal approach for counterfactual reasoning in narratives,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), 2024

  16. [16]

    Beyond what if: Advancing counterfactual text generation with structural causal modeling,

    X. Wang, Y . Zhang, J. Wu, and W. X. Zhao, “Beyond what if: Advancing counterfactual text generation with structural causal modeling,” inProceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 2024

  17. [17]

    Bartscore: Evaluating generated text as text generation,

    W. Yuan, G. Neubig, and P. Liu, “Bartscore: Evaluating generated text as text generation,”NeurIPS, 2021

  18. [18]

    Differentiable expected bleu for text generation,

    W. Wang, Z. Hu, Z. Yang, H. Shi, and E. P. Xing, “Differentiable expected bleu for text generation,” inInternational Conference on Learning Representations, 2019

  19. [19]

    BERTTune: Fine-tuning neural machine translation with BERTScore,

    I. Jauregi Unanue, J. Parnell, and M. Piccardi, “BERTTune: Fine-tuning neural machine translation with BERTScore,” inProceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 915–924, 2021

  20. [20]

    Don’t take it literally: An edit-invariant sequence loss for text generation,

    S. Liu, Y . Wang, N. Goyal, J. Wei, and G. Neubig, “Don’t take it literally: An edit-invariant sequence loss for text generation,”arXiv preprint arXiv:2205.12684, 2022

  21. [21]

    Simple statistical gradient-following algorithms for connectionist reinforcement learning,

    R. J. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning,”Machine Learning, vol. 8, pp. 229– 256, 1992

  22. [22]

    Harnessing the power of llms in practice: A survey on chatgpt and beyond,

    J. Yang, H. Jin, R. Tang, X. Han, Q. Feng, H. Jiang, S. Zhong, B. Yin, and X. Hu, “Harnessing the power of llms in practice: A survey on chatgpt and beyond,”ACM Transactions on Knowledge Discovery from Data, vol. 18, no. 6, pp. 1–32, 2024

  23. [23]

    Bleu: a method for automatic evaluation of machine translation,

    K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” inProceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), (Philadelphia, PA, USA), pp. 311–318, Association for Computational Linguistics, 2002

  24. [24]

    ROUGE: A package for automatic evaluation of summaries,

    C.-Y . Lin, “ROUGE: A package for automatic evaluation of summaries,” inText Summarization Branches Out: Proceedings of the ACL-04 Workshop, (Barcelona, Spain), pp. 74–81, 2004

  25. [25]

    Re-evaluating the role of BLEU in machine translation research,

    C. Callison-Burch, M. Osborne, and P. Koehn, “Re-evaluating the role of BLEU in machine translation research,” inProceedings of the 11th 9 conference of the European chapter of the Association for Computational Linguistics, pp. 249–256, 2006

  26. [26]

    Why we need new evaluation metrics for NLG,

    J. Novikova, O. Dušek, and V . Rieser, “Why we need new evaluation metrics for NLG,” inProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 2241–2252, 2017

  27. [27]

    Meteor: An automatic metric for mt evaluation with improved correlation with human judgments,

    S. Banerjee and A. Lavie, “Meteor: An automatic metric for mt evaluation with improved correlation with human judgments,” inProceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72, 2005

  28. [28]

    Comet: A neural framework for mt evaluation,

    R. Rei, A. Farinha, A. Lavie, and L. Specia, “Comet: A neural framework for mt evaluation,” inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 2685–2702, 2020

  29. [29]

    BERTScore: Evaluating text generation with BERT,

    T. Zhang, V . Kishore, F. Wu, K. Weinberger, and Y . Artzi, “BERTScore: Evaluating text generation with BERT,” inInternational Conference on Learning Representations (ICLR), 2020

  30. [30]

    Bleurt: Learning robust metrics for text generation,

    T. Sellam, D. Das, and A. Parikh, “Bleurt: Learning robust metrics for text generation,” inProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7881–7892, 2020

  31. [31]

    Neural machine translation by jointly learning to align and translate,

    D. Bahdanau, K. Cho, and Y . Bengio, “Neural machine translation by jointly learning to align and translate,” in3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings(Y . Bengio and Y . LeCun, eds.), 2015

  32. [32]

    Direct Preference Optimization: Your Language Model is Secretly a Reward Model

    R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,”CoRR, vol. abs/2305.18290, 2023

  33. [33]

    Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation,

    H. Xu, A. Sharaf, Y . Chen, W. Tan, L. Shen, B. Van Durme, K. Murray, and Y . J. Kim, “Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation,”arXiv preprint arXiv:2401.08417, 2024

  34. [34]

    Training objectives and evaluation metrics for counterfactual story rewritingl,

    A. Girard, I. J. Unanue, and M. Piccardi, “Training objectives and evaluation metrics for counterfactual story rewritingl,”ACM Transactions on Asian and Low-Resource Language Information Processing, pp. 1–11, 2026

  35. [35]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” inProceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event(M. Meila and T. Zhang, eds.), ...

  36. [36]

    Categorical reparameterization with Gumbel-softmax,

    E. Jang, S. Gu, and B. Poole, “Categorical reparameterization with Gumbel-softmax,” in5th International Conference on Learning Repre- sentations, ICLR 2017, 2017

  37. [37]

    XVD: cross- vocabulary differentiable training for generative adversarial attacks,

    T. Roth, I. J. Unanue, A. Abuadbba, and M. Piccardi, “XVD: cross- vocabulary differentiable training for generative adversarial attacks,” in Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC/COLING 2024, pp. 17753–17763, 2024

  38. [38]

    Back- propagation through the void: Optimizing control variates for black- box gradient estimation,

    W. Grathwohl, D. Choi, Y . Wu, G. Roeder, and D. Duvenaud, “Back- propagation through the void: Optimizing control variates for black- box gradient estimation,” in6th International Conference on Learning Representations, ICLR 2018, 2018

  39. [39]

    R. S. Sutton and A. G. Barto,Reinforcement learning - an introduction. MIT Press, 1998

  40. [40]

    Abductive commonsense reasoning,

    C. Bhagavatula, R. L. Bras, C. Malaviya, K. Sakaguchi, A. Holtz- man, H. Rashkin, D. Downey, W. tau Yih, and Y . Choi, “Abductive commonsense reasoning,” inProceedings of the Eighth International Conference on Learning Representations (ICLR), 2020. OpenReview ID: Byg1v1HKDB

  41. [41]

    A corpus and cloze evaluation for deeper understanding of commonsense stories,

    N. Mostafazadeh, N. Chambers, X. He, D. Parikh, D. Batra, L. Van- derwende, P. Kohli, and J. Allen, “A corpus and cloze evaluation for deeper understanding of commonsense stories,” inProceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 839–849, 2016

  42. [42]

    BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

    M. Lewis, Y . Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V . Stoyanov, and L. Zettlemoyer, “Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehen- sion,”arXiv preprint arXiv:1910.13461, 2019

  43. [43]

    A call for clarity in reporting BLEU scores,

    M. Post, “A call for clarity in reporting BLEU scores,” inProceedings of the Third Conference on Machine Translation: Research Papers(O. Bojar, R. Chatterjee, C. Federmann, M. Fishel, Y . Graham, B. Haddow, M. Huck, A. J. Yepes, P. Koehn, C. Monz, M. Negri, A. Névéol, M. Neves, M. Post, L. Specia, M. Turchi, and K. Verspoor, eds.), (Brussels, Belgium), pp...

  44. [44]

    Back to the future: Unsupervised backprop- based decoding for counterfactual and abductive commonsense reasoning,

    L. Qin, V . Shwartz, P. West, C. Bhagavatula, J. D. Hwang, R. Le Bras, A. Bosselut, and Y . Choi, “Back to the future: Unsupervised backprop- based decoding for counterfactual and abductive commonsense reasoning,” inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)(B. Webber, T. Cohn, Y . He, and Y . Liu, eds.)...

  45. [45]

    The hitchhiker‘s guide to testing statistical significance in natural language processing,

    R. Dror, G. Baumer, S. Shlomov, and R. Reichart, “The hitchhiker‘s guide to testing statistical significance in natural language processing,” inProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)(I. Gurevych and Y . Miyao, eds.), (Melbourne, Australia), pp. 1383–1392, Association for Computationa...

  46. [46]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,”CoRR, vol. abs/1707.06347, 2017

  47. [47]

    Free vs paid ChatGPT token limits: A comprehensive guide,

    Tactiq, “Free vs paid ChatGPT token limits: A comprehensive guide,”

  48. [48]

    winner” and “loser

    Accessed: 2024-07-18. APPENDIXA DIRECT ANDCONTRASTIVEPREFERENCEOPTIMIZATION Rafailovet al. in [ 32] reformulated reinforcement learning from human feedback (RLHF) as a pairwise differentiable training objective. Given an input, x, and two candidate outputs, apreferredoutput yw and adispreferredoutput yl (informally referred to as “winner” and “loser”, hen...

  49. [49]

    APPENDIXB MODELHYPERPARAMETERS All our experiments have used the pretrained facebook/bart-large-cnn model as the base model

    The key difference between our proposed objectives and CPO’s is that we directly optimize a desirable evaluation metric rather than a generic likelihood, increasing the chances that our trained model will perform along the expectations of the targeted task. APPENDIXB MODELHYPERPARAMETERS All our experiments have used the pretrained facebook/bart-large-cnn...

  50. [50]

    The edited ending should remain as close as possible to the original ending

    Minimal Intervention: Adjust the story’s original ending with minimal changes needed to align it with the counterfactual event. The edited ending should remain as close as possible to the original ending

  51. [51]

    Narrative Insight: Understand the story structure and make changes essential for maintaining the story’s coherence and thematic consistency, avoiding unnecessary alterations

  52. [52]

    Counterfactual Adaptability: Adapt the story’s course in response to the counterfactual event that diverges from the initial event. Premise: {test_data[’premise’]} Initial event: {test_data[’initial’]} Original ending: {test_data[’original_ending’]} Counterfactual event: {test_data[’counterfactual’]} Now, generate the adapted ending: This prompt is design...