pith. machine review for the scientific record. sign in

arxiv: 2605.04920 · v1 · submitted 2026-05-06 · 💻 cs.LG · cs.CL

Recognition: 2 theorem links

Reinforcement Learning for Compositional Generalization with Outcome-Level Optimization

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:12 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords compositional generalizationreinforcement learningoutcome-level optimizationpolicy optimizationlanguage modelssupervised fine-tuninggeneralization benchmarks
0
0 comments X

The pith

Outcome-level reinforcement learning improves compositional generalization over supervised fine-tuning by optimizing entire outputs rather than token sequences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Compositional generalization means correctly handling novel combinations of familiar elements, which current supervised training struggles with because it rewards exact token imitation. The paper tests whether shifting to reinforcement learning that scores only the final result can capture the needed global structure instead. Using Group Relative Policy Optimization, models receive either a simple binary reward for correct answers or a composite reward that adds composition signals. Experiments across benchmarks show clear gains, with analysis indicating that supervised models overfit common training patterns while the reinforcement approach redistributes probability mass toward more complex unseen forms.

Core claim

The paper establishes that outcome-level optimization with Group Relative Policy Optimization, using binary or composite rewards on final outputs, produces stronger compositional generalization than token-level supervised fine-tuning; the improvement arises because reinforcement learning reshapes the output distribution away from frequent training compositions and toward more complex novel ones.

What carries the argument

Group Relative Policy Optimization applied at the outcome level, where a reward signal evaluates the correctness or compositional quality of the complete generated response.

If this is right

  • Supervised fine-tuning models overfit to frequent compositions seen during training.
  • Reinforcement learning reshapes output distributions to favor more complex composition types.
  • Both binary outcome rewards and composite rewards that add explicit composition feedback yield measurable gains.
  • The advantage of outcome-level optimization holds across multiple standard compositional generalization benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Outcome-level training may reduce reliance on training sets that exhaustively enumerate all possible combinations.
  • The same reward-based reshaping could apply to other structural generalization problems such as mathematical or logical reasoning.
  • Comparing token-level versus outcome-level signals points to a broader design choice in how generative models explore solution spaces.

Load-bearing premise

The chosen binary or composite rewards on final outputs supply an unbiased and sufficient signal for global compositional structure without new failure modes or hidden tuning effects driving the observed gains.

What would settle it

A new compositional benchmark on which outcome-level reinforcement learning produces no accuracy gain over supervised fine-tuning and shows no measurable shift in output probabilities for complex composition types.

Figures

Figures reproduced from arXiv: 2605.04920 by Wei Liu, Xiyan Fu.

Figure 1
Figure 1. Figure 1: Illustration of compositional generalization and training paradigms. Top: Example of compositional general￾ization where the model must correctly compose previously seen primitives (e.g., jump twice and turn left) to produce the correct action sequence for a novel instruction. Bottom: Com￾parison of training signals. Token-level optimization relies on supervised targets and cross-entropy loss, whereas outc… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of Compositional Group Relative Policy Optimization. The LLM samples a group of candidate label sequences (y 1 , y2 , . . . , yK) based on inputs. Each candidate is then evaluated using two complementary reward signals: (i) a binary reward that measures exact match with the gold sequence, and (ii) a compositional reward that assesses primitive correctness and structural composition patterns. Rewar… view at source ↗
Figure 3
Figure 3. Figure 3: Average training trigram frequency of incorrect predictions from SFT and GRPO. Trigram frequencies are computed with respect to the training data, excluding trigrams appearing in ground-truth outputs. 6.1 Copying Behavior Prior works (Rice et al., 2020) suggest that poor generalization may stem from models overfitting to the prior output distribution of the training data. In compositional generalization ta… view at source ↗
Figure 4
Figure 4. Figure 4: Performance on the SCAN-Length split across various target output lengths. Examples are grouped into bins by output length. respect to output length. In compositional gen￾eralization tasks, longer outputs typically corre￾spond to more complex structural compositions and therefore present a more challenging composi￾tionality. We group the test examples into several buckets according to the gold output lengt… view at source ↗
read the original abstract

Compositional generalization refers to correctly interpret novel combinations of known primitives, which remains a major challenge. Existing approaches often rely on supervised fine-tuning, which encourages models to imitate target outputs. This token-level training paradigm fails to capture the global compositional structure required for generalizing to unseen combinations. In this work, we investigate whether compositional generalization can instead be improved through outcome-level reinforcement learning. We adopt Group Relative Policy Optimization to optimize models based on feedback on their final outputs. Within this framework, we explore both a simple binary outcome reward and a composite reward that provides additional composition feedback. Experiments on multiple compositional benchmarks show that reinforcement learning improves compositional generalization compared to supervised fine-tuning. Further analysis reveals that supervised models tend to overfit frequent training compositions, whereas reinforcement learning improves compositional generalization by reshaping the output distribution, particularly for more complex composition types.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The paper claims that outcome-level reinforcement learning via Group Relative Policy Optimization (GRPO) improves compositional generalization over supervised fine-tuning on multiple benchmarks. It explores binary outcome rewards and composite rewards with additional composition feedback, arguing that RL reshapes output distributions to reduce overfitting to frequent training compositions and better handle complex unseen combinations.

Significance. If the results hold after addressing comparison confounds, the work would provide evidence that shifting from token-level imitation to outcome-level optimization can better capture global compositional structure, offering a practical alternative to SFT for generalization challenges in language models.

major comments (1)
  1. [Abstract and Methods] Abstract and reward definition (likely §3): The composite reward 'provides additional composition feedback' while the SFT baseline uses only token-level imitation. If the composition signal derives from ground-truth parses, trees, or primitive-level correctness (standard in these benchmarks), the RL policy receives explicit global structure supervision absent from SFT. This makes it impossible to attribute gains specifically to outcome-level RL or GRPO without an ablation that supplies equivalent composition signals to an SFT model (e.g., via auxiliary loss). The analysis of output-distribution reshaping and complex-composition gains inherits the same ambiguity.
minor comments (3)
  1. [Methods] Provide the exact mathematical definition of the composite reward and how it is computed from model outputs versus ground truth.
  2. [Experiments] Include full experimental details: number of runs, statistical tests, error bars, and hyperparameter sensitivity for GRPO and reward weighting.
  3. [Results] Clarify whether the binary reward alone (without composite) already outperforms SFT, to separate the effect of outcome-level optimization from the richer reward.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback, particularly the identification of a potential confound in comparing our RL setups to SFT. We address this concern directly and outline targeted revisions to improve clarity without overstating our claims.

read point-by-point responses
  1. Referee: [Abstract and Methods] Abstract and reward definition (likely §3): The composite reward 'provides additional composition feedback' while the SFT baseline uses only token-level imitation. If the composition signal derives from ground-truth parses, trees, or primitive-level correctness (standard in these benchmarks), the RL policy receives explicit global structure supervision absent from SFT. This makes it impossible to attribute gains specifically to outcome-level RL or GRPO without an ablation that supplies equivalent composition signals to an SFT model (e.g., via auxiliary loss). The analysis of output-distribution reshaping and complex-composition gains inherits the same ambiguity.

    Authors: We agree this is an important point for causal attribution. Our binary outcome reward is computed exclusively from final-output correctness (task success or exact match against the benchmark target), with no explicit primitive, parse, or composition signals supplied to the policy. The composite reward adds terms that score compositional elements in the completed output using the same ground-truth evaluation functions already used for benchmark scoring; these are still post-hoc outcome verifiers rather than injected structure or auxiliary targets during generation. Nevertheless, we acknowledge that the composite signal is richer than pure binary outcome feedback and could partially explain some gains. To address the concern, we will revise the abstract and §3 to provide precise mathematical definitions of both rewards, foreground the binary-reward results as the core evidence for outcome-level optimization, and expand the analysis section to show that distribution reshaping and complex-composition improvements are already visible under the binary reward alone. We will also add a brief discussion of the suggested auxiliary-loss ablation on SFT as a limitation and direction for future work (or include a preliminary version if time permits during revision). revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical comparison without derivations or self-referential reductions

full rationale

The paper reports experimental results on compositional benchmarks, comparing outcome-level RL (with binary or composite rewards) against supervised fine-tuning. No equations, first-principles derivations, or predictions are claimed; the central claims rest on observed performance differences and post-hoc analysis of output distributions. The composite reward is described as providing 'additional composition feedback,' but this is an explicit design choice in the experimental setup rather than a hidden self-definition or fitted quantity renamed as a result. No self-citations are invoked to justify uniqueness or load-bearing premises, and the work does not reduce any finding to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities are stated. Standard RL hyperparameters (learning rate, group size, reward scaling) are implicitly present but not detailed.

pith-pipeline@v0.9.0 · 5431 in / 1094 out tokens · 38460 ms · 2026-05-08T18:12:06.437214+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

300 extracted references · 180 canonical work pages · 7 internal anchors

  1. [1]

    Cognition , volume=

    Connectionism and cognitive architecture: A critical analysis , author=. Cognition , volume=. 1988 , publisher=

  2. [2]

    Behavioral and brain sciences , volume=

    Building machines that learn and think like people , author=. Behavioral and brain sciences , volume=. 2017 , publisher=

  3. [3]

    Journal of Artificial Intelligence Research , volume=

    Compositionality decomposed: How do neural networks generalise? , author=. Journal of Artificial Intelligence Research , volume=. 2020 , url=

  4. [4]

    International conference on machine learning , pages=

    Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks , author=. International conference on machine learning , pages=. 2018 , organization=

  5. [5]

    Testing the General Deductive Reasoning Capacity of Large Language Models Using

    Abulhair Saparov and Richard Yuanzhe Pang and Vishakh Padmakumar and Nitish Joshi and Mehran Kazemi and Najoung Kim and He He , booktitle=. Testing the General Deductive Reasoning Capacity of Large Language Models Using. 2023 , url=

  6. [6]

    Nature , pages=

    Human-like systematic generalization through a meta-learning neural network , author=. Nature , pages=. 2023 , publisher=

  7. [7]

    Dynamic MOdularized Reasoning for Compositional Structured Explanation Generation

    Fu, Xiyan and Frank, Anette. Dynamic MOdularized Reasoning for Compositional Structured Explanation Generation. Proceedings of the 13th Joint Conference on Lexical and Computational Semantics (*SEM 2024). 2024

  8. [8]

    The Eleventh International Conference on Learning Representations , year=

    Characterizing intrinsic compositionality in transformers with Tree Projections , author=. The Eleventh International Conference on Learning Representations , year=

  9. [9]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author =. arXiv preprint , archivePrefix=. 2501.12948 , year =

  10. [10]

    2025 , url=

    Tianzhe Chu and Yuexiang Zhai and Jihan Yang and Shengbang Tong and Saining Xie and Dale Schuurmans and Quoc V Le and Sergey Levine and Yi Ma , booktitle=. 2025 , url=

  11. [11]

    Proceedings of the International Conference on Learning Representations (ICLR) , year =

    Understanding the Effects of RLHF on LLM Generalisation and Diversity , author =. Proceedings of the International Conference on Learning Representations (ICLR) , year =

  12. [12]

    Learning what reinforcement learning can’t: Interleaved online fine-tuning for hardest questions, 2026

    Learning What Reinforcement Learning Can’t: Interleaved Online Fine-Tuning for Hardest Questions , author =. 2025 , archivePrefix=. 2506.07527 , primaryClass =

  13. [13]

    RL Fine-Tuning Heals OOD Forgetting in SFT

    RL Fine-Tuning Heals OOD Forgetting in SFT , author =. 2025 , archivePrefix=. 2509.12235 , primaryClass =

  14. [14]

    Proceedings of the 34th International Conference on Machine Learning , pages =

    On Calibration of Modern Neural Networks , author =. Proceedings of the 34th International Conference on Machine Learning , pages =. 2017 , editor =

  15. [15]

    Team OLMo and Pete Walsh and Luca Soldaini and Dirk Groeneveld and Kyle Lo and Shane Arora and Akshita Bhagia and Yuling Gu and Shengyi Huang and Matt Jordan and Nathan Lambert and Dustin Schwenk and Oyvind Tafjord and Taira Anderson and David Atkinson and Faeze Brahman and Christopher Clark and Pradeep Dasigi and Nouha Dziri and Michal Guerquin and Hamis...

  16. [16]

    arXiv preprint arXiv:2505.00661 , year=

    On the generalization of language models from in-context learning and finetuning: a controlled study , author=. arXiv preprint arXiv:2505.00661 , year=

  17. [17]

    Advances in neural information processing systems , volume=

    Professor forcing: A new algorithm for training recurrent networks , author=. Advances in neural information processing systems , volume=. 2016 , url=

  18. [18]

    Advances in Neural Information Processing Systems (NeurIPS) , year =

    Hierarchical Poset Decoding for Compositional Generalization in Language , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  19. [19]

    Li and Y

    Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Mingchuan Zhang and Y.K. Li and Y. Wu and Daya Guo , title =. CoRR , volume =. 2024 , url =

  20. [20]

    International Conference on Learning Representations , year=

    Measuring Compositional Generalization: A Comprehensive Method on Realistic Data , author=. International Conference on Learning Representations , year=

  21. [21]

    Qwen2.5: A Party of Foundation Models , url =

    Team Qwen , month =. Qwen2.5: A Party of Foundation Models , url =

  22. [22]

    The Llama 3 Herd of Models

    The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

  23. [23]

    Training language models to follow instructions with human feedback , url =

    Ouyang, Long and Wu, Jeffrey and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and Schulman, John and Hilton, Jacob and Kelton, Fraser and Miller, Luke and Simens, Maddie and Askell, Amanda and Welinder, Peter and Christiano, Paul F and Leike, Jan and Lowe,...

  24. [24]

    GPT-4 Technical Report

    Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , url=

  25. [25]

    Beyond English-Centric Training: How Reinforcement Learning Improves Cross-Lingual Reasoning in

    Shulin Huang and Yiran Ding and Junshu Pan and Yue Zhang , booktitle=. Beyond English-Centric Training: How Reinforcement Learning Improves Cross-Lingual Reasoning in. 2026 , url=

  26. [26]

    RAG-RL: Advancing Retrieval-Augmented Generation via RL and Curriculum Learning

    Rag-rl: Advancing retrieval-augmented generation via rl and curriculum learning , author=. arXiv preprint arXiv:2503.12759 , year=

  27. [27]

    arXiv preprint arXiv:2503.19612 , year=

    RL-finetuning LLMs from on-and off-policy data with a single algorithm , author=. arXiv preprint arXiv:2503.19612 , year=

  28. [28]

    2026 , url=

    Xiangxiang Chu and Hailang Huang and Xiao Zhang and Fei Wei and Yong Wang , booktitle=. 2026 , url=

  29. [29]

    2025 , url=

    Yuxiang Wei and Olivier Duchenne and Jade Copet and Quentin Carbonneaux and LINGMING ZHANG and Daniel Fried and Gabriel Synnaeve and Rishabh Singh and Sida Wang , booktitle=. 2025 , url=

  30. [30]

    Proceedings of the 37th International Conference on Machine Learning , pages =

    Overfitting in adversarially robust deep learning , author =. Proceedings of the 37th International Conference on Machine Learning , pages =. 2020 , editor =

  31. [31]

    DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

    Deepseek llm: Scaling open-source language models with longtermism , author=. arXiv preprint arXiv:2401.02954 , url=

  32. [32]

    Second Conference on Language Modeling , year=

    Tulu 3: Pushing Frontiers in Open Language Model Post-Training , author=. Second Conference on Language Modeling , year=

  33. [33]

    The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

    Reinforcement Learning for Reasoning in Large Language Models with One Training Example , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

  34. [34]

    On-Policy

    Wenhao Zhang and Yuexiang Xie and Yuchang Sun and Yanxi Chen and Guoyin Wang and Yaliang Li and Bolin Ding and Jingren Zhou , booktitle=. On-Policy. 2026 , url=

  35. [35]

    Exploring Continual Learning of Compositional Generalization in NLI

    Fu, Xiyan and Frank, Anette. Exploring Continual Learning of Compositional Generalization in NLI. Transactions of the Association for Computational Linguistics. 2024. doi:10.1162/tacl_a_00680

  36. [36]

    Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.0

  37. [37]

    Findings of the WMT 24 General Machine Translation Shared Task: The LLM Era Is Here but MT Is Not Solved Yet

    Kocmi, Tom and Avramidis, Eleftherios and Bawden, Rachel and Bojar, Ond rej and Dvorkovich, Anton and Federmann, Christian and Fishel, Mark and Freitag, Markus and Gowda, Thamme and Grundkiewicz, Roman and Haddow, Barry and Karpinska, Marzena and Koehn, Philipp and Marie, Benjamin and Monz, Christof and Murray, Kenton and Nagata, Masaaki and Popel, Martin...

  38. [38]

    Are LLM s Breaking MT Metrics? Results of the WMT 24 Metrics Shared Task

    Freitag, Markus and Mathur, Nitika and Deutsch, Daniel and Lo, Chi-Kiu and Avramidis, Eleftherios and Rei, Ricardo and Thompson, Brian and Blain, Frederic and Kocmi, Tom and Wang, Jiayi and Adelani, David Ifeoluwa and Buchicchio, Marianna and Zerva, Chrysoula and Lavie, Alon. Are LLM s Breaking MT Metrics? Results of the WMT 24 Metrics Shared Task. Procee...

  39. [39]

    De Souza, Jos \'e G

    Zerva, Chrysoula and Blain, Frederic and C. De Souza, Jos\'e G. and Kanojia, Diptesh and Deoghare, Sourabh and Guerreiro, Nuno M. and Attanasio, Giuseppe and Rei, Ricardo and Orasan, Constantin and Negri, Matteo and Turchi, Marco and Chatterjee, Rajen and Bhattacharyya, Pushpak and Freitag, Markus and Martins, Andr\'e. Findings of the Quality Estimation S...

  40. [40]

    Findings of the WMT 2024 Shared Task of the Open Language Data Initiative

    Maillard, Jean and Burchell, Laurie and Anastasopoulos, Antonios and Federmann, Christian and Koehn, Philipp and Wang, Skyler. Findings of the WMT 2024 Shared Task of the Open Language Data Initiative. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.4

  41. [41]

    Results of the WAT / WMT 2024 Shared Task on Patent Translation

    Higashiyama, Shohei. Results of the WAT / WMT 2024 Shared Task on Patent Translation. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.5

  42. [42]

    Findings of the WMT 2024 Biomedical Translation Shared Task: Test Sets on Abstract Level

    Neves, Mariana and Grozea, Cristian and Thomas, Philippe and Roller, Roland and Bawden, Rachel and N\'ev\'eol, Aur\'elie and Castle, Steffen and Bonato, Vanessa and Di Nunzio, Giorgio Maria and Vezzani, Federica and Vicente Navarro, Maika and Yeganova, Lana and Jimeno Yepes, Antonio. Findings of the WMT 2024 Biomedical Translation Shared Task: Test Sets o...

  43. [43]

    MSLC 24 Submissions to the General Machine Translation Task

    Larkin, Samuel and Lo, Chi-Kiu and Knowles, Rebecca. MSLC 24 Submissions to the General Machine Translation Task. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.7

  44. [44]

    IOL Research Machine Translation Systems for WMT 24 General Machine Translation Shared Task

    Zhang, Wenbo. IOL Research Machine Translation Systems for WMT 24 General Machine Translation Shared Task. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.8

  45. [45]

    Choose the Final Translation from NMT and LLM Hypotheses Using MBR Decoding: HW - TSC `s Submission to the WMT 24 General MT Shared Task

    Wu, Zhanglin and Wei, Daimeng and Li, Zongyao and Shang, Hengchao and Guo, Jiaxin and Li, Shaojun and Rao, Zhiqiang and Luo, Yuanchang and Xie, Ning and Yang, Hao. Choose the Final Translation from NMT and LLM Hypotheses Using MBR Decoding: HW - TSC 's Submission to the WMT 24 General MT Shared Task. Proceedings of the Ninth Conference on Machine Translat...

  46. [46]

    C ycle GN : A Cycle Consistent Approach for Neural Machine Translation

    C ycle GN : A Cycle Consistent Approach for Neural Machine Translation. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.10

  47. [47]

    U v A - MT `s Participation in the WMT 24 General Translation Shared Task

    Tan, Shaomu and Stap, David and Aycock, Seth and Monz, Christof and Wu, Di. U v A - MT 's Participation in the WMT 24 General Translation Shared Task. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.11

  48. [48]

    Rei, Ricardo and Pombal, Jose and Guerreiro, Nuno M. and Alves, Jo\ ao and Martins, Pedro Henrique and Fernandes, Patrick and Wu, Helena and Vaz, Tania and Alves, Duarte and Farajian, Amin and Agrawal, Sweta and Farinhas, Antonio and C. De Souza, Jos\'e G. and Martins, Andr\'e. Tower v2: Unbabel- IST 2024 Submission for the General MT Shared Task. Proceed...

  49. [49]

    TSU HITS `s Submissions to the WMT 2024 General Machine Translation Shared Task

    Mynka, Vladimir and Mikhaylovskiy, Nikolay. TSU HITS 's Submissions to the WMT 2024 General Machine Translation Shared Task. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.13

  50. [50]

    Document-level Translation with LLM Reranking: Team- J at WMT 2024 General Translation Task

    Kudo, Keito and Deguchi, Hiroyuki and Morishita, Makoto and Fujii, Ryo and Ito, Takumi and Ozaki, Shintaro and Natsumi, Koki and Sato, Kai and Yano, Kazuki and Takahashi, Ryosuke and Kimura, Subaru and Hara, Tomomasa and Sakai, Yusuke and Suzuki, Jun. Document-level Translation with LLM Reranking: Team- J at WMT 2024 General Translation Task. Proceedings ...

  51. [51]

    DLUT and GTCOM `s Neural Machine Translation Systems for WMT 24

    Zong, Hao and Bei, Chao and Liu, Huan and Yuan, Conghu and Chen, Wentao and Huang, Degen. DLUT and GTCOM 's Neural Machine Translation Systems for WMT 24. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.15

  52. [52]

    CUNI at WMT 24 General Translation Task: LLM s, ( Q ) L o RA , CPO and Model Merging

    Hrabal, Miroslav and Jon, Josef and Popel, Martin and Luu, Nam and Semin, Danil and Bojar, Ond rej. CUNI at WMT 24 General Translation Task: LLM s, ( Q ) L o RA , CPO and Model Merging. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.16

  53. [53]

    From General LLM to Translation: How We Dramatically Improve Translation Quality Using Human Evaluation Data for LLM Finetuning

    Elshin, Denis and Karpachev, Nikolay and Gruzdev, Boris and Golovanov, Ilya and Ivanov, Georgy and Antonov, Alexander and Skachkov, Nickolay and Latypova, Ekaterina and Layner, Vladimir and Enikeeva, Ekaterina and Popov, Dmitry and Chekashev, Anton and Negodin, Vladislav and Frantsuzova, Vera and Chernyshev, Alexander and Denisov, Kirill. From General LLM...

  54. [54]

    Cogs in a Machine, Doing What They`re Meant to Do -- the AMI Submission to the WMT 24 General Translation Task

    Jasonarson, Atli and Hafsteinsson, Hinrik and \'Armannsson, Bjarki and Steingr\' msson, Steinth\'or. Cogs in a Machine, Doing What They're Meant to Do -- the AMI Submission to the WMT 24 General Translation Task. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.18

  55. [55]

    IKUN for WMT 24 General MT Task: LLM s Are Here for Multilingual Machine Translation

    Liao, Baohao and Herold, Christian and Khadivi, Shahram and Monz, Christof. IKUN for WMT 24 General MT Task: LLM s Are Here for Multilingual Machine Translation. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.19

  56. [56]

    NTTSU at WMT 2024 General Translation Task

    Kondo, Minato and Fukuda, Ryo and Wang, Xiaotian and Chousa, Katsuki and Nishimura, Masato and Buma, Kosei and Kano, Takatomo and Utsuro, Takehito. NTTSU at WMT 2024 General Translation Task. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.20

  57. [57]

    SCIR - MT `s Submission for WMT 24 General Machine Translation Task

    Li, Baohang and Ye, Zekai and Huang, Yichong and Feng, Xiaocheng and Qin, Bing. SCIR - MT 's Submission for WMT 24 General Machine Translation Task. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.21

  58. [58]

    AIST AIRC Systems for the WMT 2024 Shared Tasks

    Rikters, Matiss and Miwa, Makoto. AIST AIRC Systems for the WMT 2024 Shared Tasks. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.22

  59. [59]

    Occiglot at WMT 24: E uropean Open-source Large Language Models Evaluated on Translation

    Occiglot at WMT 24: E uropean Open-source Large Language Models Evaluated on Translation. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.23

  60. [60]

    C o ST of breaking the LLM s

    Mukherjee, Ananya and Yadav, Saumitra and Shrivastava, Manish. C o ST of breaking the LLM s. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.24

  61. [61]

    WMT 24 Test Suite: Gender Resolution in Speaker-Listener Dialogue Roles

    Dawkins, Hillary and Nejadgholi, Isar and Lo, Chi-Kiu. WMT 24 Test Suite: Gender Resolution in Speaker-Listener Dialogue Roles. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.25

  62. [62]

    The G ender Q ueer Test Suite

    Friidhriksd\'ottir, Steinunn Rut. The G ender Q ueer Test Suite. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.26

  63. [63]

    Domain Dynamics: Evaluating Large Language Models in E nglish- H indi Translation

    Bhattacharjee, Soham and Gain, Baban and Ekbal, Asif. Domain Dynamics: Evaluating Large Language Models in E nglish- H indi Translation. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.27

  64. [64]

    Investigating the Linguistic Performance of Large Language Models in Machine Translation

    Investigating the Linguistic Performance of Large Language Models in Machine Translation. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.28

  65. [65]

    I so C hrono M eter: A Simple and Effective Isochronic Translation Evaluation Metric

    Rozanov, Nikolai and Pankov, Vikentiy and Mukhutdinov, Dmitrii and Vypirailenko, Dima. I so C hrono M eter: A Simple and Effective Isochronic Translation Evaluation Metric. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.29

  66. [66]

    A Test Suite of Prompt Injection Attacks for LLM -based Machine Translation

    Miceli Barone, Antonio Valerio and Sun, Zhifan. A Test Suite of Prompt Injection Attacks for LLM -based Machine Translation. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.30

  67. [67]

    Killing Two Flies with One Stone: An Attempt to Break LLM s Using E nglish- I celandic Idioms and Proper Names

    \'Armannsson, Bjarki and Hafsteinsson, Hinrik and Jasonarson, Atli and Steingrimsson, Steinthor. Killing Two Flies with One Stone: An Attempt to Break LLM s Using E nglish- I celandic Idioms and Proper Names. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.31

  68. [68]

    M eta M etrics- MT : Tuning Meta-Metrics for Machine Translation via Human Preference Calibration

    Anugraha, David and Kuwanto, Garry and Susanto, Lucky and Wijaya, Derry Tanti and Winata, Genta. M eta M etrics- MT : Tuning Meta-Metrics for Machine Translation via Human Preference Calibration. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.32

  69. [69]

    chr F - S : Semantics Is All You Need

    Mukherjee, Ananya and Shrivastava, Manish. chr F - S : Semantics Is All You Need. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.33

  70. [70]

    MSLC 24: Further Challenges for Metrics on a Wide Landscape of Translation Quality

    Knowles, Rebecca and Larkin, Samuel and Lo, Chi-Kiu. MSLC 24: Further Challenges for Metrics on a Wide Landscape of Translation Quality. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.34

  71. [71]

    M etric X -24: The G oogle Submission to the WMT 2024 Metrics Shared Task

    Juraska, Juraj and Deutsch, Daniel and Finkelstein, Mara and Freitag, Markus. M etric X -24: The G oogle Submission to the WMT 2024 Metrics Shared Task. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.35

  72. [72]

    Evaluating WMT 2024 Metrics Shared Task Submissions on A fri MTE (the A frican Challenge Set)

    Wang, Jiayi and Adelani, David Ifeoluwa and Stenetorp, Pontus. Evaluating WMT 2024 Metrics Shared Task Submissions on A fri MTE (the A frican Challenge Set). Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.36

  73. [73]

    Machine Translation Metrics Are Better in Evaluating Linguistic Errors on LLM s than on Encoder-Decoder Systems

    Machine Translation Metrics Are Better in Evaluating Linguistic Errors on LLM s than on Encoder-Decoder Systems. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.37

  74. [74]

    TMU - HIT `s Submission for the WMT 24 Quality Estimation Shared Task: Is GPT -4 a Good Evaluator for Machine Translation?

    Sato, Ayako and Nakajima, Kyotaro and Kim, Hwichan and Chen, Zhousi and Komachi, Mamoru. TMU - HIT 's Submission for the WMT 24 Quality Estimation Shared Task: Is GPT -4 a Good Evaluator for Machine Translation?. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.38

  75. [75]

    HW - TSC 2024 Submission for the Quality Estimation Shared Task

    Shan, Weiqiao and Zhu, Ming and Li, Yuang and Piao, Mengyao and Zhao, Xiaofeng and Su, Chang and Zhang, Min and Yang, Hao and Jiang, Yanfei. HW - TSC 2024 Submission for the Quality Estimation Shared Task. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.39

  76. [76]

    HW - TSC `s Participation in the WMT 2024 QEAPE Task

    Yu, Jiawei and Zhao, Xiaofeng and Zhang, Min and Yanqing, Zhao and Li, Yuang and Chang, Su and Qiao, Xiaosong and Miaomiao, Ma and Yang, Hao. HW - TSC 's Participation in the WMT 2024 QEAPE Task. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.40

  77. [77]

    Perez-Ortiz, Juan Antonio and S\'anchez-Mart\' nez, Felipe and S\'anchez-Cartagena, V\' ctor M. and Espl\`a-Gomis, Miquel and Galiano Jimenez, Aaron and Oliver, Antoni and Avent\' n-Boya, Claudi and Pardos, Alejandro and Vald\'es, Cristina and Sans Socasau, Jus\`ep Lo\' s and Mart\' nez, Juan Pablo. Expanding the FLORES + Multilingual Benchmark with Trans...

  78. [78]

    The B angla/ B engali Seed Dataset Submission to the WMT 24 Open Language Data Initiative Shared Task

    Ahmed, Firoz and Venkateswaran, Nitin and Moeller, Sarah. The B angla/ B engali Seed Dataset Submission to the WMT 24 Open Language Data Initiative Shared Task. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.42

  79. [79]

    A High-quality Seed Dataset for I talian Machine Translation

    Ferrante, Edoardo. A High-quality Seed Dataset for I talian Machine Translation. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.43

  80. [80]

    Correcting FLORES Evaluation Dataset for Four A frican Languages

    Abdulmumin, Idris and Mkhwanazi, Sthembiso and Mbooi, Mahlatse and Muhammad, Shamsuddeen Hassan and Ahmad, Ibrahim Said and Putini, Neo and Mathebula, Miehleketo and Shingange, Matimba and Gwadabe, Tajuddeen and Marivate, Vukosi. Correcting FLORES Evaluation Dataset for Four A frican Languages. Proceedings of the Ninth Conference on Machine Translation. 2...

Showing first 80 references.