arxiv: 2605.04920 · v1 · submitted 2026-05-06 · 💻 cs.LG · cs.CL

Recognition: 2 theorem links

Reinforcement Learning for Compositional Generalization with Outcome-Level Optimization

Xiyan Fu , Wei Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:12 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords compositional generalizationreinforcement learningoutcome-level optimizationpolicy optimizationlanguage modelssupervised fine-tuninggeneralization benchmarks

0 comments

The pith

Outcome-level reinforcement learning improves compositional generalization over supervised fine-tuning by optimizing entire outputs rather than token sequences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Compositional generalization means correctly handling novel combinations of familiar elements, which current supervised training struggles with because it rewards exact token imitation. The paper tests whether shifting to reinforcement learning that scores only the final result can capture the needed global structure instead. Using Group Relative Policy Optimization, models receive either a simple binary reward for correct answers or a composite reward that adds composition signals. Experiments across benchmarks show clear gains, with analysis indicating that supervised models overfit common training patterns while the reinforcement approach redistributes probability mass toward more complex unseen forms.

Core claim

The paper establishes that outcome-level optimization with Group Relative Policy Optimization, using binary or composite rewards on final outputs, produces stronger compositional generalization than token-level supervised fine-tuning; the improvement arises because reinforcement learning reshapes the output distribution away from frequent training compositions and toward more complex novel ones.

What carries the argument

Group Relative Policy Optimization applied at the outcome level, where a reward signal evaluates the correctness or compositional quality of the complete generated response.

If this is right

Supervised fine-tuning models overfit to frequent compositions seen during training.
Reinforcement learning reshapes output distributions to favor more complex composition types.
Both binary outcome rewards and composite rewards that add explicit composition feedback yield measurable gains.
The advantage of outcome-level optimization holds across multiple standard compositional generalization benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Outcome-level training may reduce reliance on training sets that exhaustively enumerate all possible combinations.
The same reward-based reshaping could apply to other structural generalization problems such as mathematical or logical reasoning.
Comparing token-level versus outcome-level signals points to a broader design choice in how generative models explore solution spaces.

Load-bearing premise

The chosen binary or composite rewards on final outputs supply an unbiased and sufficient signal for global compositional structure without new failure modes or hidden tuning effects driving the observed gains.

What would settle it

A new compositional benchmark on which outcome-level reinforcement learning produces no accuracy gain over supervised fine-tuning and shows no measurable shift in output probabilities for complex composition types.

Figures

Figures reproduced from arXiv: 2605.04920 by Wei Liu, Xiyan Fu.

**Figure 1.** Figure 1: Illustration of compositional generalization and training paradigms. Top: Example of compositional generalization where the model must correctly compose previously seen primitives (e.g., jump twice and turn left) to produce the correct action sequence for a novel instruction. Bottom: Comparison of training signals. Token-level optimization relies on supervised targets and cross-entropy loss, whereas outc… view at source ↗

**Figure 2.** Figure 2: Overview of Compositional Group Relative Policy Optimization. The LLM samples a group of candidate label sequences (y 1 , y2 , . . . , yK) based on inputs. Each candidate is then evaluated using two complementary reward signals: (i) a binary reward that measures exact match with the gold sequence, and (ii) a compositional reward that assesses primitive correctness and structural composition patterns. Rewar… view at source ↗

**Figure 3.** Figure 3: Average training trigram frequency of incorrect predictions from SFT and GRPO. Trigram frequencies are computed with respect to the training data, excluding trigrams appearing in ground-truth outputs. 6.1 Copying Behavior Prior works (Rice et al., 2020) suggest that poor generalization may stem from models overfitting to the prior output distribution of the training data. In compositional generalization ta… view at source ↗

**Figure 4.** Figure 4: Performance on the SCAN-Length split across various target output lengths. Examples are grouped into bins by output length. respect to output length. In compositional generalization tasks, longer outputs typically correspond to more complex structural compositions and therefore present a more challenging compositionality. We group the test examples into several buckets according to the gold output lengt… view at source ↗

read the original abstract

Compositional generalization refers to correctly interpret novel combinations of known primitives, which remains a major challenge. Existing approaches often rely on supervised fine-tuning, which encourages models to imitate target outputs. This token-level training paradigm fails to capture the global compositional structure required for generalizing to unseen combinations. In this work, we investigate whether compositional generalization can instead be improved through outcome-level reinforcement learning. We adopt Group Relative Policy Optimization to optimize models based on feedback on their final outputs. Within this framework, we explore both a simple binary outcome reward and a composite reward that provides additional composition feedback. Experiments on multiple compositional benchmarks show that reinforcement learning improves compositional generalization compared to supervised fine-tuning. Further analysis reveals that supervised models tend to overfit frequent training compositions, whereas reinforcement learning improves compositional generalization by reshaping the output distribution, particularly for more complex composition types.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The reported gains from outcome-level RL may trace to the composite reward's extra composition signal rather than the shift away from token-level training.

read the letter

The main point is that this paper applies Group Relative Policy Optimization to compositional generalization benchmarks and reports better handling of novel combinations than supervised fine-tuning, along with some analysis of output distribution changes. The framing around outcome rewards capturing global structure instead of local token imitation is sensible on its face and aligns with known limits of standard fine-tuning on these tasks. The distribution-reshaping observation for complex compositions is the clearest new piece, and it gives a concrete way to think about why the RL version might avoid latching onto frequent training patterns. That part of the work is worth having in the literature even if the effect sizes turn out modest. The paper does a reasonable job laying out the problem and choosing GRPO as the optimizer. The soft spot is the reward design. The abstract describes both a binary outcome reward and a composite version that adds explicit composition feedback. If that feedback is computed from ground-truth parses or primitive correctness, the RL policy receives structured information about global structure that the supervised baseline never sees. Without an ablation that supplies the same signal to an SFT model, it is impossible to tell whether the gains require the policy optimization step or simply the richer supervision. The further claims about overfitting and distribution reshaping carry the same ambiguity. The summary also lacks any numbers, baselines, or variance estimates, which makes it hard to judge how large or reliable the reported improvements are. This work is aimed at researchers already working on compositional generalization in language or reasoning models. Someone looking for training alternatives to imitation learning could extract the basic setup and the distribution analysis, but they would have to add the missing controls themselves. It deserves peer review so the authors can supply the full results and the necessary ablations; the core question is worth settling with tighter evidence rather than desk rejection.

Referee Report

1 major / 3 minor

Summary. The paper claims that outcome-level reinforcement learning via Group Relative Policy Optimization (GRPO) improves compositional generalization over supervised fine-tuning on multiple benchmarks. It explores binary outcome rewards and composite rewards with additional composition feedback, arguing that RL reshapes output distributions to reduce overfitting to frequent training compositions and better handle complex unseen combinations.

Significance. If the results hold after addressing comparison confounds, the work would provide evidence that shifting from token-level imitation to outcome-level optimization can better capture global compositional structure, offering a practical alternative to SFT for generalization challenges in language models.

major comments (1)

[Abstract and Methods] Abstract and reward definition (likely §3): The composite reward 'provides additional composition feedback' while the SFT baseline uses only token-level imitation. If the composition signal derives from ground-truth parses, trees, or primitive-level correctness (standard in these benchmarks), the RL policy receives explicit global structure supervision absent from SFT. This makes it impossible to attribute gains specifically to outcome-level RL or GRPO without an ablation that supplies equivalent composition signals to an SFT model (e.g., via auxiliary loss). The analysis of output-distribution reshaping and complex-composition gains inherits the same ambiguity.

minor comments (3)

[Methods] Provide the exact mathematical definition of the composite reward and how it is computed from model outputs versus ground truth.
[Experiments] Include full experimental details: number of runs, statistical tests, error bars, and hyperparameter sensitivity for GRPO and reward weighting.
[Results] Clarify whether the binary reward alone (without composite) already outperforms SFT, to separate the effect of outcome-level optimization from the richer reward.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback, particularly the identification of a potential confound in comparing our RL setups to SFT. We address this concern directly and outline targeted revisions to improve clarity without overstating our claims.

read point-by-point responses

Referee: [Abstract and Methods] Abstract and reward definition (likely §3): The composite reward 'provides additional composition feedback' while the SFT baseline uses only token-level imitation. If the composition signal derives from ground-truth parses, trees, or primitive-level correctness (standard in these benchmarks), the RL policy receives explicit global structure supervision absent from SFT. This makes it impossible to attribute gains specifically to outcome-level RL or GRPO without an ablation that supplies equivalent composition signals to an SFT model (e.g., via auxiliary loss). The analysis of output-distribution reshaping and complex-composition gains inherits the same ambiguity.

Authors: We agree this is an important point for causal attribution. Our binary outcome reward is computed exclusively from final-output correctness (task success or exact match against the benchmark target), with no explicit primitive, parse, or composition signals supplied to the policy. The composite reward adds terms that score compositional elements in the completed output using the same ground-truth evaluation functions already used for benchmark scoring; these are still post-hoc outcome verifiers rather than injected structure or auxiliary targets during generation. Nevertheless, we acknowledge that the composite signal is richer than pure binary outcome feedback and could partially explain some gains. To address the concern, we will revise the abstract and §3 to provide precise mathematical definitions of both rewards, foreground the binary-reward results as the core evidence for outcome-level optimization, and expand the analysis section to show that distribution reshaping and complex-composition improvements are already visible under the binary reward alone. We will also add a brief discussion of the suggested auxiliary-loss ablation on SFT as a limitation and direction for future work (or include a preliminary version if time permits during revision). revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical comparison without derivations or self-referential reductions

full rationale

The paper reports experimental results on compositional benchmarks, comparing outcome-level RL (with binary or composite rewards) against supervised fine-tuning. No equations, first-principles derivations, or predictions are claimed; the central claims rest on observed performance differences and post-hoc analysis of output distributions. The composite reward is described as providing 'additional composition feedback,' but this is an explicit design choice in the experimental setup rather than a hidden self-definition or fitted quantity renamed as a result. No self-citations are invoked to justify uniqueness or load-bearing premises, and the work does not reduce any finding to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities are stated. Standard RL hyperparameters (learning rate, group size, reward scaling) are implicitly present but not detailed.

pith-pipeline@v0.9.0 · 5431 in / 1094 out tokens · 38460 ms · 2026-05-08T18:12:06.437214+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

300 extracted references · 180 canonical work pages · 7 internal anchors

[1]

Cognition , volume=

Connectionism and cognitive architecture: A critical analysis , author=. Cognition , volume=. 1988 , publisher=

1988
[2]

Behavioral and brain sciences , volume=

Building machines that learn and think like people , author=. Behavioral and brain sciences , volume=. 2017 , publisher=

2017
[3]

Journal of Artificial Intelligence Research , volume=

Compositionality decomposed: How do neural networks generalise? , author=. Journal of Artificial Intelligence Research , volume=. 2020 , url=

2020
[4]

International conference on machine learning , pages=

Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks , author=. International conference on machine learning , pages=. 2018 , organization=

2018
[5]

Testing the General Deductive Reasoning Capacity of Large Language Models Using

Abulhair Saparov and Richard Yuanzhe Pang and Vishakh Padmakumar and Nitish Joshi and Mehran Kazemi and Najoung Kim and He He , booktitle=. Testing the General Deductive Reasoning Capacity of Large Language Models Using. 2023 , url=

2023
[6]

Nature , pages=

Human-like systematic generalization through a meta-learning neural network , author=. Nature , pages=. 2023 , publisher=

2023
[7]

Dynamic MOdularized Reasoning for Compositional Structured Explanation Generation

Fu, Xiyan and Frank, Anette. Dynamic MOdularized Reasoning for Compositional Structured Explanation Generation. Proceedings of the 13th Joint Conference on Lexical and Computational Semantics (*SEM 2024). 2024

2024
[8]

The Eleventh International Conference on Learning Representations , year=

Characterizing intrinsic compositionality in transformers with Tree Projections , author=. The Eleventh International Conference on Learning Representations , year=
[9]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author =. arXiv preprint , archivePrefix=. 2501.12948 , year =

work page internal anchor Pith review arXiv
[10]

2025 , url=

Tianzhe Chu and Yuexiang Zhai and Jihan Yang and Shengbang Tong and Saining Xie and Dale Schuurmans and Quoc V Le and Sergey Levine and Yi Ma , booktitle=. 2025 , url=

2025
[11]

Proceedings of the International Conference on Learning Representations (ICLR) , year =

Understanding the Effects of RLHF on LLM Generalisation and Diversity , author =. Proceedings of the International Conference on Learning Representations (ICLR) , year =
[12]

Learning what reinforcement learning can’t: Interleaved online fine-tuning for hardest questions, 2026

Learning What Reinforcement Learning Can’t: Interleaved Online Fine-Tuning for Hardest Questions , author =. 2025 , archivePrefix=. 2506.07527 , primaryClass =

work page arXiv 2025
[13]

RL Fine-Tuning Heals OOD Forgetting in SFT

RL Fine-Tuning Heals OOD Forgetting in SFT , author =. 2025 , archivePrefix=. 2509.12235 , primaryClass =

work page internal anchor Pith review arXiv 2025
[14]

Proceedings of the 34th International Conference on Machine Learning , pages =

On Calibration of Modern Neural Networks , author =. Proceedings of the 34th International Conference on Machine Learning , pages =. 2017 , editor =

2017
[15]

Team OLMo and Pete Walsh and Luca Soldaini and Dirk Groeneveld and Kyle Lo and Shane Arora and Akshita Bhagia and Yuling Gu and Shengyi Huang and Matt Jordan and Nathan Lambert and Dustin Schwenk and Oyvind Tafjord and Taira Anderson and David Atkinson and Faeze Brahman and Christopher Clark and Pradeep Dasigi and Nouha Dziri and Michal Guerquin and Hamis...

work page internal anchor Pith review arXiv
[16]

arXiv preprint arXiv:2505.00661 , year=

On the generalization of language models from in-context learning and finetuning: a controlled study , author=. arXiv preprint arXiv:2505.00661 , year=

work page arXiv
[17]

Advances in neural information processing systems , volume=

Professor forcing: A new algorithm for training recurrent networks , author=. Advances in neural information processing systems , volume=. 2016 , url=

2016
[18]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Hierarchical Poset Decoding for Compositional Generalization in Language , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[19]

Li and Y

Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Mingchuan Zhang and Y.K. Li and Y. Wu and Daya Guo , title =. CoRR , volume =. 2024 , url =

2024
[20]

International Conference on Learning Representations , year=

Measuring Compositional Generalization: A Comprehensive Method on Realistic Data , author=. International Conference on Learning Representations , year=
[21]

Qwen2.5: A Party of Foundation Models , url =

Team Qwen , month =. Qwen2.5: A Party of Foundation Models , url =
[22]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review arXiv
[23]

Training language models to follow instructions with human feedback , url =

Ouyang, Long and Wu, Jeffrey and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and Schulman, John and Hilton, Jacob and Kelton, Fraser and Miller, Luke and Simens, Maddie and Askell, Amanda and Welinder, Peter and Christiano, Paul F and Leike, Jan and Lowe,...
[24]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , url=

work page internal anchor Pith review arXiv
[25]

Beyond English-Centric Training: How Reinforcement Learning Improves Cross-Lingual Reasoning in

Shulin Huang and Yiran Ding and Junshu Pan and Yue Zhang , booktitle=. Beyond English-Centric Training: How Reinforcement Learning Improves Cross-Lingual Reasoning in. 2026 , url=

2026
[26]

RAG-RL: Advancing Retrieval-Augmented Generation via RL and Curriculum Learning

Rag-rl: Advancing retrieval-augmented generation via rl and curriculum learning , author=. arXiv preprint arXiv:2503.12759 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[27]

arXiv preprint arXiv:2503.19612 , year=

RL-finetuning LLMs from on-and off-policy data with a single algorithm , author=. arXiv preprint arXiv:2503.19612 , year=

work page arXiv
[28]

2026 , url=

Xiangxiang Chu and Hailang Huang and Xiao Zhang and Fei Wei and Yong Wang , booktitle=. 2026 , url=

2026
[29]

2025 , url=

Yuxiang Wei and Olivier Duchenne and Jade Copet and Quentin Carbonneaux and LINGMING ZHANG and Daniel Fried and Gabriel Synnaeve and Rishabh Singh and Sida Wang , booktitle=. 2025 , url=

2025
[30]

Proceedings of the 37th International Conference on Machine Learning , pages =

Overfitting in adversarially robust deep learning , author =. Proceedings of the 37th International Conference on Machine Learning , pages =. 2020 , editor =

2020
[31]

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

Deepseek llm: Scaling open-source language models with longtermism , author=. arXiv preprint arXiv:2401.02954 , url=

work page internal anchor Pith review arXiv
[32]

Second Conference on Language Modeling , year=

Tulu 3: Pushing Frontiers in Open Language Model Post-Training , author=. Second Conference on Language Modeling , year=
[33]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Reinforcement Learning for Reasoning in Large Language Models with One Training Example , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
[34]

On-Policy

Wenhao Zhang and Yuexiang Xie and Yuchang Sun and Yanxi Chen and Guoyin Wang and Yaliang Li and Bolin Ding and Jingren Zhou , booktitle=. On-Policy. 2026 , url=

2026
[35]

Exploring Continual Learning of Compositional Generalization in NLI

Fu, Xiyan and Frank, Anette. Exploring Continual Learning of Compositional Generalization in NLI. Transactions of the Association for Computational Linguistics. 2024. doi:10.1162/tacl_a_00680

work page doi:10.1162/tacl_a_00680 2024
[36]

Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.0

work page doi:10.18653/v1/2024.wmt-1.0 2024
[37]

Findings of the WMT 24 General Machine Translation Shared Task: The LLM Era Is Here but MT Is Not Solved Yet

Kocmi, Tom and Avramidis, Eleftherios and Bawden, Rachel and Bojar, Ond rej and Dvorkovich, Anton and Federmann, Christian and Fishel, Mark and Freitag, Markus and Gowda, Thamme and Grundkiewicz, Roman and Haddow, Barry and Karpinska, Marzena and Koehn, Philipp and Marie, Benjamin and Monz, Christof and Murray, Kenton and Nagata, Masaaki and Popel, Martin...

work page doi:10.18653/v1/2024.wmt-1.1 2024
[38]

Are LLM s Breaking MT Metrics? Results of the WMT 24 Metrics Shared Task

Freitag, Markus and Mathur, Nitika and Deutsch, Daniel and Lo, Chi-Kiu and Avramidis, Eleftherios and Rei, Ricardo and Thompson, Brian and Blain, Frederic and Kocmi, Tom and Wang, Jiayi and Adelani, David Ifeoluwa and Buchicchio, Marianna and Zerva, Chrysoula and Lavie, Alon. Are LLM s Breaking MT Metrics? Results of the WMT 24 Metrics Shared Task. Procee...

work page doi:10.18653/v1/2024.wmt-1.2 2024
[39]

De Souza, Jos \'e G

Zerva, Chrysoula and Blain, Frederic and C. De Souza, Jos\'e G. and Kanojia, Diptesh and Deoghare, Sourabh and Guerreiro, Nuno M. and Attanasio, Giuseppe and Rei, Ricardo and Orasan, Constantin and Negri, Matteo and Turchi, Marco and Chatterjee, Rajen and Bhattacharyya, Pushpak and Freitag, Markus and Martins, Andr\'e. Findings of the Quality Estimation S...

work page doi:10.18653/v1/2024.wmt-1.3 2024
[40]

Findings of the WMT 2024 Shared Task of the Open Language Data Initiative

Maillard, Jean and Burchell, Laurie and Anastasopoulos, Antonios and Federmann, Christian and Koehn, Philipp and Wang, Skyler. Findings of the WMT 2024 Shared Task of the Open Language Data Initiative. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.4

work page doi:10.18653/v1/2024.wmt-1.4 2024
[41]

Results of the WAT / WMT 2024 Shared Task on Patent Translation

Higashiyama, Shohei. Results of the WAT / WMT 2024 Shared Task on Patent Translation. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.5

work page doi:10.18653/v1/2024.wmt-1.5 2024
[42]

Findings of the WMT 2024 Biomedical Translation Shared Task: Test Sets on Abstract Level

Neves, Mariana and Grozea, Cristian and Thomas, Philippe and Roller, Roland and Bawden, Rachel and N\'ev\'eol, Aur\'elie and Castle, Steffen and Bonato, Vanessa and Di Nunzio, Giorgio Maria and Vezzani, Federica and Vicente Navarro, Maika and Yeganova, Lana and Jimeno Yepes, Antonio. Findings of the WMT 2024 Biomedical Translation Shared Task: Test Sets o...

work page doi:10.18653/v1/2024.wmt-1.6 2024
[43]

MSLC 24 Submissions to the General Machine Translation Task

Larkin, Samuel and Lo, Chi-Kiu and Knowles, Rebecca. MSLC 24 Submissions to the General Machine Translation Task. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.7

work page doi:10.18653/v1/2024.wmt-1.7 2024
[44]

IOL Research Machine Translation Systems for WMT 24 General Machine Translation Shared Task

Zhang, Wenbo. IOL Research Machine Translation Systems for WMT 24 General Machine Translation Shared Task. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.8

work page doi:10.18653/v1/2024.wmt-1.8 2024
[45]

Choose the Final Translation from NMT and LLM Hypotheses Using MBR Decoding: HW - TSC `s Submission to the WMT 24 General MT Shared Task

Wu, Zhanglin and Wei, Daimeng and Li, Zongyao and Shang, Hengchao and Guo, Jiaxin and Li, Shaojun and Rao, Zhiqiang and Luo, Yuanchang and Xie, Ning and Yang, Hao. Choose the Final Translation from NMT and LLM Hypotheses Using MBR Decoding: HW - TSC 's Submission to the WMT 24 General MT Shared Task. Proceedings of the Ninth Conference on Machine Translat...

work page doi:10.18653/v1/2024.wmt-1.9 2024
[46]

C ycle GN : A Cycle Consistent Approach for Neural Machine Translation

C ycle GN : A Cycle Consistent Approach for Neural Machine Translation. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.10

work page doi:10.18653/v1/2024.wmt-1.10 2024
[47]

U v A - MT `s Participation in the WMT 24 General Translation Shared Task

Tan, Shaomu and Stap, David and Aycock, Seth and Monz, Christof and Wu, Di. U v A - MT 's Participation in the WMT 24 General Translation Shared Task. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.11

work page doi:10.18653/v1/2024.wmt-1.11 2024
[48]

Rei, Ricardo and Pombal, Jose and Guerreiro, Nuno M. and Alves, Jo\ ao and Martins, Pedro Henrique and Fernandes, Patrick and Wu, Helena and Vaz, Tania and Alves, Duarte and Farajian, Amin and Agrawal, Sweta and Farinhas, Antonio and C. De Souza, Jos\'e G. and Martins, Andr\'e. Tower v2: Unbabel- IST 2024 Submission for the General MT Shared Task. Proceed...

work page doi:10.18653/v1/2024.wmt-1.12 2024
[49]

TSU HITS `s Submissions to the WMT 2024 General Machine Translation Shared Task

Mynka, Vladimir and Mikhaylovskiy, Nikolay. TSU HITS 's Submissions to the WMT 2024 General Machine Translation Shared Task. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.13

work page doi:10.18653/v1/2024.wmt-1.13 2024
[50]

Document-level Translation with LLM Reranking: Team- J at WMT 2024 General Translation Task

Kudo, Keito and Deguchi, Hiroyuki and Morishita, Makoto and Fujii, Ryo and Ito, Takumi and Ozaki, Shintaro and Natsumi, Koki and Sato, Kai and Yano, Kazuki and Takahashi, Ryosuke and Kimura, Subaru and Hara, Tomomasa and Sakai, Yusuke and Suzuki, Jun. Document-level Translation with LLM Reranking: Team- J at WMT 2024 General Translation Task. Proceedings ...

work page doi:10.18653/v1/2024.wmt-1.14 2024
[51]

DLUT and GTCOM `s Neural Machine Translation Systems for WMT 24

Zong, Hao and Bei, Chao and Liu, Huan and Yuan, Conghu and Chen, Wentao and Huang, Degen. DLUT and GTCOM 's Neural Machine Translation Systems for WMT 24. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.15

work page doi:10.18653/v1/2024.wmt-1.15 2024
[52]

CUNI at WMT 24 General Translation Task: LLM s, ( Q ) L o RA , CPO and Model Merging

Hrabal, Miroslav and Jon, Josef and Popel, Martin and Luu, Nam and Semin, Danil and Bojar, Ond rej. CUNI at WMT 24 General Translation Task: LLM s, ( Q ) L o RA , CPO and Model Merging. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.16

work page doi:10.18653/v1/2024.wmt-1.16 2024
[53]

From General LLM to Translation: How We Dramatically Improve Translation Quality Using Human Evaluation Data for LLM Finetuning

Elshin, Denis and Karpachev, Nikolay and Gruzdev, Boris and Golovanov, Ilya and Ivanov, Georgy and Antonov, Alexander and Skachkov, Nickolay and Latypova, Ekaterina and Layner, Vladimir and Enikeeva, Ekaterina and Popov, Dmitry and Chekashev, Anton and Negodin, Vladislav and Frantsuzova, Vera and Chernyshev, Alexander and Denisov, Kirill. From General LLM...

work page doi:10.18653/v1/2024.wmt-1.17 2024
[54]

Cogs in a Machine, Doing What They`re Meant to Do -- the AMI Submission to the WMT 24 General Translation Task

Jasonarson, Atli and Hafsteinsson, Hinrik and \'Armannsson, Bjarki and Steingr\' msson, Steinth\'or. Cogs in a Machine, Doing What They're Meant to Do -- the AMI Submission to the WMT 24 General Translation Task. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.18

work page doi:10.18653/v1/2024.wmt-1.18 2024
[55]

IKUN for WMT 24 General MT Task: LLM s Are Here for Multilingual Machine Translation

Liao, Baohao and Herold, Christian and Khadivi, Shahram and Monz, Christof. IKUN for WMT 24 General MT Task: LLM s Are Here for Multilingual Machine Translation. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.19

work page doi:10.18653/v1/2024.wmt-1.19 2024
[56]

NTTSU at WMT 2024 General Translation Task

Kondo, Minato and Fukuda, Ryo and Wang, Xiaotian and Chousa, Katsuki and Nishimura, Masato and Buma, Kosei and Kano, Takatomo and Utsuro, Takehito. NTTSU at WMT 2024 General Translation Task. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.20

work page doi:10.18653/v1/2024.wmt-1.20 2024
[57]

SCIR - MT `s Submission for WMT 24 General Machine Translation Task

Li, Baohang and Ye, Zekai and Huang, Yichong and Feng, Xiaocheng and Qin, Bing. SCIR - MT 's Submission for WMT 24 General Machine Translation Task. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.21

work page doi:10.18653/v1/2024.wmt-1.21 2024
[58]

AIST AIRC Systems for the WMT 2024 Shared Tasks

Rikters, Matiss and Miwa, Makoto. AIST AIRC Systems for the WMT 2024 Shared Tasks. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.22

work page doi:10.18653/v1/2024.wmt-1.22 2024
[59]

Occiglot at WMT 24: E uropean Open-source Large Language Models Evaluated on Translation

Occiglot at WMT 24: E uropean Open-source Large Language Models Evaluated on Translation. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.23

work page doi:10.18653/v1/2024.wmt-1.23 2024
[60]

C o ST of breaking the LLM s

Mukherjee, Ananya and Yadav, Saumitra and Shrivastava, Manish. C o ST of breaking the LLM s. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.24

work page doi:10.18653/v1/2024.wmt-1.24 2024
[61]

WMT 24 Test Suite: Gender Resolution in Speaker-Listener Dialogue Roles

Dawkins, Hillary and Nejadgholi, Isar and Lo, Chi-Kiu. WMT 24 Test Suite: Gender Resolution in Speaker-Listener Dialogue Roles. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.25

work page doi:10.18653/v1/2024.wmt-1.25 2024
[62]

The G ender Q ueer Test Suite

Friidhriksd\'ottir, Steinunn Rut. The G ender Q ueer Test Suite. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.26

work page doi:10.18653/v1/2024.wmt-1.26 2024
[63]

Domain Dynamics: Evaluating Large Language Models in E nglish- H indi Translation

Bhattacharjee, Soham and Gain, Baban and Ekbal, Asif. Domain Dynamics: Evaluating Large Language Models in E nglish- H indi Translation. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.27

work page doi:10.18653/v1/2024.wmt-1.27 2024
[64]

Investigating the Linguistic Performance of Large Language Models in Machine Translation

Investigating the Linguistic Performance of Large Language Models in Machine Translation. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.28

work page doi:10.18653/v1/2024.wmt-1.28 2024
[65]

I so C hrono M eter: A Simple and Effective Isochronic Translation Evaluation Metric

Rozanov, Nikolai and Pankov, Vikentiy and Mukhutdinov, Dmitrii and Vypirailenko, Dima. I so C hrono M eter: A Simple and Effective Isochronic Translation Evaluation Metric. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.29

work page doi:10.18653/v1/2024.wmt-1.29 2024
[66]

A Test Suite of Prompt Injection Attacks for LLM -based Machine Translation

Miceli Barone, Antonio Valerio and Sun, Zhifan. A Test Suite of Prompt Injection Attacks for LLM -based Machine Translation. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.30

work page doi:10.18653/v1/2024.wmt-1.30 2024
[67]

Killing Two Flies with One Stone: An Attempt to Break LLM s Using E nglish- I celandic Idioms and Proper Names

\'Armannsson, Bjarki and Hafsteinsson, Hinrik and Jasonarson, Atli and Steingrimsson, Steinthor. Killing Two Flies with One Stone: An Attempt to Break LLM s Using E nglish- I celandic Idioms and Proper Names. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.31

work page doi:10.18653/v1/2024.wmt-1.31 2024
[68]

M eta M etrics- MT : Tuning Meta-Metrics for Machine Translation via Human Preference Calibration

Anugraha, David and Kuwanto, Garry and Susanto, Lucky and Wijaya, Derry Tanti and Winata, Genta. M eta M etrics- MT : Tuning Meta-Metrics for Machine Translation via Human Preference Calibration. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.32

work page doi:10.18653/v1/2024.wmt-1.32 2024
[69]

chr F - S : Semantics Is All You Need

Mukherjee, Ananya and Shrivastava, Manish. chr F - S : Semantics Is All You Need. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.33

work page doi:10.18653/v1/2024.wmt-1.33 2024
[70]

MSLC 24: Further Challenges for Metrics on a Wide Landscape of Translation Quality

Knowles, Rebecca and Larkin, Samuel and Lo, Chi-Kiu. MSLC 24: Further Challenges for Metrics on a Wide Landscape of Translation Quality. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.34

work page doi:10.18653/v1/2024.wmt-1.34 2024
[71]

M etric X -24: The G oogle Submission to the WMT 2024 Metrics Shared Task

Juraska, Juraj and Deutsch, Daniel and Finkelstein, Mara and Freitag, Markus. M etric X -24: The G oogle Submission to the WMT 2024 Metrics Shared Task. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.35

work page doi:10.18653/v1/2024.wmt-1.35 2024
[72]

Evaluating WMT 2024 Metrics Shared Task Submissions on A fri MTE (the A frican Challenge Set)

Wang, Jiayi and Adelani, David Ifeoluwa and Stenetorp, Pontus. Evaluating WMT 2024 Metrics Shared Task Submissions on A fri MTE (the A frican Challenge Set). Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.36

work page doi:10.18653/v1/2024.wmt-1.36 2024
[73]

Machine Translation Metrics Are Better in Evaluating Linguistic Errors on LLM s than on Encoder-Decoder Systems

Machine Translation Metrics Are Better in Evaluating Linguistic Errors on LLM s than on Encoder-Decoder Systems. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.37

work page doi:10.18653/v1/2024.wmt-1.37 2024
[74]

TMU - HIT `s Submission for the WMT 24 Quality Estimation Shared Task: Is GPT -4 a Good Evaluator for Machine Translation?

Sato, Ayako and Nakajima, Kyotaro and Kim, Hwichan and Chen, Zhousi and Komachi, Mamoru. TMU - HIT 's Submission for the WMT 24 Quality Estimation Shared Task: Is GPT -4 a Good Evaluator for Machine Translation?. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.38

work page doi:10.18653/v1/2024.wmt-1.38 2024
[75]

HW - TSC 2024 Submission for the Quality Estimation Shared Task

Shan, Weiqiao and Zhu, Ming and Li, Yuang and Piao, Mengyao and Zhao, Xiaofeng and Su, Chang and Zhang, Min and Yang, Hao and Jiang, Yanfei. HW - TSC 2024 Submission for the Quality Estimation Shared Task. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.39

work page doi:10.18653/v1/2024.wmt-1.39 2024
[76]

HW - TSC `s Participation in the WMT 2024 QEAPE Task

Yu, Jiawei and Zhao, Xiaofeng and Zhang, Min and Yanqing, Zhao and Li, Yuang and Chang, Su and Qiao, Xiaosong and Miaomiao, Ma and Yang, Hao. HW - TSC 's Participation in the WMT 2024 QEAPE Task. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.40

work page doi:10.18653/v1/2024.wmt-1.40 2024
[77]

Perez-Ortiz, Juan Antonio and S\'anchez-Mart\' nez, Felipe and S\'anchez-Cartagena, V\' ctor M. and Espl\`a-Gomis, Miquel and Galiano Jimenez, Aaron and Oliver, Antoni and Avent\' n-Boya, Claudi and Pardos, Alejandro and Vald\'es, Cristina and Sans Socasau, Jus\`ep Lo\' s and Mart\' nez, Juan Pablo. Expanding the FLORES + Multilingual Benchmark with Trans...

work page doi:10.18653/v1/2024.wmt-1.41 2024
[78]

The B angla/ B engali Seed Dataset Submission to the WMT 24 Open Language Data Initiative Shared Task

Ahmed, Firoz and Venkateswaran, Nitin and Moeller, Sarah. The B angla/ B engali Seed Dataset Submission to the WMT 24 Open Language Data Initiative Shared Task. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.42

work page doi:10.18653/v1/2024.wmt-1.42 2024
[79]

A High-quality Seed Dataset for I talian Machine Translation

Ferrante, Edoardo. A High-quality Seed Dataset for I talian Machine Translation. Proceedings of the Ninth Conference on Machine Translation. 2024. doi:10.18653/v1/2024.wmt-1.43

work page doi:10.18653/v1/2024.wmt-1.43 2024
[80]

Correcting FLORES Evaluation Dataset for Four A frican Languages

Abdulmumin, Idris and Mkhwanazi, Sthembiso and Mbooi, Mahlatse and Muhammad, Shamsuddeen Hassan and Ahmad, Ibrahim Said and Putini, Neo and Mathebula, Miehleketo and Shingange, Matimba and Gwadabe, Tajuddeen and Marivate, Vukosi. Correcting FLORES Evaluation Dataset for Four A frican Languages. Proceedings of the Ninth Conference on Machine Translation. 2...

work page doi:10.18653/v1/2024.wmt-1.44 2024

Showing first 80 references.