{"work":{"id":"28f4db09-e48a-458a-967b-755399c7d663","openalex_id":null,"doi":null,"arxiv_id":"2402.01306","raw_key":null,"title":"KTO: Model Alignment as Prospect Theoretic Optimization","authors":null,"authors_text":"Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, Douwe Kiela","year":2024,"venue":"cs.LG","abstract":"Kahneman & Tversky's $\\textit{prospect theory}$ tells us that humans perceive random variables in a biased but well-defined manner (1992); for example, humans are famously loss-averse. We show that objectives for aligning LLMs with human feedback implicitly incorporate many of these biases -- the success of these objectives (e.g., DPO) over cross-entropy minimization can partly be ascribed to them belonging to a family of loss functions that we call $\\textit{human-aware losses}$ (HALOs). However, the utility functions these methods attribute to humans still differ from those in the prospect theory literature. Using a Kahneman-Tversky model of human utility, we propose a HALO that directly maximizes the utility of generations instead of maximizing the log-likelihood of preferences, as current methods do. We call this approach KTO, and it matches or exceeds the performance of preference-based methods at scales from 1B to 30B, despite only learning from a binary signal of whether an output is desirable. More broadly, our work suggests that there is no one HALO that is universally superior; the best loss depends on the inductive biases most appropriate for a given setting, an oft-overlooked consideration.","external_url":"https://arxiv.org/abs/2402.01306","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-25T06:45:25.561403+00:00","pith_arxiv_id":"2402.01306","created_at":"2026-05-09T02:29:41.040131+00:00","updated_at":"2026-05-25T06:45:25.561403+00:00","title_quality_ok":true,"display_title":"KTO: Model Alignment as Prospect Theoretic Optimization","render_title":"KTO: Model Alignment as Prospect Theoretic Optimization"},"hub":{"state":{"work_id":"28f4db09-e48a-458a-967b-755399c7d663","tier":"hub","tier_reason":"10+ Pith inbound or 1,000+ external citations","pith_inbound_count":71,"external_cited_by_count":null,"distinct_field_count":13,"first_pith_cited_at":"2024-02-29T13:53:35+00:00","last_pith_cited_at":"2026-05-22T05:25:00+00:00","author_build_status":"not_needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"not_needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-05-26T03:36:11.098097+00:00","tier_text":"hub"},"tier":"hub","role_counts":[{"context_role":"background","n":13},{"context_role":"method","n":2}],"polarity_counts":[{"context_polarity":"background","n":11},{"context_polarity":"unclear","n":2},{"context_polarity":"extend","n":1},{"context_polarity":"use_method","n":1}],"runs":{"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-23T04:44:21.469199+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"Proximal Policy Optimization Algorithms","work_id":"240c67fe-d14d-4520-91c1-38a4e272ca19","shared_citers":36},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":32},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":23},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":22},{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":19},{"title":"Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback","work_id":"a1f2574b-a899-4713-be60-c87ba332656c","shared_citers":17},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":16},{"title":"Constitutional AI: Harmlessness from AI Feedback","work_id":"faaaa4e0-2676-4fac-a0b4-99aef10d2095","shared_citers":15},{"title":"DAPO: An Open-Source LLM Reinforcement Learning System at Scale","work_id":"64019d00-0b11-4bbd-b173-b46c8fad0157","shared_citers":15},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":13},{"title":"Fine-Tuning Language Models from Human Preferences","work_id":"4f54aad1-f3b6-404f-b9c7-e21ba0a33b99","shared_citers":12},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":12},{"title":"ORPO: Monolithic Preference Optimization without Reference Model","work_id":"93ad9e7b-6db2-4249-a0fa-8a96ef124d53","shared_citers":12},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":9},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":8},{"title":"Group Sequence Policy Optimization","work_id":"3a98b53b-9f52-4d95-adf7-89353c0a9a65","shared_citers":8},{"title":"In Advances in Neural Information Processing Systems (NeurIPS), volume 37","work_id":"09ac9a7a-d953-4f17-acda-ac268871a79c","shared_citers":8},{"title":"Tulu 3: Pushing Frontiers in Open Language Model Post-Training","work_id":"28c9dbea-056a-48c2-8000-85f809827e45","shared_citers":8},{"title":"UltraFeedback: Boosting Language Models with Scaled AI Feedback","work_id":"5f1e5704-a4ce-482a-b124-b3692397be11","shared_citers":8},{"title":"Understanding R1-Zero-Like Training: A Critical Perspective","work_id":"ec354f3b-9484-4a0c-94c8-92d4d0260835","shared_citers":8},{"title":"Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators","work_id":"ef25adcf-addb-445e-b3b5-858eeb9883ca","shared_citers":7},{"title":"Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge","work_id":"28ea1282-d657-4c61-a83c-f1249be6d6b1","shared_citers":7},{"title":"arXiv preprint arXiv:2310.12036 , year=","work_id":"44673d8e-2cc2-4818-86d3-24bc812aa41c","shared_citers":6},{"title":"arXiv preprint arXiv:2405.00675 , year=","work_id":"9d5b91f5-dd34-4c04-925b-ae80101224d5","shared_citers":6}],"time_series":[{"n":6,"year":2024},{"n":18,"year":2025},{"n":43,"year":2026}],"dependency_candidates":[{"n":1,"role":"method","polarity":"extend","paper_title":"Beyond Pairs: Your Language Model is Secretly Optimizing a Preference Graph","primary_cat":"cs.LG","context_text":"policy log-probability ratios against pairwise preference data relative to a fixed reference model. This reformulation reduces alignment to a stable classification-style objective while retaining strong em- pirical performance. As a result, DPO has inspired a growing family of reference-based, reward-free alignment methods, including IPO [11], KTO [12], SimPO [13], ORPO [14], and iterative or online variants such as SPIN [15]. Preprint. arXiv:2605.08037v1 [cs.LG] 8 May 2026 The pairwise and listwise bottleneck.Despite these advances, most DPO-style methods fundamen- tally operate onisolated pairwise comparisons. This assumption is often mismatched with practical data collection regimes. In many real-world and benchmark settings, preference data is obtained by","citing_arxiv_id":"2605.08037"},{"n":1,"role":"method","polarity":"use_method","paper_title":"Controlling Distributional Bias in Multi-Round LLM Generation via KL-Optimized Fine-Tuning","primary_cat":"cs.CL","context_text":"ate a base set of N responses, denoted as Dbase = {(x, ri, ai)}N i=1, where ri is the textual response anda i ∈ Ais the corresponding attribute. To embed the target distributionP∗ into the train- ing data, we explicitly control the generation fre- quency such that the count Nk of responses exhibit- ing attributea k satisfies: Nk =round(N·P ∗(ak|x))(6) For instance, given a target distribution of {Male: 0.99, Female: 0.01} and N= 100 , Dbase will contain 99 responses with the Male attribute and 1 with the Female attribute. Transformation to Preference PairsTo apply our hybrid loss function, we transform Dbase into a preference dataset Dpref ={(x, y w, yl)}. For each sample (x, r, a)from the base dataset, we construct","citing_arxiv_id":"2604.05756"}]},"error":null,"updated_at":"2026-05-23T04:44:15.397286+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-23T04:44:29.762787+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"KTO: Model Alignment as Prospect Theoretic Optimization","claims":[{"claim_text":"Kahneman & Tversky's $\\textit{prospect theory}$ tells us that humans perceive random variables in a biased but well-defined manner (1992); for example, humans are famously loss-averse. We show that objectives for aligning LLMs with human feedback implicitly incorporate many of these biases -- the success of these objectives (e.g., DPO) over cross-entropy minimization can partly be ascribed to them belonging to a family of loss functions that we call $\\textit{human-aware losses}$ (HALOs). However, the utility functions these methods attribute to humans still differ from those in the prospect th","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"ate a base set of N responses, denoted as Dbase = {(x, ri, ai)}N i=1, where ri is the textual response anda i ∈ Ais the corresponding attribute. To embed the target distributionP∗ into the train- ing data, we explicitly control the generation fre- quency such that the count Nk of responses exhibit- ing attributea k satisfies: Nk =round(N·P ∗(ak|x))(6) For instance, given a target distribution of {Male: 0.99, Female: 0.01} and N= 100 , Dbase will contain 99 responses with the Male attribute and 1","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"policy log-probability ratios against pairwise preference data relative to a fixed reference model. This reformulation reduces alignment to a stable classification-style objective while retaining strong em- pirical performance. As a result, DPO has inspired a growing family of reference-based, reward-free alignment methods, including IPO [11], KTO [12], SimPO [13], ORPO [14], and iterative or online variants such as SPIN [15]. Preprint. arXiv:2605.08037v1 [cs.LG] 8 May 2026 The pairwise and list","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"non-linear optimization problems involving phys- ical dynamics. We follow a scalable backtrans- lation based synthetic data generation strategy described in Section 3.2. 2.3. RL for Reasoning and Code Generation Group Relative Policy Optimization (GRPO) [31] eliminates the critic model from PPO [32] by sampling groups of outputs and normalizing ad- vantages within each group; DeepSeek-R1 [33] showed that complex reasoning strategies emerge from GRPO with verifiable rewards alone, and Dr. GRPO [3","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"Orpo: Monolithic preference optimization without reference model. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 11170-11189, 2024. [63] Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward.Advances in Neural Information Processing Systems, 37:124198- 124235, 2024. [64] Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic opti","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"[172] Chongyu Fan, Yihua Zhang, Jinghan Jia, Alfred Hero, and Sijia Liu. Cyclicreflex: Im- proving large reasoning models via cyclical reflection token scheduling. arXiv preprint arXiv:2506.11077, 2025. [173] Siqi Fan, Peng Han, Shuo Shang, Yequan Wang, and Aixin Sun. Cothink: Token-efficient reasoning via instruct models guiding reasoning models. arXiv preprint arXiv:2505.22017, 2025. [174] Tiantian Fan, Lingjun Liu, Yu Yue, Jiaze Chen, Chengyi Wang, Qiying Yu, Chi Zhang, Zhiqi Lin, Ruofei Zhu,","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"[59] proposed a two-stage strat- egy combining SFT and Feasibility-and-Optimality-Aware Reinforcement Learning (FOARL) to guide LLMs and improve solution quality. 3.2.2 Reinforcement Learning RL strategies are introduced to enhance model robustness. To address hallucina- tion issues in LLMs, Jiang et al. [60] incorporated Kahneman-Tversky Optimization (KTO) [61] along with self-correction mechanisms, and proposed LLMOPT, which has been validated across six real-world datasets spanning 20 domains","claim_type":"background","confidence":0.85,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks KTO: Model Alignment as Prospect Theoretic Optimization because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (11 contexts).","role_counts":[{"n":11,"context_role":"background"},{"n":2,"context_role":"method"}]},"error":null,"updated_at":"2026-05-23T04:44:15.403442+00:00"}},"summary":{"title":"KTO: Model Alignment as Prospect Theoretic Optimization","claims":[{"claim_text":"Kahneman & Tversky's $\\textit{prospect theory}$ tells us that humans perceive random variables in a biased but well-defined manner (1992); for example, humans are famously loss-averse. We show that objectives for aligning LLMs with human feedback implicitly incorporate many of these biases -- the success of these objectives (e.g., DPO) over cross-entropy minimization can partly be ascribed to them belonging to a family of loss functions that we call $\\textit{human-aware losses}$ (HALOs). However, the utility functions these methods attribute to humans still differ from those in the prospect th","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"ate a base set of N responses, denoted as Dbase = {(x, ri, ai)}N i=1, where ri is the textual response anda i ∈ Ais the corresponding attribute. To embed the target distributionP∗ into the train- ing data, we explicitly control the generation fre- quency such that the count Nk of responses exhibit- ing attributea k satisfies: Nk =round(N·P ∗(ak|x))(6) For instance, given a target distribution of {Male: 0.99, Female: 0.01} and N= 100 , Dbase will contain 99 responses with the Male attribute and 1","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"policy log-probability ratios against pairwise preference data relative to a fixed reference model. This reformulation reduces alignment to a stable classification-style objective while retaining strong em- pirical performance. As a result, DPO has inspired a growing family of reference-based, reward-free alignment methods, including IPO [11], KTO [12], SimPO [13], ORPO [14], and iterative or online variants such as SPIN [15]. Preprint. arXiv:2605.08037v1 [cs.LG] 8 May 2026 The pairwise and list","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"non-linear optimization problems involving phys- ical dynamics. We follow a scalable backtrans- lation based synthetic data generation strategy described in Section 3.2. 2.3. RL for Reasoning and Code Generation Group Relative Policy Optimization (GRPO) [31] eliminates the critic model from PPO [32] by sampling groups of outputs and normalizing ad- vantages within each group; DeepSeek-R1 [33] showed that complex reasoning strategies emerge from GRPO with verifiable rewards alone, and Dr. GRPO [3","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"Orpo: Monolithic preference optimization without reference model. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 11170-11189, 2024. [63] Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward.Advances in Neural Information Processing Systems, 37:124198- 124235, 2024. [64] Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic opti","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"[172] Chongyu Fan, Yihua Zhang, Jinghan Jia, Alfred Hero, and Sijia Liu. Cyclicreflex: Im- proving large reasoning models via cyclical reflection token scheduling. arXiv preprint arXiv:2506.11077, 2025. [173] Siqi Fan, Peng Han, Shuo Shang, Yequan Wang, and Aixin Sun. Cothink: Token-efficient reasoning via instruct models guiding reasoning models. arXiv preprint arXiv:2505.22017, 2025. [174] Tiantian Fan, Lingjun Liu, Yu Yue, Jiaze Chen, Chengyi Wang, Qiying Yu, Chi Zhang, Zhiqi Lin, Ruofei Zhu,","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"[59] proposed a two-stage strat- egy combining SFT and Feasibility-and-Optimality-Aware Reinforcement Learning (FOARL) to guide LLMs and improve solution quality. 3.2.2 Reinforcement Learning RL strategies are introduced to enhance model robustness. To address hallucina- tion issues in LLMs, Jiang et al. [60] incorporated Kahneman-Tversky Optimization (KTO) [61] along with self-correction mechanisms, and proposed LLMOPT, which has been validated across six real-world datasets spanning 20 domains","claim_type":"background","confidence":0.85,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks KTO: Model Alignment as Prospect Theoretic Optimization because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (11 contexts).","role_counts":[{"n":11,"context_role":"background"},{"n":2,"context_role":"method"}]},"graph":{"co_cited":[{"title":"Proximal Policy Optimization Algorithms","work_id":"240c67fe-d14d-4520-91c1-38a4e272ca19","shared_citers":36},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":32},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":23},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":22},{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":19},{"title":"Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback","work_id":"a1f2574b-a899-4713-be60-c87ba332656c","shared_citers":17},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":16},{"title":"Constitutional AI: Harmlessness from AI Feedback","work_id":"faaaa4e0-2676-4fac-a0b4-99aef10d2095","shared_citers":15},{"title":"DAPO: An Open-Source LLM Reinforcement Learning System at Scale","work_id":"64019d00-0b11-4bbd-b173-b46c8fad0157","shared_citers":15},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":13},{"title":"Fine-Tuning Language Models from Human Preferences","work_id":"4f54aad1-f3b6-404f-b9c7-e21ba0a33b99","shared_citers":12},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":12},{"title":"ORPO: Monolithic Preference Optimization without Reference Model","work_id":"93ad9e7b-6db2-4249-a0fa-8a96ef124d53","shared_citers":12},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":9},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":8},{"title":"Group Sequence Policy Optimization","work_id":"3a98b53b-9f52-4d95-adf7-89353c0a9a65","shared_citers":8},{"title":"In Advances in Neural Information Processing Systems (NeurIPS), volume 37","work_id":"09ac9a7a-d953-4f17-acda-ac268871a79c","shared_citers":8},{"title":"Tulu 3: Pushing Frontiers in Open Language Model Post-Training","work_id":"28c9dbea-056a-48c2-8000-85f809827e45","shared_citers":8},{"title":"UltraFeedback: Boosting Language Models with Scaled AI Feedback","work_id":"5f1e5704-a4ce-482a-b124-b3692397be11","shared_citers":8},{"title":"Understanding R1-Zero-Like Training: A Critical Perspective","work_id":"ec354f3b-9484-4a0c-94c8-92d4d0260835","shared_citers":8},{"title":"Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators","work_id":"ef25adcf-addb-445e-b3b5-858eeb9883ca","shared_citers":7},{"title":"Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge","work_id":"28ea1282-d657-4c61-a83c-f1249be6d6b1","shared_citers":7},{"title":"arXiv preprint arXiv:2310.12036 , year=","work_id":"44673d8e-2cc2-4818-86d3-24bc812aa41c","shared_citers":6},{"title":"arXiv preprint arXiv:2405.00675 , year=","work_id":"9d5b91f5-dd34-4c04-925b-ae80101224d5","shared_citers":6}],"time_series":[{"n":6,"year":2024},{"n":18,"year":2025},{"n":43,"year":2026}],"dependency_candidates":[{"n":1,"role":"method","polarity":"extend","paper_title":"Beyond Pairs: Your Language Model is Secretly Optimizing a Preference Graph","primary_cat":"cs.LG","context_text":"policy log-probability ratios against pairwise preference data relative to a fixed reference model. This reformulation reduces alignment to a stable classification-style objective while retaining strong em- pirical performance. As a result, DPO has inspired a growing family of reference-based, reward-free alignment methods, including IPO [11], KTO [12], SimPO [13], ORPO [14], and iterative or online variants such as SPIN [15]. Preprint. arXiv:2605.08037v1 [cs.LG] 8 May 2026 The pairwise and listwise bottleneck.Despite these advances, most DPO-style methods fundamen- tally operate onisolated pairwise comparisons. This assumption is often mismatched with practical data collection regimes. In many real-world and benchmark settings, preference data is obtained by","citing_arxiv_id":"2605.08037"},{"n":1,"role":"method","polarity":"use_method","paper_title":"Controlling Distributional Bias in Multi-Round LLM Generation via KL-Optimized Fine-Tuning","primary_cat":"cs.CL","context_text":"ate a base set of N responses, denoted as Dbase = {(x, ri, ai)}N i=1, where ri is the textual response anda i ∈ Ais the corresponding attribute. To embed the target distributionP∗ into the train- ing data, we explicitly control the generation fre- quency such that the count Nk of responses exhibit- ing attributea k satisfies: Nk =round(N·P ∗(ak|x))(6) For instance, given a target distribution of {Male: 0.99, Female: 0.01} and N= 100 , Dbase will contain 99 responses with the Male attribute and 1 with the Female attribute. Transformation to Preference PairsTo apply our hybrid loss function, we transform Dbase into a preference dataset Dpref ={(x, y w, yl)}. For each sample (x, r, a)from the base dataset, we construct","citing_arxiv_id":"2604.05756"}]},"authors":[]}}