pith. machine review for the scientific record. sign in

arxiv: 2604.17197 · v1 · submitted 2026-04-19 · 💻 cs.CL

Recognition: unknown

Learning to Control Summaries with Score Ranking

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:38 UTC · model grok-4.3

classification 💻 cs.CL
keywords text summarizationcontrollable generationscore rankingfine-grained evaluationquality trade-offslanguage model training
0
0 comments X

The pith

A ranking loss on fine-grained evaluator scores lets summarization models control trade-offs between quality dimensions like conciseness and faithfulness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a loss function that trains models to match their summary outputs to separate scores on each quality dimension from an external evaluator. This produces summaries whose overall quality matches current top methods while letting a user raise or lower emphasis on one dimension without retraining. The approach directly tackles the fact that gains in conciseness tend to reduce completeness and similar inherent conflicts. A reader would care because many practical uses of summaries require tuning for specific priorities rather than accepting a single blended optimum.

Core claim

The authors show that a score-ranking loss aligning generated summaries with fine-grained evaluator scores improves overall quality and, at the same time, permits explicit control over individual criteria. By adjusting target scores for dimensions such as completeness, conciseness, and faithfulness, the model can be steered toward desired trade-offs. Experiments on LLaMA, Qwen, and Mistral confirm that the resulting summaries reach performance levels comparable to state-of-the-art systems while adding this controllability.

What carries the argument

The score-ranking loss that encourages generated summaries to follow the relative ordering of fine-grained quality scores supplied by an external evaluator model.

If this is right

  • Summaries reach quality levels comparable to current state-of-the-art systems on standard benchmarks.
  • Users can selectively prioritize one criterion over others by changing the target scores during training or inference.
  • The controllability holds across multiple base models including LLaMA, Qwen, and Mistral.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same ranking objective could be tested on other conditional generation tasks that face similar quality trade-offs.
  • If evaluator scores contain systematic biases, the control mechanism may amplify those biases rather than true human preferences.
  • Practical deployment would require checking whether the induced control remains stable when the summaries are used in downstream applications.

Load-bearing premise

Fine-grained scores from evaluator models serve as reliable, unbiased targets that let training achieve genuine control over trade-offs without the scores themselves being gamed or introducing new inconsistencies.

What would settle it

A controlled experiment in which human raters judge summaries produced under different target scores and find no measurable shift in the intended dimension, or in which the evaluator scores show low agreement with human ratings on the same outputs.

Figures

Figures reproduced from arXiv: 2604.17197 by Hongye Liu, Liang Ding, Ricardo Henao.

Figure 1
Figure 1. Figure 1: Summaries of the abstract prioritizing completeness (Com↑), conciseness (Con↑) and balance (Bal). Pagnoni et al., 2021). Recent model-based eval￾uators like FineSurE (Song et al., 2024a) and UniSumEval (Lee et al., 2024) instead extract and align atomic semantic units or keyfacts between the source and summary, delivering fine-grained completeness and conciseness scores that correlate better with human ass… view at source ↗
Figure 2
Figure 2. Figure 2: Model architecture and loss functions. The model takes an input document along with a specific prompt to generates K summaries, which are used to compute different loss components. The prompts for Com↑, Con↑ and Bal are designed to prioritize a more complete, more concise or a balanced summary, respectively. in another (e.g., completeness). However, existing approaches do not effectively control the summar… view at source ↗
Figure 3
Figure 3. Figure 3: Model performance across different control settings. Each point represents the mean of HM(Y˜ ) (y-axis) and R(Y˜ ) (x-axis) across all test cases in FeedSum test set. Models are grouped by color into four categories: Baseline models (blue), Our methods (green), SummLLaMA (orange), and Commercial models (red). The three panels show performance under different control priorities: Com↑ prioritizes completenes… view at source ↗
Figure 4
Figure 4. Figure 4: Distributions of R(Y˜ ) (x) and HM(Y˜ ) (y) metrics. Com↑ prioritizes completeness, Con↑ prioritizes conciseness, and Bal to balance them. Blue dashed lines mark the mean of the metrics, and the arrows point in the direction in which metrics are better or worse. dicating balanced attention to both completeness and conciseness, together with high HM(Y˜ ). Here, LLaMA* matches SummLLaMA in HM(Y˜ ) but is muc… view at source ↗
Figure 5
Figure 5. Figure 5: Distribution of Spearman correlations between model likelihood and model-based scores across different [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Scatter plot of model performance across different control settings using G-Eval+. Each point represents the mean of HM(Y˜ ) (y-axis) and R(Y˜ ) (x-axis) across all test cases in FeedSum test set. Models are grouped by color into four categories: Baseline models (blue), Our methods (green), SummLLaMA (orange), and Commercial models (red). The three panels show performance under different control priorities… view at source ↗
Figure 7
Figure 7. Figure 7: Contour plot showing the distribution of models with respect to control ability (x-axis: [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Contour plot showing the distribution of models with respect to control ability (x-axis: [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Scatter plot of model performance testing in out-of-domain data (MeQSum; OpoSum) across different [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Scatter plot of model performance separated by different domain (DialogSum; CNN/DM; WikiHow) [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Model performance across different control settings. Each point represents the mean of HM(Y˜ ) (y-axis) and R(Y˜ ) (x-axis) across all test cases in FeedSum test set. The three panels show performance under different control priorities: Com↑ prioritizes completeness, Fai↑ prioritizes faithfulness, and Bal aims to balance both. The vertical dashed line at R(Y˜ ) = 0 represents the controllability target (r… view at source ↗
Figure 12
Figure 12. Figure 12: Distributions of R(Y˜ ) (x) and HM(Y˜ ) (y) metrics. Com↑ prioritizes completeness, Fai↑ prioritizes faithfulness, and Bal to balances them. Blue dashed lines mark the mean of the metrics, and the arrows point in the direction in which metrics are better or worse [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗
read the original abstract

Recent advances in summarization research focus on improving summary quality across multiple criteria, such as completeness, conciseness, and faithfulness, by jointly optimizing these dimensions. However, these efforts largely overlook the challenge of controlling summary generation with respect to individual criteria, especially in the presence of their inherent trade-offs. For example, enhancing conciseness can compromise completeness, and vice versa. In this work, we address this gap by proposing a loss function that aligns model outputs with fine-grained, model-based evaluation scores (e.g., from FineSurE), enabling both improvement in summary quality and dimension-specific control. Our approach improves the overall quality of summaries while maintaining the ability to selectively prioritize one criterion over others. Experiments on three pretrained models (LLaMA, Qwen, and Mistral) demonstrate that our method achieves performance comparable to state-of-the-art summarizers, while uniquely offering strong controllability over individual quality dimensions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes a loss function for summarization models that aligns generated outputs with fine-grained model-based evaluation scores (e.g., from FineSurE) to improve overall summary quality across dimensions such as completeness, conciseness, and faithfulness while enabling selective control over individual dimensions to manage inherent trade-offs. Experiments on LLaMA, Qwen, and Mistral claim performance comparable to state-of-the-art summarizers with the added benefit of strong controllability.

Significance. If the results hold with proper validation, the work could meaningfully advance controllable summarization by offering a direct alignment approach to handle multi-criteria trade-offs, potentially more flexible than joint optimization techniques. The multi-model evaluation adds some generalizability, though the absence of detailed quantitative support limits assessment of impact.

major comments (3)
  1. Abstract: The central claims of 'performance comparable to state-of-the-art summarizers' and 'strong controllability over individual quality dimensions' are unsupported by any reported metrics, baselines, statistical tests, or quantification of control (e.g., how trade-offs are measured or prioritized). This is load-bearing for the paper's contribution as the abstract provides the only high-level evidence summary.
  2. Experiments section: No details are given on how controllability is evaluated, such as ablation studies varying dimension priorities, human validation of controlled outputs, or comparisons against alternative evaluators, leaving the uniqueness of the controllability claim unverified.
  3. Method section: Reliance on FineSurE-style scores for the loss creates a circularity risk if the evaluator shares model families, training data, or failure modes with LLaMA/Qwen/Mistral; the paper must demonstrate independence or provide held-out human judgments to confirm that control addresses genuine trade-offs rather than evaluator artifacts.
minor comments (2)
  1. The abstract would be strengthened by including at least one key quantitative result (e.g., a metric value or improvement delta) to ground the claims.
  2. Notation for the loss function and score dimensions could be clarified with an explicit equation or table for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We have reviewed each major comment carefully and provide point-by-point responses below, indicating where we will revise the manuscript to address the concerns.

read point-by-point responses
  1. Referee: Abstract: The central claims of 'performance comparable to state-of-the-art summarizers' and 'strong controllability over individual quality dimensions' are unsupported by any reported metrics, baselines, statistical tests, or quantification of control (e.g., how trade-offs are measured or prioritized). This is load-bearing for the paper's contribution as the abstract provides the only high-level evidence summary.

    Authors: We agree that the abstract would be strengthened by including specific quantitative support. The experiments section of the manuscript reports detailed comparisons across LLaMA, Qwen, and Mistral using FineSurE scores and other metrics, demonstrating comparability to SOTA methods along with controllability via dimension-specific alignments. In the revised version, we will update the abstract to incorporate key numerical highlights from the results tables, including overall quality improvements and examples of score changes under different priority settings. revision: yes

  2. Referee: Experiments section: No details are given on how controllability is evaluated, such as ablation studies varying dimension priorities, human validation of controlled outputs, or comparisons against alternative evaluators, leaving the uniqueness of the controllability claim unverified.

    Authors: We will expand the Experiments section to provide explicit details on the controllability evaluation. This includes describing the ablation studies that vary dimension priority weights in the loss function, quantitative reporting of resulting trade-offs (e.g., increases in one dimension's score at the expense of others), and human validation of a subset of controlled outputs. We will also clarify comparisons to alternative control approaches where relevant. revision: yes

  3. Referee: Method section: Reliance on FineSurE-style scores for the loss creates a circularity risk if the evaluator shares model families, training data, or failure modes with LLaMA/Qwen/Mistral; the paper must demonstrate independence or provide held-out human judgments to confirm that control addresses genuine trade-offs rather than evaluator artifacts.

    Authors: We acknowledge this important point on potential circularity. FineSurE operates as an independent fine-grained evaluator with distinct architectural and training characteristics from the generation models. In the revised manuscript, we will add a subsection clarifying these differences to establish independence. We will also incorporate correlation analysis with held-out human judgments on controlled outputs to confirm that the dimension-specific controls reflect genuine quality trade-offs. revision: partial

Circularity Check

0 steps flagged

No circularity: loss function uses external evaluator scores as independent targets

full rationale

The paper introduces a loss that aligns generated summaries to scores from the separate FineSurE evaluator and reports experimental results on LLaMA, Qwen, and Mistral. This is a standard supervised training setup with an external signal; the controllability claim follows directly from the training objective rather than reducing to a self-referential definition, fitted prediction, or self-citation chain. No equations or steps in the described method equate the output controllability to the input scores by construction. The derivation remains self-contained against the external benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the method depends on pre-existing evaluation models such as FineSurE whose internals are not detailed here.

pith-pipeline@v0.9.0 · 5450 in / 1069 out tokens · 48169 ms · 2026-05-10T06:38:01.302125+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

60 extracted references · 31 canonical work pages · 11 internal anchors

  1. [1]

    arXiv preprint arXiv:2203.16804 , year=

    BRIO: Bringing order to abstractive summarization , author=. arXiv preprint arXiv:2203.16804 , year=

  2. [2]

    arXiv preprint arXiv:2503.21332 , year=

    ReFeed: Multi-dimensional Summarization Refinement with Reflective Reasoning on Feedback , author=. arXiv preprint arXiv:2503.21332 , year=

  3. [3]

    Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) , pages=

    GECSum: Generative Evaluation-Driven Sequence Level Contrastive Learning for Abstractive Summarization , author=. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) , pages=

  4. [4]

    arXiv preprint arXiv:2410.13116 , year=

    Learning to Summarize from LLM-generated Feedback , author=. arXiv preprint arXiv:2410.13116 , year=

  5. [5]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

  6. [6]

    arXiv preprint arXiv:2407.00908 , year=

    FineSurE: Fine-grained summarization evaluation using LLMs , author=. arXiv preprint arXiv:2407.00908 , year=

  7. [7]

    arXiv preprint arXiv:2409.19898 , year=

    UniSumEval: Towards Unified, Fine-Grained, Multi-Dimensional Summarization Evaluation for LLMs , author=. arXiv preprint arXiv:2409.19898 , year=

  8. [8]

    Learning to Substitute Words with Model-based Score Ranking , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

  9. [9]

    2000 , publisher=

    Advances in large margin classifiers , author=. 2000 , publisher=

  10. [10]

    arXiv preprint arXiv:2305.14239 , year=

    On learning to summarize with large language models as references , author=. arXiv preprint arXiv:2305.14239 , year=

  11. [11]

    arXiv preprint arXiv:2307.04507 , year=

    Improving Factuality of Abstractive Summarization via Contrastive Reward Learning , author=. arXiv preprint arXiv:2307.04507 , year=

  12. [12]

    Advances in Neural Information Processing Systems , volume=

    Direct preference optimization: Your language model is secretly a reward model , author=. Advances in Neural Information Processing Systems , volume=

  13. [13]

    Advances in Neural Information Processing Systems , volume=

    Language model alignment with elastic reset , author=. Advances in Neural Information Processing Systems , volume=

  14. [14]

    Controllable preference optimization: Toward controllable multi-objective alignment.arXiv preprint arXiv:2402.19085, 2024

    Controllable preference optimization: Toward controllable multi-objective alignment , author=. arXiv preprint arXiv:2402.19085 , year=

  15. [15]

    arXiv preprint arXiv:2105.06762

    DialogSum: A real-life scenario dialogue summarization dataset , author=. arXiv preprint arXiv:2105.06762 , year=

  16. [16]

    Koupaee and W

    Wikihow: A large scale text summarization dataset , author=. arXiv preprint arXiv:1810.09305 , year=

  17. [17]

    Abstractive text summarization using sequence-to-sequence RNN s and beyond

    Abstractive text summarization using sequence-to-sequence rnns and beyond , author=. arXiv preprint arXiv:1602.06023 , year=

  18. [18]

    Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages=

    On the summarization of consumer health questions , author=. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages=

  19. [19]

    arXiv preprint arXiv:1808.08858 , year=

    Summarizing opinions: Aspect extraction meets sentiment prediction and they are both weakly supervised , author=. arXiv preprint arXiv:1808.08858 , year=

  20. [20]

    arXiv preprint arXiv:2010.07100 , year=

    Re-evaluating evaluation in text summarization , author=. arXiv preprint arXiv:2010.07100 , year=

  21. [21]

    , author=

    The harpy speech recognition system. , author=. 1976 , publisher=

  22. [22]

    Language Resources and Evaluation , volume=

    The challenging task of summary evaluation: an overview , author=. Language Resources and Evaluation , volume=. 2018 , publisher=

  23. [23]

    Journal of artificial intelligence research , volume=

    Reinforcement learning: A survey , author=. Journal of artificial intelligence research , volume=

  24. [24]

    Proximal Policy Optimization Algorithms

    Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

  25. [25]

    LLaMA: Open and Efficient Foundation Language Models

    Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=

  26. [26]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

  27. [27]

    The Llama 3 Herd of Models

    The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

  28. [28]

    Qwen2.5 Technical Report

    Qwen2. 5 technical report , author=. arXiv preprint arXiv:2412.15115 , year=

  29. [29]

    2023 , eprint=

    Mistral 7B , author=. 2023 , eprint=

  30. [30]

    , author=

    Lora: Low-rank adaptation of large language models. , author=. ICLR , volume=

  31. [31]

    Proceedings of the 2023 conference on empirical methods in natural language processing , pages=

    SummEdits: Measuring LLM ability at factual reasoning through the lens of summarization , author=. Proceedings of the 2023 conference on empirical methods in natural language processing , pages=

  32. [32]

    Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Toward Human-Like Evaluation for Natural Language Generation with Error Analysis , author=. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  33. [33]

    PressAcademia Procedia , volume=

    A survey automatic text summarization , author=. PressAcademia Procedia , volume=. 2007 , publisher=

  34. [34]

    The Oxford Handbook of Computational Linguistics , pages=

    Text Summarization , author=. The Oxford Handbook of Computational Linguistics , pages=. 2015 , publisher=

  35. [35]

    Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Neural Summarization by Extracting Sentences and Words , author=. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  36. [36]

    Neural Text Summarization: A Critical Evaluation , author=. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pages=

  37. [37]

    Text summarization branches out , pages=

    Rouge: A package for automatic evaluation of summaries , author=. Text summarization branches out , pages=

  38. [38]

    BERTScore: Evaluating Text Generation with BERT

    Bertscore: Evaluating text generation with bert , author=. arXiv preprint arXiv:1904.09675 , year=

  39. [39]

    Advances in neural information processing systems , volume=

    Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

  40. [40]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini: a family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=

  41. [41]

    G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

    G-eval: NLG evaluation using gpt-4 with better human alignment , author=. arXiv preprint arXiv:2303.16634 , year=

  42. [42]

    Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages=

    On Faithfulness and Factuality in Abstractive Summarization , author=. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages=

  43. [43]

    Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages=

    Understanding Factuality in Abstractive Summarization with FRANK: A Benchmark for Factuality Metrics , author=. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages=

  44. [44]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Back to Basics: Revisiting REINFORCE-Style Optimization for Learning from Human Feedback in LLMs , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  45. [45]

    Length desensitization in direct preference optimization.arXiv preprint arXiv:2409.06411, 2024

    Length Desensitization in Directed Preference Optimization , author=. arXiv preprint arXiv:2409.06411 , year=

  46. [46]

    arXiv preprint arXiv:2409.15360 , year=

    Reward-robust rlhf in llms , author=. arXiv preprint arXiv:2409.15360 , year=

  47. [47]

    arXiv preprint arXiv:2410.21819 (2025)

    Self-preference bias in llm-as-a-judge , author=. arXiv preprint arXiv:2410.21819 , year=

  48. [48]

    Chenghao Yang, Sida Li, and Ari Holtzman

    Pride and prejudice: LLM amplifies self-bias in self-refinement , author=. arXiv preprint arXiv:2402.11436 , year=

  49. [49]

    Advances in Neural Information Processing Systems , volume=

    Defining and characterizing reward gaming , author=. Advances in Neural Information Processing Systems , volume=

  50. [50]

    The Curious Case of Neural Text Degeneration

    The curious case of neural text degeneration , author=. arXiv preprint arXiv:1904.09751 , year=

  51. [51]

    Language Resources and Evaluation Conference 2026 , year=

    This One or That One? A Study on Accessibility via Demonstratives with Multimodal Large Language Models , author=. Language Resources and Evaluation Conference 2026 , year=

  52. [52]

    2026 , eprint=

    STRIDE-ED: A Strategy-Grounded Stepwise Reasoning Framework for Empathetic Dialogue Systems , author=. 2026 , eprint=

  53. [53]

    DTCRS : Dynamic Tree Construction for Recursive Summarization

    Luo, Guanran and Jian, Zhongquan and Qiu, Wentao and Wang, Meihong and Wu, Qingqiang. DTCRS : Dynamic Tree Construction for Recursive Summarization. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.536

  54. [54]

    2026 , eprint=

    GCoT-Decoding: Unlocking Deep Reasoning Paths for Universal Question Answering , author=. 2026 , eprint=

  55. [55]

    2026 , eprint=

    AGSC: Adaptive Granularity and Semantic Clustering for Uncertainty Quantification in Long-text Generation , author=. 2026 , eprint=

  56. [56]

    Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

    F ^2 Bench: An Open-ended Fairness Evaluation Benchmark for LLMs with Factuality Considerations , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

  57. [57]

    Findings of the Association for Computational Linguistics: ACL 2025 , pages=

    McBE: A Multi-task Chinese Bias Evaluation Benchmark for Large Language Models , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

  58. [58]

    Think-While-Generating: On-the-Fly Reasoning for Personalized Long-Form Generation.arXiv preprint arXiv:2512.06690.2025

    Think-While-Generating: On-the-Fly Reasoning for Personalized Long-Form Generation , author=. arXiv preprint arXiv:2512.06690 , year=

  59. [59]

    arXiv preprint arXiv:2603.11863 , year=

    CreativeBench: Benchmarking and Enhancing Machine Creativity via Self-Evolving Challenges , author=. arXiv preprint arXiv:2603.11863 , year=

  60. [60]

    RCBSF: A Multi-Agent Framework for Automated Contract Revision via Stackelberg Game

    Shijia Xu and Yu Wang and Xiaolong Jia and Zhou Wu and Kai Liu and April Xiaowen Dong , year=. 2604.10740 , archivePrefix=