pith. the verified trust layer for science. sign in

arxiv: 2511.15408 · v2 · pith:SWUJDETKnew · submitted 2025-11-19 · 💻 cs.CL · cs.AI· cs.IR· cs.MA· cs.NE

Chinese Short-Form Creative Content Generation via Explanation-Oriented Multi-Objective Optimization

Pith reviewed 2026-05-17 20:46 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.IRcs.MAcs.NE
keywords Chinese short-form creative generationmulti-objective optimizationmulti-agent frameworkexplanation verificationpersonalized constraintsbaby namingLLM creative writingnatural language generation
0
0 comments X p. Extension
Add this Pith Number to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{SWUJDETK}

Prints a linked pith:SWUJDETK badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Formalizing Chinese short-form creative tasks as joint optimization of constraints and explanation reliability produces more trustworthy personalized outputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that brief Chinese creative texts such as names or ads supply too few observable cues to check whether complex user constraints have been met. It therefore treats the generation task as a heterogeneous multi-objective optimization problem that must improve both the creative output and the reliability of its accompanying explanation. A training-free multi-agent system called MAGIC-HMO iterates between creating the text, writing the explanation, and verifying both against the constraints. This setup matters because standard outcome-only methods break down when the final text is too short to judge on its own, while explanations can supply the missing verification signals. If the approach works, it gives a practical route to more dependable LLM results in culturally dense, constraint-heavy creative domains without any model retraining.

Core claim

The paper formalizes Chinese short-form CNLG as a heterogeneous multi-objective optimization issue that jointly optimizes multiple personalized constraints and explanation reliability. It introduces MAGIC-HMO, a training-free multi-agent framework that performs iterative generation and verification under an explanation-oriented multi-objective strategy. Experiments on the Chinese Baby Naming benchmark show that MAGIC-HMO significantly outperforms six strong baselines across various LLM backbones.

What carries the argument

MAGIC-HMO, a training-free multi-agent framework that iterates between generating creative content and its explanation while verifying both against multiple personalized constraints.

If this is right

  • Short creative outputs can meet diverse personalized constraints more consistently when explanations are optimized alongside the text itself.
  • LLM-generated explanations become usable verification cues rather than additional sources of error.
  • The same iterative verification process works across different LLM backbones without requiring fine-tuning.
  • The method supplies a general template for other short-form creative tasks that suffer from limited direct observability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same explanation-oriented loop could be tested on other compact languages or on short advertising copy under similar constraint sets.
  • Adding an external human review step on the generated explanations might further tighten the multi-objective balance.
  • The framework suggests that explanation quality can serve as a scalable proxy signal for constraint achievement in future automated creative systems.

Load-bearing premise

Iterative multi-agent generation and verification can reliably reduce hallucination, incompleteness, and ambiguity in explanations under complex personalized constraints without introducing new failure modes.

What would settle it

A head-to-head evaluation on the Chinese Baby Naming benchmark in which MAGIC-HMO fails to outperform the six baselines on combined measures of constraint satisfaction and explanation quality would falsify the central performance claim.

Figures

Figures reproduced from arXiv: 2511.15408 by Jianxun Lian, Laks V. S. Lakshmanan, Shanlin Zhou, Xiaoyuan Yi, Xinpeng Wang, Yongtao Hao, Zhenghao Liu.

Figure 1
Figure 1. Figure 1: Example of Chinese Baby Naming (NCB). Different colors indicate [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of NAMEGEN. Steps 1.1 and 1.2 constitute the multi-objective information preparation process, which is primarily handled by MOM and MOE. The dynamic iterative objective optimization process includes Steps 2 and 3: Step 2 is managed by MOG, while Step 3 reflects MOE’s role in evaluating the generation results. The green block at the bottom right illustrates the complete pipeline of NAMeGEn. as lack… view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of balance performance on EUOs, IIOs, and their overall [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Fine-grained comparison of explicit and implicit objective com [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Pairwise IIO comparisons. Blue lines show Pareto front; red line [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Kernel density estimation (KDE) of interaction distributions across LLMs using our method. (a) shows API request counts over the full process; (b) [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of dynamic changes during NAMeGEn’s iteration process [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Comparison of methods on the same query and backbone (DeepSeek). (a) NCB task results. (b) Slogan design task results. Red highlights factual or [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
read the original abstract

Chinese demonstrates high semantic compactness and rich metaphorical expressiveness, enabling limited text to convey dense meanings while increasing the difficulty of generation and verification, particularly in short-form creative natural language generation (CNLG). In the real world, users often require personalized, fine-grained creative constraints, making reliable verification critical to guiding optimization. According to Brunswik's Lens Model from psychology, constraints' achievement can be inferred from sufficient observable cues. Existing studies are mainly outcome-oriented, implicitly assuming that the outcome itself provides adequate cues for verification. However, this assumption breaks down in Chinese short-form CNLG (e.g., naming or advertising) with diverse personalized constraints, where extremely brief outcomes inherently offer limited information. Explanations can naturally serve as extra cues. Nevertheless, under complex constraints, LLMs' explanations may suffer from hallucination, incompleteness, or ambiguity. To address these, we novelly formalize the Chinese short-form CNLG task as a heterogeneous multi-objective optimization (HMO) issue that needs to jointly optimize multiple personalized constraints and explanation reliability. We further propose MAGIC-HMO, a training-free multi-agent framework that optimizes these objectives through iterative generation and verification under an explanation-oriented multi-objective strategy. Experiments on \emph{Chinese Baby Naming}, a challenging benchmark, demonstrate that MAGIC-HMO significantly outperforms six strong baselines across various LLM backbones. Relevant data and codes are available at https://github.com/foolfun/MAGIC_HMO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that Chinese short-form creative NLG under complex personalized constraints can be formalized as a heterogeneous multi-objective optimization (HMO) problem. It proposes MAGIC-HMO, a training-free multi-agent framework that performs iterative generation and verification of both outputs and explanations, drawing on Brunswik's Lens Model to treat explanations as additional observable cues. Experiments on the Chinese Baby Naming benchmark are reported to show that MAGIC-HMO significantly outperforms six strong baselines across multiple LLM backbones.

Significance. If the reported gains are substantiated with detailed metrics and controls, the work would offer a practical, training-free route to improving reliability in subjective creative generation tasks where outcomes are too brief to serve as self-sufficient verification signals. The explicit multi-objective framing and emphasis on explanation quality distinguish it from purely outcome-oriented prompting methods.

major comments (2)
  1. [Experiments] Experiments section: the central claim of significant outperformance on Chinese Baby Naming is stated without effect sizes, confidence intervals, statistical significance tests, or per-metric breakdowns (e.g., hallucination rate, completeness, or human-rated explanation quality). This absence makes it impossible to evaluate whether the iterative verification loop produces the claimed reductions in hallucination and ambiguity or merely shifts failure modes.
  2. [MAGIC-HMO Framework] MAGIC-HMO framework description: the iterative generation-verification procedure is presented as reliably mitigating hallucination, incompleteness, and ambiguity under personalized constraints, yet no quantitative diagnostics (per-iteration hallucination rates, inter-agent agreement, or ablation of the verification agent) are supplied. In a domain without objective ground truth, this leaves the load-bearing assumption untested.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'significantly outperforms' is used without defining the metric or referencing the statistical procedure that supports the adverb.
  2. [Introduction / Problem Formulation] Notation: the distinction between the heterogeneous objectives and the explanation-reliability objective is introduced but not given explicit mathematical formulation or weighting scheme in the early sections.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment point by point below, indicating the specific revisions planned for the next version of the manuscript.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the central claim of significant outperformance on Chinese Baby Naming is stated without effect sizes, confidence intervals, statistical significance tests, or per-metric breakdowns (e.g., hallucination rate, completeness, or human-rated explanation quality). This absence makes it impossible to evaluate whether the iterative verification loop produces the claimed reductions in hallucination and ambiguity or merely shifts failure modes.

    Authors: We agree that the current presentation of results would be strengthened by additional statistical details and metric breakdowns. In the revised manuscript we will report effect sizes, 95% confidence intervals, and statistical significance tests (paired t-tests or Wilcoxon signed-rank tests as appropriate) for all main comparisons across LLM backbones. We will also add explicit per-metric tables covering hallucination rate, completeness, and human-rated explanation quality. These analyses, drawn from the existing evaluation data already collected for the Chinese Baby Naming benchmark, show consistent reductions in hallucination and ambiguity rather than simple failure-mode shifts; the revised tables and figures will make this evidence directly accessible to readers. revision: yes

  2. Referee: [MAGIC-HMO Framework] MAGIC-HMO framework description: the iterative generation-verification procedure is presented as reliably mitigating hallucination, incompleteness, and ambiguity under personalized constraints, yet no quantitative diagnostics (per-iteration hallucination rates, inter-agent agreement, or ablation of the verification agent) are supplied. In a domain without objective ground truth, this leaves the load-bearing assumption untested.

    Authors: We accept that quantitative diagnostics are needed to support the claims about the iterative loop. The revised manuscript will include per-iteration hallucination-rate curves, inter-agent agreement statistics (Cohen’s kappa on verification decisions), and a dedicated ablation that removes the verification agent while keeping all other components fixed. Although the creative-naming domain lacks objective ground truth, the benchmark relies on expert human judgments; we will expand the description of these judgments and their reliability in the revision. These additions will directly test the contribution of the verification step and the explanation-oriented multi-objective strategy. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical framework evaluated externally

full rationale

The paper formalizes Chinese short-form CNLG as a heterogeneous multi-objective optimization problem drawing on Brunswik's Lens Model and proposes the training-free MAGIC-HMO multi-agent framework using iterative generation and verification. All reported gains are measured via experiments on the external Chinese Baby Naming benchmark against six independent baselines across LLM backbones. No equations, fitted parameters, or self-citations reduce the claimed outperformance to quantities defined by the authors' own inputs or prior work; the derivation chain remains self-contained with success determined by external comparison rather than construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that LLMs can produce usable explanations under iterative verification and on the empirical claim that the multi-agent loop improves constraint satisfaction; no free parameters or new physical entities are introduced.

axioms (2)
  • domain assumption LLM-generated explanations can serve as observable cues for constraint achievement per Brunswik's Lens Model
    Invoked in the abstract to justify adding explanations as verification signals.
  • ad hoc to paper Iterative multi-agent generation and verification reduces hallucination and ambiguity without new failure modes
    Core premise of the MAGIC-HMO strategy.

pith-pipeline@v0.9.0 · 5595 in / 1363 out tokens · 67598 ms · 2026-05-17T20:46:48.044234+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

88 extracted references · 88 canonical work pages · 13 internal anchors

  1. [1]

    Hello gpt-4o,

    OpenAI, “Hello gpt-4o,” https://openai.com/index/hello-gpt-4o/, 2024, accessed: 2025-01-29

  2. [2]

    Introducing openai o1,

    ——, “Introducing openai o1,” https://openai.com/o1/, 2024, accessed: 2024-10-28

  3. [3]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican, D. Silver, M. Johnson, I. Antonoglou, J. Schrittwieser, A. Glaese, J. Chen, E. Pitler, T. Lillicrap, A. Lazaridou, O. Firat, J. Molloyet al., “Gemini: A family of highly capable multimodal models,” 2024. [Online]. Available: https://arxiv.org/ab...

  4. [4]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Biet al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,”arXiv preprint arXiv:2501.12948, 2025

  5. [5]

    MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

    Y . Wang, X. Ma, G. Zhang, Y . Ni, A. Chandra, S. Guo, W. Ren, A. Arulraj, X. He, Z. Jiang, T. Li, M. Ku, K. Wang, A. Zhuang, R. Fan, X. Yue, and W. Chen, “MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark,” Nov. 2024, arXiv:2406.01574 [cs]. [Online]. Available: http://arxiv.org/abs/2406.01574

  6. [6]

    Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi,

    X. Yue, Y . Ni, K. Zhang, T. Zheng, R. Liu, G. Zhang, S. Stevens, D. Jiang, W. Ren, Y . Sunet al., “Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 9556–9567

  7. [7]

    MathPrompter: Mathematical reasoning using large language models,

    S. Imani, L. Du, and H. Shrivastava, “MathPrompter: Mathematical reasoning using large language models,” inProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track), S. Sitaram, B. Beigman Klebanov, and J. D. Williams, Eds. Toronto, Canada: Association for Computational Linguistics, Jul. 2023, pp. 3...

  8. [8]

    Hi-tom: A benchmark for evaluating higher-order theory of mind reasoning in large language models,

    Y . He, Y . Wu, Y . Jia, R. Mihalcea, Y . Chen, and N. Deng, “Hi-tom: A benchmark for evaluating higher-order theory of mind reasoning in large language models,”arXiv preprint arXiv:2310.16755, 2023

  9. [9]

    Understanding social reasoning in language models with language models,

    K. Gandhi, J.-P. Fraenken, T. Gerstenberg, and N. Goodman, “Understanding social reasoning in language models with language models,” inAdvances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, Eds., vol. 36. Curran Associates, Inc., 2023, pp. 13 518–13 529. [Online]. Available: https://proceedi...

  10. [10]

    Educating lan- guage models as promoters: Multi-aspect instruction alignment with self- augmentation,

    X. Sun, K. Shi, H. Tang, D. Wang, G. Xu, and Q. Li, “Educating lan- guage models as promoters: Multi-aspect instruction alignment with self- augmentation,”IEEE Transactions on Knowledge and Data Engineering, vol. 37, no. 8, pp. 4564–4577, 2025

  11. [11]

    Controllable text generation for open-domain creativity and fairness,

    N. Peng, “Controllable text generation for open-domain creativity and fairness,”arXiv preprint arXiv:2209.12099, 2022

  12. [12]

    Creative natural language generation,

    T. Chakrabarty, V . Padmakumar, H. He, and N. Peng, “Creative natural language generation,” inProceedings of the 2023 Conference on Empir- ical Methods in Natural Language Processing: Tutorial Abstracts, 2023, pp. 34–40

  13. [13]

    CTRL: A Conditional Transformer Language Model for Controllable Generation

    N. S. Keskar, B. McCann, L. R. Varshney, C. Xiong, and R. Socher, “Ctrl: A conditional transformer language model for controllable gen- eration,”arXiv preprint arXiv:1909.05858, 2019

  14. [14]

    Controllable natural language generation with contrastive prefixes,

    J. Qian, L. Dong, Y . Shen, F. Wei, and W. Chen, “Controllable natural language generation with contrastive prefixes,” inFindings of the Association for Computational Linguistics: ACL 2022, S. Muresan, P. Nakov, and A. Villavicencio, Eds. Dublin, Ireland: Association for Computational Linguistics, May 2022, pp. 2912–2924. [Online]. Available: https://acla...

  15. [15]

    A survey on evaluation of large language models,

    Y . Chang, X. Wang, J. Wang, Y . Wu, L. Yang, K. Zhu, H. Chen, X. Yi, C. Wang, Y . Wanget al., “A survey on evaluation of large language models,”ACM Transactions on Intelligent Systems and Technology, vol. 15, no. 3, pp. 1–45, 2024

  16. [16]

    ” it felt like having a second mind

    Q. Wan, S. Hu, Y . Zhang, P. Wang, B. Wen, and Z. Lu, “” it felt like having a second mind”: Investigating human-ai co-creativity in prewriting with large language models,”Proceedings of the ACM on Human-Computer Interaction, vol. 8, no. CSCW1, pp. 1–26, 2024

  17. [17]

    Cre- ativity support in the age of large language models: An empirical study involving emerging writers,

    T. Chakrabarty, V . Padmakumar, F. Brahman, and S. Muresan, “Cre- ativity support in the age of large language models: An empirical study involving emerging writers,”arXiv preprint arXiv:2309.12570, 2023

  18. [18]

    Jiuge: A human-machine collaborative chinese classical poetry generation system,

    G. Zhipeng, X. Yi, M. Sun, W. Li, C. Yang, J. Liang, H. Chen, Y . Zhang, and R. Li, “Jiuge: A human-machine collaborative chinese classical poetry generation system,” inProceedings of the 57th annual meeting of the association for computational linguistics: system demonstrations, 2019, pp. 25–30

  19. [19]

    Charpoet: A chinese classical poetry generation system based on token-free llm,

    C. Yu, L. Zang, J. Wang, C. Zhuang, and J. Gu, “Charpoet: A chinese classical poetry generation system based on token-free llm,” inProceed- ings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), 2024, pp. 315–325

  20. [20]

    Poetry in rags: Modern greek inter- war poetry generation using rag and contrastive training,

    S. Chatzikyriakidis and A. Natsina, “Poetry in rags: Modern greek inter- war poetry generation using rag and contrastive training,” inProceedings of the 5th International Conference on Natural Language Processing for Digital Humanities, 2025, pp. 257–264

  21. [21]

    Collabstory: Multi-llm collaborative story generation and authorship analysis,

    S. Venkatraman, N. I. Tripto, and D. Lee, “Collabstory: Multi-llm collaborative story generation and authorship analysis,”arXiv preprint arXiv:2406.12665, 2024

  22. [22]

    Seed- story: Multimodal long story generation with large language model,

    S. Yang, Y . Ge, Y . Li, Y . Chen, Y . Ge, Y . Shan, and Y . Chen, “Seed- story: Multimodal long story generation with large language model,” arXiv preprint arXiv:2407.08683, 2024

  23. [23]

    Summary of a haystack: A challenge to long-context llms and rag systems,

    P. Laban, A. R. Fabbri, C. Xiong, and C.-S. Wu, “Summary of a haystack: A challenge to long-context llms and rag systems,” in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024, pp. 9885–9903

  24. [24]

    A comprehensive survey on process-oriented automatic text summarization with exploration of llm-based methods.arXiv preprint arXiv:2403.02901, 2024

    H. Jin, Y . Zhang, D. Meng, J. Wang, and J. Tan, “A comprehensive sur- vey on process-oriented automatic text summarization with exploration of llm-based methods,”arXiv preprint arXiv:2403.02901, 2024

  25. [25]

    Unified multi-scenario summarization evaluation and explanation,

    S. Shang, Z. Yao, H. Fu, C. Tao, X. Chen, F. Wang, Y . Wang, Z. Ren, and S. Gao, “Unified multi-scenario summarization evaluation and explanation,”IEEE Transactions on Knowledge and Data Engineering, vol. 37, no. 2, pp. 991–1003, 2025

  26. [26]

    Enhancing coherence and diversity in multi-class slogan generation systems,

    P. N. Ahmad, Y . Liu, I. Ullah, and M. Shabaz, “Enhancing coherence and diversity in multi-class slogan generation systems,”ACM Transactions on Asian and Low-Resource Language Information Processing, vol. 23, no. 8, pp. 1–24, 2024

  27. [27]

    Deep poetry: A chinese classical poetry generation system,

    Y . Liu, D. Liu, and J. Lv, “Deep poetry: A chinese classical poetry generation system,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 09, 2020, pp. 13 626–13 627

  28. [28]

    Chae: Fine-grained con- trollable story generation with characters, actions and emotions,

    X. Wang, H. Jiang, Z. Wei, and S. Zhou, “Chae: Fine-grained con- trollable story generation with characters, actions and emotions,” in Proceedings of the 29th International Conference on Computational Linguistics, 2022, pp. 6426–6435

  29. [29]

    Weaver: Foundation models for creative writing,

    T. Wang, J. Chen, Q. Jia, S. Wang, R. Fang, H. Wang, Z. Gao, C. Xie, C. Xu, J. Daiet al., “Weaver: Foundation models for creative writing,” arXiv preprint arXiv:2401.17268, 2024

  30. [30]

    Llm discussion: Enhancing the creativity of large language models via discussion framework and role-play,

    L.-C. Lu, S.-J. Chen, T.-M. Pai, C.-H. Yu, H.-y. Lee, and S.-H. Sun, “Llm discussion: Enhancing the creativity of large language models via discussion framework and role-play,”arXiv preprint arXiv:2405.06373, 2024

  31. [31]

    Controlled text gen- eration as continuous optimization with multiple constraints,

    S. Kumar, E. Malmi, A. Severyn, and Y . Tsvetkov, “Controlled text gen- eration as continuous optimization with multiple constraints,”Advances in Neural Information Processing Systems, vol. 34, pp. 14 542–14 554, 2021

  32. [32]

    Position: A roadmap to pluralistic alignment,

    T. Sorensen, J. Moore, J. Fisher, M. L. Gordon, N. Mireshghallah, C. M. Rytting, A. Ye, L. Jiang, X. Lu, N. Dziriet al., “Position: A roadmap to pluralistic alignment,” inForty-first International Conference on Machine Learning, 2024

  33. [33]

    Suri: Multi-constraint instruction following in long-form text generation,

    C. M. Pham, S. Sun, and M. Iyyer, “Suri: Multi-constraint instruction following in long-form text generation,” inFindings of the Association for Computational Linguistics: EMNLP 2024, Y . Al- Onaizan, M. Bansal, and Y .-N. Chen, Eds. Miami, Florida, USA: Association for Computational Linguistics, Nov. 2024, pp. 1722–1753. [Online]. Available: https://acla...

  34. [34]

    Followbench: A multi-level fine- grained constraints following benchmark for large language models,

    Y . Jiang, Y . Wang, X. Zeng, W. Zhong, L. Li, F. Mi, L. Shang, X. Jiang, Q. Liu, and W. Wang, “Followbench: A multi-level fine- grained constraints following benchmark for large language models,” arXiv preprint arXiv:2310.20410, 2023

  35. [35]

    What’s old about new ideas,

    T. B. Ward, “What’s old about new ideas,”The creative cognition approach, pp. 157–178, 1995

  36. [36]

    R. A. Finke, T. B. Ward, and S. M. Smith,Creative cognition: Theory, research, and applications. MIT press, 1996

  37. [37]

    Implicit motives and basic psychological needs,

    J. Sch ¨uler, N. Baumann, A. Chasiotis, M. Bender, and I. Baum, “Implicit motives and basic psychological needs,”Journal of personality, vol. 87, no. 1, pp. 37–55, 2019

  38. [38]

    On the creativity of large language models,

    G. Franceschelli and M. Musolesi, “On the creativity of large language models,”AI & SOCIETY, pp. 1–11, 2024

  39. [39]

    Art or artifice? large language models and the false promise of creativity,

    T. Chakrabarty, P. Laban, D. Agarwal, S. Muresan, and C.-S. Wu, “Art or artifice? large language models and the false promise of creativity,” JOURNAL OF LATEX CLASS FILES. 12 inProceedings of the 2024 CHI Conference on Human Factors in Computing Systems, 2024, pp. 1–34

  40. [40]

    Generative ai lacks the human creativity to achieve scientific discovery from scratch,

    A. W. Ding and S. Li, “Generative ai lacks the human creativity to achieve scientific discovery from scratch,”Scientific Reports, vol. 15, no. 1, p. 9587, 2025

  41. [41]

    Mixpoet: Diverse poetry generation via learning controllable mixed latent space,

    X. Yi, R. Li, C. Yang, W. Li, and M. Sun, “Mixpoet: Diverse poetry generation via learning controllable mixed latent space,” inProceedings of the AAAI conference on artificial intelligence, vol. 34, no. 05, 2020, pp. 9450–9457

  42. [42]

    Evaluating creative short story generation in humans and large language models,

    M. Ismayilzada, C. Stevenson, and L. van der Plas, “Evaluating creative short story generation in humans and large language models,”arXiv preprint arXiv:2411.02316, 2024

  43. [43]

    Small language models can out- perform humans in short creative writing: A study comparing slms with humans and llms,

    G. Marco, L. Rello, and J. Gonzalo, “Small language models can out- perform humans in short creative writing: A study comparing slms with humans and llms,” inProceedings of the 31st International Conference on Computational Linguistics, 2025, pp. 6552–6570

  44. [44]

    ProxyQA: An alternative framework for evaluating long-form text generation with large language models,

    H. Tan, Z. Guo, Z. Shi, L. Xu, Z. Liu, Y . Feng, X. Li, Y . Wang, L. Shang, Q. Liu, and L. Song, “ProxyQA: An alternative framework for evaluating long-form text generation with large language models,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), L.-W. Ku, A. Martins, and V . Srikumar, ...

  45. [45]

    Reasoning-enhanced self-training for long-form personalized text generation, 2025

    A. Salemi, C. Li, M. Zhang, Q. Mei, W. Kong, T. Chen, Z. Li, M. Ben- dersky, and H. Zamani, “Reasoning-enhanced self-training for long-form personalized text generation,”arXiv preprint arXiv:2501.04167, 2025

  46. [46]

    A distributional approach to controlled text generation,

    M. Khalifa, H. Elsahar, and M. Dymetman, “A distributional approach to controlled text generation,” inInternational Conference on Learning Representations, 2020

  47. [47]

    A distributional lens for multi-aspect controllable text generation,

    Y . Gu, X. Feng, S. Ma, L. Zhang, H. Gong, and B. Qin, “A distributional lens for multi-aspect controllable text generation,” inProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 1023–1043

  48. [48]

    An extensible plug-and-play method for multi-aspect controllable text generation,

    X. Huang, Z. Liu, P. Li, T. Li, M. Sun, and Y . Liu, “An extensible plug-and-play method for multi-aspect controllable text generation,” inProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 15 233– 15 256

  49. [49]

    Maclasa: Multi-aspect controllable text generation via efficient sampling from compact latent space,

    H. Ding, L. Pang, Z. Wei, H. Shen, X. Cheng, and T.-S. Chua, “Maclasa: Multi-aspect controllable text generation via efficient sampling from compact latent space,” inFindings of the Association for Computational Linguistics: EMNLP 2023, 2023, pp. 4424–4436

  50. [50]

    Controllable text generation via probability density estimation in the latent space,

    Y . Gu, X. Feng, S. Ma, L. Zhang, H. Gong, W. Zhong, and B. Qin, “Controllable text generation via probability density estimation in the latent space,” inProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), A. Rogers, J. Boyd-Graber, and N. Okazaki, Eds. Toronto, Canada: Association for Computati...

  51. [51]

    Tara: Token-level attribute relation adaptation for multi-attribute controllable text gener- ation,

    Y . Cao, J. Zhao, R. Zhang, H. Zou, and W. Mao, “Tara: Token-level attribute relation adaptation for multi-attribute controllable text gener- ation,” inFindings of the Association for Computational Linguistics: EMNLP 2024, 2024, pp. 12 570–12 579

  52. [52]

    A review of multi-objective optimization: Methods and its applications,

    N. Gunantara, “A review of multi-objective optimization: Methods and its applications,”Cogent Engineering, vol. 5, no. 1, p. 1502242, 2018

  53. [53]

    Mix and match: Learning-free controllable text generationusing energy language models,

    F. Mireshghallah, K. Goyal, and T. Berg-Kirkpatrick, “Mix and match: Learning-free controllable text generationusing energy language models,” inProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), S. Muresan, P. Nakov, and A. Villavicencio, Eds. Dublin, Ireland: Association for Computational Ling...

  54. [54]

    Cold decoding: energy- based constrained text generation with langevin dynamics,

    L. Qin, S. Welleck, D. Khashabi, and Y . Choi, “Cold decoding: energy- based constrained text generation with langevin dynamics,” inPro- ceedings of the 36th International Conference on Neural Information Processing Systems, 2022, pp. 9538–9551

  55. [55]

    Prefix-tuning: Optimizing continuous prompts for generation,

    X. L. Li and P. Liang, “Prefix-tuning: Optimizing continuous prompts for generation,” inProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), C. Zong, F. Xia, W. Li, and R. Navigli, Eds. Online: Association for Computationa...

  56. [56]

    Suri: Multi-constraint instruction following for long-form text generation

    C. M. Pham, S. Sun, and M. Iyyer, “Suri: Multi-constraint in- struction following for long-form text generation,”arXiv preprint arXiv:2406.19371, 2024

  57. [57]

    Benchmarking complex instruction-following with multiple constraints composition,

    B. Wen, P. Ke, X. Gu, L. Wu, H. Huang, J. Zhou, W. Li, B. Hu, W. Gao, J. Xuet al., “Benchmarking complex instruction-following with multiple constraints composition,”Advances in Neural Information Processing Systems, vol. 37, pp. 137 610–137 645, 2024

  58. [58]

    From complex to simple: Enhancing multi-constraint complex instruction following ability of large language models,

    Q. He, J. Zeng, Q. He, J. Liang, and Y . Xiao, “From complex to simple: Enhancing multi-constraint complex instruction following ability of large language models,”arXiv preprint arXiv:2404.15846, 2024

  59. [59]

    GPT-4 Technical Report

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

  60. [60]

    CAMEL: Communicative agents for

    G. Li, H. A. A. K. Hammoud, H. Itani, D. Khizbullin, and B. Ghanem, “CAMEL: Communicative agents for ”mind” exploration of large language model society,” inThirty-seventh Conference on Neural Information Processing Systems, 2023. [Online]. Available: https://openreview.net/forum?id=3IyL2XWDkG

  61. [61]

    Autogen: Enabling next-gen LLM applications via multi-agent conversation,

    Q. Wu, G. Bansal, J. Zhang, Y . Wu, B. Li, E. Zhu, L. Jiang, X. Zhang, S. Zhang, J. Liu, A. H. Awadallah, R. W. White, D. Burger, and C. Wang, “Autogen: Enabling next-gen LLM applications via multi-agent conversation,” 2024. [Online]. Available: https://openreview.net/forum?id=tEAF9LBdgu

  62. [62]

    MetaGPT: Meta programming for a multi-agent collaborative framework,

    S. Hong, M. Zhuge, J. Chen, X. Zheng, Y . Cheng, J. Wang, C. Zhang, Z. Wang, S. K. S. Yau, Z. Lin, L. Zhou, C. Ran, L. Xiao, C. Wu, and J. Schmidhuber, “MetaGPT: Meta programming for a multi-agent collaborative framework,” inThe Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/forum?id=VtmBAGCN7o

  63. [63]

    Rethinking the role of demonstrations: What makes in-context learning work?

    S. Min, X. Lyu, A. Holtzman, M. Artetxe, M. Lewis, H. Hajishirzi, and L. Zettlemoyer, “Rethinking the role of demonstrations: What makes in-context learning work?” inProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Y . Goldberg, Z. Kozareva, and Y . Zhang, Eds. Abu Dhabi, United Arab Emirates: Association for Computa...

  64. [64]

    React: Synergizing reasoning and acting in language models,

    S. Yao, J. Zhao, D. Yu, I. Shafran, K. R. Narasimhan, and Y . Cao, “React: Synergizing reasoning and acting in language models,” inNeurIPS 2022 Foundation Models for Decision Making Workshop, 2022

  65. [65]

    Reflexion: language agents with verbal reinforcement learning,

    N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, “Reflexion: language agents with verbal reinforcement learning,” in Proceedings of the 37th International Conference on Neural Information Processing Systems, 2023, pp. 8634–8652

  66. [66]

    Rest meets react: Self-improvement for multi-step reasoning llm agent,

    R. Aksitov, S. Miryoosefi, Z. Li, D. Li, S. Babayan, K. Kopparapu, Z. Fisher, R. Guo, S. Prakash, P. Srinivasanet al., “Rest meets react: Self-improvement for multi-step reasoning llm agent,” inICLR 2024 Workshop on Large Language Model (LLM) Agents, 2024

  67. [67]

    Retrieval- augmented generation for knowledge-intensive nlp tasks,

    P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K ¨uttler, M. Lewis, W.-t. Yih, T. Rockt ¨aschelet al., “Retrieval- augmented generation for knowledge-intensive nlp tasks,”Advances in neural information processing systems, vol. 33, pp. 9459–9474, 2020

  68. [68]

    Gaussian process optimization for adaptable multi-objective text generation using linearly-weighted language models,

    M. M. Abdollah Pour, A. Pesaranghader, E. Cohen, and S. Sanner, “Gaussian process optimization for adaptable multi-objective text generation using linearly-weighted language models,” inFindings of the Association for Computational Linguistics: NAACL 2024, K. Duh, H. Gomez, and S. Bethard, Eds. Mexico City, Mexico: Association for Computational Linguistics...

  69. [69]

    Qwen Team

    H. Sun, L. Liu, J. Li, F. Wang, B. Dong, R. Lin, and R. Huang, “Conifer: Improving complex constrained instruction-following ability of large language models,”arXiv preprint arXiv:2404.02823, 2024

  70. [70]

    Improving factuality and reasoning in language models through multiagent debate,

    Y . Du, S. Li, A. Torralba, J. B. Tenenbaum, and I. Mordatch, “Improving factuality and reasoning in language models through multiagent debate,” inProceedings of the 41st International Conference on Machine Learn- ing, ser. Proceedings of Machine Learning Research, R. Salakhutdinov, Z. Kolter, K. Heller, A. Weller, N. Oliver, J. Scarlett, and F. Berkenkam...

  71. [71]

    Generative agents: Interactive simulacra of human behavior,

    J. S. Park, J. O’Brien, C. J. Cai, M. R. Morris, P. Liang, and M. S. Bernstein, “Generative agents: Interactive simulacra of human behavior,” inProceedings of the 36th annual acm symposium on user interface software and technology, 2023, pp. 1–22

  72. [72]

    Exploring Large Language Models for Communication Games: An Empirical Study on Werewolf

    Y . Xu, S. Wang, P. Li, F. Luo, X. Wang, W. Liu, and Y . Liu, “Exploring large language models for communication games: An empirical study on werewolf,”arXiv preprint arXiv:2309.04658, 2023

  73. [73]

    Unleashing the emergent cognitive synergy in large language models: A task-solving agent through multi-persona self-collaboration,

    Z. Wang, S. Mao, W. Wu, T. Ge, F. Wei, and H. Ji, “Unleashing the emergent cognitive synergy in large language models: A task-solving agent through multi-persona self-collaboration,” inProceedings of the 2024 Conference of the North American Chapter of the Association for JOURNAL OF LATEX CLASS FILES. 13 Computational Linguistics: Human Language Technolog...

  74. [74]

    A survey on llm-based multi-agent system: Recent advances and new frontiers in application,

    S. Chen, Y . Liu, W. Han, W. Zhang, and T. Liu, “A survey on llm-based multi-agent system: Recent advances and new frontiers in application,”

  75. [75]

    Available: https://arxiv.org/abs/2412.17481

    [Online]. Available: https://arxiv.org/abs/2412.17481

  76. [76]

    Large lan- guage models are zero-shot reasoners,

    T. Kojima, S. S. Gu, M. Reid, Y . Matsuo, and Y . Iwasawa, “Large lan- guage models are zero-shot reasoners,”Advances in neural information processing systems, vol. 35, pp. 22 199–22 213, 2022

  77. [77]

    Large Language Models as Optimizers

    C. Yang, X. Wang, Y . Lu, H. Liu, Q. V . Le, D. Zhou, and X. Chen, “Large language models as optimizers,”arXiv preprint arXiv:2309.03409, 2023

  78. [78]

    Language mod- els are few-shot learners,

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askellet al., “Language mod- els are few-shot learners,”Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020

  79. [79]

    Query expansion by prompting large language models

    R. Jagerman, H. Zhuang, Z. Qin, X. Wang, and M. Bendersky, “Query expansion by prompting large language models,”arXiv preprint arXiv:2305.03653, 2023

  80. [80]

    ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

    C.-M. Chan, W. Chen, Y . Su, J. Yu, W. Xue, S. Zhang, J. Fu, and Z. Liu, “Chateval: Towards better llm-based evaluators through multi- agent debate,”arXiv preprint arXiv:2308.07201, 2023

Showing first 80 references.