arxiv: 2604.14167 · v1 · submitted 2026-03-24 · 💻 cs.CL · cs.AI

Recognition: no theorem link

Chinese Essay Rhetoric Recognition Using LoRA, In-context Learning and Model Ensemble

Yuxuan Lai , Xiajing Wang , Chen Zheng

Authors on Pith no claims yet

Pith reviewed 2026-05-15 01:10 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords Chinese rhetoric recognitionLoRA fine-tuningin-context learningmodel ensemblelarge language modelsautomated essay scoringJSON output formattingCCL 2025

0 comments

The pith

LLMs using LoRA fine-tuning, in-context learning, JSON outputs and ensembles achieve first place in Chinese essay rhetoric recognition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that large language models can identify rhetorical devices in Chinese student essays by combining low-rank adaptation for targeted fine-tuning with carefully chosen in-context examples. Outputs are forced into a JSON structure whose keys are translated into Chinese, and several models are combined through ensemble techniques. This pipeline produced the highest scores on every track of the CCL 2025 evaluation, demonstrating that modest adaptation plus structured prompting can turn general-purpose models into reliable rhetoric detectors for automated essay scoring.

Core claim

By applying LoRA-based fine-tuning and in-context learning to large language models, formulating recognition results as JSON with Chinese keys, and applying model ensemble methods, the system reaches the best performance on all three tracks of the CCL 2025 Chinese essay rhetoric recognition task and wins first prize.

What carries the argument

LoRA fine-tuning plus in-context examples that produce JSON-structured rhetoric labels, combined through model ensembles.

If this is right

Rhetoric labels produced in JSON format can be fed directly into downstream automated essay scoring systems.
Translating output keys to Chinese improves readability and integration for Chinese-language education tools.
Ensemble methods raise robustness when individual models miss subtle rhetorical devices.
The same adaptation pattern can be reused for other language-specific writing-analysis tasks without full retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may extend to real-time feedback loops in classroom writing platforms where teachers receive instant rhetoric breakdowns.
Reducing reliance on massive fine-tuning datasets could make rhetoric recognition feasible for smaller institutions or less-resourced languages.
If JSON outputs prove stable, similar structured prompting could be tested on related tasks such as argument mining or coherence detection in student texts.

Load-bearing premise

The specific mix of LoRA, selected in-context examples, JSON formatting, and ensembles will keep high accuracy on new student essays that differ from the competition data.

What would settle it

Run the same pipeline on a fresh collection of Chinese student essays collected outside the CCL 2025 dataset and measure whether accuracy drops below the reported competition scores.

Figures

Figures reproduced from arXiv: 2604.14167 by Chen Zheng, Xiajing Wang, Yuxuan Lai.

**Figure 1.** Figure 1: An illustration of our methods. fine-tuning (Hu et al., 2022) for open-source LLMs and in-context learning (Brown et al., 2020) for close-source LLMs. Besides, we also investigate the combination of LoRA and in-context learning, i.e., augmenting predictions of LLMs after LoRA with additional in-context learning examples. We further explore several model ensemble strategies to boost overall performance. For… view at source ↗

read the original abstract

Rhetoric recognition is a critical component in automated essay scoring. By identifying rhetorical elements in student writing, AI systems can better assess linguistic and higher-order thinking skills, making it an essential task in the area of AI for education. In this paper, we leverage Large Language Models (LLMs) for the Chinese rhetoric recognition task. Specifically, we explore Low-Rank Adaptation (LoRA) based fine-tuning and in-context learning to integrate rhetoric knowledge into LLMs. We formulate the outputs as JSON to obtain structural outputs and translate keys to Chinese. To further enhance the performance, we also investigate several model ensemble methods. Our method achieves the best performance on all three tracks of CCL 2025 Chinese essay rhetoric recognition evaluation task, winning the first prize.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper secures a first-place win on the CCL 2025 Chinese rhetoric task by combining LoRA fine-tuning, in-context learning, JSON outputs, and ensembles on an independent test set.

read the letter

The main thing to know is that this work reports a clean first-place result across all three tracks of the CCL 2025 Chinese essay rhetoric recognition competition. The pipeline uses LoRA to adapt LLMs, adds in-context examples, forces JSON-structured outputs with Chinese keys, and applies ensembles. Because the test set is held out and scored by organizers, the performance claim rests on external data rather than internal fitting alone.

Referee Report

1 major / 3 minor

Summary. The manuscript describes a pipeline for Chinese essay rhetoric recognition that integrates LoRA fine-tuning of LLMs, in-context learning with JSON-structured outputs (keys translated to Chinese), and multiple model-ensemble strategies. The central empirical claim is that this combination secured first place on all three tracks of the CCL 2025 competition.

Significance. If the reported competition result holds under the organizers' held-out evaluation, the work supplies concrete evidence that parameter-efficient adaptation plus structured prompting and ensembling can deliver state-of-the-art performance on a domain-specific classification task in educational NLP. The external validation supplied by the competition setting strengthens the practical significance for automated essay scoring.

major comments (1)

[§4.2] §4.2 (Ensemble subsection): the description of how ensemble weights or voting thresholds were selected is insufficient to determine whether they were tuned on the official validation split or post-hoc on competition feedback; this directly affects the load-bearing claim that the reported first-place scores reflect the method rather than tuning artifacts.

minor comments (3)

[§3.2] The prompt templates and exact JSON schema used for in-context learning should be reproduced verbatim in an appendix to support reproducibility.
[Tables 2-4] Table captions in the results section should explicitly state the evaluation metric (e.g., macro-F1) and the three tracks being compared.
[§5] A short error-analysis subsection or qualitative examples of misclassified rhetorical devices would clarify the remaining failure modes without altering the central claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive evaluation of our work and the constructive comment on the ensemble description. We address the point below and will revise the manuscript to improve clarity and reproducibility.

read point-by-point responses

Referee: [§4.2] §4.2 (Ensemble subsection): the description of how ensemble weights or voting thresholds were selected is insufficient to determine whether they were tuned on the official validation split or post-hoc on competition feedback; this directly affects the load-bearing claim that the reported first-place scores reflect the method rather than tuning artifacts.

Authors: We agree that the original description in §4.2 was insufficiently detailed on this point. The ensemble weights and voting thresholds were determined exclusively via grid search on the official validation split released by the CCL 2025 organizers, optimizing for macro-F1; no test-set information or post-submission competition feedback was used at any stage. We will revise §4.2 to explicitly document this procedure, including the search range, the validation-based selection criterion, and the final weights assigned to each model. This addition will confirm that the reported first-place results reflect the method evaluated on held-out data rather than tuning artifacts. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical result on external competition data

full rationale

The paper describes an empirical pipeline combining LoRA fine-tuning, in-context learning, JSON output formatting, and model ensembles for Chinese rhetoric recognition. The central claim is first-place performance on the three tracks of the CCL 2025 competition, which supplies an independent held-out test set evaluated by organizers. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The result is therefore self-contained against external benchmarks and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The approach rests on standard assumptions about LLM adaptability without introducing new entities or parameters beyond typical hyperparameter choices.

free parameters (2)

LoRA rank and alpha
Hyperparameters selected to control adaptation strength during fine-tuning.
Ensemble combination weights
Values chosen to maximize competition score.

axioms (1)

domain assumption LLMs can integrate rhetoric knowledge through fine-tuning and prompting
Invoked when stating that LoRA and in-context learning integrate rhetoric knowledge.

pith-pipeline@v0.9.0 · 5423 in / 1069 out tokens · 52517 ms · 2026-05-15T01:10:11.153071+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

9 extracted references · 9 canonical work pages · 3 internal anchors

[1]

Gpt-4 technical report.arXiv preprint arXiv:2303.08774. Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Nee- lakantan, et al

work page internal anchor Pith review Pith/arXiv arXiv
[2]

InFindings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 1429–1434

Jointly identifying rhetoric and implicit emotions via multi-task learning. InFindings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 1429–1434. Todd Firsich and Anthony Rios

work page 2021
[3]

Edward J

Can gpt4 detect euphemisms across multiple languages? InProceedings of the 4th Workshop on Figurative Language Processing (FigLang 2024), pages 65–72. Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Lu Wang, and Weizhu Chen

work page 2024
[4]

Yuxuan Lai, Xiajing Wang, and Wenpeng Hu

Computational approaches to the detection of lesser-known rhetorical figures: A systematic survey and research challenges.arXiv preprint arXiv:2406.16674. Yuxuan Lai, Xiajing Wang, and Wenpeng Hu

work page arXiv
[5]

InThe 2023 Conference on Empirical Methods in Natural Language Processing

Making large language models better data creators. InThe 2023 Conference on Empirical Methods in Natural Language Processing. Chunhong Li and Yongquan Li

work page 2023
[6]

InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 6744–6759

Cerd: A comprehensive chinese rhetoric dataset for rhetorical understanding and generation in essays. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 6744–6759. Man Luo, Xin Xu, Yue Liu, Panupong Pasupat, and Mehran Kazemi

work page 2024
[7]

Structuredrag: Json response formatting with large language models

Structuredrag: Json response formatting with large language models.arXiv preprint arXiv:2408.11061. Computational Linguistics An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al

work page arXiv
[8]

Qwen2.5 Technical Report

Qwen2.5 technical report.arXiv preprint arXiv:2412.15115. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Qwen3 Technical Report

Qwen3 technical report.arXiv preprint arXiv:2505.09388. Dawei Zhu, Qiusi Zhan, Zhejian Zhou, Yifan Song, Jiebin Zhang, and Sujian Li

work page internal anchor Pith review Pith/arXiv arXiv