Semantic Grading of Written Answers in Low-Resource Language Bangla Using a Fine-Tuned Lightweight Language Model

Aniket Joarder; Mahmudul Hasan; Md. Mosaddek Khan; Meherun Farzana

arxiv: 2606.11931 · v1 · pith:NXPL5SBDnew · submitted 2026-06-10 · 💻 cs.CL

Semantic Grading of Written Answers in Low-Resource Language Bangla Using a Fine-Tuned Lightweight Language Model

Meherun Farzana , Aniket Joarder , Mahmudul Hasan , Md. Mosaddek Khan This is my paper

Pith reviewed 2026-06-27 09:30 UTC · model grok-4.3

classification 💻 cs.CL

keywords semantic gradingBangla NLPautomatic assessmentfine-tuned language modellow-resource languageseducational feedbackQLoRA tuning

0 comments

The pith

A QLoRA-tuned Qwen3-8B model grades Bangla student answers with strong agreement to human scores by assessing semantic correctness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that fine-tuning a lightweight language model on a synthetic bilingual dataset can produce numeric scores and feedback for Bangla written answers that align closely with human judgment. This matters in low-resource settings where qualified teachers are scarce and manual grading is slow and inconsistent. The work constructs a bilingual evaluation system that inputs the question, reference answer, and student response, then compares the fine-tuned model against other proprietary and open-source LLMs under one protocol. Results show the tuned model yields the highest resistance to answer leakage in synthetic tests and the closest match to human ratings in a dedicated study.

Core claim

The paper claims that its QLoRA-tuned Qwen3-8B produces the most leakage-resistant feedback with RoRa of 0.819 in synthetic evaluation and the strongest agreement with human scores at rho of 0.936 and MAE of 0.725 in a human study, outperforming other models when grading Bangla answers for semantic correctness rather than surface overlap.

What carries the argument

The QLoRA-tuned Qwen3-8B model, which receives the question, reference answer, and student answer as input and outputs a numeric score plus concise feedback.

If this is right

Grading systems for Bangla can now prioritize meaning over exact wording and still reach high human agreement.
Lightweight models become viable for deployment in remote regions once tuned on synthetic data.
The same fine-tuning protocol can be applied across other open-source models to produce comparable feedback.
Bilingual reference materials improve grading consistency when student answers mix languages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could extend to other low-resource languages by repeating the synthetic data construction step.
Integration into mobile apps for teachers would allow faster initial scoring before final human review.
Future work might test whether the same model maintains performance when questions come from different subjects.

Load-bearing premise

The synthetic bilingual dataset captures enough of the variety found in actual student answers for the measured performance to hold in real classrooms.

What would settle it

A follow-up study with several hundred real Bangla classroom answers graded independently by multiple teachers that shows agreement dropping below rho of 0.8.

Figures

Figures reproduced from arXiv: 2606.11931 by Aniket Joarder, Mahmudul Hasan, Md. Mosaddek Khan, Meherun Farzana.

**Figure 2.** Figure 2: Teacher–model score difference distribution across graders. The x-axis shows the score difference (teacher−model) and the y-axis shows the number of graded instances. Values near 0 indicate closer agreement with teacher scoring, while heavier tails indicate more frequent large deviations. agreement with teachers on which student answers deserve higher scores, while also reducing absolute deviation from … view at source ↗

**Figure 3.** Figure 3: Answer input: (a) A student can either type or upload a handwritten response image; (b) the integrated Bangla HTR module extracts text for downstream grading. (0.819) and RECEVAL (0.471), and remains close to the top on ROSCOE (0.715) and IBE (0.714). Overall, it ranks best in 6 of 8 reported metrics in [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 5.** Figure 5: Exam-level summary: Aggregated performance over an exam session. etary graders, as confirmed by a dedicated human evaluation study. Overall, we show that an accurate, trustworthy automated assessment for low-resource classrooms is achievable with open models when grading is explicitly grounded in the question and reference answer and paired with bounded, contextaware feedback. Limitations and Future Work… view at source ↗

read the original abstract

Bangla is among the world's most widely spoken languages, yet it remains underserved in educational NLP research. In many remote and rural regions, access to qualified subject teachers is limited, and written answers are consequently graded largely by hand, restricting timely and consistent feedback. Automatic assessment is challenging because semantically correct responses can vary substantially in surface form. We present a bilingual (Bangla-English) evaluation system designed for low-resource educational settings that prioritizes semantic correctness over lexical overlap. Our approach fine-tunes a lightweight language model to grade each response using the question, reference answer, and student answer, producing a numeric score and concise, context-grounded feedback suitable for classroom deployment. We also construct a synthetic bilingual dataset to enable controlled training and evaluation. Across proprietary and open-source LLMs evaluated under a unified protocol, our QLoRA-tuned Qwen3-8B confirms consistent improvement by producing the most leakage-resistant feedback (RoRa = 0.819) in synthetic evaluation and the strongest agreement with human scores (rho = 0.936, MAE = 0.725) in a dedicated human study.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Qwen3-8B fine-tuned on synthetic Bangla data gives strong reported metrics but synthetic representativeness is unverified.

read the letter

The paper applies QLoRA fine-tuning to Qwen3-8B for semantic grading of Bangla student answers. It builds a bilingual synthetic dataset and reports RoRa of 0.819 on leakage resistance plus 0.936 Spearman correlation and 0.725 MAE with human scores.

This is a straightforward extension of existing LLM tuning techniques to an underserved language. The choice of a lightweight model and the unified evaluation across several LLMs are practical steps that make the work relevant for low-resource educational settings where quick feedback matters.

The results look concrete on the numbers given. The focus on semantic correctness rather than surface overlap fits the problem of variable student phrasing.

The soft spot is the synthetic dataset. The abstract describes its construction but provides no overlap statistics, error-type coverage, or length distributions comparing it to real classroom answers. Without that check, the leakage metric and human agreement are hard to trust for actual deployment. The human study is labeled dedicated yet gives no scale, rater count, or inter-rater numbers, which leaves the rho value difficult to interpret.

This is for people working on multilingual educational NLP or edtech tools aimed at South Asia and similar regions. A reader looking for concrete metrics on Bangla grading would find the reported comparisons useful.

It deserves peer review. The application is specific enough and the numbers invite direct scrutiny on data construction and evaluation design.

Referee Report

2 major / 2 minor

Summary. The paper presents a bilingual (Bangla-English) system for semantic grading of student written answers in low-resource settings. It constructs a synthetic dataset, fine-tunes Qwen3-8B via QLoRA to score responses and generate feedback based on question, reference answer, and student answer, and reports that this model outperforms other LLMs with RoRa=0.819 leakage resistance on synthetic evaluation and strongest human agreement (rho=0.936, MAE=0.725) in a dedicated study, prioritizing semantic correctness.

Significance. If the synthetic dataset proves representative and the human-study results generalize, the work offers a deployable lightweight tool for consistent, timely feedback in Bangla classrooms where qualified teachers are scarce. The emphasis on leakage-resistant feedback and open-source model choice are practical strengths for low-resource NLP in education.

major comments (2)

[Dataset construction] Dataset construction section: no quantitative validation (overlap statistics, error-type coverage, length/topic distributions) is provided between the synthetic bilingual answers and real student responses from Bangla classrooms. This directly undermines transferability of both the RoRa=0.819 leakage result and the human-agreement metrics to the claimed classroom use case.
[Human study] Human study section: the study is labeled only as 'dedicated' with no reported scale (number of answers/graders), inter-rater agreement (e.g., Cohen's kappa), sampling frame, or exclusion criteria. These omissions make the rho=0.936 / MAE=0.725 figures impossible to interpret for reliability or statistical significance.

minor comments (2)

[Abstract] Abstract: the 'unified protocol' for comparing proprietary and open-source LLMs is not summarized; adding one sentence on prompt format, temperature, and scoring rubric would improve clarity.
[Evaluation metrics] Evaluation: RoRa is introduced without an explicit formula or reference; a short definition or equation in the metrics subsection would prevent reader confusion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments identify clear gaps in reporting that affect interpretability and generalizability. We address each point below and will revise the manuscript to incorporate the requested information where feasible.

read point-by-point responses

Referee: [Dataset construction] Dataset construction section: no quantitative validation (overlap statistics, error-type coverage, length/topic distributions) is provided between the synthetic bilingual answers and real student responses from Bangla classrooms. This directly undermines transferability of both the RoRa=0.819 leakage result and the human-agreement metrics to the claimed classroom use case.

Authors: We acknowledge that the current manuscript provides no quantitative comparison between the synthetic dataset and real Bangla classroom responses. The synthetic data was generated to control for specific semantic variations and error patterns, enabling the leakage-resistance evaluation (RoRa). However, the referee is correct that this limits claims about transfer to real classrooms. In the revised manuscript we will add a dedicated subsection under Dataset Construction that reports overlap statistics, error-type coverage, length and topic distributions, using any available real student answer samples. If real data access is limited, we will explicitly note the scope of the comparison. revision: yes
Referee: [Human study] Human study section: the study is labeled only as 'dedicated' with no reported scale (number of answers/graders), inter-rater agreement (e.g., Cohen's kappa), sampling frame, or exclusion criteria. These omissions make the rho=0.936 / MAE=0.725 figures impossible to interpret for reliability or statistical significance.

Authors: We agree that the human-study description is insufficient. The manuscript reports only the aggregate metrics without the underlying study parameters. In the revision we will expand the Human Study section to include the number of answers and graders, inter-rater agreement (Cohen's kappa or equivalent), sampling frame, exclusion criteria, and any power or significance considerations for the reported rho and MAE. These details were collected during the study and will be added for transparency. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical evaluation

full rationale

The paper reports fine-tuning results (QLoRA on Qwen3-8B) and empirical metrics (RoRa on synthetic data, Spearman rho and MAE on human scores) with no equations, derivations, or parameter-fitting steps that reduce to inputs by construction. Dataset construction and evaluation are described as standard supervised training plus held-out testing; no self-citation load-bearing, uniqueness theorems, or ansatz smuggling appear. The central claims rest on direct comparison to external human judgments and leakage metrics rather than any definitional or fitted-input reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Central claim rests on transfer from synthetic data to real answers and on the chosen metrics (RoRa, rho, MAE) adequately capturing semantic correctness; no explicit free parameters, axioms, or invented entities are stated in the abstract.

pith-pipeline@v0.9.1-grok · 5741 in / 1127 out tokens · 30837 ms · 2026-06-27T09:30:27.153637+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 12 canonical work pages · 1 internal anchor

[1]

Mahmudul and Choudhury, Ahmed Nesar Tahsin and Hasan, Mahmudul and Khan, Md Mosaddek

Hasan, Md. Mahmudul and Choudhury, Ahmed Nesar Tahsin and Hasan, Mahmudul and Khan, Md Mosaddek. G ra D e T - HTR : A Resource-Efficient B engali Handwritten Text Recognition System utilizing Grapheme-based Tokenizer and Decoder-only Transformer. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations....

work page doi:10.18653/v1/2025.emnlp-demos.52 2025
[2]

Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics , pages =

Papineni, Kishore and Roukos, Salim and Ward, Todd and Zhu, Wei-Jing. B leu: a Method for Automatic Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 2002. doi:10.3115/1073083.1073135

work page doi:10.3115/1073083.1073135 2002
[3]

International Conference on Learning Representations , year=

BERTScore: Evaluating Text Generation with BERT , author=. International Conference on Learning Representations , year=
[4]

Towards a Unified Multi-Dimensional Evaluator for Text Generation

Zhong, Ming and Liu, Yang and Yin, Da and Mao, Yuning and Jiao, Yizhu and Liu, Pengfei and Zhu, Chenguang and Ji, Heng and Han, Jiawei. Towards a Unified Multi-Dimensional Evaluator for Text Generation. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. doi:10.18653/v1/2022.emnlp-main.131

work page doi:10.18653/v1/2022.emnlp-main.131 2022
[5]

Yang Liu and Dan Iter and Yichong Xu and Shuohang Wang and Ruochen Xu and Chenguang Zhu , booktitle=. G-Eval:. 2023 , url=

2023
[6]

ChatEval: Towards Better

Chi-Min Chan and Weize Chen and Yusheng Su and Jianxuan Yu and Wei Xue and Shanghang Zhang and Jie Fu and Zhiyuan Liu , booktitle=. ChatEval: Towards Better. 2024 , url=

2024
[7]

Your answer is incorrect

Filighera, Anna and Parihar, Siddharth and Steuer, Tim and Meuser, Tobias and Ochs, Sebastian. Your Answer is Incorrect... Would you like to know why? Introducing a Bilingual Short Answer Feedback Dataset. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. doi:10.18653/v1/2022.acl-long.587

work page doi:10.18653/v1/2022.acl-long.587 2022
[8]

Handbook of Automated Essay Evaluation: Current Applications and New Directions , editor =

Automated Essay Scoring and Writing Assessment , author =. Handbook of Automated Essay Evaluation: Current Applications and New Directions , editor =
[9]

Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013) , year =

SemEval-2013 Task 7: The Joint Student Response Analysis and 8th Recognizing Textual Entailment Challenge , author =. Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013) , year =

2013
[10]

International Journal of Artificial Intelligence in Education , volume =

The Eras and Trends of Automatic Short Answer Grading , author =. International Journal of Artificial Intelligence in Education , volume =. 2015 , publisher =

2015
[11]

Proceedings of NAACL , year=

Leveraging Context Information for Natural Question Generation , author=. Proceedings of NAACL , year=
[12]

Proceedings of ACL , year=

Harvesting Paragraph-Level Question-Answer Pairs from Wikipedia , author=. Proceedings of ACL , year=
[13]

Proceedings of EMNLP , year=

Asking Questions Like Educational Experts , author=. Proceedings of EMNLP , year=
[14]

2023 , url =

Dettmers, Tim and Pagnoni, Artidoro and Holtzman, Ari and Zettlemoyer, Luke , booktitle =. 2023 , url =

2023
[15]

arXiv preprint arXiv:2106.09685 , year=

LoRA: Low-Rank Adaptation of Large Language Models , author=. arXiv preprint arXiv:2106.09685 , year=

Pith/arXiv arXiv
[16]

RORA : Robust Free-Text Rationale Evaluation

Jiang, Zhengping and Lu, Yining and Chen, Hanjie and Khashabi, Daniel and Van Durme, Benjamin and Liu, Anqi. RORA : Robust Free-Text Rationale Evaluation. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.60

work page doi:10.18653/v1/2024.acl-long.60 2024
[17]

IF ly EA : A C hinese Essay Assessment System with Automated Rating, Review Generation, and Recommendation

Gong, Jiefu and Hu, Xiao and Song, Wei and Fu, Ruiji and Sheng, Zhichao and Zhu, Bo and Wang, Shijin and Liu, Ting. IF ly EA : A C hinese Essay Assessment System with Automated Rating, Review Generation, and Recommendation. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference o...

work page doi:10.18653/v1/2021.acl-demo.29 2021
[18]

Automatic Comment Generation for C hinese Student Narrative Essays

Zhang, Zhexin and Guan, Jian and Xu, Guowei and Tian, Yixiang and Huang, Minlie. Automatic Comment Generation for C hinese Student Narrative Essays. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 2022. doi:10.18653/v1/2022.emnlp-demos.21

work page doi:10.18653/v1/2022.emnlp-demos.21 2022
[19]

PEEP -Talk: A Situational Dialogue-based Chatbot for E nglish Education

Lee, Seungjun and Jang, Yoonna and Park, Chanjun and Lee, Jungseob and Seo, Jaehyung and Moon, Hyeonseok and Eo, Sugyeong and Lee, Seounghoon and Yahya, Bernardo and Lim, Heuiseok. PEEP -Talk: A Situational Dialogue-based Chatbot for E nglish Education. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: Syst...

work page doi:10.18653/v1/2023.acl-demo.18 2023
[20]

Bengali language , author =
[21]

Bangladesh: Selected Indicators , author =
[22]

km of land area) --- Bangladesh , author =

Population density (people per sq. km of land area) --- Bangladesh , author =
[23]

World Development Indicators: Bangladesh (DataBank view) , author =
[24]

Understanding networked family language policy: a study among Bengali immigrants in Australia , volume =

Bose, Priyanka and Gao, Xuesong and Starfield, Sue and Perera, Nirukshi , year =. Understanding networked family language policy: a study among Bengali immigrants in Australia , volume =. Current Issues in Language Planning , doi =
[25]

2025 , howpublished =

What are the 10 largest / most spoken languages in the world? , author =. 2025 , howpublished =

2025
[26]

The American Journal of Psychology , volume =

The Proof and Measurement of Association between Two Things , author =. The American Journal of Psychology , volume =. 1904 , url =

1904
[27]

R e CE val: Evaluating Reasoning Chains via Correctness and Informativeness

Prasad, Archiki and Saha, Swarnadeep and Zhou, Xiang and Bansal, Mohit. R e CE val: Evaluating Reasoning Chains via Correctness and Informativeness. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.622

work page doi:10.18653/v1/2023.emnlp-main.622 2023
[28]

2023 , url=

Olga Golovneva and Moya Peng Chen and Spencer Poff and Martin Corredor and Luke Zettlemoyer and Maryam Fazel-Zarandi and Asli Celikyilmaz , booktitle=. 2023 , url=

2023
[29]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =

Inference to the Best Explanation in Large Language Models , author =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =
[30]

Climate Research , volume =

Advantages of the Mean Absolute Error (MAE) over the Root Mean Square Error (RMSE) in Assessing Average Model Performance , author =. Climate Research , volume =. 2005 , doi =

2005
[31]

A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , author =. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) , month = jun, year =. doi:10.18653/v1/N18-1101 , url =

work page internal anchor Pith review doi:10.18653/v1/n18-1101 2018
[32]

BART : Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

Lewis, Mike and Liu, Yinhan and Goyal, Naman and Ghazvininejad, Marjan and Mohamed, Abdelrahman and Levy, Omer and Stoyanov, Ves and Zettlemoyer, Luke , booktitle =. 2020 , address =. doi:10.18653/v1/2020.acl-main.703 , url =

work page doi:10.18653/v1/2020.acl-main.703 2020
[33]

Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks

Reimers, Nils and Gurevych, Iryna , booktitle =. Sentence-. 2019 , address =. doi:10.18653/v1/D19-1410 , url =

work page doi:10.18653/v1/d19-1410 2019
[34]

2020 , url =

Song, Kaitao and Tan, Xu and Qin, Tao and Lu, Jianfeng and Liu, Tie-Yan , booktitle =. 2020 , url =

2020

[1] [1]

Mahmudul and Choudhury, Ahmed Nesar Tahsin and Hasan, Mahmudul and Khan, Md Mosaddek

Hasan, Md. Mahmudul and Choudhury, Ahmed Nesar Tahsin and Hasan, Mahmudul and Khan, Md Mosaddek. G ra D e T - HTR : A Resource-Efficient B engali Handwritten Text Recognition System utilizing Grapheme-based Tokenizer and Decoder-only Transformer. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations....

work page doi:10.18653/v1/2025.emnlp-demos.52 2025

[2] [2]

Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics , pages =

Papineni, Kishore and Roukos, Salim and Ward, Todd and Zhu, Wei-Jing. B leu: a Method for Automatic Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 2002. doi:10.3115/1073083.1073135

work page doi:10.3115/1073083.1073135 2002

[3] [3]

International Conference on Learning Representations , year=

BERTScore: Evaluating Text Generation with BERT , author=. International Conference on Learning Representations , year=

[4] [4]

Towards a Unified Multi-Dimensional Evaluator for Text Generation

Zhong, Ming and Liu, Yang and Yin, Da and Mao, Yuning and Jiao, Yizhu and Liu, Pengfei and Zhu, Chenguang and Ji, Heng and Han, Jiawei. Towards a Unified Multi-Dimensional Evaluator for Text Generation. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. doi:10.18653/v1/2022.emnlp-main.131

work page doi:10.18653/v1/2022.emnlp-main.131 2022

[5] [5]

Yang Liu and Dan Iter and Yichong Xu and Shuohang Wang and Ruochen Xu and Chenguang Zhu , booktitle=. G-Eval:. 2023 , url=

2023

[6] [6]

ChatEval: Towards Better

Chi-Min Chan and Weize Chen and Yusheng Su and Jianxuan Yu and Wei Xue and Shanghang Zhang and Jie Fu and Zhiyuan Liu , booktitle=. ChatEval: Towards Better. 2024 , url=

2024

[7] [7]

Your answer is incorrect

Filighera, Anna and Parihar, Siddharth and Steuer, Tim and Meuser, Tobias and Ochs, Sebastian. Your Answer is Incorrect... Would you like to know why? Introducing a Bilingual Short Answer Feedback Dataset. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. doi:10.18653/v1/2022.acl-long.587

work page doi:10.18653/v1/2022.acl-long.587 2022

[8] [8]

Handbook of Automated Essay Evaluation: Current Applications and New Directions , editor =

Automated Essay Scoring and Writing Assessment , author =. Handbook of Automated Essay Evaluation: Current Applications and New Directions , editor =

[9] [9]

Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013) , year =

SemEval-2013 Task 7: The Joint Student Response Analysis and 8th Recognizing Textual Entailment Challenge , author =. Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013) , year =

2013

[10] [10]

International Journal of Artificial Intelligence in Education , volume =

The Eras and Trends of Automatic Short Answer Grading , author =. International Journal of Artificial Intelligence in Education , volume =. 2015 , publisher =

2015

[11] [11]

Proceedings of NAACL , year=

Leveraging Context Information for Natural Question Generation , author=. Proceedings of NAACL , year=

[12] [12]

Proceedings of ACL , year=

Harvesting Paragraph-Level Question-Answer Pairs from Wikipedia , author=. Proceedings of ACL , year=

[13] [13]

Proceedings of EMNLP , year=

Asking Questions Like Educational Experts , author=. Proceedings of EMNLP , year=

[14] [14]

2023 , url =

Dettmers, Tim and Pagnoni, Artidoro and Holtzman, Ari and Zettlemoyer, Luke , booktitle =. 2023 , url =

2023

[15] [15]

arXiv preprint arXiv:2106.09685 , year=

LoRA: Low-Rank Adaptation of Large Language Models , author=. arXiv preprint arXiv:2106.09685 , year=

Pith/arXiv arXiv

[16] [16]

RORA : Robust Free-Text Rationale Evaluation

Jiang, Zhengping and Lu, Yining and Chen, Hanjie and Khashabi, Daniel and Van Durme, Benjamin and Liu, Anqi. RORA : Robust Free-Text Rationale Evaluation. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.60

work page doi:10.18653/v1/2024.acl-long.60 2024

[17] [17]

IF ly EA : A C hinese Essay Assessment System with Automated Rating, Review Generation, and Recommendation

Gong, Jiefu and Hu, Xiao and Song, Wei and Fu, Ruiji and Sheng, Zhichao and Zhu, Bo and Wang, Shijin and Liu, Ting. IF ly EA : A C hinese Essay Assessment System with Automated Rating, Review Generation, and Recommendation. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference o...

work page doi:10.18653/v1/2021.acl-demo.29 2021

[18] [18]

Automatic Comment Generation for C hinese Student Narrative Essays

Zhang, Zhexin and Guan, Jian and Xu, Guowei and Tian, Yixiang and Huang, Minlie. Automatic Comment Generation for C hinese Student Narrative Essays. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 2022. doi:10.18653/v1/2022.emnlp-demos.21

work page doi:10.18653/v1/2022.emnlp-demos.21 2022

[19] [19]

PEEP -Talk: A Situational Dialogue-based Chatbot for E nglish Education

Lee, Seungjun and Jang, Yoonna and Park, Chanjun and Lee, Jungseob and Seo, Jaehyung and Moon, Hyeonseok and Eo, Sugyeong and Lee, Seounghoon and Yahya, Bernardo and Lim, Heuiseok. PEEP -Talk: A Situational Dialogue-based Chatbot for E nglish Education. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: Syst...

work page doi:10.18653/v1/2023.acl-demo.18 2023

[20] [20]

Bengali language , author =

[21] [21]

Bangladesh: Selected Indicators , author =

[22] [22]

km of land area) --- Bangladesh , author =

Population density (people per sq. km of land area) --- Bangladesh , author =

[23] [23]

World Development Indicators: Bangladesh (DataBank view) , author =

[24] [24]

Understanding networked family language policy: a study among Bengali immigrants in Australia , volume =

Bose, Priyanka and Gao, Xuesong and Starfield, Sue and Perera, Nirukshi , year =. Understanding networked family language policy: a study among Bengali immigrants in Australia , volume =. Current Issues in Language Planning , doi =

[25] [25]

2025 , howpublished =

What are the 10 largest / most spoken languages in the world? , author =. 2025 , howpublished =

2025

[26] [26]

The American Journal of Psychology , volume =

The Proof and Measurement of Association between Two Things , author =. The American Journal of Psychology , volume =. 1904 , url =

1904

[27] [27]

R e CE val: Evaluating Reasoning Chains via Correctness and Informativeness

Prasad, Archiki and Saha, Swarnadeep and Zhou, Xiang and Bansal, Mohit. R e CE val: Evaluating Reasoning Chains via Correctness and Informativeness. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.622

work page doi:10.18653/v1/2023.emnlp-main.622 2023

[28] [28]

2023 , url=

Olga Golovneva and Moya Peng Chen and Spencer Poff and Martin Corredor and Luke Zettlemoyer and Maryam Fazel-Zarandi and Asli Celikyilmaz , booktitle=. 2023 , url=

2023

[29] [29]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =

Inference to the Best Explanation in Large Language Models , author =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =

[30] [30]

Climate Research , volume =

Advantages of the Mean Absolute Error (MAE) over the Root Mean Square Error (RMSE) in Assessing Average Model Performance , author =. Climate Research , volume =. 2005 , doi =

2005

[31] [31]

A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , author =. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) , month = jun, year =. doi:10.18653/v1/N18-1101 , url =

work page internal anchor Pith review doi:10.18653/v1/n18-1101 2018

[32] [32]

BART : Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

Lewis, Mike and Liu, Yinhan and Goyal, Naman and Ghazvininejad, Marjan and Mohamed, Abdelrahman and Levy, Omer and Stoyanov, Ves and Zettlemoyer, Luke , booktitle =. 2020 , address =. doi:10.18653/v1/2020.acl-main.703 , url =

work page doi:10.18653/v1/2020.acl-main.703 2020

[33] [33]

Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks

Reimers, Nils and Gurevych, Iryna , booktitle =. Sentence-. 2019 , address =. doi:10.18653/v1/D19-1410 , url =

work page doi:10.18653/v1/d19-1410 2019

[34] [34]

2020 , url =

Song, Kaitao and Tan, Xu and Qin, Tao and Lu, Jianfeng and Liu, Tie-Yan , booktitle =. 2020 , url =

2020