BamiBERT: A New BERT-based Language Model for Vietnamese

Chi Tran; Dat Quoc Nguyen; Linh The Nguyen; Thinh Pham

arxiv: 2607.02259 · v1 · pith:SL54BS7Cnew · submitted 2026-07-02 · 💻 cs.CL

BamiBERT: A New BERT-based Language Model for Vietnamese

Dat Quoc Nguyen , Thinh Pham , Chi Tran , Linh The Nguyen This is my paper

Pith reviewed 2026-07-03 14:26 UTC · model grok-4.3

classification 💻 cs.CL

keywords VietnameseBERTlanguage modelpre-trainingnatural language processingbenchmarksraw text inputcontext length

0 comments

The pith

BamiBERT sets new state of the art for base-sized Vietnamese encoders by training on raw text with 2048-token context.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces BamiBERT as a BERT-based model for Vietnamese that improves on prior work by training from scratch on a 129GB general-domain corpus for 20 epochs. It processes raw input directly without external word segmentation and handles contexts up to 2048 tokens. Across eight benchmarks it records the top score on 11 of 15 metrics and second place on three others, while also showing strong results on text from varied domains. A sympathetic reader would care because this points to simpler, higher-performing language tools for Vietnamese that avoid extra preprocessing stages.

Core claim

BamiBERT is a new pre-trained language model for Vietnamese that achieves the best score on 11 of 15 metrics and second-best on three others across eight benchmarks, setting a new state of the art among base-sized Vietnamese encoders while demonstrating strong cross-domain generalization; it is trained from scratch on 129GB of general-domain text for 20 epochs, supports up to 2048 tokens, and operates directly on raw input without external word segmentation.

What carries the argument

The BamiBERT model, a BERT architecture pre-trained directly on raw Vietnamese text with extended 2048-token context length.

If this is right

Vietnamese NLP applications can reach higher accuracy without relying on separate word segmentation tools.
Extended context windows enable better handling of longer Vietnamese documents in tasks like summarization or question answering.
Training on large general-domain raw text produces models that generalize across different text domains.
The released model allows direct use in downstream Vietnamese language tasks without additional pre-processing pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same raw-text training approach without segmentation could be tested on other languages that currently depend on word boundary tools.
If the performance holds, larger raw corpora might reduce the need for language-specific linguistic resources in model development.
The 2048-token context opens the possibility of document-level tasks that previous shorter-context Vietnamese models could not address directly.

Load-bearing premise

The benchmarks and metrics used are fair, unbiased, and representative of real Vietnamese language understanding with equivalent comparison conditions to prior models.

What would settle it

Re-running all evaluations with PhoBERT under the exact same training and inference conditions as BamiBERT and finding that PhoBERT matches or exceeds the reported scores would falsify the superiority claim.

read the original abstract

In this paper, we introduce BamiBERT, a new BERT-based pre-trained language model for Vietnamese that addresses key limitations of PhoBERT -- the current de facto Vietnamese text encoder. Trained from scratch on a 129GB corpus of general-domain Vietnamese text for 20 epochs, BamiBERT supports an extended context length of up to 2048 tokens and operates directly on raw input, eliminating the need for external word segmentation. Across 8 Vietnamese benchmarks, it achieves the best score on 11 of 15 metrics and the second-best on 3 others, setting a new state of the art among "base"-sized Vietnamese encoders and demonstrating strong cross-domain generalization. We release BamiBERT at: https://huggingface.co/Qualcomm-AI-Research/BamiBERT

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BamiBERT is a standard BERT retrain on Vietnamese data with raw input and 2048-token context, but the SOTA claims rest on abstract-level benchmark numbers without methods details.

read the letter

The main point is that this is a BERT model trained from scratch on 129GB of Vietnamese text for 20 epochs, with two practical tweaks: it takes raw input without external segmentation and supports up to 2048 tokens. It reports beating PhoBERT on 11 of 15 metrics across 8 benchmarks and gets released on Hugging Face.

The useful part is the scale of the corpus and the release itself. Vietnamese NLP has fewer resources than English, so a new base encoder that removes the segmentation step and handles longer sequences can save people time on downstream tasks. The cross-domain generalization note is also straightforward to check if the numbers hold.

The soft spot is the evaluation. The abstract gives no hyperparameters, no description of how baselines were implemented, no preprocessing steps, and no error bars or significance tests. The stress-test concern about tokenization is worth taking seriously here: if PhoBERT needs segmented input and BamiBERT does not, any comparison needs to show the test sets were prepared identically, otherwise the gains could partly come from that difference rather than model quality. Without those controls spelled out, the performance edge is hard to assess.

This paper is for people who work on Vietnamese or other lower-resource languages and need an updated encoder for their pipelines. Readers focused on practical improvements rather than new pretraining methods will find the most value.

It deserves a serious referee because a public model with claimed gains on a language that is underrepresented in the literature is worth verifying, even if the current write-up is thin on methods. The full paper would need to add the missing experimental details to make the central claims convincing.

Referee Report

2 major / 1 minor

Summary. The paper introduces BamiBERT, a BERT-based pre-trained language model for Vietnamese trained from scratch on a 129GB general-domain corpus for 20 epochs. It supports an extended context length of 2048 tokens and operates directly on raw input without external word segmentation, in contrast to PhoBERT. The central empirical claim is that across 8 Vietnamese benchmarks it achieves the best score on 11 of 15 metrics and second-best on 3 others, establishing a new state of the art among base-sized Vietnamese encoders with strong cross-domain generalization. The model is released publicly on Hugging Face.

Significance. If the benchmark results hold under equivalent evaluation conditions, BamiBERT would provide a stronger base encoder for Vietnamese NLP, particularly for tasks benefiting from longer context and reduced preprocessing. The public model release is a clear strength that enables direct reproducibility and downstream use.

major comments (2)

[Abstract] Abstract: the central SOTA claim rests on benchmark wins, yet the manuscript supplies no details on training hyperparameters, data preprocessing pipelines, baseline re-implementations, statistical testing, or error bars. This prevents verification that the reported gains reflect model quality rather than uncontrolled variables.
[Abstract] Abstract: the evaluation contrasts BamiBERT's raw-input operation with PhoBERT's segmentation requirement, but provides no information on whether test sets were prepared identically for both models (e.g., pre-segmented or raw). If tokenization differences contribute to the 11/15 metric wins, the cross-model SOTA conclusion does not follow.

minor comments (1)

The abstract states the corpus size and epoch count but does not indicate the exact training objective, optimizer schedule, or hardware used; adding a brief methods paragraph would improve clarity without altering the core claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments below and will revise the manuscript to improve clarity and reproducibility.

read point-by-point responses

Referee: [Abstract] Abstract: the central SOTA claim rests on benchmark wins, yet the manuscript supplies no details on training hyperparameters, data preprocessing pipelines, baseline re-implementations, statistical testing, or error bars. This prevents verification that the reported gains reflect model quality rather than uncontrolled variables.

Authors: We agree that the current version lacks sufficient detail for full verification. In the revised manuscript we will add a dedicated experimental setup subsection that reports all training hyperparameters, the complete data preprocessing pipeline, how each baseline was sourced or re-implemented, and any statistical testing or error bars obtained from multiple runs. revision: yes
Referee: [Abstract] Abstract: the evaluation contrasts BamiBERT's raw-input operation with PhoBERT's segmentation requirement, but provides no information on whether test sets were prepared identically for both models (e.g., pre-segmented or raw). If tokenization differences contribute to the 11/15 metric wins, the cross-model SOTA conclusion does not follow.

Authors: We acknowledge the need to document evaluation conditions explicitly. The revised version will include a paragraph clarifying that every model (including baselines) received test inputs prepared under the protocol appropriate to that model, with raw text supplied to BamiBERT and segmented text supplied to PhoBERT; we will also discuss the contribution of the segmentation difference to the observed gains. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical training and benchmark evaluation

full rationale

The paper introduces BamiBERT via standard pre-training on a 129GB corpus followed by evaluation on external Vietnamese benchmarks. No derivation chain, equations, fitted parameters renamed as predictions, or self-citation load-bearing premises exist. The central claim (SOTA on 11/15 metrics) rests on reported benchmark numbers, which are independent of any internal reduction to the paper's own inputs. This is the expected non-finding for an applied NLP model paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the standard transformer architecture and pretraining objectives being effective for Vietnamese, plus the assumption that the chosen benchmarks measure genuine capability gains.

axioms (1)

domain assumption Standard BERT masked language modeling and next-sentence prediction objectives transfer effectively to Vietnamese text.
The paper applies BERT pretraining without introducing new objectives or proving transfer.

pith-pipeline@v0.9.1-grok · 5665 in / 1136 out tokens · 30881 ms · 2026-07-03T14:26:46.104281+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

86 extracted references · 13 canonical work pages · 6 internal anchors

[1]

Proceedings of LREC

Minh, Nguyen and Tran, Vu Hoang and Hoang, Vu and Ta, Huy Duc and Bui, Trung Huu and Truong, Steven Quoc Hung. Proceedings of LREC. 2022

2022
[2]

Morris and Sarath Chandar , year=

Lola Le Breton and Quentin Fournier and Mariam El Mezouar and John X. Morris and Sarath Chandar , year=
[3]

Proceedings of the ACL

Warner, Benjamin and Chaffin, Antoine and Clavi. Proceedings of the ACL. 2025

2025
[4]

2020 , pages=

Chau, Chieu-Nguyen and Nguyen, Truong-Son and Nguyen, Le-Minh , booktitle=. 2020 , pages=

2020
[5]

Proceedings of PACLIC

Bui, The Viet and Tran, Thi Oanh and Le-Hong, Phuong. Proceedings of PACLIC. 2020

2020
[6]

Bai, Shuai and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Song, Sibo and Dang, Kai and Wang, Peng and Wang, Shijie and Tang, Jun and Zhong, Humen and Zhu, Yuanzhi and Yang, Mingkun and Li, Zhaohai and Wan, Jianqiang and Wang, Pengfei and Ding, Wei and Fu, Zheren and Xu, Yiheng and Ye, Jiabo and Zhang, Xi and Xie, Tianbao and Cheng, Z...
[7]

Aaron van den Oord and Yazhe Li and Oriol Vinyals , year=
[8]

Wang, Liang and Yang, Nan and Huang, Xiaolong and Jiao, Binxing and Yang, Linjun and Jiang, Daxin and Majumder, Rangan and Wei, Furu , journal=
[9]

Li, Zehan and Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Pengjun and Zhang, Meishan , journal=
[10]

Proceedings of EMNLP-IJCNLP

Reimers, Nils and Gurevych, Iryna. Proceedings of EMNLP-IJCNLP. 2019

2019
[11]

Wang, Peng and Bai, Shuai and Tan, Sinan and Wang, Shijie and Fan, Zhihao and Bai, Jinze and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Fan, Yang and Dang, Kai and Du, Mengfei and Ren, Xuancheng and Men, Rui and Liu, Dayiheng and Zhou, Chang and Zhou, Jingren and Lin, Junyang , journal=
[12]

2023 , pages =

Zhai, Xiaohua and Mustafa, Basil and Kolesnikov, Alexander and Beyer, Lucas , booktitle =. 2023 , pages =

2023
[13]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features , author=. arXiv preprint arXiv:2502.14786 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[14]

International conference on machine learning , pages=

Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

2021
[15]

arXiv preprint arXiv:2205.12522 , year=

Crossmodal-3600: A massively multilingual multimodal evaluation dataset , author=. arXiv preprint arXiv:2205.12522 , year=

work page arXiv
[16]

Wissam Antoun and Francis Kulumba and Rian Touchent and Éric de la Clergerie and Benoît Sagot and Djamé Seddah , journal=
[17]

arXiv preprint arXiv:2401.08100 , year=

KTVIC: A Vietnamese Image Captioning Dataset on the Life Domain , author=. arXiv preprint arXiv:2401.08100 , year=

work page arXiv
[18]

30VNFoods: A Dataset for Vietnamese Foods Recognition , year=

Do, Trong-Hop and Nguyen, Duc-Duy-Anh and Dang, Hoang-Quan and Nguyen, Hoang-Nhan and Pham, Phu-Phuoc and Nguyen, Duc-Tri , booktitle=. 30VNFoods: A Dataset for Vietnamese Foods Recognition , year=
[19]

Decoupled Weight Decay Regularization

Decoupled weight decay regularization , author=. arXiv preprint arXiv:1711.05101 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[20]

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis , pages=

Zero: Memory optimizations toward training trillion parameter models , author=. SC20: International Conference for High Performance Computing, Networking, Storage and Analysis , pages=. 2020 , organization=

2020
[21]

Radford, Alec and Kim, Jong Wook and Hallacy, Chris and Ramesh, Aditya and Goh, Gabriel and Agarwal, Sandhini and Sastry, Girish and Askell, Amanda and Mishkin, Pamela and Clark, Jack and Krueger, Gretchen and Sutskever, Ilya , booktitle =
[22]

Proceedings of ICLR , year=

Decoupled Weight Decay Regularization , author=. Proceedings of ICLR , year=
[23]

Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and Rémi Louf and Morgan Funtowicz and Joe Davison and Sam Shleifer and Patrick von Platen and Clara Ma and Yacine Jernite and Julien Plu and Canwen Xu and Teven Le Scao and Sylvain Gugger and Mariama Drame and Quentin L...

2020
[24]

2021 , volume =

Van Thin, Dang and Nguyen, Ngan Luu-Thuy and Truong, Tri Minh and Le, Lac Si and Vo, Duy Tin , title =. 2021 , volume =

2021
[25]

Proceedings of KSEM

Luc Phan, Luong and Huynh Pham, Phuc and Thi-Thanh Nguyen, Kim and Khai Huynh, Sieu and Thi Nguyen, Tham and Thanh Nguyen, Luan and Van Huynh, Tin and Van Nguyen, Kiet. Proceedings of KSEM. 2021

2021
[26]

Proceedings of COLING

Huynh, Tin Van and Nguyen, Kiet Van and Nguyen, Ngan Luu-Thuy. Proceedings of COLING. 2022

2022
[27]

and Nguyen, Anh Gia-Tuan

Van Dinh, Co and Luu, Son T. and Nguyen, Anh Gia-Tuan. Proceedings of ACIIDS. 2022

2022
[28]

Nguyen, Kiet Van and Nguyen, Vu Duc and Nguyen, Phu X. V. and Truong, Tham T. H. and Nguyen, Ngan Luu-Thuy , booktitle=. 2018 , pages=

2018
[29]

Thinh Hung Truong and Mai Hoang Dao and Dat Quoc Nguyen , booktitle =
[30]

Proceedings of LREC 2018 , pages=

Dat Quoc Nguyen and Dai Quoc Nguyen and Thanh Vu and Mark Dras and Mark Johnson , title=. Proceedings of LREC 2018 , pages=

2018
[31]

V n C ore NLP : A V ietnamese Natural Language Processing Toolkit

Vu, Thanh and Nguyen, Dat Quoc and Nguyen, Dai Quoc and Dras, Mark and Johnson, Mark. V n C ore NLP : A V ietnamese Natural Language Processing Toolkit. Proceedings of NAACL: Demonstrations. 2018

2018
[32]

Findings of EACL 2023

Tran, Cong Dao and Pham, Nhut Huy and Nguyen, Anh Tuan and Hy, Truong Son and Vu, Tu. Findings of EACL 2023. 2023

2023
[33]

Proceedings of EMNLP

Nguyen, Nam and Phan, Thang and Nguyen, Duc-Vu and Nguyen, Kiet. Proceedings of EMNLP. 2023

2023
[34]

2020 , pages =

Dat Quoc Nguyen and Anh Tuan Nguyen , booktitle =. 2020 , pages =

2020
[35]

arXiv preprint , volume =

Alexis Conneau and Kartikay Khandelwal and Naman Goyal and Vishrav Chaudhary and Guillaume Wenzek and Francisco Guzm. arXiv preprint , volume =
[36]

Dat Quoc Nguyen and Linh The Nguyen and Chi Tran and Dung Ngoc Nguyen and Dinh Phung and Hung Bui , journal =
[37]

Transactions of the ACL , volume =

Agarwal, Oshin and Nenkova, Ani , title =. Transactions of the ACL , volume =. 2022 , month =

2022
[38]

Findings of NAACL

Do, Phong Nguyen-Thuan and Tran, Son Quoc and Hoang, Phu Gia and Nguyen, Kiet Van and Nguyen, Ngan Luu-Thuy. Findings of NAACL. 2024

2024
[39]

Tran and Vu, Huy-The and Minh-Tien, Nguyen

Tran-Tien, Manh and Le, Huu-Loi and Minh, Dang Nhat and Khang, T. Tran and Vu, Huy-The and Minh-Tien, Nguyen. Proceedings of PACLIC. 2023

2023
[40]

2024 , booktitle =

Fan, Wenqi and Ding, Yujuan and Ning, Liangbo and Wang, Shijie and Li, Hengyun and Yin, Dawei and Chua, Tat-Seng and Li, Qing , title =. 2024 , booktitle =

2024
[41]

2019 , booktitle =

Sun, Fei and Liu, Jun and Wu, Jian and Pei, Changhua and Lin, Xiao and Ou, Wenwu and Jiang, Peng , title =. 2019 , booktitle =

2019
[42]

2019 , pages=

Myle Ott and Sergey Edunov and Alexei Baevski and Angela Fan and Sam Gross and Nathan Ng and David Grangier and Michael Auli , booktitle =. 2019 , pages=

2019
[43]

Kingma and Jimmy Ba , title =

Diederik P. Kingma and Jimmy Ba , title =. Proceedings of ICLR , year =
[44]

Yinhan Liu and Myle Ott and Naman Goyal and Jingfei Du and Mandar Joshi and Danqi Chen and Omer Levy and Mike Lewis and Luke Zettlemoyer and Veselin Stoyanov , year=
[45]

Proceedings of NAACL

Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina. Proceedings of NAACL. 2019

2019
[46]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

1972
[47]

Publications Manual , year = "1983", publisher =

1983
[48]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981
[49]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of
[50]

Dan Gusfield , title =. 1997

1997
[51]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

2015
[52]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =
[53]

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

MiniCPM-V: A GPT-4V Level MLLM on Your Phone , author=. arXiv preprint arXiv:2408.01800 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[54]

m CLIP : Multilingual CLIP via Cross-lingual Transfer

Chen, Guanhua and Hou, Lu and Chen, Yun and Dai, Wenliang and Shang, Lifeng and Jiang, Xin and Liu, Qun and Pan, Jia and Wang, Wenping. m CLIP : Multilingual CLIP via Cross-lingual Transfer. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.728

work page doi:10.18653/v1/2023.acl-long.728 2023
[55]

arXiv:2211.01335 , year=

Chinese clip: Contrastive vision-language pretraining in chinese , author=. arXiv preprint arXiv:2211.01335 , year=

work page arXiv
[56]

Proceedings of the thirteenth language resources and evaluation conference , pages=

Cross-lingual and multilingual clip , author=. Proceedings of the thirteenth language resources and evaluation conference , pages=
[57]

arXiv preprint arXiv:2310.13683 , year=

CAPIVARA: Cost-efficient approach for improving multilingual CLIP performance on low-resource languages , author=. arXiv preprint arXiv:2310.13683 , year=

work page arXiv
[58]

Findings of the Association for Computational Linguistics: ACL 2023 , pages=

AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities , author=. Findings of the Association for Computational Linguistics: ACL 2023 , pages=

2023
[59]

Advances in neural information processing systems , volume=

Attention is all you need , author=. Advances in neural information processing systems , volume=
[60]

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

Deberta: Decoding-enhanced bert with disentangled attention , author=. arXiv preprint arXiv:2006.03654 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2006
[61]

Morris, and Sarath Chandar

Lola Le Breton, Quentin Fournier, Mariam El Mezouar, John X. Morris, and Sarath Chandar. 2025. NeoBERT: A Next-Generation BERT

2025
[62]

The Viet Bui, Thi Oanh Tran, and Phuong Le-Hong. 2020. Improving Sequence Tagging for V ietnamese Text using Transformer-based Neural Models . In Proceedings of PACLIC, pages 13--20

2020
[63]

Chieu-Nguyen Chau, Truong-Son Nguyen, and Le-Minh Nguyen. 2020. VNLawBERT: A Vietnamese Legal Answer Selection Approach Using BERT Language Model . In Proceedings of NICS, pages 298--301

2020
[64]

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzm \' a n, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Unsupervised Cross-lingual Representation Learning at Scale . arXiv preprint, arXiv:1911.02116

work page internal anchor Pith review Pith/arXiv arXiv 2019
[65]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding . In Proceedings of NAACL, pages 4171--4186

2019
[66]

Phong Nguyen-Thuan Do, Son Quoc Tran, Phu Gia Hoang, Kiet Van Nguyen, and Ngan Luu-Thuy Nguyen. 2024. VLUE : A New Benchmark and Multi-task Knowledge Transfer Learning for V ietnamese Natural Language Understanding . In Findings of NAACL, pages 211--222

2024
[67]

Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li. 2024. A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models . In Proceedings of KDD, page 6491–6501

2024
[68]

Tin Van Huynh, Kiet Van Nguyen, and Ngan Luu-Thuy Nguyen. 2022. V i NLI : A V ietnamese Corpus for Studies on Open-Domain Natural Language Inference . In Proceedings of COLING, pages 3858--3872

2022
[69]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization . In Proceedings of ICLR

2015
[70]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach . arXiv preprint, arXiv:1907.11692

work page internal anchor Pith review Pith/arXiv arXiv 2019
[71]

Ilya Loshchilov and Frank Hutter. 2019. https://openreview.net/forum?id=Bkg6RiCqY7 Decoupled weight decay regularization . In Proceedings of ICLR

2019
[72]

Luong Luc Phan, Phuc Huynh Pham, Kim Thi-Thanh Nguyen, Sieu Khai Huynh, Tham Thi Nguyen, Luan Thanh Nguyen, Tin Van Huynh, and Kiet Van Nguyen. 2021. SA2SL: From Aspect-Based Sentiment Analysis to Social Listening System for Business Intelligence . In Proceedings of KSEM, pages 647--658

2021
[73]

Nguyen Minh, Vu Hoang Tran, Vu Hoang, Huy Duc Ta, Trung Huu Bui, and Steven Quoc Hung Truong. 2022. V i H ealth BERT : Pre-trained Language Models for V ietnamese in Health Text Mining . In Proceedings of LREC, pages 328--337

2022
[74]

Dat Quoc Nguyen and Anh Tuan Nguyen. 2020. PhoBERT: Pre-trained language models for Vietnamese . In Findings of EMNLP 2020, pages 1037--1042

2020
[75]

Dat Quoc Nguyen, Dai Quoc Nguyen, Thanh Vu, Mark Dras, and Mark Johnson. 2018 a . A Fast and Accurate Vietnamese Word Segmenter . In Proceedings of LREC 2018, pages 2582--2587

2018
[76]

Dat Quoc Nguyen, Linh The Nguyen, Chi Tran, Dung Ngoc Nguyen, Dinh Phung, and Hung Bui. 2023 a . PhoGPT: Generative Pre-training for Vietnamese . arXiv preprint, arXiv:2311.02945

work page arXiv 2023
[77]

Kiet Van Nguyen, Vu Duc Nguyen, Phu X. V. Nguyen, Tham T. H. Truong, and Ngan Luu-Thuy Nguyen. 2018 b . UIT-VSFC: Vietnamese Students’ Feedback Corpus for Sentiment Analysis . In Proceedings of KSE, pages 19--24

2018
[78]

Nam Nguyen, Thang Phan, Duc-Vu Nguyen, and Kiet Nguyen. 2023 b . V i S o BERT : A Pre-Trained Language Model for V ietnamese Social Media Text Processing . In Proceedings of EMNLP, pages 5191--5207

2023
[79]

Cong Dao Tran, Nhut Huy Pham, Anh Tuan Nguyen, Truong Son Hy, and Tu Vu. 2023. V i D e BERT a: A powerful pre-trained language model for V ietnamese . In Findings of EACL 2023, pages 1071--1078

2023
[80]

Tran Khang, Huy-The Vu, and Nguyen Minh-Tien

Manh Tran-Tien, Huu-Loi Le, Dang Nhat Minh, T. Tran Khang, Huy-The Vu, and Nguyen Minh-Tien. 2023. V i P ubmed D e BERT a: A Pre-trained Model for V ietnamese Biomedical Text . In Proceedings of PACLIC, pages 831--840

2023

Showing first 80 references.

[1] [1]

Proceedings of LREC

Minh, Nguyen and Tran, Vu Hoang and Hoang, Vu and Ta, Huy Duc and Bui, Trung Huu and Truong, Steven Quoc Hung. Proceedings of LREC. 2022

2022

[2] [2]

Morris and Sarath Chandar , year=

Lola Le Breton and Quentin Fournier and Mariam El Mezouar and John X. Morris and Sarath Chandar , year=

[3] [3]

Proceedings of the ACL

Warner, Benjamin and Chaffin, Antoine and Clavi. Proceedings of the ACL. 2025

2025

[4] [4]

2020 , pages=

Chau, Chieu-Nguyen and Nguyen, Truong-Son and Nguyen, Le-Minh , booktitle=. 2020 , pages=

2020

[5] [5]

Proceedings of PACLIC

Bui, The Viet and Tran, Thi Oanh and Le-Hong, Phuong. Proceedings of PACLIC. 2020

2020

[6] [6]

Bai, Shuai and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Song, Sibo and Dang, Kai and Wang, Peng and Wang, Shijie and Tang, Jun and Zhong, Humen and Zhu, Yuanzhi and Yang, Mingkun and Li, Zhaohai and Wan, Jianqiang and Wang, Pengfei and Ding, Wei and Fu, Zheren and Xu, Yiheng and Ye, Jiabo and Zhang, Xi and Xie, Tianbao and Cheng, Z...

[7] [7]

Aaron van den Oord and Yazhe Li and Oriol Vinyals , year=

[8] [8]

Wang, Liang and Yang, Nan and Huang, Xiaolong and Jiao, Binxing and Yang, Linjun and Jiang, Daxin and Majumder, Rangan and Wei, Furu , journal=

[9] [9]

Li, Zehan and Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Pengjun and Zhang, Meishan , journal=

[10] [10]

Proceedings of EMNLP-IJCNLP

Reimers, Nils and Gurevych, Iryna. Proceedings of EMNLP-IJCNLP. 2019

2019

[11] [11]

Wang, Peng and Bai, Shuai and Tan, Sinan and Wang, Shijie and Fan, Zhihao and Bai, Jinze and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Fan, Yang and Dang, Kai and Du, Mengfei and Ren, Xuancheng and Men, Rui and Liu, Dayiheng and Zhou, Chang and Zhou, Jingren and Lin, Junyang , journal=

[12] [12]

2023 , pages =

Zhai, Xiaohua and Mustafa, Basil and Kolesnikov, Alexander and Beyer, Lucas , booktitle =. 2023 , pages =

2023

[13] [13]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features , author=. arXiv preprint arXiv:2502.14786 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

International conference on machine learning , pages=

Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

2021

[15] [15]

arXiv preprint arXiv:2205.12522 , year=

Crossmodal-3600: A massively multilingual multimodal evaluation dataset , author=. arXiv preprint arXiv:2205.12522 , year=

work page arXiv

[16] [16]

Wissam Antoun and Francis Kulumba and Rian Touchent and Éric de la Clergerie and Benoît Sagot and Djamé Seddah , journal=

[17] [17]

arXiv preprint arXiv:2401.08100 , year=

KTVIC: A Vietnamese Image Captioning Dataset on the Life Domain , author=. arXiv preprint arXiv:2401.08100 , year=

work page arXiv

[18] [18]

30VNFoods: A Dataset for Vietnamese Foods Recognition , year=

Do, Trong-Hop and Nguyen, Duc-Duy-Anh and Dang, Hoang-Quan and Nguyen, Hoang-Nhan and Pham, Phu-Phuoc and Nguyen, Duc-Tri , booktitle=. 30VNFoods: A Dataset for Vietnamese Foods Recognition , year=

[19] [19]

Decoupled Weight Decay Regularization

Decoupled weight decay regularization , author=. arXiv preprint arXiv:1711.05101 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis , pages=

Zero: Memory optimizations toward training trillion parameter models , author=. SC20: International Conference for High Performance Computing, Networking, Storage and Analysis , pages=. 2020 , organization=

2020

[21] [21]

Radford, Alec and Kim, Jong Wook and Hallacy, Chris and Ramesh, Aditya and Goh, Gabriel and Agarwal, Sandhini and Sastry, Girish and Askell, Amanda and Mishkin, Pamela and Clark, Jack and Krueger, Gretchen and Sutskever, Ilya , booktitle =

[22] [22]

Proceedings of ICLR , year=

Decoupled Weight Decay Regularization , author=. Proceedings of ICLR , year=

[23] [23]

Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and Rémi Louf and Morgan Funtowicz and Joe Davison and Sam Shleifer and Patrick von Platen and Clara Ma and Yacine Jernite and Julien Plu and Canwen Xu and Teven Le Scao and Sylvain Gugger and Mariama Drame and Quentin L...

2020

[24] [24]

2021 , volume =

Van Thin, Dang and Nguyen, Ngan Luu-Thuy and Truong, Tri Minh and Le, Lac Si and Vo, Duy Tin , title =. 2021 , volume =

2021

[25] [25]

Proceedings of KSEM

Luc Phan, Luong and Huynh Pham, Phuc and Thi-Thanh Nguyen, Kim and Khai Huynh, Sieu and Thi Nguyen, Tham and Thanh Nguyen, Luan and Van Huynh, Tin and Van Nguyen, Kiet. Proceedings of KSEM. 2021

2021

[26] [26]

Proceedings of COLING

Huynh, Tin Van and Nguyen, Kiet Van and Nguyen, Ngan Luu-Thuy. Proceedings of COLING. 2022

2022

[27] [27]

and Nguyen, Anh Gia-Tuan

Van Dinh, Co and Luu, Son T. and Nguyen, Anh Gia-Tuan. Proceedings of ACIIDS. 2022

2022

[28] [28]

Nguyen, Kiet Van and Nguyen, Vu Duc and Nguyen, Phu X. V. and Truong, Tham T. H. and Nguyen, Ngan Luu-Thuy , booktitle=. 2018 , pages=

2018

[29] [29]

Thinh Hung Truong and Mai Hoang Dao and Dat Quoc Nguyen , booktitle =

[30] [30]

Proceedings of LREC 2018 , pages=

Dat Quoc Nguyen and Dai Quoc Nguyen and Thanh Vu and Mark Dras and Mark Johnson , title=. Proceedings of LREC 2018 , pages=

2018

[31] [31]

V n C ore NLP : A V ietnamese Natural Language Processing Toolkit

Vu, Thanh and Nguyen, Dat Quoc and Nguyen, Dai Quoc and Dras, Mark and Johnson, Mark. V n C ore NLP : A V ietnamese Natural Language Processing Toolkit. Proceedings of NAACL: Demonstrations. 2018

2018

[32] [32]

Findings of EACL 2023

Tran, Cong Dao and Pham, Nhut Huy and Nguyen, Anh Tuan and Hy, Truong Son and Vu, Tu. Findings of EACL 2023. 2023

2023

[33] [33]

Proceedings of EMNLP

Nguyen, Nam and Phan, Thang and Nguyen, Duc-Vu and Nguyen, Kiet. Proceedings of EMNLP. 2023

2023

[34] [34]

2020 , pages =

Dat Quoc Nguyen and Anh Tuan Nguyen , booktitle =. 2020 , pages =

2020

[35] [35]

arXiv preprint , volume =

Alexis Conneau and Kartikay Khandelwal and Naman Goyal and Vishrav Chaudhary and Guillaume Wenzek and Francisco Guzm. arXiv preprint , volume =

[36] [36]

Dat Quoc Nguyen and Linh The Nguyen and Chi Tran and Dung Ngoc Nguyen and Dinh Phung and Hung Bui , journal =

[37] [37]

Transactions of the ACL , volume =

Agarwal, Oshin and Nenkova, Ani , title =. Transactions of the ACL , volume =. 2022 , month =

2022

[38] [38]

Findings of NAACL

Do, Phong Nguyen-Thuan and Tran, Son Quoc and Hoang, Phu Gia and Nguyen, Kiet Van and Nguyen, Ngan Luu-Thuy. Findings of NAACL. 2024

2024

[39] [39]

Tran and Vu, Huy-The and Minh-Tien, Nguyen

Tran-Tien, Manh and Le, Huu-Loi and Minh, Dang Nhat and Khang, T. Tran and Vu, Huy-The and Minh-Tien, Nguyen. Proceedings of PACLIC. 2023

2023

[40] [40]

2024 , booktitle =

Fan, Wenqi and Ding, Yujuan and Ning, Liangbo and Wang, Shijie and Li, Hengyun and Yin, Dawei and Chua, Tat-Seng and Li, Qing , title =. 2024 , booktitle =

2024

[41] [41]

2019 , booktitle =

Sun, Fei and Liu, Jun and Wu, Jian and Pei, Changhua and Lin, Xiao and Ou, Wenwu and Jiang, Peng , title =. 2019 , booktitle =

2019

[42] [42]

2019 , pages=

Myle Ott and Sergey Edunov and Alexei Baevski and Angela Fan and Sam Gross and Nathan Ng and David Grangier and Michael Auli , booktitle =. 2019 , pages=

2019

[43] [43]

Kingma and Jimmy Ba , title =

Diederik P. Kingma and Jimmy Ba , title =. Proceedings of ICLR , year =

[44] [44]

Yinhan Liu and Myle Ott and Naman Goyal and Jingfei Du and Mandar Joshi and Danqi Chen and Omer Levy and Mike Lewis and Luke Zettlemoyer and Veselin Stoyanov , year=

[45] [45]

Proceedings of NAACL

Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina. Proceedings of NAACL. 2019

2019

[46] [46]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

1972

[47] [47]

Publications Manual , year = "1983", publisher =

1983

[48] [48]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981

[49] [49]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

[50] [50]

Dan Gusfield , title =. 1997

1997

[51] [51]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

2015

[52] [52]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

[53] [53]

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

MiniCPM-V: A GPT-4V Level MLLM on Your Phone , author=. arXiv preprint arXiv:2408.01800 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[54] [54]

m CLIP : Multilingual CLIP via Cross-lingual Transfer

Chen, Guanhua and Hou, Lu and Chen, Yun and Dai, Wenliang and Shang, Lifeng and Jiang, Xin and Liu, Qun and Pan, Jia and Wang, Wenping. m CLIP : Multilingual CLIP via Cross-lingual Transfer. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.728

work page doi:10.18653/v1/2023.acl-long.728 2023

[55] [55]

arXiv:2211.01335 , year=

Chinese clip: Contrastive vision-language pretraining in chinese , author=. arXiv preprint arXiv:2211.01335 , year=

work page arXiv

[56] [56]

Proceedings of the thirteenth language resources and evaluation conference , pages=

Cross-lingual and multilingual clip , author=. Proceedings of the thirteenth language resources and evaluation conference , pages=

[57] [57]

arXiv preprint arXiv:2310.13683 , year=

CAPIVARA: Cost-efficient approach for improving multilingual CLIP performance on low-resource languages , author=. arXiv preprint arXiv:2310.13683 , year=

work page arXiv

[58] [58]

Findings of the Association for Computational Linguistics: ACL 2023 , pages=

AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities , author=. Findings of the Association for Computational Linguistics: ACL 2023 , pages=

2023

[59] [59]

Advances in neural information processing systems , volume=

Attention is all you need , author=. Advances in neural information processing systems , volume=

[60] [60]

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

Deberta: Decoding-enhanced bert with disentangled attention , author=. arXiv preprint arXiv:2006.03654 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2006

[61] [61]

Morris, and Sarath Chandar

Lola Le Breton, Quentin Fournier, Mariam El Mezouar, John X. Morris, and Sarath Chandar. 2025. NeoBERT: A Next-Generation BERT

2025

[62] [62]

The Viet Bui, Thi Oanh Tran, and Phuong Le-Hong. 2020. Improving Sequence Tagging for V ietnamese Text using Transformer-based Neural Models . In Proceedings of PACLIC, pages 13--20

2020

[63] [63]

Chieu-Nguyen Chau, Truong-Son Nguyen, and Le-Minh Nguyen. 2020. VNLawBERT: A Vietnamese Legal Answer Selection Approach Using BERT Language Model . In Proceedings of NICS, pages 298--301

2020

[64] [64]

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzm \' a n, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Unsupervised Cross-lingual Representation Learning at Scale . arXiv preprint, arXiv:1911.02116

work page internal anchor Pith review Pith/arXiv arXiv 2019

[65] [65]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding . In Proceedings of NAACL, pages 4171--4186

2019

[66] [66]

Phong Nguyen-Thuan Do, Son Quoc Tran, Phu Gia Hoang, Kiet Van Nguyen, and Ngan Luu-Thuy Nguyen. 2024. VLUE : A New Benchmark and Multi-task Knowledge Transfer Learning for V ietnamese Natural Language Understanding . In Findings of NAACL, pages 211--222

2024

[67] [67]

Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li. 2024. A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models . In Proceedings of KDD, page 6491–6501

2024

[68] [68]

Tin Van Huynh, Kiet Van Nguyen, and Ngan Luu-Thuy Nguyen. 2022. V i NLI : A V ietnamese Corpus for Studies on Open-Domain Natural Language Inference . In Proceedings of COLING, pages 3858--3872

2022

[69] [69]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization . In Proceedings of ICLR

2015

[70] [70]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach . arXiv preprint, arXiv:1907.11692

work page internal anchor Pith review Pith/arXiv arXiv 2019

[71] [71]

Ilya Loshchilov and Frank Hutter. 2019. https://openreview.net/forum?id=Bkg6RiCqY7 Decoupled weight decay regularization . In Proceedings of ICLR

2019

[72] [72]

Luong Luc Phan, Phuc Huynh Pham, Kim Thi-Thanh Nguyen, Sieu Khai Huynh, Tham Thi Nguyen, Luan Thanh Nguyen, Tin Van Huynh, and Kiet Van Nguyen. 2021. SA2SL: From Aspect-Based Sentiment Analysis to Social Listening System for Business Intelligence . In Proceedings of KSEM, pages 647--658

2021

[73] [73]

Nguyen Minh, Vu Hoang Tran, Vu Hoang, Huy Duc Ta, Trung Huu Bui, and Steven Quoc Hung Truong. 2022. V i H ealth BERT : Pre-trained Language Models for V ietnamese in Health Text Mining . In Proceedings of LREC, pages 328--337

2022

[74] [74]

Dat Quoc Nguyen and Anh Tuan Nguyen. 2020. PhoBERT: Pre-trained language models for Vietnamese . In Findings of EMNLP 2020, pages 1037--1042

2020

[75] [75]

Dat Quoc Nguyen, Dai Quoc Nguyen, Thanh Vu, Mark Dras, and Mark Johnson. 2018 a . A Fast and Accurate Vietnamese Word Segmenter . In Proceedings of LREC 2018, pages 2582--2587

2018

[76] [76]

Dat Quoc Nguyen, Linh The Nguyen, Chi Tran, Dung Ngoc Nguyen, Dinh Phung, and Hung Bui. 2023 a . PhoGPT: Generative Pre-training for Vietnamese . arXiv preprint, arXiv:2311.02945

work page arXiv 2023

[77] [77]

Kiet Van Nguyen, Vu Duc Nguyen, Phu X. V. Nguyen, Tham T. H. Truong, and Ngan Luu-Thuy Nguyen. 2018 b . UIT-VSFC: Vietnamese Students’ Feedback Corpus for Sentiment Analysis . In Proceedings of KSE, pages 19--24

2018

[78] [78]

Nam Nguyen, Thang Phan, Duc-Vu Nguyen, and Kiet Nguyen. 2023 b . V i S o BERT : A Pre-Trained Language Model for V ietnamese Social Media Text Processing . In Proceedings of EMNLP, pages 5191--5207

2023

[79] [79]

Cong Dao Tran, Nhut Huy Pham, Anh Tuan Nguyen, Truong Son Hy, and Tu Vu. 2023. V i D e BERT a: A powerful pre-trained language model for V ietnamese . In Findings of EACL 2023, pages 1071--1078

2023

[80] [80]

Tran Khang, Huy-The Vu, and Nguyen Minh-Tien

Manh Tran-Tien, Huu-Loi Le, Dang Nhat Minh, T. Tran Khang, Huy-The Vu, and Nguyen Minh-Tien. 2023. V i P ubmed D e BERT a: A Pre-trained Model for V ietnamese Biomedical Text . In Proceedings of PACLIC, pages 831--840

2023