pith. sign in

arxiv: 2607.02259 · v1 · pith:SL54BS7Cnew · submitted 2026-07-02 · 💻 cs.CL

BamiBERT: A New BERT-based Language Model for Vietnamese

Pith reviewed 2026-07-03 14:26 UTC · model grok-4.3

classification 💻 cs.CL
keywords VietnameseBERTlanguage modelpre-trainingnatural language processingbenchmarksraw text inputcontext length
0
0 comments X

The pith

BamiBERT sets new state of the art for base-sized Vietnamese encoders by training on raw text with 2048-token context.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces BamiBERT as a BERT-based model for Vietnamese that improves on prior work by training from scratch on a 129GB general-domain corpus for 20 epochs. It processes raw input directly without external word segmentation and handles contexts up to 2048 tokens. Across eight benchmarks it records the top score on 11 of 15 metrics and second place on three others, while also showing strong results on text from varied domains. A sympathetic reader would care because this points to simpler, higher-performing language tools for Vietnamese that avoid extra preprocessing stages.

Core claim

BamiBERT is a new pre-trained language model for Vietnamese that achieves the best score on 11 of 15 metrics and second-best on three others across eight benchmarks, setting a new state of the art among base-sized Vietnamese encoders while demonstrating strong cross-domain generalization; it is trained from scratch on 129GB of general-domain text for 20 epochs, supports up to 2048 tokens, and operates directly on raw input without external word segmentation.

What carries the argument

The BamiBERT model, a BERT architecture pre-trained directly on raw Vietnamese text with extended 2048-token context length.

If this is right

  • Vietnamese NLP applications can reach higher accuracy without relying on separate word segmentation tools.
  • Extended context windows enable better handling of longer Vietnamese documents in tasks like summarization or question answering.
  • Training on large general-domain raw text produces models that generalize across different text domains.
  • The released model allows direct use in downstream Vietnamese language tasks without additional pre-processing pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same raw-text training approach without segmentation could be tested on other languages that currently depend on word boundary tools.
  • If the performance holds, larger raw corpora might reduce the need for language-specific linguistic resources in model development.
  • The 2048-token context opens the possibility of document-level tasks that previous shorter-context Vietnamese models could not address directly.

Load-bearing premise

The benchmarks and metrics used are fair, unbiased, and representative of real Vietnamese language understanding with equivalent comparison conditions to prior models.

What would settle it

Re-running all evaluations with PhoBERT under the exact same training and inference conditions as BamiBERT and finding that PhoBERT matches or exceeds the reported scores would falsify the superiority claim.

read the original abstract

In this paper, we introduce BamiBERT, a new BERT-based pre-trained language model for Vietnamese that addresses key limitations of PhoBERT -- the current de facto Vietnamese text encoder. Trained from scratch on a 129GB corpus of general-domain Vietnamese text for 20 epochs, BamiBERT supports an extended context length of up to 2048 tokens and operates directly on raw input, eliminating the need for external word segmentation. Across 8 Vietnamese benchmarks, it achieves the best score on 11 of 15 metrics and the second-best on 3 others, setting a new state of the art among "base"-sized Vietnamese encoders and demonstrating strong cross-domain generalization. We release BamiBERT at: https://huggingface.co/Qualcomm-AI-Research/BamiBERT

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces BamiBERT, a BERT-based pre-trained language model for Vietnamese trained from scratch on a 129GB general-domain corpus for 20 epochs. It supports an extended context length of 2048 tokens and operates directly on raw input without external word segmentation, in contrast to PhoBERT. The central empirical claim is that across 8 Vietnamese benchmarks it achieves the best score on 11 of 15 metrics and second-best on 3 others, establishing a new state of the art among base-sized Vietnamese encoders with strong cross-domain generalization. The model is released publicly on Hugging Face.

Significance. If the benchmark results hold under equivalent evaluation conditions, BamiBERT would provide a stronger base encoder for Vietnamese NLP, particularly for tasks benefiting from longer context and reduced preprocessing. The public model release is a clear strength that enables direct reproducibility and downstream use.

major comments (2)
  1. [Abstract] Abstract: the central SOTA claim rests on benchmark wins, yet the manuscript supplies no details on training hyperparameters, data preprocessing pipelines, baseline re-implementations, statistical testing, or error bars. This prevents verification that the reported gains reflect model quality rather than uncontrolled variables.
  2. [Abstract] Abstract: the evaluation contrasts BamiBERT's raw-input operation with PhoBERT's segmentation requirement, but provides no information on whether test sets were prepared identically for both models (e.g., pre-segmented or raw). If tokenization differences contribute to the 11/15 metric wins, the cross-model SOTA conclusion does not follow.
minor comments (1)
  1. The abstract states the corpus size and epoch count but does not indicate the exact training objective, optimizer schedule, or hardware used; adding a brief methods paragraph would improve clarity without altering the core claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments below and will revise the manuscript to improve clarity and reproducibility.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central SOTA claim rests on benchmark wins, yet the manuscript supplies no details on training hyperparameters, data preprocessing pipelines, baseline re-implementations, statistical testing, or error bars. This prevents verification that the reported gains reflect model quality rather than uncontrolled variables.

    Authors: We agree that the current version lacks sufficient detail for full verification. In the revised manuscript we will add a dedicated experimental setup subsection that reports all training hyperparameters, the complete data preprocessing pipeline, how each baseline was sourced or re-implemented, and any statistical testing or error bars obtained from multiple runs. revision: yes

  2. Referee: [Abstract] Abstract: the evaluation contrasts BamiBERT's raw-input operation with PhoBERT's segmentation requirement, but provides no information on whether test sets were prepared identically for both models (e.g., pre-segmented or raw). If tokenization differences contribute to the 11/15 metric wins, the cross-model SOTA conclusion does not follow.

    Authors: We acknowledge the need to document evaluation conditions explicitly. The revised version will include a paragraph clarifying that every model (including baselines) received test inputs prepared under the protocol appropriate to that model, with raw text supplied to BamiBERT and segmented text supplied to PhoBERT; we will also discuss the contribution of the segmentation difference to the observed gains. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical training and benchmark evaluation

full rationale

The paper introduces BamiBERT via standard pre-training on a 129GB corpus followed by evaluation on external Vietnamese benchmarks. No derivation chain, equations, fitted parameters renamed as predictions, or self-citation load-bearing premises exist. The central claim (SOTA on 11/15 metrics) rests on reported benchmark numbers, which are independent of any internal reduction to the paper's own inputs. This is the expected non-finding for an applied NLP model paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the standard transformer architecture and pretraining objectives being effective for Vietnamese, plus the assumption that the chosen benchmarks measure genuine capability gains.

axioms (1)
  • domain assumption Standard BERT masked language modeling and next-sentence prediction objectives transfer effectively to Vietnamese text.
    The paper applies BERT pretraining without introducing new objectives or proving transfer.

pith-pipeline@v0.9.1-grok · 5665 in / 1136 out tokens · 30881 ms · 2026-07-03T14:26:46.104281+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

86 extracted references · 13 canonical work pages · 6 internal anchors

  1. [1]

    Proceedings of LREC

    Minh, Nguyen and Tran, Vu Hoang and Hoang, Vu and Ta, Huy Duc and Bui, Trung Huu and Truong, Steven Quoc Hung. Proceedings of LREC. 2022

  2. [2]

    Morris and Sarath Chandar , year=

    Lola Le Breton and Quentin Fournier and Mariam El Mezouar and John X. Morris and Sarath Chandar , year=

  3. [3]

    Proceedings of the ACL

    Warner, Benjamin and Chaffin, Antoine and Clavi. Proceedings of the ACL. 2025

  4. [4]

    2020 , pages=

    Chau, Chieu-Nguyen and Nguyen, Truong-Son and Nguyen, Le-Minh , booktitle=. 2020 , pages=

  5. [5]

    Proceedings of PACLIC

    Bui, The Viet and Tran, Thi Oanh and Le-Hong, Phuong. Proceedings of PACLIC. 2020

  6. [6]

    Bai, Shuai and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Song, Sibo and Dang, Kai and Wang, Peng and Wang, Shijie and Tang, Jun and Zhong, Humen and Zhu, Yuanzhi and Yang, Mingkun and Li, Zhaohai and Wan, Jianqiang and Wang, Pengfei and Ding, Wei and Fu, Zheren and Xu, Yiheng and Ye, Jiabo and Zhang, Xi and Xie, Tianbao and Cheng, Z...

  7. [7]

    Aaron van den Oord and Yazhe Li and Oriol Vinyals , year=

  8. [8]

    Wang, Liang and Yang, Nan and Huang, Xiaolong and Jiao, Binxing and Yang, Linjun and Jiang, Daxin and Majumder, Rangan and Wei, Furu , journal=

  9. [9]

    Li, Zehan and Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Pengjun and Zhang, Meishan , journal=

  10. [10]

    Proceedings of EMNLP-IJCNLP

    Reimers, Nils and Gurevych, Iryna. Proceedings of EMNLP-IJCNLP. 2019

  11. [11]

    Wang, Peng and Bai, Shuai and Tan, Sinan and Wang, Shijie and Fan, Zhihao and Bai, Jinze and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Fan, Yang and Dang, Kai and Du, Mengfei and Ren, Xuancheng and Men, Rui and Liu, Dayiheng and Zhou, Chang and Zhou, Jingren and Lin, Junyang , journal=

  12. [12]

    2023 , pages =

    Zhai, Xiaohua and Mustafa, Basil and Kolesnikov, Alexander and Beyer, Lucas , booktitle =. 2023 , pages =

  13. [13]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features , author=. arXiv preprint arXiv:2502.14786 , year=

  14. [14]

    International conference on machine learning , pages=

    Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

  15. [15]

    arXiv preprint arXiv:2205.12522 , year=

    Crossmodal-3600: A massively multilingual multimodal evaluation dataset , author=. arXiv preprint arXiv:2205.12522 , year=

  16. [16]

    Wissam Antoun and Francis Kulumba and Rian Touchent and Éric de la Clergerie and Benoît Sagot and Djamé Seddah , journal=

  17. [17]

    arXiv preprint arXiv:2401.08100 , year=

    KTVIC: A Vietnamese Image Captioning Dataset on the Life Domain , author=. arXiv preprint arXiv:2401.08100 , year=

  18. [18]

    30VNFoods: A Dataset for Vietnamese Foods Recognition , year=

    Do, Trong-Hop and Nguyen, Duc-Duy-Anh and Dang, Hoang-Quan and Nguyen, Hoang-Nhan and Pham, Phu-Phuoc and Nguyen, Duc-Tri , booktitle=. 30VNFoods: A Dataset for Vietnamese Foods Recognition , year=

  19. [19]

    Decoupled Weight Decay Regularization

    Decoupled weight decay regularization , author=. arXiv preprint arXiv:1711.05101 , year=

  20. [20]

    SC20: International Conference for High Performance Computing, Networking, Storage and Analysis , pages=

    Zero: Memory optimizations toward training trillion parameter models , author=. SC20: International Conference for High Performance Computing, Networking, Storage and Analysis , pages=. 2020 , organization=

  21. [21]

    Radford, Alec and Kim, Jong Wook and Hallacy, Chris and Ramesh, Aditya and Goh, Gabriel and Agarwal, Sandhini and Sastry, Girish and Askell, Amanda and Mishkin, Pamela and Clark, Jack and Krueger, Gretchen and Sutskever, Ilya , booktitle =

  22. [22]

    Proceedings of ICLR , year=

    Decoupled Weight Decay Regularization , author=. Proceedings of ICLR , year=

  23. [23]

    Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and Rémi Louf and Morgan Funtowicz and Joe Davison and Sam Shleifer and Patrick von Platen and Clara Ma and Yacine Jernite and Julien Plu and Canwen Xu and Teven Le Scao and Sylvain Gugger and Mariama Drame and Quentin L...

  24. [24]

    2021 , volume =

    Van Thin, Dang and Nguyen, Ngan Luu-Thuy and Truong, Tri Minh and Le, Lac Si and Vo, Duy Tin , title =. 2021 , volume =

  25. [25]

    Proceedings of KSEM

    Luc Phan, Luong and Huynh Pham, Phuc and Thi-Thanh Nguyen, Kim and Khai Huynh, Sieu and Thi Nguyen, Tham and Thanh Nguyen, Luan and Van Huynh, Tin and Van Nguyen, Kiet. Proceedings of KSEM. 2021

  26. [26]

    Proceedings of COLING

    Huynh, Tin Van and Nguyen, Kiet Van and Nguyen, Ngan Luu-Thuy. Proceedings of COLING. 2022

  27. [27]

    and Nguyen, Anh Gia-Tuan

    Van Dinh, Co and Luu, Son T. and Nguyen, Anh Gia-Tuan. Proceedings of ACIIDS. 2022

  28. [28]

    Nguyen, Kiet Van and Nguyen, Vu Duc and Nguyen, Phu X. V. and Truong, Tham T. H. and Nguyen, Ngan Luu-Thuy , booktitle=. 2018 , pages=

  29. [29]

    Thinh Hung Truong and Mai Hoang Dao and Dat Quoc Nguyen , booktitle =

  30. [30]

    Proceedings of LREC 2018 , pages=

    Dat Quoc Nguyen and Dai Quoc Nguyen and Thanh Vu and Mark Dras and Mark Johnson , title=. Proceedings of LREC 2018 , pages=

  31. [31]

    V n C ore NLP : A V ietnamese Natural Language Processing Toolkit

    Vu, Thanh and Nguyen, Dat Quoc and Nguyen, Dai Quoc and Dras, Mark and Johnson, Mark. V n C ore NLP : A V ietnamese Natural Language Processing Toolkit. Proceedings of NAACL: Demonstrations. 2018

  32. [32]

    Findings of EACL 2023

    Tran, Cong Dao and Pham, Nhut Huy and Nguyen, Anh Tuan and Hy, Truong Son and Vu, Tu. Findings of EACL 2023. 2023

  33. [33]

    Proceedings of EMNLP

    Nguyen, Nam and Phan, Thang and Nguyen, Duc-Vu and Nguyen, Kiet. Proceedings of EMNLP. 2023

  34. [34]

    2020 , pages =

    Dat Quoc Nguyen and Anh Tuan Nguyen , booktitle =. 2020 , pages =

  35. [35]

    arXiv preprint , volume =

    Alexis Conneau and Kartikay Khandelwal and Naman Goyal and Vishrav Chaudhary and Guillaume Wenzek and Francisco Guzm. arXiv preprint , volume =

  36. [36]

    Dat Quoc Nguyen and Linh The Nguyen and Chi Tran and Dung Ngoc Nguyen and Dinh Phung and Hung Bui , journal =

  37. [37]

    Transactions of the ACL , volume =

    Agarwal, Oshin and Nenkova, Ani , title =. Transactions of the ACL , volume =. 2022 , month =

  38. [38]

    Findings of NAACL

    Do, Phong Nguyen-Thuan and Tran, Son Quoc and Hoang, Phu Gia and Nguyen, Kiet Van and Nguyen, Ngan Luu-Thuy. Findings of NAACL. 2024

  39. [39]

    Tran and Vu, Huy-The and Minh-Tien, Nguyen

    Tran-Tien, Manh and Le, Huu-Loi and Minh, Dang Nhat and Khang, T. Tran and Vu, Huy-The and Minh-Tien, Nguyen. Proceedings of PACLIC. 2023

  40. [40]

    2024 , booktitle =

    Fan, Wenqi and Ding, Yujuan and Ning, Liangbo and Wang, Shijie and Li, Hengyun and Yin, Dawei and Chua, Tat-Seng and Li, Qing , title =. 2024 , booktitle =

  41. [41]

    2019 , booktitle =

    Sun, Fei and Liu, Jun and Wu, Jian and Pei, Changhua and Lin, Xiao and Ou, Wenwu and Jiang, Peng , title =. 2019 , booktitle =

  42. [42]

    2019 , pages=

    Myle Ott and Sergey Edunov and Alexei Baevski and Angela Fan and Sam Gross and Nathan Ng and David Grangier and Michael Auli , booktitle =. 2019 , pages=

  43. [43]

    Kingma and Jimmy Ba , title =

    Diederik P. Kingma and Jimmy Ba , title =. Proceedings of ICLR , year =

  44. [44]

    Yinhan Liu and Myle Ott and Naman Goyal and Jingfei Du and Mandar Joshi and Danqi Chen and Omer Levy and Mike Lewis and Luke Zettlemoyer and Veselin Stoyanov , year=

  45. [45]

    Proceedings of NAACL

    Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina. Proceedings of NAACL. 2019

  46. [46]

    Aho and Jeffrey D

    Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

  47. [47]

    Publications Manual , year = "1983", publisher =

  48. [48]

    Chandra and Dexter C

    Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

  49. [49]

    Scalable training of

    Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

  50. [50]

    Dan Gusfield , title =. 1997

  51. [51]

    Tetreault , title =

    Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

  52. [52]

    A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

    Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

  53. [53]

    MiniCPM-V: A GPT-4V Level MLLM on Your Phone

    MiniCPM-V: A GPT-4V Level MLLM on Your Phone , author=. arXiv preprint arXiv:2408.01800 , year=

  54. [54]

    m CLIP : Multilingual CLIP via Cross-lingual Transfer

    Chen, Guanhua and Hou, Lu and Chen, Yun and Dai, Wenliang and Shang, Lifeng and Jiang, Xin and Liu, Qun and Pan, Jia and Wang, Wenping. m CLIP : Multilingual CLIP via Cross-lingual Transfer. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.728

  55. [55]

    arXiv:2211.01335 , year=

    Chinese clip: Contrastive vision-language pretraining in chinese , author=. arXiv preprint arXiv:2211.01335 , year=

  56. [56]

    Proceedings of the thirteenth language resources and evaluation conference , pages=

    Cross-lingual and multilingual clip , author=. Proceedings of the thirteenth language resources and evaluation conference , pages=

  57. [57]

    arXiv preprint arXiv:2310.13683 , year=

    CAPIVARA: Cost-efficient approach for improving multilingual CLIP performance on low-resource languages , author=. arXiv preprint arXiv:2310.13683 , year=

  58. [58]

    Findings of the Association for Computational Linguistics: ACL 2023 , pages=

    AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities , author=. Findings of the Association for Computational Linguistics: ACL 2023 , pages=

  59. [59]

    Advances in neural information processing systems , volume=

    Attention is all you need , author=. Advances in neural information processing systems , volume=

  60. [60]

    DeBERTa: Decoding-enhanced BERT with Disentangled Attention

    Deberta: Decoding-enhanced bert with disentangled attention , author=. arXiv preprint arXiv:2006.03654 , year=

  61. [61]

    Morris, and Sarath Chandar

    Lola Le Breton, Quentin Fournier, Mariam El Mezouar, John X. Morris, and Sarath Chandar. 2025. NeoBERT: A Next-Generation BERT

  62. [62]

    The Viet Bui, Thi Oanh Tran, and Phuong Le-Hong. 2020. Improving Sequence Tagging for V ietnamese Text using Transformer-based Neural Models . In Proceedings of PACLIC, pages 13--20

  63. [63]

    Chieu-Nguyen Chau, Truong-Son Nguyen, and Le-Minh Nguyen. 2020. VNLawBERT: A Vietnamese Legal Answer Selection Approach Using BERT Language Model . In Proceedings of NICS, pages 298--301

  64. [64]

    Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzm \' a n, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Unsupervised Cross-lingual Representation Learning at Scale . arXiv preprint, arXiv:1911.02116

  65. [65]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding . In Proceedings of NAACL, pages 4171--4186

  66. [66]

    Phong Nguyen-Thuan Do, Son Quoc Tran, Phu Gia Hoang, Kiet Van Nguyen, and Ngan Luu-Thuy Nguyen. 2024. VLUE : A New Benchmark and Multi-task Knowledge Transfer Learning for V ietnamese Natural Language Understanding . In Findings of NAACL, pages 211--222

  67. [67]

    Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li. 2024. A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models . In Proceedings of KDD, page 6491–6501

  68. [68]

    Tin Van Huynh, Kiet Van Nguyen, and Ngan Luu-Thuy Nguyen. 2022. V i NLI : A V ietnamese Corpus for Studies on Open-Domain Natural Language Inference . In Proceedings of COLING, pages 3858--3872

  69. [69]

    Kingma and Jimmy Ba

    Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization . In Proceedings of ICLR

  70. [70]

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach . arXiv preprint, arXiv:1907.11692

  71. [71]

    Ilya Loshchilov and Frank Hutter. 2019. https://openreview.net/forum?id=Bkg6RiCqY7 Decoupled weight decay regularization . In Proceedings of ICLR

  72. [72]

    Luong Luc Phan, Phuc Huynh Pham, Kim Thi-Thanh Nguyen, Sieu Khai Huynh, Tham Thi Nguyen, Luan Thanh Nguyen, Tin Van Huynh, and Kiet Van Nguyen. 2021. SA2SL: From Aspect-Based Sentiment Analysis to Social Listening System for Business Intelligence . In Proceedings of KSEM, pages 647--658

  73. [73]

    Nguyen Minh, Vu Hoang Tran, Vu Hoang, Huy Duc Ta, Trung Huu Bui, and Steven Quoc Hung Truong. 2022. V i H ealth BERT : Pre-trained Language Models for V ietnamese in Health Text Mining . In Proceedings of LREC, pages 328--337

  74. [74]

    Dat Quoc Nguyen and Anh Tuan Nguyen. 2020. PhoBERT: Pre-trained language models for Vietnamese . In Findings of EMNLP 2020, pages 1037--1042

  75. [75]

    Dat Quoc Nguyen, Dai Quoc Nguyen, Thanh Vu, Mark Dras, and Mark Johnson. 2018 a . A Fast and Accurate Vietnamese Word Segmenter . In Proceedings of LREC 2018, pages 2582--2587

  76. [76]

    Dat Quoc Nguyen, Linh The Nguyen, Chi Tran, Dung Ngoc Nguyen, Dinh Phung, and Hung Bui. 2023 a . PhoGPT: Generative Pre-training for Vietnamese . arXiv preprint, arXiv:2311.02945

  77. [77]

    Kiet Van Nguyen, Vu Duc Nguyen, Phu X. V. Nguyen, Tham T. H. Truong, and Ngan Luu-Thuy Nguyen. 2018 b . UIT-VSFC: Vietnamese Students’ Feedback Corpus for Sentiment Analysis . In Proceedings of KSE, pages 19--24

  78. [78]

    Nam Nguyen, Thang Phan, Duc-Vu Nguyen, and Kiet Nguyen. 2023 b . V i S o BERT : A Pre-Trained Language Model for V ietnamese Social Media Text Processing . In Proceedings of EMNLP, pages 5191--5207

  79. [79]

    Cong Dao Tran, Nhut Huy Pham, Anh Tuan Nguyen, Truong Son Hy, and Tu Vu. 2023. V i D e BERT a: A powerful pre-trained language model for V ietnamese . In Findings of EACL 2023, pages 1071--1078

  80. [80]

    Tran Khang, Huy-The Vu, and Nguyen Minh-Tien

    Manh Tran-Tien, Huu-Loi Le, Dang Nhat Minh, T. Tran Khang, Huy-The Vu, and Nguyen Minh-Tien. 2023. V i P ubmed D e BERT a: A Pre-trained Model for V ietnamese Biomedical Text . In Proceedings of PACLIC, pages 831--840

Showing first 80 references.