Recognition: unknown
JFinTEB: Japanese Financial Text Embedding Benchmark
Pith reviewed 2026-05-10 07:52 UTC · model grok-4.3
The pith
JFinTEB provides the first benchmark tailored to Japanese financial text embeddings with retrieval and classification tasks from real scenarios.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce JFinTEB, the first comprehensive benchmark specifically designed for evaluating Japanese financial text embeddings. Existing embedding benchmarks provide limited coverage of language-specific and domain-specific aspects found in Japanese financial texts. Our benchmark encompasses diverse task categories including retrieval and classification tasks that reflect realistic and well-defined financial text processing scenarios.
What carries the argument
JFinTEB benchmark, built from instruction-following retrieval datasets, financial text generation queries, sentiment analysis, document categorization, and domain-specific classification tasks derived from economic survey data.
If this is right
- Researchers can directly compare Japanese-specific embedding models of varying sizes against multilingual models and commercial services on the same financial tasks.
- The public datasets enable development of new embeddings that better handle Japanese financial terminology and structures.
- A standardized evaluation protocol becomes available for the Japanese financial text mining community.
- Downstream applications such as financial document search and economic survey analysis can select models using JFinTEB scores.
Where Pith is reading between the lines
- The same task-construction approach could fill similar gaps for financial texts in other languages.
- Performance differences on JFinTEB may highlight which models better preserve meaning in Japanese compound financial terms.
- Future extensions could add time-series or multi-document financial reasoning tasks to test deeper domain understanding.
Load-bearing premise
Existing embedding benchmarks provide limited coverage of language-specific and domain-specific aspects found in Japanese financial texts, and the chosen retrieval and classification tasks reflect realistic financial text processing scenarios.
What would settle it
A test showing that models ranked highest on JFinTEB produce no measurable gain in accuracy when used for actual Japanese financial document retrieval or sentiment classification in live financial systems.
Figures
read the original abstract
We introduce JFinTEB, the first comprehensive benchmark specifically designed for evaluating Japanese financial text embeddings. Existing embedding benchmarks provide limited coverage of language-specific and domain-specific aspects found in Japanese financial texts. Our benchmark encompasses diverse task categories including retrieval and classification tasks that reflect realistic and well-defined financial text processing scenarios. The retrieval tasks leverage instruction-following datasets and financial text generation queries, while classification tasks cover sentiment analysis, document categorization, and domain-specific classification challenges derived from economic survey data. We conduct extensive evaluations across a wide range of embedding models, including Japanese-specific models of various sizes, multilingual models, and commercial embedding services. We publicly release JFinTEB datasets and evaluation framework at https://github.com/retarfi/JFinTEB to facilitate future research and provide a standardized evaluation protocol for the Japanese financial text mining community. This work addresses a critical gap in Japanese financial text processing resources and establishes a foundation for advancing domain-specific embedding research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces JFinTEB as the first comprehensive benchmark for Japanese financial text embeddings. It defines retrieval tasks based on instruction-following datasets and financial text generation queries, plus classification tasks covering sentiment analysis, document categorization, and economic survey data. The authors evaluate a range of Japanese-specific, multilingual, and commercial embedding models and publicly release the datasets and evaluation framework.
Significance. If the tasks are shown to be realistic, leakage-free, and aligned with actual Japanese financial NLP use cases, this benchmark would fill a clear gap in domain- and language-specific embedding evaluation resources. The public release of datasets and code is a concrete strength that supports reproducibility and community adoption.
major comments (2)
- [Abstract and §3] Abstract and §3 (Benchmark Construction): the central claim that retrieval and classification tasks 'reflect realistic and well-defined financial text processing scenarios' is unsupported by any stated selection criteria, expert validation steps, or mapping to documented Japanese financial workflows (e.g., regulatory filings, analyst reports). This premise is load-bearing for the 'comprehensive benchmark' assertion.
- [§4 and abstract] §4 (Experiments) and abstract: no dataset sizes, annotation protocols, leakage checks, or statistical significance tests are reported for the tasks or model comparisons. Without these, the 'extensive evaluations' cannot be verified and the benchmark's soundness remains unestablished.
minor comments (1)
- [Abstract] The abstract and introduction could include a concise table summarizing the number of tasks, query types, and label distributions to improve immediate clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript introducing JFinTEB. We address each major comment point by point below, clarifying aspects of the benchmark construction and experiments while committing to revisions that strengthen the paper without overstating what is currently present.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (Benchmark Construction): the central claim that retrieval and classification tasks 'reflect realistic and well-defined financial text processing scenarios' is unsupported by any stated selection criteria, expert validation steps, or mapping to documented Japanese financial workflows (e.g., regulatory filings, analyst reports). This premise is load-bearing for the 'comprehensive benchmark' assertion.
Authors: We appreciate this observation, as the realism of the tasks is indeed central to the benchmark's value. In §3, the tasks are described as derived from instruction-following datasets, financial text generation queries, sentiment analysis, document categorization, and economic survey data, all sourced from Japanese financial contexts. However, we acknowledge that explicit selection criteria, formal expert validation steps, and direct mappings to workflows such as regulatory filings or analyst reports are not detailed in the current manuscript. To address this, we will revise §3 to include a dedicated subsection on task construction rationale. This will explain how the chosen datasets align with common Japanese financial NLP applications (e.g., processing earnings call transcripts and market reports) and reference prior literature on these data sources. We note that no external expert validation panel was used; the selections were guided by the authors' review of publicly available Japanese financial corpora. This addition will better substantiate the claim while remaining faithful to the manuscript's content. revision: yes
-
Referee: [§4 and abstract] §4 (Experiments) and abstract: no dataset sizes, annotation protocols, leakage checks, or statistical significance tests are reported for the tasks or model comparisons. Without these, the 'extensive evaluations' cannot be verified and the benchmark's soundness remains unestablished.
Authors: We agree that these elements are essential for establishing the benchmark's soundness and enabling verification. Dataset sizes are currently summarized in a table within §3 but will be explicitly stated and expanded in the main text of both §3 and §4 in the revision. The tasks primarily reuse existing labeled datasets (e.g., pre-annotated economic survey data and sentiment corpora), so annotation protocols will be clarified by describing the original data creation processes and any preprocessing steps we applied. Leakage checks, including deduplication and train-test split verification, were conducted during benchmark construction but not reported; we will add a description of these procedures in §4. For statistical significance, we will incorporate paired statistical tests (e.g., bootstrap resampling or McNemar's test) with p-values for key model comparisons in the revised experiments section. These changes will be made to support the 'extensive evaluations' claim and improve reproducibility. revision: yes
Circularity Check
No circularity: benchmark introduction is self-contained empirical work
full rationale
The paper introduces new datasets and an evaluation framework for Japanese financial text embeddings rather than deriving any fitted quantity, prediction, or result from prior inputs. No equations, parameters, or load-bearing self-citations appear in the provided abstract or description that reduce the central claim to the authors' own earlier outputs by construction. The work is an empirical resource creation with independent content, making it self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Chung-Chi Chen, Hen-Hsen Huang, Yow-Ting Shiue, and Hsin-Hsi Chen. 2018. Numeral understanding in financial tweets for fine-grained crowd-based fore- casting. In2018 IEEE/WIC/ACM International Conference on Web Intelligence (WI). 136–143
2018
-
[2]
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guil- laume Wenzek, et al. 2020. Unsupervised Cross-lingual Representation Learning at Scale. InProceedings of the 58th Annual Meeting of the Association for Compu- tational Linguistics. 8440–8451. doi:10.18653/v1/2020.acl-main.747
-
[3]
Kenneth Enevoldsen, Isaac Chung, Imene Kerboua, Márton Kardos, Ashwin Mathur, et al. 2025. MMTEB: Massive Multilingual Text Embedding Benchmark. InThe Thirteenth International Conference on Learning Representations
2025
-
[4]
Masanori Hirano. 2024. Construction of a Japanese Financial Benchmark for Large Language Models. InProceedings of the Joint Workshop of the 7th Financial Technology and Natural Language Processing, the 5th Knowledge Discovery from Unstructured Data in Financial Services, and the 4th Workshop on Economics and Natural Language Processing. Association for Co...
2024
-
[5]
Masanori Hirano and Kentaro Imajo. 2025. pfmt-bench-fin-ja: Preferred Multi- turn Benchmark for Finance in Japanese. In18th IIAI International Congress on Advanced Applied Informatics (IIAI AAI). 273–279
2025
-
[6]
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. 2022. LoRA: Low-Rank Adaptation of Large Language Models. InInternational Conference on Learning Representations
2022
-
[7]
Rasmus Jørgensen, Oliver Brandt, Mareike Hartmann, Xiang Dai, Christian Igel, and Desmond Elliott. 2023. MultiFin: A Dataset for Multilingual Financial NLP. InFindings of the Association for Computational Linguistics: EACL 2023. 894–909. doi:10.18653/v1/2023.findings-eacl.66
-
[8]
Yasutomo Kimura, Eisaku Sato, Kazuma Kadowaki, and Hokuto Ototake. 2025. Overview of the NTCIR-18 U4 Task. InProceedings of the 18th NTCIR Conference on Evaluation of Information Access Technologies, Vol. 6. 2025
2025
-
[9]
Shengzhe Li, Masaya Ohagi, and Ryokan Li. 2024. JMTEB: Japanese Massive Text Embedding Benchmark. https://github.com/sbintuitions/JMTEB
2024
-
[10]
Niklas Muennighoff, Nouamane Tazi, Loic Magne, and Nils Reimers. 2023. MTEB: Massive Text Embedding Benchmark. InProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. 2014–2037. doi:10.18653/v1/2023.eacl-main.148
-
[11]
Sosuke Nishikawa, Ryokan Ri, et al . 2022. EASE: Entity-Aware Contrastive Learning of Sentence Embedding. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 3870–3885. doi:10.18653/v1/2022.naacl-main.284
- [12]
-
[13]
Masahiro Suzuki and Hiroki Sakaji. 2025. Economy Watchers Survey Provides Datasets and Tasks for Japanese Financial Domain. InCompanion Proceedings of the ACM on Web Conference 2025. 805–808. doi:10.1145/3701716.3715304
-
[14]
Masahiro Suzuki, Hiroki Sakaji, Masanori Hirano, and Kiyoshi Izumi. 2023. Con- structing and analyzing domain-specific language model for financial text mining. Information Processing & Management60, 2 (2023). doi:10.1016/j.ipm.2022.103194
-
[15]
Hiroki Nakayama Takahiro Kubo. 2018. chABSA: Aspect Based Sentiment ANal- ysis dataset in Japanese. https://github.com/chakki-works/chABSA-dataset
2018
- [16]
- [17]
-
[18]
Hayato Tsukagoshi, Shengzhe Li, Akihiko Fukuchi, and Tomohide Shibata. 2025. ModernBERT-Ja. https://huggingface.co/collections/sbintuitions/modernbert-ja- 67b68fe891132877cf67aa0a
2025
- [19]
-
[20]
Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, et al. 2024. Multilingual E5 Text Embeddings: A Technical Report. https://arxiv.org/abs/2402.05672
work page internal anchor Pith review arXiv 2024
-
[21]
Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda, and Yuji Matsumoto
-
[22]
LUKE : Deep Contextualized Entity Representations with Entity-aware Self-attention
LUKE: Deep Contextualized Entity Representations with Entity-aware Self- attention. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 6442–6454. doi:10.18653/v1/2020.emnlp-main.523
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.