Semantic insurance pricing with large language models

Christopher Blier-Wong; Derek Kusmenko

arxiv: 2606.29371 · v1 · pith:U4YRYUERnew · submitted 2026-06-28 · 📊 stat.AP

Semantic insurance pricing with large language models

Christopher Blier-Wong , Derek Kusmenko This is my paper

Pith reviewed 2026-06-30 02:10 UTC · model grok-4.3

classification 📊 stat.AP

keywords insurance pricinglarge language modelsembeddingsgeneralized linear modelsclaim frequencyactuarial modelingPoisson regressionFrench motor data

0 comments

The pith

Embeddings from a pre-trained large language model can replace hand-crafted features as inputs to a standard actuarial pricing model for Poisson claim-frequency regression.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether embeddings computed from natural-language policyholder descriptions can substitute for manually engineered risk factors in classical insurance pricing. On French motor third-party liability data the embedding inputs fed into a generalized linear model yield lower prediction error than the usual hand-crafted features, with the largest gains appearing when the training sample is small. At bigger sample sizes the advantage becomes sensitive to model choice and embedding dimension. The authors also show that domain-specific fine-tuning of the embeddings helps and that the pipeline changes output when any out-of-template text is added to the prompt.

Core claim

Embeddings from a pre-trained large language model, computed from a natural-language description of each policyholder, can replace hand-crafted features as inputs to a standard actuarial pricing model. Using French motor third-party liability data the embedding-based model outperforms the generalized linear model, especially when data are scarce, whereas at larger sample sizes the comparison is model- and dimension-dependent. Insurance-specific fine-tuning further improves the embeddings, and a prompt-sensitivity diagnostic shows that the pipeline reacts to any appended out-of-template field, making controlled prompts a governance requirement.

What carries the argument

Pre-trained language-model embeddings treated as fixed covariates inside a generalized linear model for Poisson claim-frequency regression.

Load-bearing premise

The supplied natural-language policy descriptions already contain the risk-relevant information that the embeddings extract without systematic bias or loss of actuarial signal.

What would settle it

A new motor-insurance dataset in which the embedding-based generalized linear model shows no improvement over the hand-crafted-feature version on small training samples would falsify the performance claim.

Figures

Figures reproduced from arXiv: 2606.29371 by Christopher Blier-Wong, Derek Kusmenko.

**Figure 2.** Figure 2: Default instruction-prefixed underwriting-list prompt for generating LLM embeddings [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 3.** Figure 3: Learning curves comparing the baseline GLM (48 hand-engineered features) against [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗

**Figure 4.** Figure 4: Effect of embedding dimension and reduction method on test deviance across embedding [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

**Figure 5.** Figure 5: Test mean Poisson deviance for two ways of combining Qwen embeddings with the [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: shows the main effect of fine-tuning for the two strongest adapters in the evaluated grid. Fine-tuning shifts the Qwen PCA curve downward at every tested dimension, with the largest gain at K = 48. The 100,000-pair, two-epoch adapter provides the strongest low-dimensional embedding-only result, while the 200,000-pair, two-epoch adapter gives the best high-dimensional result. Adding the raw GLM covariates t… view at source ↗

**Figure 7.** Figure 7: Learning curves for Qwen fine-tuning at K = 48. The left panel shows the full range from n = 1,000 to n = 500,000; the right panel zooms in on the stable medium- and large-sample regime. Raw-plus-LoRA appends the 48 GLM covariates to the 48 fine-tuned PCA components. The 100,000-pair and 200,000-pair adapters shown are the two-epoch variants, matching the labelling of [PITH_FULL_IMAGE:figures/full_fig_p02… view at source ↗

**Figure 8.** Figure 8: Distribution summaries of individual-level predicted frequency shifts for the five prompt [PITH_FULL_IMAGE:figures/full_fig_p031_8.png] view at source ↗

**Figure 9.** Figure 9: UMAP projection of the Qwen3-Embedding-0.6B policyholder embeddings, colored by [PITH_FULL_IMAGE:figures/full_fig_p032_9.png] view at source ↗

**Figure 10.** Figure 10: Per-band feature profile. Each cell prints the mean of that feature over the policies in the [PITH_FULL_IMAGE:figures/full_fig_p033_10.png] view at source ↗

**Figure 11.** Figure 11: Exposure-weighted claim frequency by band, computed on the full [PITH_FULL_IMAGE:figures/full_fig_p034_11.png] view at source ↗

**Figure 12.** Figure 12: Prose narrative prompt. The prose format ( [PITH_FULL_IMAGE:figures/full_fig_p035_12.png] view at source ↗

**Figure 13.** Figure 13: Minimal key-value prompt. The minimal format ( [PITH_FULL_IMAGE:figures/full_fig_p036_13.png] view at source ↗

**Figure 14.** Figure 14: Task-instruction list prompt. This format ( [PITH_FULL_IMAGE:figures/full_fig_p036_14.png] view at source ↗

**Figure 15.** Figure 15: Qualitative descriptor prompt. This format ( [PITH_FULL_IMAGE:figures/full_fig_p037_15.png] view at source ↗

**Figure 16.** Figure 16: Long underwriting prompt. The long underwriting prompt ( [PITH_FULL_IMAGE:figures/full_fig_p038_16.png] view at source ↗

read the original abstract

Classical actuarial pricing models, such as the generalized linear model, are valued for transparency and ease of governance, but they use interactions among risk factors only when these are supplied through explicit feature engineering. We study whether embeddings from a pre-trained large language model, computed from a natural-language description of each policyholder, can replace hand-crafted features as inputs to a standard actuarial pricing model, taking Poisson claim-frequency regression as the main example. The language model is used only to construct deterministic embedding covariates; pricing is performed by a standard generalized linear model. Using French motor third-party liability data, the embedding-based model outperforms the generalized linear model, especially when data are scarce, whereas at larger sample sizes the comparison is model- and dimension-dependent. Insurance-specific fine-tuning further improves the embeddings, and a prompt-sensitivity diagnostic shows that the pipeline reacts to any appended out-of-template field, making controlled prompts a governance requirement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLM embeddings from natural-language policy descriptions can replace hand-crafted features in a GLM for claim frequency on French motor data, with gains clearest at small samples, but the abstract gives no numbers to size the effect.

read the letter

The key takeaway is that pre-trained LLM embeddings from natural-language policyholder descriptions can replace hand-crafted features in a standard GLM for Poisson claim frequency on French motor data, with better performance when data are scarce.

The paper keeps the pricing step as a transparent GLM and uses the LLM only to generate fixed covariates. It reports that insurance-specific fine-tuning improves results and that the pipeline is sensitive to prompt changes, which they treat as a governance requirement. This is a solid new application of embeddings to actuarial work rather than a methodological advance.

It does well by testing the low-data regime explicitly and by highlighting the prompt issue for practical use. The comparison is direct and stays within the GLM framework that actuaries already use.

The soft spots are the absence of any numbers, error bars, or significance tests in the abstract, which makes the outperformance claim hard to evaluate. The stress-test concern about whether the descriptions actually hold new actuarial signal or just rephrase existing features is important and needs addressing in the full paper. If the descriptions are derived from the same structured data, the gain might be from the embedding's non-linear mapping rather than additional information.

This paper is for actuaries and data scientists in insurance who are considering text inputs for pricing models. A reader interested in keeping interpretability while adding unstructured data would find the sample-size dependence and the diagnostic useful.

It deserves a serious referee because the experiment is on real data and the governance angle is relevant. I would recommend sending it to peer review so the quantitative results and data details can be examined.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes replacing hand-crafted features with deterministic embeddings from a pre-trained LLM applied to natural-language policyholder descriptions as covariates in a standard GLM for Poisson claim-frequency regression. On French motor third-party liability data, the embedding-based GLM outperforms the baseline GLM (especially in low-data regimes), with further gains from insurance-specific fine-tuning; a prompt-sensitivity check indicates the need for controlled prompts.

Significance. If the performance gains are robust and attributable to extractable actuarial signal in the text, the approach offers a way to incorporate unstructured data into transparent, governance-friendly pricing models without abandoning GLMs. The deterministic embedding step and retention of the GLM are strengths that keep the method within existing actuarial workflows. The low-data regime result, if confirmed with proper controls, would be particularly relevant for new lines or small portfolios.

major comments (2)

[Abstract, §3] Abstract and §3 (data description): the central claim that embeddings 'can replace hand-crafted features' and outperform them rests on the assumption that the natural-language descriptions encode risk-relevant information not already captured by the baseline covariates. The manuscript provides no details on whether these descriptions are free-form applicant text, templated conversions of the structured fields, or richer external text; without this, outperformance cannot be unambiguously attributed to semantic extraction rather than implicit non-linear feature engineering or data leakage.
[§4] §4 (experimental results): the abstract states outperformance 'especially when data are scarce' and 'model- and dimension-dependent' at larger sizes, yet the provided text contains no quantitative metrics, confidence intervals, data-split details, or statistical significance tests. This absence makes the load-bearing empirical claim unverifiable from the given information and requires explicit tables or figures with effect sizes.

minor comments (2)

[Abstract] Abstract: the claim of outperformance should be accompanied by at least one quantitative metric (e.g., relative improvement in Poisson deviance or log-likelihood) even in the abstract.
[§2] Notation: clarify whether the GLM is fit on the raw embeddings or on a reduced-dimensional projection, and state the exact dimension choice and any regularization.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important areas for clarification and strengthening of the empirical presentation. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract, §3] Abstract and §3 (data description): the central claim that embeddings 'can replace hand-crafted features' and outperform them rests on the assumption that the natural-language descriptions encode risk-relevant information not already captured by the baseline covariates. The manuscript provides no details on whether these descriptions are free-form applicant text, templated conversions of the structured fields, or richer external text; without this, outperformance cannot be unambiguously attributed to semantic extraction rather than implicit non-linear feature engineering or data leakage.

Authors: We agree that the source and construction of the natural-language descriptions must be specified to support attribution of performance gains to semantic content rather than leakage or implicit engineering. In the revised manuscript we will expand the data description in §3 to state explicitly how the descriptions are obtained (free-form applicant text versus any templating or conversion from structured fields) and will add a short discussion of how this affects interpretation of the results. revision: yes
Referee: [§4] §4 (experimental results): the abstract states outperformance 'especially when data are scarce' and 'model- and dimension-dependent' at larger sizes, yet the provided text contains no quantitative metrics, confidence intervals, data-split details, or statistical significance tests. This absence makes the load-bearing empirical claim unverifiable from the given information and requires explicit tables or figures with effect sizes.

Authors: We accept that the current manuscript version does not present the quantitative results with sufficient detail for independent verification. In the revision we will add tables and figures in §4 that report the relevant performance metrics together with confidence intervals, explicit data-split and cross-validation procedures, and statistical significance tests comparing the embedding-based and baseline GLMs. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical comparison of covariate sets in standard GLM

full rationale

The paper's central claim is an empirical performance comparison: LLM embeddings computed from natural-language policy descriptions are substituted as covariates into an otherwise standard Poisson GLM and evaluated against a baseline GLM using hand-crafted features on French motor data. No equations, uniqueness theorems, or predictions are derived; the pipeline is a deterministic embedding step followed by off-the-shelf GLM fitting and out-of-sample evaluation. No self-citations are load-bearing for the result, no fitted parameters are relabeled as predictions, and no ansatz is smuggled via prior work. The derivation chain is self-contained against external benchmarks and does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the assumption that LLM embeddings preserve actuarial signal from text without post-hoc selection or bias; no free parameters, axioms, or invented entities are explicitly introduced beyond standard GLM and pre-trained model usage.

pith-pipeline@v0.9.1-grok · 5676 in / 1092 out tokens · 23533 ms · 2026-06-30T02:10:17.663787+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

72 extracted references · 29 canonical work pages · 6 internal anchors

[1]

British Actuarial Journal , volume =

Balona, Caesar , year =. British Actuarial Journal , volume =. doi:10.1017/S1357321724000102 , urldate =

work page doi:10.1017/s1357321724000102
[2]

Annals of Actuarial Science , volume =

Richman, Ronald , year =. Annals of Actuarial Science , volume =. doi:10.1017/S1748499520000238 , urldate =

work page doi:10.1017/s1748499520000238
[3]

Annals of Actuarial Science , volume =

Richman, Ronald , year =. Annals of Actuarial Science , volume =. doi:10.1017/S174849952000024X , urldate =

work page doi:10.1017/s174849952000024x
[4]

2023 , journal =

Non-Life Insurance Risk Classification Using Categorical Embedding , author =. 2023 , journal =. doi:10.1080/10920277.2022.2123361 , urldate =

work page doi:10.1080/10920277.2022.2123361 2023
[5]

2024 , month = dec, journal =

Enhancing Actuarial Non-Life Pricing Models via Transformers , author =. 2024 , month = dec, journal =. doi:10.1007/s13385-024-00388-2 , urldate =

work page doi:10.1007/s13385-024-00388-2 2024
[6]

2021 , month = jan, journal =

Machine Learning in. 2021 , month = jan, journal =. doi:10.3390/risks9010004 , urldate =

work page doi:10.3390/risks9010004 2021
[7]

arXiv preprint arXiv:2102.05784 , year =

Rethinking Representations in. arXiv preprint arXiv:2102.05784 , year =

work page arXiv
[8]

2024 , journal =

High-cardinality categorical covariates in network regressions , author =. 2024 , journal =. doi:10.1007/s42081-024-00243-4 , urldate =

work page doi:10.1007/s42081-024-00243-4 2024
[9]

Avanzi, Benjamin and Taylor, Greg and Wang, Melantha and Wong, Bernard , year =. Machine. ASTIN Bulletin: The Journal of the IAA , volume =. doi:10.1017/asb.2024.7 , urldate =

work page doi:10.1017/asb.2024.7 2024
[10]

Lee, Jinhyuk and Dai, Zhuyun and Ren, Xiaoqi and Chen, Blair and Cer, Daniel and Cole, Jeremy R. and Hui, Kai and Boratko, Michael and Kapadia, Rajvi and Ding, Wen and Luan, Yi and Duddu, Sai Meher Karthik and Abrego, Gustavo Hernandez and Shi, Weiqiang and Gupta, Nithi and Kusupati, Aditya and Jain, Prateek and Jonnalagadda, Siddhartha Reddy and Chang, M...
[11]

Gemini Embedding: Generalizable Embeddings from Gemini

Lee, Jinhyuk and Chen, Feiyang and Dua, Sahil and Cer, Daniel and Shanbhogue, Madhuri and Naim, Iftekhar and. Gemini Embedding: Generalizable Embeddings from. arXiv preprint arXiv:2503.07891 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Qwen3 Embedding: Advancing Text Embedding and Reranking through Foundation Models , author =. arXiv preprint arXiv:2506.05176 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Text and Code Embeddings by Contrastive Pre-Training

Text and Code Embeddings by Contrastive Pre-Training , author =. arXiv preprint arXiv:2201.10005 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Attention Is All You Need , booktitle =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser,. Attention Is All You Need , booktitle =
[15]

ASTIN Bulletin , volume =

Geographic Ratemaking with Spatial Embeddings , author =. ASTIN Bulletin , volume =. doi:10.1017/asb.2021.25 , urldate =

work page doi:10.1017/asb.2021.25 2021
[16]

ASTIN Bulletin: The Journal of the IAA , volume =

A Representation-Learning Approach for Insurance Pricing with Images , author =. ASTIN Bulletin: The Journal of the IAA , volume =. doi:10.1017/asb.2024.9 , urldate =

work page doi:10.1017/asb.2024.9 2024
[17]

arXiv preprint arXiv:2511.17954 , doi =

A Multi-View Contrastive Learning Framework for Spatial Embeddings in Risk Modelling , author =. arXiv preprint arXiv:2511.17954 , doi =

work page arXiv
[18]

North American Actuarial Journal , doi =

Neural Networks for Insurance Pricing with Frequency and Severity Data: A Benchmark Study from Data Preprocessing to Technical Tariff , author =. North American Actuarial Journal , doi =
[19]

ASTIN Bulletin , volume =

Actuarial Applications of Word Embedding Models , author =. ASTIN Bulletin , volume =. doi:10.1017/asb.2019.28 , urldate =

work page doi:10.1017/asb.2019.28 2019
[20]

doi:10.1016/j.insmatheco.2022.07.013 , urldate =

Xu, Shuzhe and Zhang, Chuanlong and Hong, Don , year = 2022, month = nov, journal =. doi:10.1016/j.insmatheco.2022.07.013 , urldate =

work page doi:10.1016/j.insmatheco.2022.07.013 2022
[21]

2021 , journal =

Mining Actuarial Risk Predictors in Accident Descriptions Using Recurrent Neural Networks , author =. 2021 , journal =. doi:10.3390/risks9010007 , urldate =

work page doi:10.3390/risks9010007 2021
[22]

Dong, Panyi and Quan, Zhiyu , journal =
[23]

Lee, Chankyu and Roy, Rajarshi and Xu, Mengyao and Raiman, Jonathan and Shoeybi, Mohammad and Catanzaro, Bryan and Ping, Wei , journal =
[24]

Babakhin, Yauhen and Osmulski, Radek and Ak, Ronay and Moreira, Gabriel and Xu, Mengyao and Schifferer, Benedikt and Liu, Bo and Oldridge, Even , journal =. Llama-
[25]

Kenneth Enevoldsen and Isaac Chung and Imene Kerboua and Márton Kardos and Ashwin Mathur and David Stap and Jay Gala and Wissam Siblini and Dominik Krzemiński and Genta Indra Winata and Saba Sturua and Saiteja Utpala and Mathieu Ciancone and Marion Schaeffer and Gabriel Sequeira and Diganta Misra and Shreeya Dhakal and Jonathan Rystrøm and Roman Solomatin...
[26]

2023 , journal =

The use of autoencoders for training neural networks with mixed categorical and numerical features , author =. 2023 , journal =. doi:10.1017/asb.2023.15 , urldate =

work page doi:10.1017/asb.2023.15 2023
[27]

Text Mining in Insurance: From Unstructured Data to Meaning , shorttitle =

Zappa, Diego and Borrelli, Mattia and Clemente, Gian Paolo and Savelli, Nino , year = 2021, journal =. Text Mining in Insurance: From Unstructured Data to Meaning , shorttitle =

2021
[28]

Annals of Actuarial Science , volume =

On Clustering Levels of a Hierarchical Categorical Risk Factor , author =. Annals of Actuarial Science , volume =. doi:10.1017/S1748499523000283 , urldate =

work page doi:10.1017/s1748499523000283
[29]

Advanced Applications of Generative

Hatzesberger, Simon and Nonneman, Iris , journal =. Advanced Applications of Generative
[30]

arXiv preprint arXiv:2206.02014 , year =

Actuarial Applications of Natural Language Processing Using Transformers , author =. arXiv preprint arXiv:2206.02014 , year =

work page arXiv
[31]

Operationalizing

Balona, Caesar , year = 2025, pages =. Operationalizing

2025
[32]

Hegselmann, Stefan and Buendia, Alejandro and Lang, Hunter and Agrawal, Monica and Jiang, Xiaoyi and Sontag, David , booktitle=. Tab. 2023 , publisher=

2023
[33]

Dinh, Tuan and Zeng, Yuchen and Zhang, Ruisu and Lin, Ziqian and Gira, Michael and Rajput, Shashank and Sohn, Jy-yong and Papailiopoulos, Dimitris and Lee, Kangwook , booktitle=
[34]

Large Language Models (

Fang, Xi and Xu, Weijie and Tan, Fiona Anting and Zhang, Jiani and Hu, Ziqing and Qi, Yanjun and Nickleach, Scott and Sber, Diego and Gorbachev, Artem and Hou, Ellie , journal=. Large Language Models (
[35]

, title =

Ono, Kyoka and Lee, Simon A. , title =. Proceedings of the 41st International Conference on Machine Learning,. 2024 , eprint =

2024
[36]

Koloski, Boshko and Perdih, Timen and Pollak, Senja , journal=
[37]

Villalobos Carballo, Karina and Joshi, Shaan and Ren, Haoran and Lanusse, Fran. Tab. arXiv preprint arXiv:2206.10381 , year=

work page arXiv
[38]

Enriching Tabular Data with Contextual

Kasneci, Enkelejda and Kasneci, Gjergji , journal=. Enriching Tabular Data with Contextual
[39]

Latte: Transferring

Shi, Han and Gao, Jiahui and Xu, Hang and Liang, Xiaodan and Li, Zhenguo , journal=. Latte: Transferring
[40]

arXiv preprint arXiv:2406.12031 , year=

Large Scale Transfer Learning for Tabular Data via Language Modeling , author=. arXiv preprint arXiv:2406.12031 , year=

work page arXiv
[41]

arXiv preprint arXiv:2602.15844 , year=

Language Model Representations for Efficient Few-Shot Tabular Classification , author=. arXiv preprint arXiv:2602.15844 , year=

work page arXiv
[42]

Wang, Ruijie and Wang, Yumo and Li, Ondrej , journal=. Uni
[43]

arXiv preprint arXiv:2310.07338 , year=

From Supervised to Generative: A Novel Paradigm for Tabular Deep Learning with Large Language Models , author=. arXiv preprint arXiv:2310.07338 , year=

work page arXiv
[44]

Unlock the Potential of Large Language Models for Predictive Tabular Tasks in Data Science with Table-Specific Pretraining

Unleashing the Potential of Large Language Models for Predictive Tabular Tasks in Data Science , author=. arXiv preprint arXiv:2403.20208 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[45]

Haque, Radiah and Goh, Hui-Ngo and Ting, Choo-Yee and Quek, Albert and Hasan, M. D. Rakibul , title =. Computers and Education: Artificial Intelligence , year =
[46]

Large Language Models for Automated Data Science: Introducing

Hollmann, Noah and M. Large Language Models for Automated Data Science: Introducing. NeurIPS 2023 Workshop on Synthetic Data Generation with Generative AI , year=

2023
[47]

Proceedings of the 41st International Conference on Machine Learning , year=

Large Language Models Can Automatically Engineer Features for Few-Shot Tabular Learning , author=. Proceedings of the 41st International Conference on Machine Learning , year=
[48]

, title =

Abhyankar, Nikhil and Shojaee, Parshin and Reddy, Chandan K. , title =. 2025 , eprint =

2025
[49]

Nature , volume=

Accurate Predictions on Small Data with a Tabular Foundation Model , author=. Nature , volume=. 2025 , publisher=

2025
[50]

Ma, Junwei and Nie, Valentin Thomas and Ri, Taro and Dyer, Chris , journal=. Tab
[51]

Huang, Xin and Khetan, Ashish and Cella, Milan and Dhir, Sarthak , booktitle=. Tab
[52]

Advances in Neural Information Processing Systems , volume=

Revisiting Deep Learning Models for Tabular Data , author=. Advances in Neural Information Processing Systems , volume=
[53]

Advances in Neural Information Processing Systems , volume=

Why Do Tree-Based Models Still Outperform Deep Learning on Typical Tabular Data? , author=. Advances in Neural Information Processing Systems , volume=
[54]

Wang, Zifeng and Sun, Jimeng , journal=. Trans
[55]

Kim, Myung Jun and Feuerriegel, Stefan and Hatt, Tobias , journal=
[56]

Sentence-

Reimers, Nils and Gurevych, Iryna , booktitle=. Sentence-
[57]

Text Embeddings by Weakly-Supervised Contrastive Pre-training

Text Embeddings by Weakly-Supervised Contrastive Pre-training , author=. arXiv preprint arXiv:2212.03533 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[58]

Findings of the Association for Computational Linguistics: ACL 2023 , pages=

One Embedder, Any Task: Instruction-Finetuned Text Embeddings , author=. Findings of the Association for Computational Linguistics: ACL 2023 , pages=

2023
[59]

Advances in Neural Information Processing Systems , volume=

Matryoshka Representation Learning , author=. Advances in Neural Information Processing Systems , volume=
[60]

, journal=

Huang, Xiang and Peng, Hao and Zou, Dongcheng and Liu, Zhiwei and Li, Jianxin and Liu, Kay and Wu, Jia and Su, Jianlin and Yu, Philip S. , journal=. Co
[61]

and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , journal=

Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , journal=
[62]

Liu, Shih-Yang and Wang, Chien-Yi and Yin, Hongxu and Molchanov, Pavlo and Wang, Yu-Chiang Frank and Cheng, Kwang-Ting and Chen, Min-Hung , journal=
[63]

Available at SSRN 3491790 , year=

From Generalized Linear Models to Neural Networks, and Back , author=. Available at SSRN 3491790 , year=
[64]

2023 , publisher=

Statistical Foundations of Actuarial Learning and its Applications , author=. 2023 , publisher=

2023
[65]

Entity Embeddings of Categorical Variables

Entity Embeddings of Categorical Variables , author=. arXiv preprint arXiv:1604.06737 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[66]

arXiv preprint arXiv:1910.03072 , year=

Sequence Embeddings Help to Identify Fraudulent Cases in Healthcare Insurance , author=. arXiv preprint arXiv:1910.03072 , year=

work page arXiv 1910
[67]

Dutang, Christophe and Charpentier, Arthur , journal=. fre. 2020 , note=

2020
[68]

IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=

Representation Learning: A Review and New Perspectives , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=
[69]

Enhancing Auto Insurance Risk Evaluation with Transformer and

Sun, Fengyi and Chen, Rui and Wang, Yanyan , journal=. Enhancing Auto Insurance Risk Evaluation with Transformer and
[70]

arXiv preprint , year=

Large Language Models for Insurance Intelligence , author=. arXiv preprint , year=
[71]

arXiv preprint , year=

Assessing Insurers' Litigation Risk: Claim Dispute Prediction with Actionable Interpretations Using Machine Learning , author=. arXiv preprint , year=
[72]

Proceedings of the Fourth ACM International Conference on AI in Finance , pages=

Large language models in finance: A survey , author=. Proceedings of the Fourth ACM International Conference on AI in Finance , pages=

[1] [1]

British Actuarial Journal , volume =

Balona, Caesar , year =. British Actuarial Journal , volume =. doi:10.1017/S1357321724000102 , urldate =

work page doi:10.1017/s1357321724000102

[2] [2]

Annals of Actuarial Science , volume =

Richman, Ronald , year =. Annals of Actuarial Science , volume =. doi:10.1017/S1748499520000238 , urldate =

work page doi:10.1017/s1748499520000238

[3] [3]

Annals of Actuarial Science , volume =

Richman, Ronald , year =. Annals of Actuarial Science , volume =. doi:10.1017/S174849952000024X , urldate =

work page doi:10.1017/s174849952000024x

[4] [4]

2023 , journal =

Non-Life Insurance Risk Classification Using Categorical Embedding , author =. 2023 , journal =. doi:10.1080/10920277.2022.2123361 , urldate =

work page doi:10.1080/10920277.2022.2123361 2023

[5] [5]

2024 , month = dec, journal =

Enhancing Actuarial Non-Life Pricing Models via Transformers , author =. 2024 , month = dec, journal =. doi:10.1007/s13385-024-00388-2 , urldate =

work page doi:10.1007/s13385-024-00388-2 2024

[6] [6]

2021 , month = jan, journal =

Machine Learning in. 2021 , month = jan, journal =. doi:10.3390/risks9010004 , urldate =

work page doi:10.3390/risks9010004 2021

[7] [7]

arXiv preprint arXiv:2102.05784 , year =

Rethinking Representations in. arXiv preprint arXiv:2102.05784 , year =

work page arXiv

[8] [8]

2024 , journal =

High-cardinality categorical covariates in network regressions , author =. 2024 , journal =. doi:10.1007/s42081-024-00243-4 , urldate =

work page doi:10.1007/s42081-024-00243-4 2024

[9] [9]

Avanzi, Benjamin and Taylor, Greg and Wang, Melantha and Wong, Bernard , year =. Machine. ASTIN Bulletin: The Journal of the IAA , volume =. doi:10.1017/asb.2024.7 , urldate =

work page doi:10.1017/asb.2024.7 2024

[10] [10]

Lee, Jinhyuk and Dai, Zhuyun and Ren, Xiaoqi and Chen, Blair and Cer, Daniel and Cole, Jeremy R. and Hui, Kai and Boratko, Michael and Kapadia, Rajvi and Ding, Wen and Luan, Yi and Duddu, Sai Meher Karthik and Abrego, Gustavo Hernandez and Shi, Weiqiang and Gupta, Nithi and Kusupati, Aditya and Jain, Prateek and Jonnalagadda, Siddhartha Reddy and Chang, M...

[11] [11]

Gemini Embedding: Generalizable Embeddings from Gemini

Lee, Jinhyuk and Chen, Feiyang and Dua, Sahil and Cer, Daniel and Shanbhogue, Madhuri and Naim, Iftekhar and. Gemini Embedding: Generalizable Embeddings from. arXiv preprint arXiv:2503.07891 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Qwen3 Embedding: Advancing Text Embedding and Reranking through Foundation Models , author =. arXiv preprint arXiv:2506.05176 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Text and Code Embeddings by Contrastive Pre-Training

Text and Code Embeddings by Contrastive Pre-Training , author =. arXiv preprint arXiv:2201.10005 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Attention Is All You Need , booktitle =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser,. Attention Is All You Need , booktitle =

[15] [15]

ASTIN Bulletin , volume =

Geographic Ratemaking with Spatial Embeddings , author =. ASTIN Bulletin , volume =. doi:10.1017/asb.2021.25 , urldate =

work page doi:10.1017/asb.2021.25 2021

[16] [16]

ASTIN Bulletin: The Journal of the IAA , volume =

A Representation-Learning Approach for Insurance Pricing with Images , author =. ASTIN Bulletin: The Journal of the IAA , volume =. doi:10.1017/asb.2024.9 , urldate =

work page doi:10.1017/asb.2024.9 2024

[17] [17]

arXiv preprint arXiv:2511.17954 , doi =

A Multi-View Contrastive Learning Framework for Spatial Embeddings in Risk Modelling , author =. arXiv preprint arXiv:2511.17954 , doi =

work page arXiv

[18] [18]

North American Actuarial Journal , doi =

Neural Networks for Insurance Pricing with Frequency and Severity Data: A Benchmark Study from Data Preprocessing to Technical Tariff , author =. North American Actuarial Journal , doi =

[19] [19]

ASTIN Bulletin , volume =

Actuarial Applications of Word Embedding Models , author =. ASTIN Bulletin , volume =. doi:10.1017/asb.2019.28 , urldate =

work page doi:10.1017/asb.2019.28 2019

[20] [20]

doi:10.1016/j.insmatheco.2022.07.013 , urldate =

Xu, Shuzhe and Zhang, Chuanlong and Hong, Don , year = 2022, month = nov, journal =. doi:10.1016/j.insmatheco.2022.07.013 , urldate =

work page doi:10.1016/j.insmatheco.2022.07.013 2022

[21] [21]

2021 , journal =

Mining Actuarial Risk Predictors in Accident Descriptions Using Recurrent Neural Networks , author =. 2021 , journal =. doi:10.3390/risks9010007 , urldate =

work page doi:10.3390/risks9010007 2021

[22] [22]

Dong, Panyi and Quan, Zhiyu , journal =

[23] [23]

Lee, Chankyu and Roy, Rajarshi and Xu, Mengyao and Raiman, Jonathan and Shoeybi, Mohammad and Catanzaro, Bryan and Ping, Wei , journal =

[24] [24]

Babakhin, Yauhen and Osmulski, Radek and Ak, Ronay and Moreira, Gabriel and Xu, Mengyao and Schifferer, Benedikt and Liu, Bo and Oldridge, Even , journal =. Llama-

[25] [25]

Kenneth Enevoldsen and Isaac Chung and Imene Kerboua and Márton Kardos and Ashwin Mathur and David Stap and Jay Gala and Wissam Siblini and Dominik Krzemiński and Genta Indra Winata and Saba Sturua and Saiteja Utpala and Mathieu Ciancone and Marion Schaeffer and Gabriel Sequeira and Diganta Misra and Shreeya Dhakal and Jonathan Rystrøm and Roman Solomatin...

[26] [26]

2023 , journal =

The use of autoencoders for training neural networks with mixed categorical and numerical features , author =. 2023 , journal =. doi:10.1017/asb.2023.15 , urldate =

work page doi:10.1017/asb.2023.15 2023

[27] [27]

Text Mining in Insurance: From Unstructured Data to Meaning , shorttitle =

Zappa, Diego and Borrelli, Mattia and Clemente, Gian Paolo and Savelli, Nino , year = 2021, journal =. Text Mining in Insurance: From Unstructured Data to Meaning , shorttitle =

2021

[28] [28]

Annals of Actuarial Science , volume =

On Clustering Levels of a Hierarchical Categorical Risk Factor , author =. Annals of Actuarial Science , volume =. doi:10.1017/S1748499523000283 , urldate =

work page doi:10.1017/s1748499523000283

[29] [29]

Advanced Applications of Generative

Hatzesberger, Simon and Nonneman, Iris , journal =. Advanced Applications of Generative

[30] [30]

arXiv preprint arXiv:2206.02014 , year =

Actuarial Applications of Natural Language Processing Using Transformers , author =. arXiv preprint arXiv:2206.02014 , year =

work page arXiv

[31] [31]

Operationalizing

Balona, Caesar , year = 2025, pages =. Operationalizing

2025

[32] [32]

Hegselmann, Stefan and Buendia, Alejandro and Lang, Hunter and Agrawal, Monica and Jiang, Xiaoyi and Sontag, David , booktitle=. Tab. 2023 , publisher=

2023

[33] [33]

Dinh, Tuan and Zeng, Yuchen and Zhang, Ruisu and Lin, Ziqian and Gira, Michael and Rajput, Shashank and Sohn, Jy-yong and Papailiopoulos, Dimitris and Lee, Kangwook , booktitle=

[34] [34]

Large Language Models (

Fang, Xi and Xu, Weijie and Tan, Fiona Anting and Zhang, Jiani and Hu, Ziqing and Qi, Yanjun and Nickleach, Scott and Sber, Diego and Gorbachev, Artem and Hou, Ellie , journal=. Large Language Models (

[35] [35]

, title =

Ono, Kyoka and Lee, Simon A. , title =. Proceedings of the 41st International Conference on Machine Learning,. 2024 , eprint =

2024

[36] [36]

Koloski, Boshko and Perdih, Timen and Pollak, Senja , journal=

[37] [37]

Villalobos Carballo, Karina and Joshi, Shaan and Ren, Haoran and Lanusse, Fran. Tab. arXiv preprint arXiv:2206.10381 , year=

work page arXiv

[38] [38]

Enriching Tabular Data with Contextual

Kasneci, Enkelejda and Kasneci, Gjergji , journal=. Enriching Tabular Data with Contextual

[39] [39]

Latte: Transferring

Shi, Han and Gao, Jiahui and Xu, Hang and Liang, Xiaodan and Li, Zhenguo , journal=. Latte: Transferring

[40] [40]

arXiv preprint arXiv:2406.12031 , year=

Large Scale Transfer Learning for Tabular Data via Language Modeling , author=. arXiv preprint arXiv:2406.12031 , year=

work page arXiv

[41] [41]

arXiv preprint arXiv:2602.15844 , year=

Language Model Representations for Efficient Few-Shot Tabular Classification , author=. arXiv preprint arXiv:2602.15844 , year=

work page arXiv

[42] [42]

Wang, Ruijie and Wang, Yumo and Li, Ondrej , journal=. Uni

[43] [43]

arXiv preprint arXiv:2310.07338 , year=

From Supervised to Generative: A Novel Paradigm for Tabular Deep Learning with Large Language Models , author=. arXiv preprint arXiv:2310.07338 , year=

work page arXiv

[44] [44]

Unlock the Potential of Large Language Models for Predictive Tabular Tasks in Data Science with Table-Specific Pretraining

Unleashing the Potential of Large Language Models for Predictive Tabular Tasks in Data Science , author=. arXiv preprint arXiv:2403.20208 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[45] [45]

Haque, Radiah and Goh, Hui-Ngo and Ting, Choo-Yee and Quek, Albert and Hasan, M. D. Rakibul , title =. Computers and Education: Artificial Intelligence , year =

[46] [46]

Large Language Models for Automated Data Science: Introducing

Hollmann, Noah and M. Large Language Models for Automated Data Science: Introducing. NeurIPS 2023 Workshop on Synthetic Data Generation with Generative AI , year=

2023

[47] [47]

Proceedings of the 41st International Conference on Machine Learning , year=

Large Language Models Can Automatically Engineer Features for Few-Shot Tabular Learning , author=. Proceedings of the 41st International Conference on Machine Learning , year=

[48] [48]

, title =

Abhyankar, Nikhil and Shojaee, Parshin and Reddy, Chandan K. , title =. 2025 , eprint =

2025

[49] [49]

Nature , volume=

Accurate Predictions on Small Data with a Tabular Foundation Model , author=. Nature , volume=. 2025 , publisher=

2025

[50] [50]

Ma, Junwei and Nie, Valentin Thomas and Ri, Taro and Dyer, Chris , journal=. Tab

[51] [51]

Huang, Xin and Khetan, Ashish and Cella, Milan and Dhir, Sarthak , booktitle=. Tab

[52] [52]

Advances in Neural Information Processing Systems , volume=

Revisiting Deep Learning Models for Tabular Data , author=. Advances in Neural Information Processing Systems , volume=

[53] [53]

Advances in Neural Information Processing Systems , volume=

Why Do Tree-Based Models Still Outperform Deep Learning on Typical Tabular Data? , author=. Advances in Neural Information Processing Systems , volume=

[54] [54]

Wang, Zifeng and Sun, Jimeng , journal=. Trans

[55] [55]

Kim, Myung Jun and Feuerriegel, Stefan and Hatt, Tobias , journal=

[56] [56]

Sentence-

Reimers, Nils and Gurevych, Iryna , booktitle=. Sentence-

[57] [57]

Text Embeddings by Weakly-Supervised Contrastive Pre-training

Text Embeddings by Weakly-Supervised Contrastive Pre-training , author=. arXiv preprint arXiv:2212.03533 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[58] [58]

Findings of the Association for Computational Linguistics: ACL 2023 , pages=

One Embedder, Any Task: Instruction-Finetuned Text Embeddings , author=. Findings of the Association for Computational Linguistics: ACL 2023 , pages=

2023

[59] [59]

Advances in Neural Information Processing Systems , volume=

Matryoshka Representation Learning , author=. Advances in Neural Information Processing Systems , volume=

[60] [60]

, journal=

Huang, Xiang and Peng, Hao and Zou, Dongcheng and Liu, Zhiwei and Li, Jianxin and Liu, Kay and Wu, Jia and Su, Jianlin and Yu, Philip S. , journal=. Co

[61] [61]

and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , journal=

Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , journal=

[62] [62]

Liu, Shih-Yang and Wang, Chien-Yi and Yin, Hongxu and Molchanov, Pavlo and Wang, Yu-Chiang Frank and Cheng, Kwang-Ting and Chen, Min-Hung , journal=

[63] [63]

Available at SSRN 3491790 , year=

From Generalized Linear Models to Neural Networks, and Back , author=. Available at SSRN 3491790 , year=

[64] [64]

2023 , publisher=

Statistical Foundations of Actuarial Learning and its Applications , author=. 2023 , publisher=

2023

[65] [65]

Entity Embeddings of Categorical Variables

Entity Embeddings of Categorical Variables , author=. arXiv preprint arXiv:1604.06737 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[66] [66]

arXiv preprint arXiv:1910.03072 , year=

Sequence Embeddings Help to Identify Fraudulent Cases in Healthcare Insurance , author=. arXiv preprint arXiv:1910.03072 , year=

work page arXiv 1910

[67] [67]

Dutang, Christophe and Charpentier, Arthur , journal=. fre. 2020 , note=

2020

[68] [68]

IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=

Representation Learning: A Review and New Perspectives , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=

[69] [69]

Enhancing Auto Insurance Risk Evaluation with Transformer and

Sun, Fengyi and Chen, Rui and Wang, Yanyan , journal=. Enhancing Auto Insurance Risk Evaluation with Transformer and

[70] [70]

arXiv preprint , year=

Large Language Models for Insurance Intelligence , author=. arXiv preprint , year=

[71] [71]

arXiv preprint , year=

Assessing Insurers' Litigation Risk: Claim Dispute Prediction with Actionable Interpretations Using Machine Learning , author=. arXiv preprint , year=

[72] [72]

Proceedings of the Fourth ACM International Conference on AI in Finance , pages=

Large language models in finance: A survey , author=. Proceedings of the Fourth ACM International Conference on AI in Finance , pages=