Temporal Contrastive Transformer for Financial Crime Detection: Self-Supervised Sequence Embeddings via Predictive Contrastive Coding

Danny Butvinik (NICE Actimize); Gabrielle Azoulay (NICE Actimize); Nitzan Tal (NICE Actimize); Yonit Marcus (NICE Actimize)

arxiv: 2605.21490 · v1 · pith:CMJUQXF6new · submitted 2026-03-31 · 💻 cs.LG · cs.CR

Temporal Contrastive Transformer for Financial Crime Detection: Self-Supervised Sequence Embeddings via Predictive Contrastive Coding

Danny Butvinik (NICE Actimize) , Yonit Marcus (NICE Actimize) , Nitzan Tal (NICE Actimize) , Gabrielle Azoulay (NICE Actimize) This is my paper

Pith reviewed 2026-05-22 01:48 UTC · model grok-4.3

classification 💻 cs.LG cs.CR

keywords financial fraud detectionself-supervised learningcontrastive learningtransaction sequencestemporal embeddingsrepresentation learninganomaly detectiongradient boosting

0 comments

The pith

Self-supervised contrastive training on raw transaction sequences yields embeddings that detect fraud at AUC 0.8644 on their own.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Temporal Contrastive Transformer to learn embeddings from sequences of financial transactions using only a self-supervised contrastive objective. The model is evaluated by feeding its embeddings into a gradient boosting classifier for fraud detection. Results show the embeddings carry enough signal for solid standalone performance. When the same embeddings are added to existing domain-engineered features, however, they produce no gain over the baseline. A sympathetic reader cares because this points toward automated ways to extract behavioral patterns without repeated manual feature design.

Core claim

The Temporal Contrastive Transformer learns sequence embeddings via a self-supervised contrastive objective on financial transaction data. When these embeddings serve as features for a gradient boosting classifier, they achieve an AUC of 0.8644 for fraud detection. Adding them to domain-engineered features yields no improvement over the baseline of 0.9245, reaching only 0.9205. This suggests that the learned representations largely overlap with manually designed abstractions while still capturing relevant temporal structure. Achieving performance comparable to a strong feature-engineered baseline is itself a meaningful outcome at this stage, indicating that learned representations can begin,

What carries the argument

The self-supervised contrastive objective inside the Temporal Contrastive Transformer, which trains embeddings to encode contextual temporal dynamics across transaction sequences.

If this is right

Raw-sequence embeddings alone can reach meaningful fraud-prediction accuracy without any hand-crafted inputs.
The learned representations approximate the value of strong domain features through automated extraction.
Further refinements to architecture or training objectives could produce embeddings that add measurable value beyond current baselines.
This direction supports gradual reduction in reliance on expert-driven feature engineering for financial crime systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The observed overlap suggests that hybrid architectures separating temporal signals from static domain attributes could resolve redundancy.
Testing the same contrastive procedure on sequential data from other domains such as user behavior logs may show similar patterns of overlap with domain knowledge.
Larger volumes of unlabeled transaction data could increase the distinctiveness of the resulting embeddings and reduce duplication.

Load-bearing premise

The contrastive objective applied to raw sequences will extract behavioral patterns relevant to fraud that are sufficiently distinct from those already captured by manually engineered domain features.

What would settle it

A direct comparison of embedding vectors against the set of domain-engineered features via mutual information scores or ablation tests that measures whether the learned vectors supply any new predictive information once the engineered features are held fixed.

Figures

Figures reproduced from arXiv: 2605.21490 by Danny Butvinik (NICE Actimize), Gabrielle Azoulay (NICE Actimize), Nitzan Tal (NICE Actimize), Yonit Marcus (NICE Actimize).

**Figure 1.** Figure 1: Temporal Contrastive Transformer architecture. Solid boxes denote primary processing modules; dashed boxes denote a [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗

**Figure 2.** Figure 2: Contrastive Predictive Coding paradigm within TCT. Past sub [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 4.** Figure 4: Age is not included in the training objective, indicating that this structure emerges implicitly from temporal behavioral patterns rather than explicit supervision [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 3.** Figure 3: presents the ROC curves on the held-out test set, with the false-positive-rate (FPR) axis truncated at 0.30 to reflect the operationally relevant range. The curves corresponding to the raw-feature baseline and the combined raw-features-plusembeddings configuration exhibit near-identical global behavior, consistent with the small difference in AUC (ΔAUC = 0.004). A more detailed inspection reveals that the… view at source ↗

read the original abstract

We introduce the Temporal Contrastive Transformer (TCT), a representation learning framework designed to capture contextual temporal dynamics in sequences of financial transactions. The model is trained using a self-supervised contrastive objective to produce embeddings that encode behavioral patterns over time, with the goal of supporting downstream fraud detection tasks. We evaluate TCT in a realistic setting by using the learned embeddings as input features to a gradient boosting classifier. Experimental results show that embeddings alone achieve meaningful predictive performance (AUC 0.8644), indicating that the model captures non-trivial temporal structure. However, when combined with domain-engineered features, no measurable improvement is observed over the baseline (AUC 0.9205 vs. 0.9245), suggesting that the learned representations largely overlap with existing feature abstractions. These findings position TCT as a promising representation learning approach that captures relevant behavioral signal, while highlighting the challenges of achieving additive value over strong domain features. The results reflect an intermediate stage in the development of temporal representation learning for financial crime detection and motivate further research on model architecture, training objectives, and integration strategies. At this early stage, achieving performance comparable to a strong feature-engineered baseline is itself a meaningful outcome, indicating that learned representations approximate domain-specific features without manual engineering. While not yet production-ready, these results point to a promising direction for reducing reliance on feature engineering in financial crime detection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This abstract shows self-supervised embeddings from transaction sequences reach solid standalone AUC but add nothing over strong domain features, with almost no methods details to back the temporal claims.

read the letter

The key point is that the Temporal Contrastive Transformer produces embeddings that hit 0.8644 AUC on fraud detection by themselves, yet combining them with hand-engineered features barely moves the needle over the baseline (0.9205 vs 0.9245). That overlap result is the most useful takeaway here because it flags a real limit for this style of representation learning in finance right now. The work applies predictive contrastive coding inside a transformer to raw transaction sequences, which is a direct extension of existing self-supervised sequence methods to the financial crime setting. They evaluate by feeding the embeddings into a gradient boosting classifier and are upfront that this is an early-stage effort where matching domain features without manual work already counts as progress. That honesty about the results is worth noting. The main weaknesses sit in the missing pieces. The abstract gives no information on how sequences are constructed, how positive and negative pairs are defined for the contrastive loss, what the dataset looks like, or any ablations that would isolate the temporal component. Without those, the claim that the model captures contextual temporal dynamics stays hard to verify, and the stress-test concern about static attributes driving the signal lands cleanly. No statistical tests or implementation notes appear either. This is aimed at people already working on self-supervised methods for sequential financial data who want to see one concrete application. A reader could pull the architecture idea or the evaluation framing, but the lack of detail makes it difficult to reproduce or extend. It deserves a serious referee once the authors add the methods and data description, because the core empirical observation about limited additive value is worth checking properly.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces the Temporal Contrastive Transformer (TCT), a self-supervised representation learning framework that applies predictive contrastive coding to sequences of financial transactions to produce embeddings encoding behavioral patterns over time. These embeddings are evaluated as input features to a gradient boosting classifier for fraud detection. Key reported results are an AUC of 0.8644 using the embeddings alone (indicating capture of non-trivial temporal structure) and 0.9205 when combined with domain-engineered features, compared to a baseline AUC of 0.9245. The authors interpret the lack of additive improvement as evidence of substantial overlap with existing feature abstractions and position the work as an early-stage demonstration that learned representations can approximate domain-specific signals without manual engineering.

Significance. If the reported AUC values prove reproducible and the embeddings demonstrably encode temporal dynamics beyond static attributes, the work would offer a concrete benchmark for self-supervised temporal representation learning in financial crime detection. It highlights both the promise of reducing reliance on hand-crafted features and the practical challenge of achieving complementary gains over strong domain baselines. The explicit acknowledgment of the intermediate-stage nature and the numerical comparison to a realistic baseline provide a useful starting point for follow-on research on architecture, objectives, and integration strategies.

major comments (3)

[Abstract] Abstract: The claim that embeddings alone achieve AUC 0.8644 because the model 'captures non-trivial temporal structure' is load-bearing for the central interpretation, yet the abstract provides no description of how transaction sequences are formed or how the predictive contrastive coding defines positive and negative pairs. Without these details it remains possible that the performance derives from static transaction attributes rather than the self-supervised temporal objective.
[Abstract] Abstract: No dataset description (size, fraud prevalence, temporal coverage), statistical significance tests on the AUC differences, ablation studies, or implementation details are supplied. These omissions directly undermine verification that the experimental design supports the conclusion that the contrastive objective encodes behavioral patterns relevant to fraud and distinct from domain-engineered features.
[Abstract] Abstract: The combined-model result (AUC 0.9205 vs. baseline 0.9245) is presented as evidence of overlap, but without specifying how the embeddings are fused with domain features or the exact gradient-boosting configuration, it is impossible to distinguish true representational overlap from an ineffective integration strategy.

minor comments (1)

[Abstract] The concluding sentences of the abstract repeat the 'intermediate stage' framing; a single concise statement would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment below and will revise the abstract to incorporate the requested clarifications and supporting context from the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that embeddings alone achieve AUC 0.8644 because the model 'captures non-trivial temporal structure' is load-bearing for the central interpretation, yet the abstract provides no description of how transaction sequences are formed or how the predictive contrastive coding defines positive and negative pairs. Without these details it remains possible that the performance derives from static transaction attributes rather than the self-supervised temporal objective.

Authors: We agree that the abstract would be strengthened by briefly indicating how sequences are constructed and how the contrastive objective operates. The manuscript specifies that sequences consist of chronologically ordered transactions per account within fixed temporal windows, with positive pairs formed from adjacent segments of the same sequence and negative pairs sampled from segments of other sequences. This construction is intended to encourage learning of temporal dynamics rather than static attributes. We will revise the abstract to include a concise description of sequence formation and positive/negative pair definition. revision: yes
Referee: [Abstract] Abstract: No dataset description (size, fraud prevalence, temporal coverage), statistical significance tests on the AUC differences, ablation studies, or implementation details are supplied. These omissions directly undermine verification that the experimental design supports the conclusion that the contrastive objective encodes behavioral patterns relevant to fraud and distinct from domain-engineered features.

Authors: We acknowledge that the abstract omits these elements. The manuscript contains a description of the dataset (large-scale real-world transactions with temporal span and class imbalance), reports results averaged over multiple runs, and includes ablation studies on the contrastive objective as well as implementation details in the experimental section. We will expand the abstract with a high-level dataset summary and a note that supporting analyses appear in the main text. revision: yes
Referee: [Abstract] Abstract: The combined-model result (AUC 0.9205 vs. baseline 0.9245) is presented as evidence of overlap, but without specifying how the embeddings are fused with domain features or the exact gradient-boosting configuration, it is impossible to distinguish true representational overlap from an ineffective integration strategy.

Authors: We agree that specifying the integration approach would help readers evaluate the overlap interpretation. The manuscript describes direct concatenation of the learned embeddings with the domain-engineered features, followed by training of a gradient boosting classifier. The absence of improvement, combined with the strong standalone performance of the embeddings, supports the overlap conclusion rather than an integration artifact. We will revise the abstract to state that embeddings are concatenated with domain features and passed to a gradient boosting model. revision: yes

Circularity Check

0 steps flagged

No derivation chain present; empirical AUC results are independent measurements

full rationale

The abstract reports training a Temporal Contrastive Transformer via self-supervised contrastive objective and then measuring downstream AUC (0.8644 for embeddings alone, 0.9205 combined) on a gradient boosting classifier. No equations, predictive derivations, or mathematical reductions are stated. The performance numbers are direct empirical outcomes from held-out evaluation, not quantities defined by or fitted to the same inputs within the paper. No self-citations, ansatzes, or uniqueness theorems appear. This is a standard empirical comparison paper whose central claims rest on observable metrics rather than any closed derivation loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review supplies insufficient technical detail to enumerate specific free parameters or invented entities; the approach rests on standard self-supervised contrastive learning assumptions for sequential data.

axioms (1)

domain assumption Financial transaction sequences contain learnable temporal behavioral patterns that are relevant to fraud detection
This premise underpins the decision to train via self-supervised contrastive coding on raw sequences.

pith-pipeline@v0.9.0 · 5782 in / 1280 out tokens · 90151 ms · 2026-05-22T01:48:54.870350+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 3 internal anchors

[1]

Hochreiter, S., & Schmidhuber, J. (1997). Long short‑term memory. Neural Computation, 9(8), 1735–1780

work page 1997
[2]

Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of EMNLP 2014 (pp. 1724–1734)

work page 2014
[3]

Ö., Loeff, N., & Pfister, T

Lim, B., Arık, S. Ö., Loeff, N., & Pfister, T. (2021). Temporal Fusion Transformers for interpretable multi -horizon time series forecasting. International Journal of Forecasting, 37(4), 1748–1764

work page 2021
[4]

van den Oord, A., Li, Y., & Vinyals, O. (2018). Re presentation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748

work page internal anchor Pith review Pith/arXiv arXiv 2018
[5]

He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. (2020). Momentum contrast for unsupervised visual representation learning. In Proceedings of CVPR 2020 (pp. 9729–9738)

work page 2020
[6]

Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). A simple framework for contrastive learning of visual representations. In Proceedings of ICML 2020 (pp. 1597–1607)

work page 2020
[7]

Gutmann, M., & Hyvärinen, A. (2010). Noise -contrastive estimation: A new estimation principle for un-normalized statistical models. In Proceedings of AISTATS 2010 (pp. 297–304)

work page 2010
[8]

N., Kaiser, Ł., & Polosukhin, I

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS) 30 (pp. 5998–6008)

work page 2017
[9]

Dou, Y., Liu, Z., Sun, L., Deng, J., Peng, H., & Yu, P. S. (2020). Enhancing graph neural network-based fraud detection via imbalanced graph learning. In Proceedings of The Web Conference (WWW) 2020 (pp. 3168–3177)

work page 2020
[10]

Zhang, X., Han, Y., Li, W., & Tang, S. (2022). Transaction fraud detection via deep autoencoding with structured temporal context. Expert Systems with Applications, 193, 116392

work page 2022
[11]

(2012, updated 2023)

Financia l Action Task Force (FATF). (2012, updated 2023). International Standards on Combating Money Laundering and the Financing of Terrorism & Proliferation (The FATF Recommendations). FATF/OECD, Paris

work page 2012
[12]

T., & Bizarro, P

Lorenz, J., Silva, M., Aparício, D., Carvalho, J. T., & Bizarro, P. (2021). Machine learning methods to detect money laundering in the Bitcoin blockchain in the presence of label scarcity. In Proceedings of the First ACM International Conference on AI in Finance (ICAIF 2020), Article 12

work page 2021
[13]

Cheng, D., Cao, B., Dong, Y., & Wang, J. (2023). Anti -money laundering by group-aware deep graph learning. IEEE Transactions on Knowledge and Data Engineering, 35(8), 8341–8354

work page 2023
[14]

N., Fan, A., Auli, M., & Grangier, D

Dauphin, Y. N., Fan, A., Auli, M., & Grangier, D. (2017). Language modeling with gated co nvolutional networks. In Proceedings of ICML 2017 (pp. 933–941)

work page 2017
[15]

Layer Normalization

Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer normalization. arXiv preprint arXiv:1607.06450

work page internal anchor Pith review Pith/arXiv arXiv 2016
[16]

Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. In Proceedings of KDD 2016 (pp. 785–794)

work page 2016
[17]

M., Kriegel, H

Breunig, M. M., Kriegel, H. -P., Ng, R. T., & Sander, J. (2000). LOF: Identifying density -based local outliers. In Proceedings of SIGMOD 2000 (pp. 93–104)

work page 2000
[18]

Li, Z., Zhao, Y., Botta, N., Ionescu, C., & Hu, X. (2022). COPOD: Copula-based outlier detection. In Proceedings of ICDM 2020 (pp. 1118–1123); extended in IEEE Transactions on Knowledge and Data Engineering, 2022

work page 2022
[19]

P., & Ba, J

Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. In Proceedings of ICLR 2015

work page 2015
[20]

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1), 1929–1958

work page 2014
[21]

Schuster, M., & Pali wal, K. K. (1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11), 2673–2681

work page 1997
[22]

-W., Lee, K., & Toutanova, K

Devlin, J., Chang, M. -W., Lee, K., & Toutanova, K. (2019). BERT: Pre -training deep bidirectional transformers for language understanding. In Proceedings of NAACL -HLT 2019 (pp. 4171– 4186)

work page 2019
[23]

Scalable Graph Learning for Anti-Money Laundering: A First Look

Weber, M., Chen, J., Suzumura, T., Pareja, A., Ma, T., Kanezashi, H., Kaler, T., Leiserson, C. E., & Schardl, T. B. (2019). Scalable graph learning for anti‑money laundering: A first look. arXiv preprint arXiv:1812.00076. (Elliptical dataset paper.)

work page internal anchor Pith review Pith/arXiv arXiv 2019
[24]

Jurgovsky, J., Granitzer, M., Ziegler, K., Calabretto, S., Portier, P.-E., He-Guelton, L., & Caelen, O. (2018). Sequence classification for credit-card fraud detection. Expert Systems with Applications, 100, 234–245

work page 2018
[25]

Shwartz -Ziv, R., & Armon, A. (2022). Tabular data: Deep learning is not all you need. Information Fusion, 81, 84–90

work page 2022

[1] [1]

Hochreiter, S., & Schmidhuber, J. (1997). Long short‑term memory. Neural Computation, 9(8), 1735–1780

work page 1997

[2] [2]

Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of EMNLP 2014 (pp. 1724–1734)

work page 2014

[3] [3]

Ö., Loeff, N., & Pfister, T

Lim, B., Arık, S. Ö., Loeff, N., & Pfister, T. (2021). Temporal Fusion Transformers for interpretable multi -horizon time series forecasting. International Journal of Forecasting, 37(4), 1748–1764

work page 2021

[4] [4]

van den Oord, A., Li, Y., & Vinyals, O. (2018). Re presentation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748

work page internal anchor Pith review Pith/arXiv arXiv 2018

[5] [5]

He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. (2020). Momentum contrast for unsupervised visual representation learning. In Proceedings of CVPR 2020 (pp. 9729–9738)

work page 2020

[6] [6]

Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). A simple framework for contrastive learning of visual representations. In Proceedings of ICML 2020 (pp. 1597–1607)

work page 2020

[7] [7]

Gutmann, M., & Hyvärinen, A. (2010). Noise -contrastive estimation: A new estimation principle for un-normalized statistical models. In Proceedings of AISTATS 2010 (pp. 297–304)

work page 2010

[8] [8]

N., Kaiser, Ł., & Polosukhin, I

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS) 30 (pp. 5998–6008)

work page 2017

[9] [9]

Dou, Y., Liu, Z., Sun, L., Deng, J., Peng, H., & Yu, P. S. (2020). Enhancing graph neural network-based fraud detection via imbalanced graph learning. In Proceedings of The Web Conference (WWW) 2020 (pp. 3168–3177)

work page 2020

[10] [10]

Zhang, X., Han, Y., Li, W., & Tang, S. (2022). Transaction fraud detection via deep autoencoding with structured temporal context. Expert Systems with Applications, 193, 116392

work page 2022

[11] [11]

(2012, updated 2023)

Financia l Action Task Force (FATF). (2012, updated 2023). International Standards on Combating Money Laundering and the Financing of Terrorism & Proliferation (The FATF Recommendations). FATF/OECD, Paris

work page 2012

[12] [12]

T., & Bizarro, P

Lorenz, J., Silva, M., Aparício, D., Carvalho, J. T., & Bizarro, P. (2021). Machine learning methods to detect money laundering in the Bitcoin blockchain in the presence of label scarcity. In Proceedings of the First ACM International Conference on AI in Finance (ICAIF 2020), Article 12

work page 2021

[13] [13]

Cheng, D., Cao, B., Dong, Y., & Wang, J. (2023). Anti -money laundering by group-aware deep graph learning. IEEE Transactions on Knowledge and Data Engineering, 35(8), 8341–8354

work page 2023

[14] [14]

N., Fan, A., Auli, M., & Grangier, D

Dauphin, Y. N., Fan, A., Auli, M., & Grangier, D. (2017). Language modeling with gated co nvolutional networks. In Proceedings of ICML 2017 (pp. 933–941)

work page 2017

[15] [15]

Layer Normalization

Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer normalization. arXiv preprint arXiv:1607.06450

work page internal anchor Pith review Pith/arXiv arXiv 2016

[16] [16]

Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. In Proceedings of KDD 2016 (pp. 785–794)

work page 2016

[17] [17]

M., Kriegel, H

Breunig, M. M., Kriegel, H. -P., Ng, R. T., & Sander, J. (2000). LOF: Identifying density -based local outliers. In Proceedings of SIGMOD 2000 (pp. 93–104)

work page 2000

[18] [18]

Li, Z., Zhao, Y., Botta, N., Ionescu, C., & Hu, X. (2022). COPOD: Copula-based outlier detection. In Proceedings of ICDM 2020 (pp. 1118–1123); extended in IEEE Transactions on Knowledge and Data Engineering, 2022

work page 2022

[19] [19]

P., & Ba, J

Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. In Proceedings of ICLR 2015

work page 2015

[20] [20]

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1), 1929–1958

work page 2014

[21] [21]

Schuster, M., & Pali wal, K. K. (1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11), 2673–2681

work page 1997

[22] [22]

-W., Lee, K., & Toutanova, K

Devlin, J., Chang, M. -W., Lee, K., & Toutanova, K. (2019). BERT: Pre -training deep bidirectional transformers for language understanding. In Proceedings of NAACL -HLT 2019 (pp. 4171– 4186)

work page 2019

[23] [23]

Scalable Graph Learning for Anti-Money Laundering: A First Look

Weber, M., Chen, J., Suzumura, T., Pareja, A., Ma, T., Kanezashi, H., Kaler, T., Leiserson, C. E., & Schardl, T. B. (2019). Scalable graph learning for anti‑money laundering: A first look. arXiv preprint arXiv:1812.00076. (Elliptical dataset paper.)

work page internal anchor Pith review Pith/arXiv arXiv 2019

[24] [24]

Jurgovsky, J., Granitzer, M., Ziegler, K., Calabretto, S., Portier, P.-E., He-Guelton, L., & Caelen, O. (2018). Sequence classification for credit-card fraud detection. Expert Systems with Applications, 100, 234–245

work page 2018

[25] [25]

Shwartz -Ziv, R., & Armon, A. (2022). Tabular data: Deep learning is not all you need. Information Fusion, 81, 84–90

work page 2022