pith. the verified trust layer for science. sign in

arxiv: 2511.19693 · v3 · submitted 2025-11-24 · 💻 cs.LG · cs.AI

TREASURE: The Visa Payment Foundation Model for High-Volume Transaction Understanding

Pith reviewed 2026-05-17 05:24 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords foundation modeltransformertransaction datapayment networksabnormal behavior detectionrecommendation systemsconsumer behaviorfraud detection
0
0 comments X p. Extension

The pith

A transformer model for payment transactions captures both consumer patterns and network signals to improve fraud detection and recommendations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TREASURE as a transformer-based foundation model built specifically for high-volume payment transaction records. It processes both individual consumer behavior and payment system details such as response codes to support tasks like spotting unusual activity and generating personalized suggestions. A reader would care because better modeling of this data could make commerce safer and more tailored to users. The architecture includes separate input handling for unchanging and time-varying attributes plus a training approach suited to many possible category values. Industry dataset tests show the model raises abnormal behavior detection performance by 111 percent over current production systems when used alone and boosts recommendation models by 104 percent when supplying embeddings.

Core claim

TREASURE is a multipurpose transformer-based foundation model for transaction data that simultaneously captures consumer behavior and payment network signals, featuring an input module with dedicated sub-modules for static and dynamic attributes, an efficient training paradigm for predicting high-cardinality categorical attributes, and demonstrated effectiveness as both a standalone model that increases abnormal behavior detection performance by 111% over production systems and an embedding provider that enhances recommendation models by 104%.

What carries the argument

The TREASURE transformer model with dedicated sub-modules for static and dynamic transaction attributes and an efficient training paradigm for high-cardinality categorical attributes.

If this is right

  • Abnormal behavior detection performance increases substantially over existing production systems.
  • Recommendation systems gain accuracy when using embeddings generated by the model.
  • Training and inference become more efficient due to the specialized input module and training paradigm.
  • A single model representation combines consumer behavior signals with payment network details such as response codes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same architecture could be retrained on transaction data from other payment networks to test transferability.
  • Similar input and training designs might apply to other high-volume sequential records such as user activity logs.
  • Real-time versions of the model could support immediate monitoring of incoming transactions.
  • Public benchmarks on open datasets would clarify how much the gains depend on the original Visa data characteristics.

Load-bearing premise

The performance gains depend on proprietary industry-grade datasets whose selection, labeling, and train-test splits are not described in detail.

What would settle it

Evaluating TREASURE on an independent public transaction dataset and finding no gain over standard production baselines would show the improvements do not hold more generally.

Figures

Figures reproduced from arXiv: 2511.19693 by Chin-Chia Michael Yeh, Jiarui Sun, Junpeng Wang, Liang Wang, Mahashweta Das, Menghai Pan, Shubham Jain, Uday Singh Saini, Vineeth Rakesh, Xin Dai, Xiran Fan, Yan Zheng, Yingtong Dou, Yujie Fan, Yuzhong Chen.

Figure 1
Figure 1. Figure 1: Example of raw transaction data showing inter [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Grouped transactions from the same card, demon [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: The detailed input module of TREASURE. 𝐻 repre￾sents the input to the Transformer decoder block. Numerical and categorical attributes are processed differently. Numerical attributes are first transformed to a logarithmic scale, as all numerical features in our dataset (e.g., transaction amounts, time differences between transactions) exhibit long-tail distributions. These log-scaled numerical attributes ar… view at source ↗
Figure 3
Figure 3. Figure 3: The overall model architecture of TREASURE. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: The detailed output module of TREASURE. Two [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Efficiency improvement through shared negative [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Embeddings generated by TREASURE demonstrate [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: We developed a GUI to explore the embedding space. [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Model performance scales with dataset size, with [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Performance scaling with model size, using 16-bit [PITH_FULL_IMAGE:figures/full_fig_p008_11.png] view at source ↗
read the original abstract

Payment networks form the backbone of modern commerce, generating high volumes of transaction records from daily activities. Properly modeling this data can enable applications such as abnormal behavior detection and consumer-level insights for hyper-personalized experiences, ultimately improving people's lives. In this paper, we present TREASURE, TRansformer Engine As Scalable Universal transaction Representation Encoder, a multipurpose transformer-based foundation model specifically designed for transaction data. The model simultaneously captures both consumer behavior and payment network signals (such as response codes and system flags), providing comprehensive information necessary for applications like accurate recommendation systems and abnormal behavior detection. Verified with industry-grade datasets, TREASURE features three key capabilities: 1) an input module with dedicated sub-modules for static and dynamic attributes, enabling more efficient training and inference; 2) an efficient and effective training paradigm for predicting high-cardinality categorical attributes; and 3) demonstrated effectiveness as both a standalone model that increases abnormal behavior detection performance by 111% over production systems and an embedding provider that enhances recommendation models by 104%. We present key insights from extensive ablation studies, benchmarks against production models, and case studies, highlighting valuable knowledge gained from developing TREASURE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces TREASURE, a transformer-based foundation model for high-volume payment transaction data. It proposes a specialized input module with sub-modules for static and dynamic attributes, an efficient training objective for high-cardinality categorical attributes, and reports that the model improves abnormal behavior detection by 111% over production systems when used standalone and boosts recommendation performance by 104% when used to provide embeddings. These results are supported by ablation studies, benchmarks against production models, and case studies on industry-grade Visa datasets.

Significance. If the reported gains prove robust under detailed scrutiny, the work could meaningfully advance foundation-model approaches in financial transaction modeling, particularly for fraud detection and personalization tasks that rely on mixed static/dynamic categorical features. The emphasis on scalable handling of high-cardinality attributes and dual use as detector or embedder addresses practical constraints in payment networks. However, the proprietary datasets and absence of reproducible experimental protocols substantially limit current assessment of generalizability and impact.

major comments (2)
  1. [Abstract and evaluation sections] Abstract and evaluation sections: the claims of 111% and 104% relative improvements are presented without any description of the underlying metrics, production baselines, dataset sampling procedure, label acquisition/validation process, train/validation/test splits, or statistical testing. These omissions are load-bearing because the central contribution rests on the magnitude and reliability of these gains on 'industry-grade' data.
  2. [Training paradigm section] The training paradigm section: the model is trained to predict attributes drawn from the same class of industry transaction data later used for downstream evaluation, yet no mention is made of strictly held-out external benchmarks or independent validation sets. This creates a circularity risk that must be addressed to support the generalization claims.
minor comments (1)
  1. [Abstract] The acronym expansion contains inconsistent capitalization ('TRansformer Engine As Scalable Universal transaction Representation Encoder'); standardize for readability.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive comments on our paper. We address each of the major comments in detail below. We agree that additional clarifications are needed in some areas and will make revisions accordingly. However, certain details regarding the proprietary Visa datasets cannot be fully disclosed due to privacy and confidentiality constraints.

read point-by-point responses
  1. Referee: [Abstract and evaluation sections] Abstract and evaluation sections: the claims of 111% and 104% relative improvements are presented without any description of the underlying metrics, production baselines, dataset sampling procedure, label acquisition/validation process, train/validation/test splits, or statistical testing. These omissions are load-bearing because the central contribution rests on the magnitude and reliability of these gains on 'industry-grade' data.

    Authors: We appreciate this observation and agree that more transparency would benefit readers. Due to the proprietary and sensitive nature of the Visa transaction datasets, we are unable to provide exhaustive details on dataset sampling procedures, label acquisition and validation processes, or exact train/validation/test splits, as these could compromise data privacy and reveal proprietary business practices. We will revise the manuscript to include descriptions of the underlying metrics used for the reported improvements (such as the specific performance measures for abnormal behavior detection and recommendation tasks), general characteristics of the production baselines, and any statistical testing performed where possible without violating confidentiality. We believe these additions will address the core concern while respecting data constraints. The reported gains were validated through extensive internal benchmarks on industry-grade data. revision: partial

  2. Referee: [Training paradigm section] The training paradigm section: the model is trained to predict attributes drawn from the same class of industry transaction data later used for downstream evaluation, yet no mention is made of strictly held-out external benchmarks or independent validation sets. This creates a circularity risk that must be addressed to support the generalization claims.

    Authors: We acknowledge the potential for perceived circularity. The pretraining objective involves predicting attributes from a broad corpus of transaction data to learn universal representations. The downstream tasks, including abnormal behavior detection and recommendation, utilize separate evaluation datasets with task-specific labels that are not part of the pretraining attribute prediction. To mitigate concerns, we will update the training paradigm section to explicitly state that evaluation sets are held-out and temporally separated from the pretraining data to prevent information leakage. While we do not have access to fully independent external public benchmarks due to the domain-specific nature of payment data, the internal validations use rigorous splits. We will add this clarification in the revision. revision: partial

standing simulated objections not resolved
  • Full disclosure of dataset details, sampling procedures, and experimental protocols due to the proprietary nature of the Visa payment transaction data.

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The provided abstract and context describe a standard transformer foundation model with dedicated input sub-modules, a training paradigm for high-cardinality attribute prediction, and downstream empirical evaluations on abnormal behavior detection and recommendation tasks using industry-grade Visa datasets. No equations, self-citations, or load-bearing steps are exhibited that reduce any claimed prediction or result to its own inputs by construction. The performance numbers (111% and 104%) are presented as outcomes of comparisons against external production baselines rather than fitted parameters renamed as predictions or self-definitional constructs. The derivation is therefore self-contained as an empirical ML development process without the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard transformer assumptions plus domain-specific design choices for transaction data; no new physical entities are introduced, but many hyperparameters and data-handling decisions are implicit.

free parameters (2)
  • transformer hyperparameters
    Number of layers, attention heads, and embedding dimensions are chosen to enable efficient training on high-volume data but not enumerated in the abstract.
  • training objective weights
    Balancing the prediction of multiple high-cardinality categorical attributes requires weighting choices that affect the learned representations.
axioms (2)
  • domain assumption Payment transaction records can be usefully decomposed into static customer attributes and dynamic sequence attributes.
    This decomposition underpins the dedicated input sub-modules described in the abstract.
  • domain assumption Predicting high-cardinality categorical fields during pretraining yields representations that transfer to detection and recommendation tasks.
    This is the core training paradigm claimed to be efficient and effective.

pith-pipeline@v0.9.0 · 5563 in / 1492 out tokens · 53651 ms · 2026-05-17T05:24:19.738195+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 4 internal anchors

  1. [1]

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al

  2. [2]

    Advances in neural information processing systems 35 (2022), 23716–23736

    Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems 35 (2022), 23716–23736

  3. [3]

    Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. 2021. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021)

  4. [4]

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901

  5. [5]

    Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)

  6. [6]

    Abhimanyu Das, Weihao Kong, Rajat Sen, and Yichen Zhou. 2024. A decoder- only foundation model for time-series forecasting. In Forty-first International Conference on Machine Learning

  7. [7]

    DeepSeek-AI. 2025. DeepSeek-V3. https://huggingface.co/deepseek-ai/ DeepSeek-V3 Accessed: 2025-5-9

  8. [8]

    Xiran Fan, Zhimeng Jiang, Chin-Chia Michael Yeh, Yuzhong Chen, Yingtong Dou, Menghai Pan, and Yan Zheng. 2025. Enhancing Foundation Models in Transaction Understanding with LLM-based Sentence Embeddings. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track . 903–911

  9. [9]

    Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, Yongdong Zhang, and Meng Wang. 2020. Lightgcn: Simplifying and powering graph convolution network for recommendation. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval . 639–648

  10. [10]

    Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory.Neural computation 9, 8 (1997), 1735–1780

  11. [11]

    Yifan Hu, Yehuda Koren, and Chris Volinsky. 2008. Collaborative filtering for implicit feedback datasets. In 2008 Eighth IEEE international conference on data mining. Ieee, 263–272

  12. [12]

    Hugging Face. 2025. Llama4. https://huggingface.co/docs/transformers/model_ doc/llama4 Accessed: 2025-5-9

  13. [13]

    Kazuki Irie. 2024. Why Are Positional Encodings Nonessential for Deep Autore- gressive Transformers? Revisiting a Petroglyph. arXiv preprint arXiv:2501.00659 (2024)

  14. [14]

    Ju-yeong Ji and Ravin Kumar. 2024. Gemma explained: An overview of Gemma model family architectures. https://developers.googleblog.com/en/gemma- explained-overview-gemma-model-family-architectures/ Accessed: 2025-5-9

  15. [15]

    Amirhossein Kazemnejad, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Payel Das, and Siva Reddy. 2023. The impact of positional encoding on length general- ization in transformers. Advances in Neural Information Processing Systems 36 (2023), 24892–24928

  16. [16]

    Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization tech- niques for recommender systems. Computer 42, 8 (2009), 30–37

  17. [17]

    Leland McInnes, John Healy, and James Melville. 2018. Umap: Uniform man- ifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018)

  18. [18]

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)

  19. [19]

    Karl Pearson. 1901. LIII. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin philosophical magazine and journal of science 2, 11 (1901), 559–572

  20. [20]

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning . PmLR, 8748–8763

  21. [21]

    Kashif Rasul, Arjun Ashok, Andrew Robert Williams, Arian Khorasani, George Adamopoulos, Rishika Bhagwatkar, Marin Biloš, Hena Ghonia, Nadhir Hassen, Anderson Schneider, et al. 2023. Lag-llama: Towards foundation models for time series forecasting. In R0-FoMo: Robustness of Few-shot and Zero-shot Learning in Large Foundation Models

  22. [22]

    Archit Rathore, Sunipa Dev, Jeff M Phillips, Vivek Srikumar, Yan Zheng, Chin- Chia Michael Yeh, Junpeng Wang, Wei Zhang, and Bei Wang. 2024. VERB: Visualizing and interpreting bias mitigation techniques geometrically for word representations. ACM Transactions on Interactive Intelligent Systems 14, 1 (2024), 1–34

  23. [23]

    Oleksandr Shchur, Marin Biloš, and Stephan Günnemann. 2019. Intensity-free learning of temporal point processes. arXiv preprint arXiv:1909.12127 (2019)

  24. [24]

    Piotr Skalski, David Sutton, Stuart Burrell, Iker Perez, and Jason Wong. 2023. Towards a foundation purchasing model: Pretrained generative autoregression on transaction sequences. In Proceedings of the Fourth ACM International Conference on AI in Finance . 141–149

  25. [25]

    Boris Van Breugel and Mihaela Van Der Schaar. 2024. Why tabular foundation models should be a research priority. arXiv preprint arXiv:2405.01147 (2024)

  26. [26]

    Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, 11 (2008)

  27. [27]

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017)

  28. [28]

    Visa Inc. 2020. Smarter STIP (Stand-in-Processing). https://usa.visa.com/ dam/VCOM/regional/na/us/about-visa/research/documents/smarter-stip.pdf Ac- cessed: 2025-5-8

  29. [29]

    Visa Inc. 2024. Visa Fact Sheet. https://corporate.visa.com/content/dam/VCOM/ corporate/documents/about-visa-factsheet.pdf Accessed: 2025-5-5

  30. [30]

    Visa Inc. 2025. Visa Intelligent Commerce. https://corporate.visa.com/en/ products/intelligent-commerce.html Accessed: 2025-5-8

  31. [31]

    Xiang Wang, Xiangnan He, Meng Wang, Fuli Feng, and Tat-Seng Chua. 2019. Neural graph collaborative filtering. In Proceedings of the 42nd international ACM SIGIR conference on Research and development in Information Retrieval . 165–174

  32. [32]

    Wikipedia contributors. 2025. ISO 3166-1 numeric. Wikipedia, The Free Encyclo- pedia. https://en.wikipedia.org/wiki/ISO_3166-1_numeric Accessed: 2025-5-17

  33. [33]

    Yazheng Yang, Yuqi Wang, Guang Liu, Ledell Wu, and Qi Liu. 2023. Unitabe: A universal pretraining protocol for tabular foundation model in data science. arXiv preprint arXiv:2307.09249 (2023)

  34. [34]

    Chin-Chia Michael Yeh, Xin Dai, Huiyuan Chen, Yan Zheng, Yujie Fan, Audrey Der, Vivian Lai, Zhongfang Zhuang, Junpeng Wang, Liang Wang, et al . 2023. Toward a foundation model for time series data. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management . 4400–4404

  35. [35]

    Chin-Chia Michael Yeh, Mengting Gu, Yan Zheng, Huiyuan Chen, Javid Ebrahimi, Zhongfang Zhuang, Junpeng Wang, Liang Wang, and Wei Zhang. 2022. Embed- ding compression with hashing for efficient representation learning in large-scale graph. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 4391–4401

  36. [36]

    Chin-Chia Michael Yeh, Vivian Lai, Uday Singh Saini, Xiran Fan, Yujie Fan, Jun- peng Wang, Xin Dai, and Yan Zheng. 2025. Empowering Time Series Forecasting with LLM-Agents. arXiv preprint arXiv:2508.04231 (2025)

  37. [37]

    Chin-Chia Michael Yeh, Uday Singh Saini, Junpeng Wang, Xin Dai, Xiran Fan, Yujie Sun, Jiarui Fan, and Yan Zheng. 2025. TiCT: A Synthetically Pre-Trained Foundation Model for Time Series Classification. arXiv preprint arXiv:2511.19694 (2025)

  38. [38]

    Dongyu Zhang, Liang Wang, Xin Dai, Shubham Jain, Junpeng Wang, Yujie Fan, Chin-Chia Michael Yeh, Yan Zheng, Zhongfang Zhuang, and Wei Zhang. 2023. Fata-trans: Field and time-aware transformer for sequential tabular data. In Pro- ceedings of the 32nd ACM International Conference on Information and Knowledge Management. 3247–3256

  39. [39]

    Yan Zheng, Junpeng Wang, Chin-Chia Michael Yeh, Yujie Fan, Huiyuan Chen, Liang Wang, and Wei Zhang. 2023. Embeddingtree: Hierarchical exploration of entity features in embedding. In 2023 IEEE 16th Pacific Visualization Symposium (PacificVis). IEEE, 217–221. 9