TREASURE: The Visa Payment Foundation Model for High-Volume Transaction Understanding
Pith reviewed 2026-05-17 05:24 UTC · model grok-4.3
The pith
A transformer model for payment transactions captures both consumer patterns and network signals to improve fraud detection and recommendations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TREASURE is a multipurpose transformer-based foundation model for transaction data that simultaneously captures consumer behavior and payment network signals, featuring an input module with dedicated sub-modules for static and dynamic attributes, an efficient training paradigm for predicting high-cardinality categorical attributes, and demonstrated effectiveness as both a standalone model that increases abnormal behavior detection performance by 111% over production systems and an embedding provider that enhances recommendation models by 104%.
What carries the argument
The TREASURE transformer model with dedicated sub-modules for static and dynamic transaction attributes and an efficient training paradigm for high-cardinality categorical attributes.
If this is right
- Abnormal behavior detection performance increases substantially over existing production systems.
- Recommendation systems gain accuracy when using embeddings generated by the model.
- Training and inference become more efficient due to the specialized input module and training paradigm.
- A single model representation combines consumer behavior signals with payment network details such as response codes.
Where Pith is reading between the lines
- The same architecture could be retrained on transaction data from other payment networks to test transferability.
- Similar input and training designs might apply to other high-volume sequential records such as user activity logs.
- Real-time versions of the model could support immediate monitoring of incoming transactions.
- Public benchmarks on open datasets would clarify how much the gains depend on the original Visa data characteristics.
Load-bearing premise
The performance gains depend on proprietary industry-grade datasets whose selection, labeling, and train-test splits are not described in detail.
What would settle it
Evaluating TREASURE on an independent public transaction dataset and finding no gain over standard production baselines would show the improvements do not hold more generally.
Figures
read the original abstract
Payment networks form the backbone of modern commerce, generating high volumes of transaction records from daily activities. Properly modeling this data can enable applications such as abnormal behavior detection and consumer-level insights for hyper-personalized experiences, ultimately improving people's lives. In this paper, we present TREASURE, TRansformer Engine As Scalable Universal transaction Representation Encoder, a multipurpose transformer-based foundation model specifically designed for transaction data. The model simultaneously captures both consumer behavior and payment network signals (such as response codes and system flags), providing comprehensive information necessary for applications like accurate recommendation systems and abnormal behavior detection. Verified with industry-grade datasets, TREASURE features three key capabilities: 1) an input module with dedicated sub-modules for static and dynamic attributes, enabling more efficient training and inference; 2) an efficient and effective training paradigm for predicting high-cardinality categorical attributes; and 3) demonstrated effectiveness as both a standalone model that increases abnormal behavior detection performance by 111% over production systems and an embedding provider that enhances recommendation models by 104%. We present key insights from extensive ablation studies, benchmarks against production models, and case studies, highlighting valuable knowledge gained from developing TREASURE.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces TREASURE, a transformer-based foundation model for high-volume payment transaction data. It proposes a specialized input module with sub-modules for static and dynamic attributes, an efficient training objective for high-cardinality categorical attributes, and reports that the model improves abnormal behavior detection by 111% over production systems when used standalone and boosts recommendation performance by 104% when used to provide embeddings. These results are supported by ablation studies, benchmarks against production models, and case studies on industry-grade Visa datasets.
Significance. If the reported gains prove robust under detailed scrutiny, the work could meaningfully advance foundation-model approaches in financial transaction modeling, particularly for fraud detection and personalization tasks that rely on mixed static/dynamic categorical features. The emphasis on scalable handling of high-cardinality attributes and dual use as detector or embedder addresses practical constraints in payment networks. However, the proprietary datasets and absence of reproducible experimental protocols substantially limit current assessment of generalizability and impact.
major comments (2)
- [Abstract and evaluation sections] Abstract and evaluation sections: the claims of 111% and 104% relative improvements are presented without any description of the underlying metrics, production baselines, dataset sampling procedure, label acquisition/validation process, train/validation/test splits, or statistical testing. These omissions are load-bearing because the central contribution rests on the magnitude and reliability of these gains on 'industry-grade' data.
- [Training paradigm section] The training paradigm section: the model is trained to predict attributes drawn from the same class of industry transaction data later used for downstream evaluation, yet no mention is made of strictly held-out external benchmarks or independent validation sets. This creates a circularity risk that must be addressed to support the generalization claims.
minor comments (1)
- [Abstract] The acronym expansion contains inconsistent capitalization ('TRansformer Engine As Scalable Universal transaction Representation Encoder'); standardize for readability.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our paper. We address each of the major comments in detail below. We agree that additional clarifications are needed in some areas and will make revisions accordingly. However, certain details regarding the proprietary Visa datasets cannot be fully disclosed due to privacy and confidentiality constraints.
read point-by-point responses
-
Referee: [Abstract and evaluation sections] Abstract and evaluation sections: the claims of 111% and 104% relative improvements are presented without any description of the underlying metrics, production baselines, dataset sampling procedure, label acquisition/validation process, train/validation/test splits, or statistical testing. These omissions are load-bearing because the central contribution rests on the magnitude and reliability of these gains on 'industry-grade' data.
Authors: We appreciate this observation and agree that more transparency would benefit readers. Due to the proprietary and sensitive nature of the Visa transaction datasets, we are unable to provide exhaustive details on dataset sampling procedures, label acquisition and validation processes, or exact train/validation/test splits, as these could compromise data privacy and reveal proprietary business practices. We will revise the manuscript to include descriptions of the underlying metrics used for the reported improvements (such as the specific performance measures for abnormal behavior detection and recommendation tasks), general characteristics of the production baselines, and any statistical testing performed where possible without violating confidentiality. We believe these additions will address the core concern while respecting data constraints. The reported gains were validated through extensive internal benchmarks on industry-grade data. revision: partial
-
Referee: [Training paradigm section] The training paradigm section: the model is trained to predict attributes drawn from the same class of industry transaction data later used for downstream evaluation, yet no mention is made of strictly held-out external benchmarks or independent validation sets. This creates a circularity risk that must be addressed to support the generalization claims.
Authors: We acknowledge the potential for perceived circularity. The pretraining objective involves predicting attributes from a broad corpus of transaction data to learn universal representations. The downstream tasks, including abnormal behavior detection and recommendation, utilize separate evaluation datasets with task-specific labels that are not part of the pretraining attribute prediction. To mitigate concerns, we will update the training paradigm section to explicitly state that evaluation sets are held-out and temporally separated from the pretraining data to prevent information leakage. While we do not have access to fully independent external public benchmarks due to the domain-specific nature of payment data, the internal validations use rigorous splits. We will add this clarification in the revision. revision: partial
- Full disclosure of dataset details, sampling procedures, and experimental protocols due to the proprietary nature of the Visa payment transaction data.
Circularity Check
No significant circularity in derivation chain
full rationale
The provided abstract and context describe a standard transformer foundation model with dedicated input sub-modules, a training paradigm for high-cardinality attribute prediction, and downstream empirical evaluations on abnormal behavior detection and recommendation tasks using industry-grade Visa datasets. No equations, self-citations, or load-bearing steps are exhibited that reduce any claimed prediction or result to its own inputs by construction. The performance numbers (111% and 104%) are presented as outcomes of comparisons against external production baselines rather than fitted parameters renamed as predictions or self-definitional constructs. The derivation is therefore self-contained as an empirical ML development process without the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
free parameters (2)
- transformer hyperparameters
- training objective weights
axioms (2)
- domain assumption Payment transaction records can be usefully decomposed into static customer attributes and dynamic sequence attributes.
- domain assumption Predicting high-cardinality categorical fields during pretraining yields representations that transfer to detection and recommendation tasks.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Numerical attributes are first transformed to a logarithmic scale... log-normal distributions... InfoNCE loss for high-cardinality... L = Labnormal + scaled sum of auxiliary losses
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Transformer decoder block with causal masked self-attention... 3-layer, 4 heads, hidden dim 256
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al
-
[2]
Advances in neural information processing systems 35 (2022), 23716–23736
Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems 35 (2022), 23716–23736
work page 2022
-
[3]
Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. 2021. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[4]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901
work page 2020
-
[5]
Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[6]
Abhimanyu Das, Weihao Kong, Rajat Sen, and Yichen Zhou. 2024. A decoder- only foundation model for time-series forecasting. In Forty-first International Conference on Machine Learning
work page 2024
-
[7]
DeepSeek-AI. 2025. DeepSeek-V3. https://huggingface.co/deepseek-ai/ DeepSeek-V3 Accessed: 2025-5-9
work page 2025
-
[8]
Xiran Fan, Zhimeng Jiang, Chin-Chia Michael Yeh, Yuzhong Chen, Yingtong Dou, Menghai Pan, and Yan Zheng. 2025. Enhancing Foundation Models in Transaction Understanding with LLM-based Sentence Embeddings. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track . 903–911
work page 2025
-
[9]
Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, Yongdong Zhang, and Meng Wang. 2020. Lightgcn: Simplifying and powering graph convolution network for recommendation. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval . 639–648
work page 2020
-
[10]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory.Neural computation 9, 8 (1997), 1735–1780
work page 1997
-
[11]
Yifan Hu, Yehuda Koren, and Chris Volinsky. 2008. Collaborative filtering for implicit feedback datasets. In 2008 Eighth IEEE international conference on data mining. Ieee, 263–272
work page 2008
-
[12]
Hugging Face. 2025. Llama4. https://huggingface.co/docs/transformers/model_ doc/llama4 Accessed: 2025-5-9
work page 2025
- [13]
-
[14]
Ju-yeong Ji and Ravin Kumar. 2024. Gemma explained: An overview of Gemma model family architectures. https://developers.googleblog.com/en/gemma- explained-overview-gemma-model-family-architectures/ Accessed: 2025-5-9
work page 2024
-
[15]
Amirhossein Kazemnejad, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Payel Das, and Siva Reddy. 2023. The impact of positional encoding on length general- ization in transformers. Advances in Neural Information Processing Systems 36 (2023), 24892–24928
work page 2023
-
[16]
Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization tech- niques for recommender systems. Computer 42, 8 (2009), 30–37
work page 2009
-
[17]
Leland McInnes, John Healy, and James Melville. 2018. Umap: Uniform man- ifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[18]
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[19]
Karl Pearson. 1901. LIII. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin philosophical magazine and journal of science 2, 11 (1901), 559–572
work page 1901
-
[20]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning . PmLR, 8748–8763
work page 2021
-
[21]
Kashif Rasul, Arjun Ashok, Andrew Robert Williams, Arian Khorasani, George Adamopoulos, Rishika Bhagwatkar, Marin Biloš, Hena Ghonia, Nadhir Hassen, Anderson Schneider, et al. 2023. Lag-llama: Towards foundation models for time series forecasting. In R0-FoMo: Robustness of Few-shot and Zero-shot Learning in Large Foundation Models
work page 2023
-
[22]
Archit Rathore, Sunipa Dev, Jeff M Phillips, Vivek Srikumar, Yan Zheng, Chin- Chia Michael Yeh, Junpeng Wang, Wei Zhang, and Bei Wang. 2024. VERB: Visualizing and interpreting bias mitigation techniques geometrically for word representations. ACM Transactions on Interactive Intelligent Systems 14, 1 (2024), 1–34
work page 2024
- [23]
-
[24]
Piotr Skalski, David Sutton, Stuart Burrell, Iker Perez, and Jason Wong. 2023. Towards a foundation purchasing model: Pretrained generative autoregression on transaction sequences. In Proceedings of the Fourth ACM International Conference on AI in Finance . 141–149
work page 2023
- [25]
-
[26]
Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, 11 (2008)
work page 2008
-
[27]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017)
work page 2017
-
[28]
Visa Inc. 2020. Smarter STIP (Stand-in-Processing). https://usa.visa.com/ dam/VCOM/regional/na/us/about-visa/research/documents/smarter-stip.pdf Ac- cessed: 2025-5-8
work page 2020
-
[29]
Visa Inc. 2024. Visa Fact Sheet. https://corporate.visa.com/content/dam/VCOM/ corporate/documents/about-visa-factsheet.pdf Accessed: 2025-5-5
work page 2024
-
[30]
Visa Inc. 2025. Visa Intelligent Commerce. https://corporate.visa.com/en/ products/intelligent-commerce.html Accessed: 2025-5-8
work page 2025
-
[31]
Xiang Wang, Xiangnan He, Meng Wang, Fuli Feng, and Tat-Seng Chua. 2019. Neural graph collaborative filtering. In Proceedings of the 42nd international ACM SIGIR conference on Research and development in Information Retrieval . 165–174
work page 2019
-
[32]
Wikipedia contributors. 2025. ISO 3166-1 numeric. Wikipedia, The Free Encyclo- pedia. https://en.wikipedia.org/wiki/ISO_3166-1_numeric Accessed: 2025-5-17
work page 2025
- [33]
-
[34]
Chin-Chia Michael Yeh, Xin Dai, Huiyuan Chen, Yan Zheng, Yujie Fan, Audrey Der, Vivian Lai, Zhongfang Zhuang, Junpeng Wang, Liang Wang, et al . 2023. Toward a foundation model for time series data. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management . 4400–4404
work page 2023
-
[35]
Chin-Chia Michael Yeh, Mengting Gu, Yan Zheng, Huiyuan Chen, Javid Ebrahimi, Zhongfang Zhuang, Junpeng Wang, Liang Wang, and Wei Zhang. 2022. Embed- ding compression with hashing for efficient representation learning in large-scale graph. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 4391–4401
work page 2022
- [36]
- [37]
-
[38]
Dongyu Zhang, Liang Wang, Xin Dai, Shubham Jain, Junpeng Wang, Yujie Fan, Chin-Chia Michael Yeh, Yan Zheng, Zhongfang Zhuang, and Wei Zhang. 2023. Fata-trans: Field and time-aware transformer for sequential tabular data. In Pro- ceedings of the 32nd ACM International Conference on Information and Knowledge Management. 3247–3256
work page 2023
-
[39]
Yan Zheng, Junpeng Wang, Chin-Chia Michael Yeh, Yujie Fan, Huiyuan Chen, Liang Wang, and Wei Zhang. 2023. Embeddingtree: Hierarchical exploration of entity features in embedding. In 2023 IEEE 16th Pacific Visualization Symposium (PacificVis). IEEE, 217–221. 9
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.