arxiv: 2605.05572 · v1 · submitted 2026-05-07 · 💻 cs.CV

Recognition: unknown

Text-to-CAD Retrieval: a Strong Baseline

Honghu Pan , Zibo Du , Daxiang Liu , Chengliang Liu , Xiaoling Luo

Authors on Pith no claims yet

Pith reviewed 2026-05-08 15:04 UTC · model grok-4.3

classification 💻 cs.CV

keywords text-to-CAD retrievalcross-modal retrievalCAD embeddingsfeature decodermulti-modal alignmentprocedural sequencespoint cloudText2CAD dataset

0 comments

The pith

A multi-modal framework aligns text queries with CAD procedural sequences and point clouds to retrieve relevant models from large databases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces text-to-CAD retrieval as a new task and creates a benchmark using paired data from the Text2CAD dataset. It trains encoders on the construction sequence of each model, its geometric point cloud, and the accompanying text description. A feature decoder aligns these representations during training by reconstructing masked sequence features through cross-attention with the other inputs. The decoder is removed at inference to allow fast retrieval from the combined sequence and point features. A reader would care because existing CAD repositories are searched only by filenames or folders, which makes finding reusable industrial designs inefficient and limits design reuse in engineering workflows.

Core claim

The central claim is that a unified framework learns multi-modal CAD embeddings from procedural sequences and geometric point clouds, with a novel feature decoder that reconstructs masked sequence features via cross-attention with text and point features to encourage implicit alignment, so that at inference time the concatenated sequence-point features support accurate text-based retrieval.

What carries the argument

The feature decoder that reconstructs masked sequence features via cross-attention with text and point features to produce aligned multi-modal embeddings.

If this is right

The approach supplies a concrete benchmark and baseline for measuring progress on text-to-CAD retrieval.
Efficient inference becomes possible by dropping the decoder and using only the concatenated sequence and point features.
The framework can serve as a starting point for retrieval-augmented generation of new CAD models.
Combining construction logic from sequences with explicit geometry from points improves matching over using either alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Design software could embed this retrieval step to suggest similar legacy models while a user is building a new part.
The same alignment pattern might transfer to text-based search in other procedural modeling domains such as architecture or manufacturing.
Scaling the encoders to larger CAD collections could test whether the alignment remains stable as database size grows.

Load-bearing premise

The paired text descriptions in the dataset accurately reflect the semantics of the CAD models so that the alignment produces embeddings that work on new queries outside the training set.

What would settle it

If the method's retrieval accuracy on the Text2CAD test split falls below that of a simple keyword-matching baseline on the same data, the multi-modal alignment would not deliver the claimed advantage.

read the original abstract

Text-based retrieval of Computer-Aided Design (CAD) models is a critical yet underexplored task for the reuse of legacy industrial designs. Existing CAD repositories are typically searched using filenames or directories, which limits the efficiency, scalability, and accuracy of design retrieval. In this paper, we formally introduce text-to-CAD retrieval as a new cross-modal retrieval task, aiming to retrieve semantically relevant CAD models from large-scale databases given natural language queries. Leveraging paired text-CAD annotations from the Text2CAD dataset, we establish a practical benchmark for this task. To achieve text-based retrieval, we propose a unified framework that learns multi-modal CAD embeddings from both procedural sequences and geometric point clouds. Specifically, a sequence encoder captures the construction logic of CAD models, while a point encoder extracts explicit geometric features. A text encoder is used to learn semantic representations of textual queries. During training, we introduce a novel feature decoder that reconstructs masked sequence features via cross-attention with text and point features, encouraging implicit multi-modal alignment. At inference time, we remove this auxiliary decoder to enable efficient retrieval using concatenated sequence-point features. Our framework serves as a strong baseline for text-to-CAD retrieval and lays the foundation for downstream CAD generation paradigms, such as retrieval-augmented generation. The source code will be released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper defines text-to-CAD retrieval as a new task and describes a multi-modal architecture with indirect alignment via reconstruction, but supplies no metrics so the strong-baseline claim stays untested.

read the letter

This paper's core move is to treat text-to-CAD retrieval as its own cross-modal problem rather than a side application of text-to-3D work. It pulls in the Text2CAD dataset, encodes CAD models both as procedural sequences and point clouds, and adds a text encoder. Training runs a decoder that reconstructs masked sequence features through cross-attention on the text and point features; at inference the decoder is removed and the concatenated sequence-plus-point vector is used for retrieval. That combination of modalities and the auxiliary decoder is not a routine copy of existing retrieval setups, and the motivation from industrial CAD reuse is stated plainly. The plan to release code is also a plus for anyone who wants to build on it. The architecture itself is described clearly enough that the training objective and inference path can be followed without guesswork. The main weakness is the complete absence of results. No recall numbers, no comparisons to simple baselines, no ablations on the decoder, and no check on whether the reconstruction loss actually produces embeddings that rank relevant CAD models well. The stress-test concern holds: nothing in the loss directly pulls text and CAD embeddings together in a metric space, so it is possible to minimize reconstruction error while still learning features that are weak for nearest-neighbor search. The quality of the Text2CAD text annotations is also taken as given with no discussion of noise or coverage. This is aimed at people working on CAD search tools or multi-modal 3D retrieval who need a starting benchmark and architecture. A reader who wants to extend the task or add contrastive losses would get concrete material to work from. It deserves peer review because the task definition and setup are solid enough to be worth referee time, even though the current draft needs a full experimental section before it can be judged on its own terms. I would send it out with the expectation that the authors add quantitative retrieval results and test whether the indirect alignment actually competes with direct contrastive baselines.

Referee Report

3 major / 1 minor

Summary. The paper introduces text-to-CAD retrieval as a new cross-modal task for searching CAD models using natural language queries. It proposes a framework with a sequence encoder for procedural construction logic, a point encoder for geometric features, and a text encoder, trained via a feature decoder that reconstructs masked sequence features through cross-attention with text and point features to achieve implicit multi-modal alignment. At inference, the decoder is removed and concatenated sequence-point embeddings are used for retrieval. The work claims this serves as a strong baseline and foundation for downstream tasks like retrieval-augmented CAD generation, using the Text2CAD dataset.

Significance. If the empirical claims hold, the work would establish an initial benchmark and practical method for text-based retrieval in CAD repositories, addressing limitations of filename-based search in industrial design reuse. The dual use of procedural sequences and point clouds is a domain-appropriate choice for CAD representations that could support future retrieval-augmented generation pipelines.

major comments (3)

[Abstract] Abstract: The central claim that the framework 'serves as a strong baseline' for text-to-CAD retrieval is unsupported because the manuscript provides no quantitative retrieval metrics (e.g., Recall@K, mAP), ablation studies, baseline comparisons, or error analysis on the Text2CAD dataset.
[Training description] Training description (feature decoder): The masked sequence reconstruction objective via cross-attention with text and point features is an indirect alignment signal that does not explicitly enforce metric similarity between text embeddings and the final concatenated sequence-point vectors used at inference. This leaves open the possibility that reconstruction can be achieved without learning retrieval-competitive, semantically discriminative features.
[Method overview] Method overview: No details are given on how the concatenated sequence-point features are normalized or compared to text embeddings during retrieval (e.g., cosine similarity, learned projection), nor on any contrastive or ranking loss that would directly optimize the inference-time metric.

minor comments (1)

[Abstract] The abstract states that source code will be released but provides no link or repository information in the manuscript.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their detailed review and valuable feedback on our manuscript. We address each of the major comments below and outline the revisions we plan to make.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the framework 'serves as a strong baseline' for text-to-CAD retrieval is unsupported because the manuscript provides no quantitative retrieval metrics (e.g., Recall@K, mAP), ablation studies, baseline comparisons, or error analysis on the Text2CAD dataset.

Authors: We acknowledge that the claim of serving as a 'strong baseline' requires empirical validation through quantitative metrics. The current manuscript focuses on introducing the task and the framework, but to substantiate this claim, we will add comprehensive experiments including Recall@K and mAP scores, ablation studies on the components (sequence encoder, point encoder, decoder), comparisons to simple baselines such as text-only or point-only retrieval, and error analysis on the Text2CAD dataset in the revised version. revision: yes
Referee: [Training description] Training description (feature decoder): The masked sequence reconstruction objective via cross-attention with text and point features is an indirect alignment signal that does not explicitly enforce metric similarity between text embeddings and the final concatenated sequence-point vectors used at inference. This leaves open the possibility that reconstruction can be achieved without learning retrieval-competitive, semantically discriminative features.

Authors: The referee correctly points out that the alignment is implicit through the reconstruction task. While the cross-attention mechanism requires the text and point features to be informative for reconstructing the masked sequence features, thereby encouraging semantic alignment, we agree that this is indirect. To strengthen the approach, we will incorporate a direct contrastive loss between the text embeddings and the concatenated sequence-point embeddings during training to explicitly optimize for the retrieval metric. We will also provide analysis demonstrating that the reconstruction objective leads to discriminative features suitable for retrieval. revision: yes
Referee: [Method overview] Method overview: No details are given on how the concatenated sequence-point features are normalized or compared to text embeddings during retrieval (e.g., cosine similarity, learned projection), nor on any contrastive or ranking loss that would directly optimize the inference-time metric.

Authors: We apologize for the omission of these implementation details. In the revised manuscript, we will specify that the concatenated sequence-point features and text embeddings are L2-normalized and compared using cosine similarity for retrieval. Additionally, we will introduce a contrastive loss to directly optimize this similarity metric during training, as noted in our response to the training description comment. revision: yes

Circularity Check

0 steps flagged

No circularity detected: reconstruction objective independent of retrieval metric

full rationale

The paper defines a new cross-modal retrieval task and proposes an architecture with separate sequence/point/text encoders plus an auxiliary feature decoder trained on masked sequence reconstruction via cross-attention. This training signal is distinct from the inference-time retrieval procedure that simply concatenates sequence and point features for similarity search against text embeddings. No equations, fitted parameters, or self-citations are shown that would make the claimed retrieval performance equivalent to the inputs by construction. The derivation chain remains self-contained with independent training and evaluation components.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so no concrete free parameters, axioms, or invented entities can be extracted. The approach implicitly assumes that standard transformer-style encoders can learn aligned representations from the given modalities and that the Text2CAD pairs are semantically meaningful.

pith-pipeline@v0.9.0 · 5537 in / 1126 out tokens · 49362 ms · 2026-05-08T15:04:05.760543+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 5 canonical work pages · 5 internal anchors

[1]

Unified conditional image genera- tion for visible-infrared person re-identification,

H. Pan, W. Pei, X. Li, and Z. He, “Unified conditional image genera- tion for visible-infrared person re-identification,”IEEE Transactions on Information Forensics and Security, vol. 19, pp. 9026–9038, 2024. AUTHORet al.: TITLE 9

2024
[2]

Towards unified bi- jective image-text generation for text-to-image person re-identification,

Q. Wang, X. Ma, X. Jiang, J. Ji, and H. Pan, “Towards unified bi- jective image-text generation for text-to-image person re-identification,” Knowledge-Based Systems, p. 114014, 2025

2025
[3]

View-based 3-d cad model retrieval with deep residual networks,

C. Zhang, G. Zhou, H. Yang, Z. Xiao, and X. Yang, “View-based 3-d cad model retrieval with deep residual networks,”IEEE Transactions on Industrial Informatics, vol. 16, no. 4, pp. 2335–2345, 2019

2019
[4]

Complexgen: Cad reconstruction by b-rep chain complex generation,

H. Guo, S. Liu, H. Pan, Y . Liu, X. Tong, and B. Guo, “Complexgen: Cad reconstruction by b-rep chain complex generation,”ACM Transactions on Graphics, vol. 41, no. 4, pp. 1–18, 2022

2022
[5]

Brepgen: A b-rep generative diffusion model with structured latent geometry,

X. Xu, J. Lambourne, P. Jayaraman, Z. Wang, K. Willis, and Y . Fu- rukawa, “Brepgen: A b-rep generative diffusion model with structured latent geometry,”ACM Transactions on Graphics, vol. 43, no. 4, pp. 1–14, 2024

2024
[6]

Cmt: A cascade mar with topology predictor for multimodal conditional cad generation,

J. Wu, Y . Wang, X. Yue, X. Ma, J. Guo, D. Zhou, W. Ouyang, and S. Tang, “Cmt: A cascade mar with topology predictor for multimodal conditional cad generation,” inIEEE International Conference on Com- puter Vision, 2025, pp. 7014–7024

2025
[7]

Deepcad: A deep generative network for computer-aided design models,

R. Wu, C. Xiao, and C. Zheng, “Deepcad: A deep generative network for computer-aided design models,” inIEEE International Conference on Computer Vision, 2021, pp. 6772–6782

2021
[8]

Skexgen: Autoregressive generation of cad construction sequences with disentangled codebooks,

X. Xu, K. D. Willis, J. G. Lambourne, C.-Y . Cheng, P. K. Jayaraman, and Y . Furukawa, “Skexgen: Autoregressive generation of cad construction sequences with disentangled codebooks,” inInternational Conference on Machine Learning, 2022, pp. 24 698–24 724

2022
[9]

Hierarchical neural coding for controllable cad model generation,

X. Xu, P. K. Jayaraman, J. G. Lambourne, K. D. Willis, and Y . Furukawa, “Hierarchical neural coding for controllable cad model generation,” in International Conference on Machine Learning. PMLR, 2023, pp. 38 443–38 461

2023
[10]

Diffusion- cad: Controllable diffusion model for generating computer-aided design models,

A. Zhang, W. Jia, Q. Zou, Y . Feng, X. Wei, and Y . Zhang, “Diffusion- cad: Controllable diffusion model for generating computer-aided design models,”IEEE Transactions on Visualization and Computer Graphics, 2025

2025
[11]

Img2cad: Conditioned 3-d cad model generation from single image with structured visual geometry,

T. Chen, C. Yu, Y . Hu, J. Li, T. Xu, R. Cao, L. Zhu, Y . Zang, Y . Zhang, Z. Liet al., “Img2cad: Conditioned 3-d cad model generation from single image with structured visual geometry,”IEEE Transactions on Industrial Informatics, 2025

2025
[12]

Parametric primitive analysis of cad sketches with vision transformer,

X. Wang, L. Wang, H. Wu, G. Xiao, and K. Xu, “Parametric primitive analysis of cad sketches with vision transformer,”IEEE Transactions on Industrial Informatics, vol. 20, no. 10, pp. 12 041–12 050, 2024

2024
[13]

Text2cad: Generating sequential cad designs from beginner-to- expert level text prompts,

M. S. Khan, S. Sinha, T. U. Sheikh, D. Stricker, S. A. Ali, and M. Z. Afzal, “Text2cad: Generating sequential cad designs from beginner-to- expert level text prompts,”Advances in Neural Information Processing Systems, vol. 37, pp. 7552–7579, 2024

2024
[14]

Cad translator: An effective drive for text to 3d parametric computer-aided design generative modeling,

X. Li, Y . Song, Y . Lou, and X. Zhou, “Cad translator: An effective drive for text to 3d parametric computer-aided design generative modeling,” in Proceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 8461–8470

2024
[15]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational Conference on Machine Learning. PMLR, 2021, pp. 8748–8763

2021
[16]

Bert: Pre-training of deep bidirectional transformers for language understanding,

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” inPro- ceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), 2019, pp. 4171–4186

2019
[17]

Pointnet: Deep learning on point sets for 3d classification and segmentation,

C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” inIEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 652–660

2017
[18]

Point transformer,

H. Zhao, L. Jiang, J. Jia, P. H. Torr, and V . Koltun, “Point transformer,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 16 259–16 268

2021
[19]

Point transformer v2: Grouped vector attention and partition-based pooling,

X. Wu, Y . Lao, L. Jiang, X. Liu, and H. Zhao, “Point transformer v2: Grouped vector attention and partition-based pooling,”Advances in Neural Information Processing Systems, vol. 35, pp. 33 330–33 342, 2022

2022
[20]

Evaluating retrieval quality in retrieval- augmented generation,

A. Salemi and H. Zamani, “Evaluating retrieval quality in retrieval- augmented generation,” inProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Re- trieval, 2024, pp. 2395–2400

2024
[21]

Retrieval- augmented generation for knowledge-intensive nlp tasks,

P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K ¨uttler, M. Lewis, W.-t. Yih, T. Rockt ¨aschelet al., “Retrieval- augmented generation for knowledge-intensive nlp tasks,”Advances in neural information processing systems, vol. 33, pp. 9459–9474, 2020

2020
[22]

Retrieval-Augmented Generation for Large Language Models: A Survey

Y . Gao, Y . Xiong, X. Gao, K. Jia, J. Pan, Y . Bi, Y . Dai, J. Sun, H. Wang, and H. Wang, “Retrieval-augmented generation for large language models: A survey,”arXiv preprint arXiv:2312.10997, vol. 2, no. 1, 2023

work page internal anchor Pith review arXiv 2023
[23]

Revisiting cad model generation by learning raster sketch,

P. Li, W. Zhang, J. Guo, J. Chen, and D.-M. Yan, “Revisiting cad model generation by learning raster sketch,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 5, 2025, pp. 4869– 4877

2025
[24]

Cad-llama: leveraging large language models for computer-aided design parametric 3d model generation,

J. Li, W. Ma, X. Li, Y . Lou, G. Zhou, and X. Zhou, “Cad-llama: leveraging large language models for computer-aided design parametric 3d model generation,” inIEEE Conference on Computer Vision and Pattern Recognition, 2025, pp. 18 563–18 573

2025
[25]

Flexcad: Unified and versatile controllable cad generation with fine-tuned large language models,

Z. Zhang, S. Sun, W. Wang, D. Cai, and J. Bian, “Flexcad: Unified and versatile controllable cad generation with fine-tuned large language models,” inInternational Conference on Learning Representations, 2025

2025
[26]

The llama 3 herd of models,

A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fanet al., “The llama 3 herd of models,”arXiv e-prints, pp. arXiv–2407, 2024

2024
[27]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review arXiv 2023
[28]

Neural discrete representa- tion learning,

A. Van Den Oord, O. Vinyalset al., “Neural discrete representa- tion learning,”Conference on Neural Information Processing Systems, vol. 30, 2017

2017
[29]

Coca: Contrastive captioners are image-text foundation models,

J. Yu, Z. Wang, V . Vasudevan, L. Yeung, M. Seyedhosseini, and Y . Wu, “Coca: Contrastive captioners are image-text foundation models,” Transactions on Machine Learning Research
[30]

Llavanext: Improved reasoning, ocr, and world knowledge,

H. Liu, C. Li, Y . Li, B. Li, Y . Zhang, S. Shen, and Y . J. Lee, “Llavanext: Improved reasoning, ocr, and world knowledge,” 2024

2024
[31]

Mixtral of Experts

A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bam- ford, D. S. Chaplot, D. d. l. Casas, E. B. Hanna, F. Bressandet al., “Mixtral of experts,”arXiv preprint arXiv:2401.04088, 2024

work page internal anchor Pith review arXiv 2024
[32]

Text2shape: Generating shapes from natural language by learning joint embeddings,

K. Chen, C. B. Choy, M. Savva, A. X. Chang, T. Funkhouser, and S. Savarese, “Text2shape: Generating shapes from natural language by learning joint embeddings,” inAsian Conference on Computer Vision. Springer, 2018, pp. 100–116

2018
[33]

Y2seq2seq: Cross-modal representation learning for 3d shape and text by joint recon- struction and prediction of view and word sequences,

Z. Han, M. Shang, X. Wang, Y .-S. Liu, and M. Zwicker, “Y2seq2seq: Cross-modal representation learning for 3d shape and text by joint recon- struction and prediction of view and word sequences,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019, pp. 126–133

2019
[34]

Parts2words: Learning joint embedding of point clouds and texts by bidirectional matching between parts and words,

C. Tang, X. Yang, B. Wu, Z. Han, and Y . Chang, “Parts2words: Learning joint embedding of point clouds and texts by bidirectional matching between parts and words,” inIEEE Conference on Computer Vision and Pattern Recognition, 2023, pp. 6884–6893

2023
[35]

Tricolo: Trimodal contrastive loss for text to shape retrieval,

Y . Ruan, H.-H. Lee, Y . Zhang, K. Zhang, and A. X. Chang, “Tricolo: Trimodal contrastive loss for text to shape retrieval,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 5815–5825

2024
[36]

Sca3d: Enhancing cross-modal 3d retrieval via 3d shape and caption paired data augmentation,

J. Ren, H. Wu, H. Xiong, and H. Wang, “Sca3d: Enhancing cross-modal 3d retrieval via 3d shape and caption paired data augmentation,” in2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2025, pp. 9550–9557

2025
[37]

Pointcloud- text matching: Benchmark dataset and baseline,

Y . Feng, Y . Qin, D. Peng, H. Zhu, X. Peng, and P. Hu, “Pointcloud- text matching: Benchmark dataset and baseline,”IEEE Transactions on Multimedia, 2025

2025
[38]

Audio-enhanced text-to-video retrieval using text-conditioned feature alignment,

S. Ibrahimi, X. Sun, P. Wang, A. Garg, A. Sanan, and M. Omar, “Audio-enhanced text-to-video retrieval using text-conditioned feature alignment,” inIEEE International Conference on Computer Vision, 2023, pp. 12 020–12 030

2023
[39]

Eclipse: Efficient long- range video retrieval using sight and sound,

Y .-B. Lin, J. Lei, M. Bansal, and G. Bertasius, “Eclipse: Efficient long- range video retrieval using sight and sound,” inIn Proceedings of the European Conference on Computer Vision, 2022, pp. 413–430

2022
[40]

Tri-modal motion retrieval by learning a joint embedding space,

K. Yin, S. Zou, Y . Ge, and Z. Tian, “Tri-modal motion retrieval by learning a joint embedding space,” inIEEE Conference on Computer Vision and Pattern Recognition, 2024, pp. 1596–1605

2024
[41]

Enhanced cross-modal 3d retrieval via tri-modal reconstruction,

J. Ren and H. Wang, “Enhanced cross-modal 3d retrieval via tri-modal reconstruction,” in2025 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2025, pp. 1–6

2025
[42]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017

2017
[43]

Representation Learning with Contrastive Predictive Coding

A. v. d. Oord, Y . Li, and O. Vinyals, “Representation learning with contrastive predictive coding,”arXiv preprint arXiv:1807.03748, 2018

work page internal anchor Pith review arXiv 2018
[44]

Adam: A Method for Stochastic Optimization

D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review arXiv 2014