pith. sign in

arxiv: 2607.00752 · v1 · pith:DKBDUMUMnew · submitted 2026-07-01 · 💻 cs.CV

GKDT: General Keypoint Detection Transformer

Pith reviewed 2026-07-02 14:18 UTC · model grok-4.3

classification 💻 cs.CV
keywords general keypoint detectiontransformerunified datasetvisual promptstext promptsMegaKPTPCK accuracyDINOv3
0
0 comments X

The pith

A single transformer model trained on a merged keypoint dataset detects keypoints across dozens of categories with over 90 percent accuracy on most test sets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper first assembles MegaKPT by combining more than 1.3 million object instances and their annotations from twenty-nine existing datasets, adding consistent keypoint text descriptions. It then trains GKDT, a DINOv3-based Transformer that accepts visual prompts, text prompts, or both, along with mix-modal prompted training and dynamic importance sampling. The resulting single model is evaluated on twenty-two test sets covering both seen and unseen objects. A sympathetic reader would care because this replaces many narrow, category-specific detectors with one flexible system that could be dropped into varied applications without retraining per task.

Core claim

The paper constructs MegaKPT, a large-scale unified keypoint dataset drawn from twenty-nine sources with high-quality annotations and keypoint text descriptions, and introduces GKDT, a simple flexible Transformer built on DINOv3 that supports visual prompts, text prompts, or both. Using mix-modal prompted training and dynamic importance sampling, the single GKDT model achieves strong performance and generality when tested on twenty-two sets with seen or unseen objects, exceeding 90 percent PCK@0.1 accuracy on most categories.

What carries the argument

GKDT, a DINOv3-based Transformer that accepts visual prompts, text prompts, or both to locate keypoints on objects.

If this is right

  • One model replaces many specialized keypoint detectors across different object types.
  • The same weights handle both objects seen during training and entirely new categories.
  • Prompt inputs let users specify which keypoints to detect without changing the model.
  • High accuracy on broad categories supports direct use in real-world pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Merging strategies like MegaKPT could be repeated for related tasks such as instance segmentation if annotation consistency can be maintained.
  • Text prompt support opens the possibility of describing new keypoint sets in natural language and detecting them without additional labeled examples.
  • A deployed version might reduce the need for separate networks in multi-object environments like robotics or scene understanding.

Load-bearing premise

The annotations from the twenty-nine source datasets can be unified with consistent keypoint text descriptions in a way that supplies enough training signal for the model to generalize to unseen objects and categories.

What would settle it

Running the released GKDT model on a fresh collection of object categories absent from the original twenty-nine datasets and measuring PCK@0.1 well below 90 percent on most of them would show the claimed generality does not hold.

Figures

Figures reproduced from arXiv: 2607.00752 by Anton van den Hengel, Changsheng Lu, Haokun Gui, Harry Yang, Jiaya Jia, Jie Yang, Rong Wang, Yuxin Chen.

Figure 1
Figure 1. Figure 1: A glance of MegaKPT. All images have both keypoint and text annotations. (top) Visualize keypoints only; (bottom) Visualize both keypoints and texts. The dataset covers large diversity of objects from various scenes (best viewed in zoom). a series of successful convolutional neural network (CNN) or Transformer [56] based keypoint detection models, expert for various domain-specific categories, such as huma… view at source ↗
Figure 2
Figure 2. Figure 2: Instance distribution against ten super-categories in our unified dataset. sion [17,28,70,71]. Following this trend, we proactively explore the self-supervised DINOv3 for general keypoint detection. 3 MegaKPT: A Large-Scale High-Quality GKD Dataset Advancing capacity and diversity of the unified keypoint dataset is crucial for general keypoint detection (GKD). Prior major datasets include MP-100 [64] and U… view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of the pipeline of our GKDT model for general keypoint detection. Our model takes as input the text prompt, visual prompt, or both to detect the cor￾responding keypoints on a query image. Our GKDT uses a fully tuned DINOv3 vision transformer as visual backbone to extract semantically rich image features, and a kernel generation (KG) transformer to translate the textual and visual keypoint prot… view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of visual and text prompted keypoint detection using our GKDT model. The white shadow in predictions signifies PCK@0.1, which means a predicted keypoint (i.e., circle) is correct if falling on this area. GT are marked as tilted crosses. Compared Methods: The compared methods include those that can perform general keypoint detection (GKD) such as the training-free vanilla DINOv3 [49], OpenKD [… view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of general keypoint detection in multi-object scenario via GKDT. and GKDT-H with G-DINO obtain scores of 70.0% and 71.3%, while with GT boxes have 78.1% and 79.9%, showing significant room to improve. The G-DINO performs modest in HumanArt. After using the identical boxes from X-Pose, our GKDT improves the results compared to X-Pose (73.4% vs. 70.7%). Results on animals and vehicles: We evalu… view at source ↗
Figure 6
Figure 6. Figure 6: Study on pre-trained VFMs (a), mix-modal prompted training (b), and amounts of training data (c). Animal pose dataset is tested in (a) and (b). PCK@0.1 is used [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Simulation of three sampling strategies. The three rows show the distributions of uniform sampling, importance sampling, and our dynamic importance sampling at iteration progresses 0%, 50%, 90% and 98%, respectively. Our dynamic importance sampling can adaptively adjust the distribution and mitigate the head-tail data imbal￾ance. data removal from heads and the convergence of data balance. The step-by-step… view at source ↗
Figure 8
Figure 8. Figure 8: Visualization on more diverse categories, such as human and animal poses, faces, hands, insect poses, vehicles, and furniture. Our GKDT model can recognize keypoints on diverse objects, showing the excellent generality and performance. monkey reptile giraffe bear bird sleeve bed [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Visualization of zero-shot keypoint detection on in-the-wild videos. We supply the keypoint texts for each video clip to detect the corresponding keypoints. All results are obtained via a single GKDT model [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗
read the original abstract

With the emergence of various pre-trained vision and language models, computer vision is shifting from narrow-domain to open-domain recognition. The construction of a more powerful yet general keypoint detection (GKD) model to support diverse tasks has become increasingly important in the field. To this end, we firstly present a large-scale unified keypoint dataset called MegaKPT. The dataset is composed of over 1.3 million diverse object instances from twenty-nine existing datasets, and enjoys high-quality unified annotations with keypoint text descriptions. Based on MegaKPT, we develop GKDT, a simple, flexible and powerful DINOv3 based Transformer model for General Keypoint Detection. Our GKDT supports visual prompts, text prompts, or both. To enhance model training, we also propose a suite of useful strategies such as mix-modal prompted training and dynamic importance sampling. By testing over 22 test sets with seen or unseen objects, our single GKDT model shows strong performance and generality in detecting keypoints on broad categories, with most categories over 90\% PCK@0.1 accuracy, offering high practical applicability to real-world problems. The dataset, models, and codes will be released at https://github.com/AlanLuSun/General-Keypoint-Detection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces MegaKPT, a unified keypoint dataset aggregating over 1.3 million instances from 29 existing datasets with high-quality annotations and keypoint text descriptions, and proposes GKDT, a DINOv3-based Transformer model for general keypoint detection that accepts visual prompts, text prompts, or both. It further describes training strategies including mix-modal prompted training and dynamic importance sampling. The central claim is that a single GKDT model achieves strong generality, with most categories exceeding 90% PCK@0.1 accuracy across 22 test sets containing both seen and unseen objects.

Significance. If the unification process and empirical results are shown to be robust, the work would advance open-domain keypoint detection by supplying a large-scale training resource and a flexible promptable model with broad category coverage and real-world applicability. The planned public release of the dataset, models, and code would further strengthen reproducibility.

major comments (2)
  1. Abstract: the central claim that a single model generalizes to unseen objects and categories with >90% PCK@0.1 on 22 test sets rests on the assumption that the 'high-quality unified annotations' in MegaKPT form a semantically coherent training distribution, yet the abstract supplies no description of the mapping rules, conflict-resolution procedure, or consistency audit used to reconcile incompatible keypoint taxonomies and coordinate conventions across the 29 source datasets.
  2. Abstract: performance numbers are reported without any baselines, ablation studies, error analysis, or details on dataset construction and split construction, leaving the contribution of the proposed mix-modal training and dynamic importance sampling unsupported and preventing assessment of whether the reported generality exceeds what could be obtained from simpler fine-tuning.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments on the abstract. We agree that the abstract would benefit from additional context on dataset unification and experimental support. We will revise the abstract accordingly while preserving its conciseness.

read point-by-point responses
  1. Referee: Abstract: the central claim that a single model generalizes to unseen objects and categories with >90% PCK@0.1 on 22 test sets rests on the assumption that the 'high-quality unified annotations' in MegaKPT form a semantically coherent training distribution, yet the abstract supplies no description of the mapping rules, conflict-resolution procedure, or consistency audit used to reconcile incompatible keypoint taxonomies and coordinate conventions across the 29 source datasets.

    Authors: We acknowledge that the abstract does not describe the unification process. Section 3 of the manuscript details the aggregation of the 29 source datasets, including taxonomy mapping rules, conflict resolution for incompatible keypoints, coordinate normalization, and consistency audits performed to ensure semantic coherence. In the revision we will add one or two sentences to the abstract summarizing these steps so that the claim of high-quality unified annotations is better supported at first reading. revision: yes

  2. Referee: Abstract: performance numbers are reported without any baselines, ablation studies, error analysis, or details on dataset construction and split construction, leaving the contribution of the proposed mix-modal training and dynamic importance sampling unsupported and preventing assessment of whether the reported generality exceeds what could be obtained from simpler fine-tuning.

    Authors: The abstract is intentionally brief. The full manuscript contains extensive experiments (Section 4) that include baselines against prior keypoint detectors, ablations isolating mix-modal prompted training and dynamic importance sampling, error analysis, and explicit descriptions of the train/test splits across the 22 evaluation sets. We will revise the abstract to include a short clause noting that these controlled experiments demonstrate the added value of the proposed training strategies over simpler fine-tuning baselines. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset unification and model evaluation are self-contained.

full rationale

The paper constructs MegaKPT by aggregating and annotating instances from 29 prior datasets, then trains and evaluates GKDT directly on held-out test sets across 22 benchmarks. No equations, predictions, or central claims reduce by construction to fitted inputs, self-citations, or renamed prior results. Performance metrics (PCK@0.1) are reported from explicit testing rather than derived from the unification step itself. The derivation chain is therefore independent of the patterns that would trigger circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities beyond standard assumptions of transformer pre-training and dataset unification.

pith-pipeline@v0.9.1-grok · 5769 in / 1110 out tokens · 40411 ms · 2026-07-02T14:18:46.237010+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

76 extracted references · 10 canonical work pages · 4 internal anchors

  1. [1]

    GPT-4 Technical Report

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023) 5

  2. [2]

    arXiv preprint arXiv:2108.13958 (2021) 4, 9, 22 16 C

    Banik, P., Li, L., Dong, X.: A novel dataset for keypoint detection of quadruped animals from images. arXiv preprint arXiv:2108.13958 (2021) 4, 9, 22 16 C. Lu et al

  3. [3]

    IEEE transactions on pattern analysis and machine intelligence 35(12), 2930–2940 (2013) 4

    Belhumeur, P.N., Jacobs, D.W., Kriegman, D.J., Kumar, N.: Localizing parts of faces using a consensus of exemplars. IEEE transactions on pattern analysis and machine intelligence 35(12), 2930–2940 (2013) 4

  4. [4]

    Advances in neural information processing systems33, 1877–1901 (2020) 13

    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems33, 1877–1901 (2020) 13

  5. [5]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Cao, J., Tang, H., Fang, H.S., Shen, X., Lu, C., Tai, Y.W.: Cross-domain adapta- tion for animal pose estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9498–9507 (2019) 4, 10, 22

  6. [6]

    IEEE transactions on pattern analysis and machine intelligence43(1), 172–186 (2019) 1, 2, 3

    Cao, Z., Hidalgo, G., Simon, T., Wei, S.E., Sheikh, Y.: Openpose: Realtime multi- person 2d pose estimation using part affinity fields. IEEE transactions on pattern analysis and machine intelligence43(1), 172–186 (2019) 1, 2, 3

  7. [7]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Carreira, J., Agrawal, P., Fragkiadaki, K., Malik, J.: Human pose estimation with iterative error feedback. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4733–4742 (2016) 3

  8. [8]

    In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025) 3

    Chen, J., Luo, Z., Liu, Z., Jiang, W., Niu, L., Fang, Y.: Weak-shot keypoint es- timation via keyness and correspondence transfer. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025) 3

  9. [9]

    In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition

    Cheng, B., Xiao, B., Wang, J., Shi, H., Huang, T.S., Zhang, L.: Higherhrnet: Scale- aware representation learning for bottom-up human pose estimation. In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5386–5395 (2020) 1, 3

  10. [10]

    In: Proceedings of the IEEE international conference on computer vision

    Fang, H.S., Xie, S., Tai, Y.W., Lu, C.: Rmpe: Regional multi-person pose estima- tion. In: Proceedings of the IEEE international conference on computer vision. pp. 2334–2343 (2017) 3

  11. [11]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Ge, Y., Zhang, R., Wang, X., Tang, X., Luo, P.: Deepfashion2: A versatile bench- mark for detection, pose estimation, segmentation and re-identification of clothing images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5337–5345 (2019) 4, 24

  12. [12]

    elife8, e47994 (2019) 4, 23

    Graving, J.M., Chae, D., Naik, H., Li, L., Koger, B., Costelloe, B.R., Couzin, I.D.: Deepposekit, a software toolkit for fast and robust animal pose estimation using deep learning. elife8, e47994 (2019) 4, 23

  13. [13]

    In: Alvey vision conference

    Harris, C., Stephens, M., et al.: A combined corner and edge detector. In: Alvey vision conference. vol. 15, pp. 10–5244. Manchester, UK (1988) 1, 3

  14. [14]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16000–16009 (2022) 7, 13

  15. [15]

    In: European Conference on Computer Vision

    Hirschorn, O., Avidan, S.: A graph-based approach for category-agnostic pose es- timation. In: European Conference on Computer Vision. pp. 469–485. Springer (2024) 2

  16. [16]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    Honari, S., Molchanov, P., Tyree, S., Vincent, P., Pal, C., Kautz, J.: Improving landmark localization with semi-supervised learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1546–1555 (2018) 3

  17. [17]

    Advances in Neural Information Processing Systems36 (2024) 5

    Jiao, B., Liu, L., Gao, L., Wu, R., Lin, G., Wang, P., Zhang, Y.: Toward re- identifying any animal. Advances in Neural Information Processing Systems36 (2024) 5

  18. [18]

    Bioengineering11(9), 932 (2024) 3, 4, 5, 6, 10, 24 GKDT: General Keypoint Detection Transformer 17

    Joham, S.J., Hadzic, A., Urschler, M.: Implicit is not enough: Explicitly enforcing anatomical priors inside landmark localization models. Bioengineering11(9), 932 (2024) 3, 4, 5, 6, 10, 24 GKDT: General Keypoint Detection Transformer 17

  19. [19]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Jose, C., Moutakanni, T., Kang, D., Baldassarre, F., Darcet, T., Xu, H., Li, D., Szafraniec, M., Ramamonjisoa, M., Oquab, M., et al.: Dinov2 meets text: A unified framework for image-and pixel-level vision-language alignment. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 24905–24916 (2025) 7

  20. [20]

    In: 2021 IEEE international conference on robotics and au- tomation (ICRA)

    Joska, D., Clark, L., Muramatsu, N., Jericevich, R., Nicolls, F., Mathis, A., Mathis, M.W., Patel, A.: Acinoset: a 3d pose estimation dataset and baseline models for cheetahs in the wild. In: 2021 IEEE international conference on robotics and au- tomation (ICRA). pp. 13901–13908. IEEE (2021) 4, 24

  21. [21]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Ju, X., Zeng, A., Wang, J., Xu, Q., Zhang, L.: Human-art: A versatile human- centric dataset bridging natural and artificial scenes. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 618–629 (2023) 4, 9, 12, 22

  22. [22]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Khan, M.H., McDonagh, J., Khan, S., Shahabuddin, M., Arora, A., Khan, F.S., Shao, L., Tzimiropoulos, G.: Animalweb: A large-scale hierarchical dataset of an- notated animal faces. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6939–6948 (2020) 4, 10, 23

  23. [23]

    In: 2011 IEEE international conference on computer vision workshops (ICCV work- shops)

    Koestinger, M., Wohlhart, P., Roth, P.M., Bischof, H.: Annotated facial landmarks in the wild: A large-scale, real-world database for facial landmark localization. In: 2011 IEEE international conference on computer vision workshops (ICCV work- shops). pp. 2144–2151. IEEE (2011) 4, 24

  24. [24]

    in the wild

    Labuguen, R., Matsumoto, J., Negrete, S.B., Nishimaru, H., Nishijo, H., Takada, M., Go, Y., Inoue, K.i., Shibata, T.: Macaquepose: a novel “in the wild” macaque monkey pose dataset for markerless motion capture. Frontiers in behavioral neu- roscience 14, 581154 (2021) 4, 23

  25. [25]

    In: European conference on computer vision

    Le, V., Brandt, J., Lin, Z., Bourdev, L., Huang, T.S.: Interactive facial feature localization. In: European conference on computer vision. pp. 679–692. Springer (2012) 4

  26. [26]

    arXiv preprint arXiv:1906.05586 (2019) 4, 23

    Li, S., Li, J., Tang, H., Qian, R., Lin, W.: Atrw: a benchmark for amur tiger re-identification in the wild. arXiv preprint arXiv:1906.05586 (2019) 4, 23

  27. [27]

    In: European conference on computer vision

    Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: European conference on computer vision. pp. 740–755. Springer (2014) 4, 5, 7, 9, 10, 11, 12, 22

  28. [28]

    Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

    Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Li, C., Yang, J., Su, H., Zhu, J., et al.: Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499 (2023) 5, 12

  29. [29]

    Interna- tional journal of computer vision60(2), 91–110 (2004) 1, 3

    Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Interna- tional journal of computer vision60(2), 91–110 (2004) 1, 3

  30. [30]

    Lu, C.: General Keypoint Detection: Few-Shot and Zero-Shot. Ph.D. thesis, The Australian National University (Australia) (2024) 2

  31. [31]

    Neurocomputing409, 60–73 (2020) 4

    Lu, C., Gu, C., Wu, K., Xia, S., Wang, H., Guan, X.: Deep transfer neural network using hybrid representations of domain discrepancy. Neurocomputing409, 60–73 (2020) 4

  32. [32]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Lu, C., Koniusz, P.: Few-shot keypoint detection with uncertainty learning for unseen species. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 19416–19426 (2022) 2, 3, 10

  33. [33]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Lu, C., Koniusz, P.: Detect any keypoints: An efficient light-weight few-shot key- point detector. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 3882–3890 (2024) 2, 3 18 C. Lu et al

  34. [34]

    In: European Conference on Computer Vision

    Lu, C., Liu, Z., Koniusz, P.: Openkd: Opening prompt diversity for zero-and few- shot keypoint detection. In: European Conference on Computer Vision. pp. 148–

  35. [35]

    Springer (2024) 2, 4, 6, 9, 10, 11, 22, 23

  36. [36]

    Lu, C., Wang, H., Gu, C., Wu, K., Guan, X.: Viewpoint estimation for workpieces withdeeptransferlearningfromcoldtohot.In:InternationalConferenceonNeural Information Processing. pp. 21–32. Springer (2018) 4

  37. [37]

    International Journal of Computer Vision134(2), 63 (2026) 2, 3

    Lu, C., Zhu, H., Koniusz, P.: Exploiting class-agnostic visual prior for few-shot keypoint detection. International Journal of Computer Vision134(2), 63 (2026) 2, 3

  38. [38]

    Stanford University (1980) 1

    Moravec, H.P.: Obstacle avoidance and navigation in the real world by a seeing robot rover. Stanford University (1980) 1

  39. [39]

    arXiv preprint arXiv:2101.07988 (2021) 3

    Moskvyak, O., Maire, F., Dayoub, F., Baktashmotlagh, M.: Semi-supervised key- point localization. arXiv preprint arXiv:2101.07988 (2021) 3

  40. [40]

    In: European conference on computer vision

    Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose esti- mation. In: European conference on computer vision. pp. 483–499. Springer (2016) 1, 3

  41. [41]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Ng, X.L., Ong, K.E., Zheng, Q., Ni, Y., Yeo, S.Y., Liu, J.: Animal kingdom: A large and diverse dataset for animal behavior understanding. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 19023– 19034 (2022) 4, 23

  42. [42]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Pavlakos, G., Shan, D., Radosavovic, I., Kanazawa, A., Fouhey, D., Malik, J.: Reconstructing hands in 3d with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9826–9836 (2024) 4, 22

  43. [43]

    Na- ture methods 16(1), 117–125 (2019) 4, 23

    Pereira, T.D., Aldarondo, D.E., Willmore, L., Kislin, M., Wang, S.S.H., Murthy, M., Shaevitz, J.W.: Fast animal pose estimation using deep neural networks. Na- ture methods 16(1), 117–125 (2019) 4, 23

  44. [44]

    In: International conference on machine learning

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021) 4

  45. [45]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Reddy, N.D., Vo, M., Narasimhan, S.G.: Carfusion: Combining point tracking and part detection for dynamic 3d reconstruction of vehicles. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1906–1915 (2018) 4, 23

  46. [46]

    Image and vision computing47, 3–18 (2016) 4, 22

    Sagonas, C., Antonakos, E., Tzimiropoulos, G., Zafeiriou, S., Pantic, M.: 300 faces in-the-wild challenge: Database and results. Image and vision computing47, 3–18 (2016) 4, 22

  47. [47]

    In: Proceedings of the IEEE international conference on computer vision workshops

    Sagonas, C., Tzimiropoulos, G., Zafeiriou, S., Pantic, M.: 300 faces in-the-wild challenge: The first facial landmark localization challenge. In: Proceedings of the IEEE international conference on computer vision workshops. pp. 397–403 (2013) 4

  48. [48]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Shi, M., Huang, Z., Ma, X., Hu, X., Cao, Z.: Matching is not enough: A two- stage framework for category-agnostic pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7308– 7317 (2023) 10, 11

  49. [49]

    In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision

    Shi, W., Lu, C., Shao, M., Zhang, Y., Xia, S., Koniusz, P.: Few-shot shape recog- nition by learning deep shape-aware features. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 1848–1859 (2024) 3 GKDT: General Keypoint Detection Transformer 19

  50. [50]

    DINOv3

    Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khali- dov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., et al.: Dinov3. arXiv preprint arXiv:2508.10104 (2025) 3, 4, 7, 10, 11

  51. [51]

    Prototypical Networks for Few-shot Learning

    Snell, J., Swersky, K., Zemel, R.S.: Prototypical networks for few-shot learning. arXiv preprint arXiv:1703.05175 (2017) 3

  52. [52]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learn- ing for human pose estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5693–5703 (2019) 1, 2, 3

  53. [53]

    Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare:Relationnetworkforfew-shotlearning.In:ProceedingsoftheIEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1199–1208 (2018) 3

  54. [54]

    Advances in neural information processing systems27 (2014) 3

    Tompson, J.J., Jain, A., LeCun, Y., Bregler, C.: Joint training of a convolutional network and a graphical model for human pose estimation. Advances in neural information processing systems27 (2014) 3

  55. [55]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Toshev, A., Szegedy, C.: Deeppose: Human pose estimation via deep neural net- works. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1653–1660 (2014) 3

  56. [56]

    In: Proceedings of the IEEEConferenceonComputerVisionandPatternRecognition.pp.595–604(2015) 4, 10, 23

    Van Horn, G., Branson, S., Farrell, R., Haber, S., Barry, J., Ipeirotis, P., Perona, P., Belongie, S.: Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection. In: Proceedings of the IEEEConferenceonComputerVisionandPatternRecognition.pp.595–604(2015) 4, 10, 23

  57. [57]

    Advances in neural information pro- cessing systems 30 (2017) 2

    Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information pro- cessing systems 30 (2017) 2

  58. [58]

    Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The Caltech-UCSD Birds-200-2011 Dataset. Tech. Rep. CNS-TR-2011-001, California Institute of Technology (2011) 4, 10, 22

  59. [59]

    arXiv preprint arXiv:2201.08613 (2022) 3

    Wang, C., Jin, S., Guan, Y., Liu, W., Qian, C., Luo, P., Ouyang, W.: Pseudo- labeled auto-curriculum learning for semi-supervised keypoint localization. arXiv preprint arXiv:2201.08613 (2022) 3

  60. [60]

    Medical image analysis31, 63–76 (2016) 3, 4, 6, 24

    Wang, C.W., Huang, C.T., Lee, J.H., Li, C.H., Chang, S.W., Siao, M.J., Lai, T.M., Ibragimov, B., Vrtovec, T., Ronneberger, O., et al.: A benchmark for comparison of dental radiography analysis algorithms. Medical image analysis31, 63–76 (2016) 3, 4, 6, 24

  61. [61]

    IEEE Transactions on Circuits and Systems for Video Technology 29(11), 3258–3268 (2018) 4, 22

    Wang, Y., Peng, C., Liu, Y.: Mask-pose cascaded cnn for 2d hand pose estimation from single color image. IEEE Transactions on Circuits and Systems for Video Technology 29(11), 3258–3268 (2018) 4, 22

  62. [62]

    In: European Conference on Computer Vision

    Wu, J., Xue, T., Lim, J.J., Tian, Y., Tenenbaum, J.B., Torralba, A., Freeman, W.T.: Single image 3d interpreter network. In: European Conference on Computer Vision. pp. 365–382. Springer (2016) 4, 23

  63. [63]

    In: 2021 International Conference on control, au- tomation and information sciences (ICCAIS)

    Wu, X., Lu, C., Gu, C., Wu, K., Zhu, S.: Domain adaptation for viewpoint esti- mation with image generation. In: 2021 International Conference on control, au- tomation and information sciences (ICCAIS). pp. 341–346. IEEE (2021) 4

  64. [64]

    In:ProceedingsoftheIEEEconferenceoncomputervisionandpatternrecognition

    Xian, Y., Schiele, B., Akata, Z.: Zero-shot learning-the good, the bad and the ugly. In:ProceedingsoftheIEEEconferenceoncomputervisionandpatternrecognition. pp. 4582–4591 (2017) 10

  65. [65]

    In: European confer- ence on computer vision

    Xu, L., Jin, S., Zeng, W., Liu, W., Qian, C., Ouyang, W., Luo, P., Wang, X.: Pose for everything: Towards category-agnostic pose estimation. In: European confer- ence on computer vision. pp. 398–416. Springer (2022) 2, 5 20 C. Lu et al

  66. [66]

    IEEE Transactions on Pattern Analysis and Machine Intel- ligence 46(2), 1212–1230 (2023) 1, 2, 3, 7, 11, 12

    Xu, Y., Zhang, J., Zhang, Q., Tao, D.: Vitpose++: Vision transformer for generic body pose estimation. IEEE Transactions on Pattern Analysis and Machine Intel- ligence 46(2), 1212–1230 (2023) 1, 2, 3, 7, 11, 12

  67. [67]

    In: European Conference on Computer Vision

    Yang, J., Zeng, A., Zhang, R., Zhang, L.: X-pose: Detecting any keypoints. In: European Conference on Computer Vision. pp. 249–268. Springer (2024) 2, 4, 5, 6, 10, 11, 12, 13, 24

  68. [68]

    Advances in Neural Informa- tion Processing Systems35, 17301–17313 (2022) 4, 24

    Yang, Y., Yang, J., Xu, Y., Zhang, J., Lan, L., Tao, D.: Apt-36k: A large-scale benchmark for animal pose estimation and tracking. Advances in Neural Informa- tion Processing Systems35, 17301–17313 (2022) 4, 24

  69. [69]

    arXiv preprint arXiv:2203.074364(5) (2022) 4, 23

    Ye, S., Filippova, A., Lauer, J., Vidal, M., Schneider, S., Qiu, T., Mathis, A., Mathis, M.W.: Superanimal models pretrained for plug-and-play analysis of animal behavior. arXiv preprint arXiv:2203.074364(5) (2022) 4, 23

  70. [70]

    arXiv preprint arXiv:2108.12617 (2021)

    Yu, H., Xu, Y., Zhang, J., Zhao, W., Guan, Z., Tao, D.: Ap-10k: A benchmark for animal pose estimation in the wild. arXiv preprint arXiv:2108.12617 (2021) 2, 4, 10, 23

  71. [71]

    In- ternational Journal of Computer Vision132(12), 5741–5758 (2024) 2, 4, 5

    Zhang, H., Xu, L., Lai, S., Shao, W., Zheng, N., Luo, P., Qiao, Y., Zhang, K.: Open-vocabulary animal keypoint detection with semantic-feature matching. In- ternational Journal of Computer Vision132(12), 5741–5758 (2024) 2, 4, 5

  72. [72]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Zhang, X., Wang, W., Chen, Z., Xu, Y., Zhang, J., Tao, D.: Clamp: Prompt-based contrastive learning for connecting language and animal pose. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 23272– 23281 (2023) 2, 4, 5

  73. [73]

    International Jour- nal of Computer Vision pp

    Zhao, S., Gong, M., Zhao, H., Zhang, J., Tao, D.: Deep corner. International Jour- nal of Computer Vision pp. 1–25 (2023) 3

  74. [74]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Zhou, B., Zhou, H., Liang, T., Yu, Q., Zhao, S., Zeng, Y., Lv, J., Luo, S., Wang, Q., Yu, X., et al.: Clothesnet: An information-rich 3d garment model repository with simulated clothes environment. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 20428–20438 (2023) 1

  75. [75]

    In: 2012 IEEE conference on computer vision and pattern recognition

    Zhu, X., Ramanan, D.: Face detection, pose estimation, and landmark localization in the wild. In: 2012 IEEE conference on computer vision and pattern recognition. pp. 2879–2886. IEEE (2012) 4 GKDT: General Keypoint Detection Transformer (Supplementary Material) Changsheng Lu1, Yuxin Chen1, Haokun Gui1, Rong Wang2, Jie Yang3, Harry Yang1, Anton van den Hen...

  76. [76]

    To ease the usage of this dataset, we release above splits at https://github.com/AlanLuSun/General-Keypoint-Detection

    are only used for training. To ease the usage of this dataset, we release above splits at https://github.com/AlanLuSun/General-Keypoint-Detection. C Additional Implementation Details In our detection head, the upsamplerU can be either bilinear upsampler or two deconv blocks, which are for upscaling the resolution of query image feature maps. We find that ...