pith. machine review for the scientific record. sign in

arxiv: 2604.03322 · v1 · submitted 2026-04-02 · 💻 cs.CV · cs.AI· cs.RO

Recognition: no theorem link

VitaTouch: Property-Aware Vision-Tactile-Language Model for Robotic Quality Inspection in Manufacturing

Authors on Pith no claims yet

Pith reviewed 2026-05-13 21:13 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.RO
keywords vision-tactile fusionmaterial property inferencerobotic quality inspectionmultimodal language modelcontrastive alignmentdefect recognitionVitaSet dataset
0
0 comments X

The pith

A vision-tactile-language model infers material hardness and roughness for robotic inspection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces VitaTouch to overcome vision-only limits in manufacturing quality checks, where occlusion and reflections hide intrinsic properties like hardness or roughness. It processes images and tactile readings through separate encoders and a dual Q-Former, compresses the results into prefix tokens for a language model, and aligns the two senses with text via contrastive learning. A new dataset VitaSet supplies 186 objects and human-verified pairs for training and testing. The model leads on HCT and overall TVL benchmarks, hits strong accuracy numbers on VitaSet, and after light fine-tuning reaches 94 percent success in closed-loop robotic recognition and sorting trials.

Core claim

VitaTouch couples vision and tactile inputs through contrastive learning, feeds the aligned features as prefix tokens to a large language model, and thereby infers material properties and generates natural-language descriptions. On VitaSet it records 88.89 percent hardness accuracy, 75.13 percent roughness accuracy, and 54.81 percent descriptor recall, with material-description semantic similarity reaching 0.9009. LoRA fine-tuning lifts 2-, 3-, and 5-category defect recognition to 100, 96, and 92 percent, while 100 laboratory trials show 94 percent closed-loop accuracy and 94 percent end-to-end sorting success.

What carries the argument

Dual Q-Former that extracts language-relevant features from vision and tactile encoders and couples the modalities through contrastive alignment before prefix-token compression for the language model.

If this is right

  • Vision-tactile contrastive coupling produces reliable natural-language material descriptions that vision alone cannot supply.
  • LoRA fine-tuning on the same architecture raises multi-category defect recognition above 90 percent.
  • Closed-loop robotic trials reach 94 percent recognition accuracy and 94 percent end-to-end sorting success.
  • The same prefix-token mechanism works across HCT, TVL, and SSVTP benchmarks without task-specific redesign.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The contrastive coupling may transfer to other paired sensor types such as force-torque and audio for broader robotic perception.
  • Success in laboratory sorting suggests the model could support real-time inspection lines if latency and sensor calibration remain stable.
  • Descriptor recall at 54.81 percent indicates that language generation still lags behind numerical property prediction and may need richer text supervision.

Load-bearing premise

The VitaSet collection of 186 objects and its human-verified pairs are assumed to represent the range of real manufacturing surfaces without the model overfitting to the specific sensors or object set used.

What would settle it

Measure the same hardness, roughness, and defect accuracies on a fresh set of objects and surfaces collected outside VitaSet, under factory lighting and handling conditions rather than laboratory ones.

Figures

Figures reproduced from arXiv: 2604.03322 by Fang Deng, Gang Chen, Jiayuan Li, Junyi Zong, Meixian Shi, Qingxuan Jia, Tong Li, Zihang Lv.

Figure 1
Figure 1. Figure 1: Overview of VitaTouch. Left: three-stage training pipeline. Stage 1 performs cross-modal alignment via dual Q-Formers with InfoNCE and PTM losses. Stage 2 builds a property-reasoning multimodal model with fused V–T tokens in frozen Vicuna-7B. LoRA-based defect adaptation is then conducted over progressively finer-grained defect label spaces using few-shot labeled samples per category. Right: tactile sensin… view at source ↗
Figure 2
Figure 2. Figure 2: VitaSet overview (Ours + AnyTouch GelSight-only). Aligned RGB observations and paired GelSight tactile readings across objects, with controlled￾vocabulary annotations and dataset statistics under a unified schema. reference descriptors and a supplementary semantic similarity score. Task 4: Defect Recognition. The model outputs a text-form label from a predefined defect taxonomy. We consider pro￾gressively … view at source ↗
Figure 3
Figure 3. Figure 3: VitaTouch model architecture. VitaTouch employs a dual-branch vision-tactile design with modality-specific encoders and Q-Formers. The Q￾Formers distill learnable queries into vision-tactile prefix tokens, which are prepended to text embeddings and fed into a frozen Vicuna-7B decoder. Training proceeds in three stages: Stage 1 aligns cross-modal embeddings via frozen encoders; Stage 2 establishes the perce… view at source ↗
Figure 4
Figure 4. Figure 4: VitaSet validation performance of VitaTouch across training epochs. (a) Multi-task validation trends for hardness accuracy, roughness accuracy, [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ablation results on the VitaSet dataset across tasks. Each variant removes one key stage from the full model, demonstrating the necessity of explicit [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Closed-loop robotic inspection and sorting demonstration. A Franka [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
read the original abstract

Quality inspection in smart manufacturing requires identifying intrinsic material and surface properties beyond visible geometry, yet vision-only methods remain vulnerable to occlusion and reflection. We propose VitaTouch, a property-aware vision-tactile-language model for material-property inference and natural-language attribute description. VitaTouch uses modality-specific encoders and a dual Q-Former to extract language-relevant visual and tactile features, which are compressed into prefix tokens for a large language model. We align each modality with text and explicitly couple vision and touch through contrastive learning. We also construct VitaSet, a multimodal dataset with 186 objects, 52k images, and 5.1k human-verified instruction-answer pairs. VitaTouch achieves the best performance on HCT and the overall TVL benchmark, while remaining competitive on SSVTP. On VitaSet, it reaches 88.89% hardness accuracy, 75.13% roughness accuracy, and 54.81% descriptor recall; the material-description task further achieves a peak semantic similarity of 0.9009. With LoRA-based fine-tuning, VitaTouch attains 100.0%, 96.0%, and 92.0% accuracy for 2-, 3-, and 5-category defect recognition, respectively, and delivers 94.0% closed-loop recognition accuracy and 94.0% end-to-end sorting success in 100 laboratory robotic trials. More details are available at the project page: https://vitatouch.github.io/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces VitaTouch, a property-aware vision-tactile-language model for robotic quality inspection. It uses modality-specific encoders and a dual Q-Former to extract and compress visual and tactile features into prefix tokens for an LLM, aligns each modality with text, and couples vision and touch via contrastive learning. The authors construct VitaSet (186 objects, 52k images, 5.1k human-verified pairs) and report state-of-the-art results on HCT and the TVL benchmark, competitive performance on SSVTP, plus specific accuracies on VitaSet (88.89% hardness, 75.13% roughness, 54.81% descriptor recall, peak semantic similarity 0.9009) and, after LoRA fine-tuning, 100%/96%/92% defect recognition for 2/3/5 categories plus 94% closed-loop accuracy and 94% end-to-end sorting success in 100 robotic trials.

Significance. If the generalization claims hold, the work could meaningfully advance multimodal sensing for manufacturing inspection by moving beyond vision-only limitations such as occlusion and reflection. The release of VitaSet as a multimodal dataset with human-verified instruction-answer pairs would be a concrete community resource for property-aware robotics research.

major comments (2)
  1. [Dataset and Experiments] VitaSet construction and evaluation: with training and testing performed across only 186 objects, the contrastive vision-tactile alignment and subsequent LoRA fine-tuning can exploit object-specific geometry and sensor signatures rather than intrinsic material properties; the 88.89% hardness, 75.13% roughness, and 100%/96%/92% defect accuracies therefore require explicit cross-object or leave-one-object-out validation to support the generalization claim.
  2. [Robotic Experiments] Robotic trials: the 94% closed-loop recognition accuracy and 94% end-to-end sorting success are obtained in 100 laboratory trials that reuse the same 186 objects and hardware; this provides no evidence that performance survives new surfaces or different tactile sensors, which is load-bearing for the central claim of applicability to real manufacturing quality inspection.
minor comments (2)
  1. [Abstract and Results] The abstract and results tables report point accuracies without error bars, standard deviations, or details on train/test splits and ablation studies; adding these would allow readers to assess stability of the reported benchmark wins.
  2. [Model Architecture] The description of how the dual Q-Former compresses features into LLM prefix tokens would benefit from a diagram or explicit token-dimension equations for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the importance of generalization in our evaluations. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Dataset and Experiments] VitaSet construction and evaluation: with training and testing performed across only 186 objects, the contrastive vision-tactile alignment and subsequent LoRA fine-tuning can exploit object-specific geometry and sensor signatures rather than intrinsic material properties; the 88.89% hardness, 75.13% roughness, and 100%/96%/92% defect accuracies therefore require explicit cross-object or leave-one-object-out validation to support the generalization claim.

    Authors: We agree that leave-one-object-out (LOO) validation would provide stronger evidence against object-specific overfitting. Our current protocol uses a random 80/20 split across the 186 objects, but in the revised manuscript we will add full LOO results (training on 185 objects, testing on the held-out object, averaged across folds). We will report updated accuracies for hardness, roughness, descriptor recall, and defect recognition under this protocol, along with comparisons to baselines. This revision will directly address the concern and support the property-aware claims. revision: yes

  2. Referee: [Robotic Experiments] Robotic trials: the 94% closed-loop recognition accuracy and 94% end-to-end sorting success are obtained in 100 laboratory trials that reuse the same 186 objects and hardware; this provides no evidence that performance survives new surfaces or different tactile sensors, which is load-bearing for the central claim of applicability to real manufacturing quality inspection.

    Authors: We acknowledge that the robotic trials reuse the VitaSet objects and hardware, limiting direct evidence for new surfaces or sensors. In the revision we will add an explicit limitations subsection discussing this scope and outlining future work on cross-sensor and in-factory validation. We will also highlight the material diversity within the 186 objects (covering metals, plastics, woods, etc.) as partial mitigation, while clarifying that the 94% figures demonstrate system integration rather than broad generalization. revision: partial

Circularity Check

0 steps flagged

No significant circularity; results are measured empirical outcomes on external benchmarks and new data

full rationale

The paper presents a multimodal model architecture (modality encoders, dual Q-Former, contrastive alignment, LoRA fine-tuning) and reports measured accuracies on HCT, TVL, SSVTP benchmarks plus the newly collected VitaSet (186 objects, human-verified pairs). No equations, predictions, or first-principles derivations are claimed that reduce by construction to fitted parameters, self-definitions, or self-citation chains. The reported numbers (e.g., 88.89% hardness accuracy) are evaluation metrics on held-out or external data, not tautological renamings or forced outputs. Minor self-citations, if present, are not load-bearing for the central claims.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard multimodal learning assumptions and a newly collected dataset whose representativeness is not independently verified; no new physical entities or ad-hoc constants are introduced.

axioms (2)
  • domain assumption Contrastive learning between vision, tactile, and text modalities produces aligned representations that support downstream property inference.
    Invoked in the alignment and coupling steps described in the abstract.
  • domain assumption Human-verified instruction-answer pairs in VitaSet accurately reflect intrinsic material properties.
    Required for the reported accuracy and recall metrics to be meaningful.

pith-pipeline@v0.9.0 · 5595 in / 1548 out tokens · 60297 ms · 2026-05-13T21:13:38.999702+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 1 internal anchor

  1. [1]

    Application of au- tomation for in-line quality inspection, a zero-defect manufacturing approach,

    V . Azamfirei, F. Psarommatis, and Y . Lagrosen, “Application of au- tomation for in-line quality inspection, a zero-defect manufacturing approach,”J. Manuf. Syst., vol. 67, pp. 1–22, 2023

  2. [2]

    Deep learning-assisted real-time defect detection and closed-loop adjustment for additive man- ufacturing of continuous fiber-reinforced polymer composites,

    L. Lu, J. Hou, S. Yuan, X. Yao, Y . Li, and J. Zhu, “Deep learning-assisted real-time defect detection and closed-loop adjustment for additive man- ufacturing of continuous fiber-reinforced polymer composites,”Robot. Comput.-Integr. Manuf., vol. 79, Art. no. 102431, 2023

  3. [3]

    Quality costs and Industry 4.0: inspection strategy modelling and reviewing,

    A. M. Reis, A. Dall-Orsoletta, E. Nunes, L. Costa, and S. Sousa, “Quality costs and Industry 4.0: inspection strategy modelling and reviewing,”Int. J. Adv. Manuf. Technol., vol. 136, no. 9, pp. 3883–3897, 2025

  4. [4]

    Deep industrial image anomaly detection: A survey,

    J. Liu, G. Xie, J. Wang, S. Li, C. Wang, F. Zheng, and Y . Jin, “Deep industrial image anomaly detection: A survey,”Mach. Intell. Res., vol. 21, no. 1, pp. 104–135, 2024

  5. [5]

    MVTec AD—A comprehensive real-world dataset for unsupervised anomaly detection,

    P. Bergmann, M. Fauser, D. Sattlegger, and C. Steger, “MVTec AD—A comprehensive real-world dataset for unsupervised anomaly detection,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2019, pp. 9592–9600

  6. [6]

    Surface defect inspection of industrial products with object detection deep networks: A systematic review,

    Y . Ma, J. Yin, F. Huang, and Q. Li, “Surface defect inspection of industrial products with object detection deep networks: A systematic review,”Artif. Intell. Rev., vol. 57, no. 12, Art. no. 333, 2024

  7. [7]

    A comprehensive review of robot intelligent grasping based on tactile perception,

    T. Li, Y . Yan, C. Yu, J. An, Y . Wang, and G. Chen, “A comprehensive review of robot intelligent grasping based on tactile perception,”Robot. Comput.-Integr. Manuf., vol. 90, Art. no. 102792, 2024

  8. [8]

    Vision-based tactile sensing: From performance parameters to device design,

    Y .-H. Xin, K.-M. Hu, R.-J. Xiang, Y .-L. Gao, J.-F. Zhou, G. Meng, and W.-M. Zhang, “Vision-based tactile sensing: From performance parameters to device design,”Appl. Phys. Rev., vol. 12, no. 2, Art. no. 021312, 2025

  9. [9]

    AnyTouch: Learning unified static-dynamic representation across mul- tiple visuo-tactile sensors,

    R. Feng, J. Hu, W. Xia, T. Gao, A. Shen, Y . Sun, B. Fang, and D. Hu, “AnyTouch: Learning unified static-dynamic representation across mul- tiple visuo-tactile sensors,” inProc. Int. Conf. Learn. Represent. (ICLR), 2025

  10. [10]

    Learning transferable visual models from natural language supervi- sion,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervi- sion,” inProc. 38th Int. Conf. Mach. Learn. (ICML), 2021, pp. 8748– 8763

  11. [11]

    BLIP-2: Bootstrapping language- image pre-training with frozen image encoders and large language mod- els,

    J. Li, D. Li, S. Savarese, and S. Hoi, “BLIP-2: Bootstrapping language- image pre-training with frozen image encoders and large language mod- els,” inProc. 40th Int. Conf. Mach. Learn. (ICML), 2023, pp. 19730– 19742

  12. [12]

    InstructBLIP: Towards general-purpose vision- language models with instruction tuning,

    W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, and S. Hoi, “InstructBLIP: Towards general-purpose vision- language models with instruction tuning,” inAdv. Neural Inf. Process. Syst., vol. 36, 2023, pp. 49250–49267

  13. [13]

    Visual instruction tuning,

    H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” inAdv. Neural Inf. Process. Syst., vol. 36, 2023, pp. 34892–34916

  14. [14]

    A touch, vision, and language dataset for multimodal alignment,

    L. Fu, G. Datta, H. Huang, W. C.-H. Panitch, J. Drake, J. Ortiz, M. Mukadam, M. Lambeta, R. Calandra, and K. Goldberg, “A touch, vision, and language dataset for multimodal alignment,” inProc. 41st Int. Conf. Mach. Learn. (ICML), 2024, pp. 14080–14101

  15. [15]

    Octopi: Object property reasoning with large tactile-language models,

    S. Yu, K. Lin, A. Xiao, J. Duan, and H. Soh, “Octopi: Object property reasoning with large tactile-language models,” inRobotics: Science and Systems (RSS), 2024. 11

  16. [16]

    MMAD: A comprehensive benchmark for multimodal large language models in industrial anomaly detection,

    X. Jiang, J. Li, H. Deng, Y . Liu, B.-B. Gao, Y . Zhou, J. Li, C. Wang, and F. Zheng, “MMAD: A comprehensive benchmark for multimodal large language models in industrial anomaly detection,” inProc. Int. Conf. Learn. Represent. (ICLR), 2025

  17. [17]

    PaDiM: A patch dis- tribution modeling framework for anomaly detection and localization,

    T. Defard, A. Setkov, A. Loesch, and R. Audigier, “PaDiM: A patch dis- tribution modeling framework for anomaly detection and localization,” inComput. Anal. Images Patterns, 2021, pp. 475–489

  18. [18]

    DRAEM—A discriminatively trained reconstruction embedding for surface anomaly detection,

    V . Zavrtanik, M. Kristan, and D. Sko ˇcaj, “DRAEM—A discriminatively trained reconstruction embedding for surface anomaly detection,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2021, pp. 8330–8339

  19. [19]

    Towards total recall in industrial anomaly detection,

    K. Roth, L. Pemula, J. Zepeda, B. Sch ¨olkopf, T. Brox, and P. Gehler, “Towards total recall in industrial anomaly detection,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2022, pp. 14318–14328

  20. [20]

    SPot-the- Difference self-supervised pre-training for anomaly detection and seg- mentation,

    Y . Zou, J. Jeong, L. Pemula, D. Zhang, and O. Dabeer, “SPot-the- Difference self-supervised pre-training for anomaly detection and seg- mentation,” inComput. Vis.–ECCV 2022, 2022, pp. 392–408

  21. [21]

    Multi-view attention guided feature learning for unsupervised surface defect detection,

    J. Zhou, M. Liu, Y . Ma, S. Jiang, and Y . Wang, “Multi-view attention guided feature learning for unsupervised surface defect detection,” IEEE/ASME Trans. Mechatronics, vol. 30, no. 4, pp. 2844–2852, Aug. 2025, doi: 10.1109/TMECH.2025.3566311

  22. [22]

    WinCLIP: Zero-/few-shot anomaly classification and segmentation,

    J. Jeong, Y . Zou, T. Kim, D. Zhang, A. Ravichandran, and O. Dabeer, “WinCLIP: Zero-/few-shot anomaly classification and segmentation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2023, pp. 19606–19616

  23. [23]

    PromptAD: Learning prompts with only normal samples for few-shot anomaly detection,

    X. Li, Z. Zhang, X. Tan, C. Chen, Y . Qu, Y . Xie, and L. Ma, “PromptAD: Learning prompts with only normal samples for few-shot anomaly detection,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2024, pp. 16838–16848

  24. [24]

    AnomalyCLIP: Object- agnostic prompt learning for zero-shot anomaly detection,

    Q. Zhou, G. Pang, Y . Tian, S. He, and J. Chen, “AnomalyCLIP: Object- agnostic prompt learning for zero-shot anomaly detection,” inProc. Int. Conf. Learn. Represent. (ICLR), 2024

  25. [25]

    VCP-CLIP: A visual context prompting model for zero-shot anomaly segmentation,

    Z. Qu, X. Tao, M. Prasad, F. Shen, Z. Zhang, X. Gong, and G. Ding, “VCP-CLIP: A visual context prompting model for zero-shot anomaly segmentation,” inComput. Vis.–ECCV 2024, 2024, pp. 301–317

  26. [26]

    Resilient multimodal industrial surface defect detection with uncertain sensors availability,

    S. Jiang, Y . Ma, J. Zhou, Y . Bian, Y . Wang, and M. Liu, “Resilient multimodal industrial surface defect detection with uncertain sensors availability,”IEEE/ASME Trans. Mechatronics, vol. 30, no. 6, pp. 4261– 4271, Dec. 2025, doi: 10.1109/TMECH.2025.3607147

  27. [27]

    ClipSAM: CLIP and SAM collaboration for zero-shot anomaly segmentation,

    S. Li, J. Cao, P. Ye, Y . Ding, C. Tu, and T. Chen, “ClipSAM: CLIP and SAM collaboration for zero-shot anomaly segmentation,” Neurocomputing, vol. 618, Art. no. 129122, 2025

  28. [28]

    AA-CLIP: Enhancing zero-shot anomaly detection via anomaly-aware CLIP,

    W. Ma, X. Zhang, Q. Yao, F. Tang, C. Wu, Y . Li, R. Yan, Z. Jiang, and S. K. Zhou, “AA-CLIP: Enhancing zero-shot anomaly detection via anomaly-aware CLIP,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2025, pp. 4744–4754

  29. [29]

    FE-CLIP: Frequency en- hanced CLIP model for zero-shot anomaly detection and segmentation,

    T. Gong, Q. Chu, B. Liu, W. Zhou, and N. Yu, “FE-CLIP: Frequency en- hanced CLIP model for zero-shot anomaly detection and segmentation,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2025, pp. 21220– 21230

  30. [30]

    AnomalyGPT: Detecting industrial anomalies using large vision-language models,

    Z. Gu, B. Zhu, G. Zhu, Y . Chen, M. Tang, and J. Wang, “AnomalyGPT: Detecting industrial anomalies using large vision-language models,” in Proc. AAAI Conf. Artif. Intell., vol. 38, 2024, pp. 1932–1940

  31. [31]

    GelSight: High-resolution robot tactile sensors for estimating geometry and force,

    W. Yuan, S. Dong, and E. H. Adelson, “GelSight: High-resolution robot tactile sensors for estimating geometry and force,”Sensors, vol. 17, no. 12, Art. no. 2762, 2017

  32. [32]

    DIGIT: A novel design for a low-cost compact high-resolution tactile sensor with application to in-hand manipulation,

    M. Lambeta, P.-W. Chou, S. Tian, B. Yang, B. Maloon, V . R. Most, D. Stroud, R. Santos, A. Byagowi, D. Jayaraman, and R. Calandra, “DIGIT: A novel design for a low-cost compact high-resolution tactile sensor with application to in-hand manipulation,”IEEE Robot. Autom. Lett., vol. 5, no. 3, pp. 3838–3845, 2020

  33. [33]

    Robotic defect inspection with visual and tactile perception for large-scale components,

    A. Agarwal, A. Ajith, C. Wen, V . Stryzheus, B. Miller, M. Chen, M. K. Johnson, J. L. Susa Rincon, J. Rosca, and W. Yuan, “Robotic defect inspection with visual and tactile perception for large-scale components,” inProc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), 2023, pp. 10110–10116

  34. [34]

    Model-agnostic meta-learning for fast adaptation of deep networks,

    C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” inProc. 34th Int. Conf. Mach. Learn. (ICML), vol. 70, 2017, pp. 1126–1135

  35. [35]

    Prototypical networks for few-shot learning,

    J. Snell, K. Swersky, and R. Zemel, “Prototypical networks for few-shot learning,” inAdv. Neural Inf. Process. Syst., vol. 30, 2017

  36. [36]

    LoRA: Low-rank adaptation of large language models,

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” inProc. Int. Conf. Learn. Represent. (ICLR), 2022

  37. [37]

    Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

    Y . Zhang, M. Li, D. Long, X. Zhang, H. Lin, B. Yang, P. Xie, A. Yang, D. Liu, J. Lin, F. Huang, and J. Zhou, “Qwen3 Embedding: Advancing text embedding and reranking through foundation models,” arXiv preprint arXiv:2506.05176, 2025. Junyi Zongreceived the B.Eng. degree in Energy and Power Engineering from Qingdao University of Science and Technology, Qin...

  38. [38]

    degree in Control Science and Engineering at Beijing University of Posts and Telecommunications, Beijing, China

    He is currently pursuing the Ph.D. degree in Control Science and Engineering at Beijing University of Posts and Telecommunications, Beijing, China. His research interests include multimodal large models, embodied intelligence, and robotic dexterous manipulation. Qingxuan Jiareceived the B.S. degree in mechanical design and manufactur- ing from Shandong Un...